Testing the trajectory difference in a semi-parametric longitudinal model

Feiyang Niu; Jianhui Zhou; Thu H Le; Jennie Z Ma

doi:10.1177/0962280215584109

. Author manuscript; available in PMC: 2016 Nov 13.

Published in final edited form as: Stat Methods Med Res. 2015 May 13;26(3):1519–1531. doi: 10.1177/0962280215584109

Testing the trajectory difference in a semi-parametric longitudinal model

Feiyang Niu ¹, Jianhui Zhou ², Thu H Le ³, Jennie Z Ma ⁴

PMCID: PMC4644124 NIHMSID: NIHMS696144 PMID: 25972495

Abstract

Motivated by a genetic investigation on the progressive decline of renal function in a clinical trial study of kidney disease, we develop a practical test for evaluating the group difference in trajectories under a semi-parametric modeling framework. For the temporal patterns or trajectories of longitudinal data, B-splines are used to approximate the function non-parametrically. Such approximation asymptotically converts the problem of testing trajectory difference into the significance test of regression coefficients that can be simply estimated by generalized estimating equations. To select the optimal number of inner knots for B-splines, a cross-validation procedure is performed using the criterion of generalized residual sum of squares. The new proposed test successfully detects the significant difference of underlying genetic impact on the progression of renal disease, which is not captured by the parametric approach.

Keywords: B-spline, longitudinal data, generalized estimating equation, semi-parametric, trajectory testing

1 Introduction

With the development of modern medical statistics, longitudinal data have been extensively analyzed using parametric models, either marginal or mixed effects models, in many biomedical applications. These parametric approaches for longitudinal data allow us not only to evaluate the effects of population-level risk factors while accounting for subject-specific effects,¹ but also to model the temporal patterns and test for group difference in trajectories over time. For example, in studying patients with hypertensive kidney disease, researchers are interested in comparing the effects of antihypertensive drugs to prevent progressive decline of renal function, which leads to kidney failure. The objective is to test the difference in the patterns of renal function decline over time.² In the parametric settings, response trajectory patterns are usually modeled parametrically as polynomials of time, and group difference in trajectory patterns is commonly tested by the Group-by-Time interaction.³

While parametric models are highly useful, they require specific form for the mean function of response over time, which can be challenging in practice.⁴ Moreover, questions often arise about the adequacy of the model assumption and the potential impact of model mis-specification on the analysis.⁵ Alternatively, non-parametric or semi-parametric regression methods for longitudinal data offer more flexible modeling options by relaxing the restrictions of parametric models on the temporal trend. There are abundant examples of application of semi-parametric regression that is based on penalized regression splines and mixed models and examples of likelihood ratio test of curve difference for cross-sectional data in Ruppert et al.⁶ Particularly, the implementation of smoothing methods in standard software, such as R, has stimulated the applications of non-parametric and/or semi-parametric modeling to clinical studies.^7–9 However, the direct implementation of non-parametric and/or semi-parametric modeling of longitudinal data does not render the test for group differences in trajectories over time. To our knowledge, ready-to-use statistical test is only available for testing the group differences at a specific time point, but not for testing the group trajectory differences, or longitudinal response profile differences. Lack of such ready-to-use test in trajectory analysis may result in limited broader application of non-parametric and semi-parametric methods in clinical studies.

In this paper, we study a semi-parametric model with non-parametric time effect on trajectory for longitudinal data and develop a statistical test that can ascertain the group trajectory differences. Our developed test can be readily applied to testing trajectory differences over the entire follow-up period, or over a subinterval of interest. Our modeling and statistical test development are motivated by a genetic investigation on the progressive decline in renal function in the African American Study of Kidney Disease and Hypertension (AASK). The AASK study was a multicenter randomized clinical trial to test the effectiveness of three antihypertensive medications and two levels of blood pressure (BP) control on the progression of hypertensive kidney disease in a 3 × 2 factorial design.¹⁰ It included 1094 African Americans aged 18 to 70 years old with hypertensive kidney disease, measured by glomerular filtration rates (GFR) of 20 to 65 mL/min per 1.73m². During February 1995 and September 1998, patients were randomly assigned to one of two mean arterial pressure (MAP) goals, 102 to 107 mm-Hg or ≤ 92 mm-Hg, and to initiate treatment with an angiotensin-converting enzyme (ACE) inhibitor, a β-blocker, or a dihydropyridine calcium channel blocker. Details on the original AASK study design and conduct are available elsewhere.^2,10,11 The objective of our genetic study is to evaluate the association of Glutathione S-transferase -μ1 (GSTM1) gene with the kidney function, using the data available in the AASK Study. Specifically, the researchers are interested in whether patients with the GSTM1-null allele would have a more accelerated progression of kidney disease than those with GSTM1-active variant.¹²

2 Methodology

In a general longitudinal study, suppose that there are m subjects, n_i observations are collected at time ${0 \leq t_{i j}}_{j = 1}^{n_{i}}$ for the ith subject, and we have $n = \sum_{i = 1}^{m} n_{i}$ observations in total. Let Y_ij and Z_ij = (z_ij1, z_ij2, . . . , z_ijq)^T denote respectively the measurement of response and the q covariates, such as age, gender etc., for the ith subject at time t_ij, where i = 1, 2, . . . , m and j = 1, 2, . . . , n_i. Suppose subjects are distributed over K groups, and the interest is to test the differences in the longitudinal patterns among these K groups. To accurately characterize the longitudinal patterns and avoid model mis-specification, we consider the time effect non-parametrically in the following semi-parametric model,

Y_{i j} = Z_{i j}^{T} α + f_{0} (t_{i j}) + \sum_{k = 1}^{K - 1} I (g_{i} = k) δ_{k} (t_{i j}) + ∊_{i j},

(1)

for i = 1, 2, . . . , m and j = 1, 2, . . . , n_i, where α is a q-dimensional coefficient vector corresponding to the q covariates; I(g_i = k) is the indicator function for group g_i, which is a categorical variable indicating the group membership that the ith subject belongs to; f₀ is an unspecified smooth function of time depicting the trajectory of response for subjects in Group K (as reference group), and δ_k's are the trajectory differences in response between the reference group and the kth group, k = 1, 2, . . . , K – 1; and ∈_ij is the random error. We assume that the maximum follow-up lengths are equal among K groups, which is denoted as T. The semi-parametric model (1) has been studied in Lin and Carroll,¹³ He et al.,¹⁴ and Leng et al.¹⁵ among others that mainly focuses on the inference of the parametric part. Here, we focus on the inference of the nonparametric part. By taking the advantage of the local property of the B-spline basis functions, our method has the flexibility of testing the trajectory difference on the overall time interval or a specific subinterval. Note that both f₀ and δ_k's are adjusted for the covariates Z_ij. Thus, the trajectory differences between the kth group and the reference group K can then be detected by testing whether the functions δ_k(t) are zero, i.e,

δ_{k} (t) = 0,

(2)

for any t ∈ [0, T] and k = 1, 2, . . . , K – 1.

We approximate the smooth functions f₀ and δ_k's by B-splines.¹⁶ Given the spline order r and a partition 0 = t₀ < t₁ < ··· < t_s < t_s+1 = T, where s is the number of interior knots, we denote the normalized B-spline basis functions by ${B_{l}}_{l = 1}^{L}$ for L = s + r. The functions f₀ and δ_k are then approximated by

\begin{matrix} f_{0} (t_{i j}) = & \sum_{l = 1}^{L} β_{0 l} B_{l} (t_{i j}) + e_{0} (t_{i j}) \\ δ_{k} (t_{i j}) = & \sum_{l = 1}^{L} β_{k l} B_{l} (t_{i j}) + e_{k} (t_{i j}), \end{matrix}

(3)

for k = 1, 2, . . . , K – 1, where e₀(t) and e_k(t)'s are approximation error functions that converge to zero uniformly at a certain rate with the increasing number of knots and sample size. In practice, low order of basis functions and equally spaced quantile inner knots are often used. Advantages of B-splines mainly attributes to its efficient computation and accurate approximation with a small number of knots.¹⁷ For simplicity, we use the same set of B-spline basis functions to approximate f₀ and δ_k. Combining (1) and (3) together, we have the model

Y_{i j} = Z_{i j}^{T} α + \sum_{l = 1}^{L} β_{0 l} B_{l} (t_{i j}) + \sum_{k = 1}^{K - 1} \sum_{l = 1}^{L} I (g_{i} = k) β_{k l} B_{l} (t_{i j}) + ∊_{i j}^{'},

(4)

where $∊_{i j}^{'} = ∊_{i j} + e_{0} (t_{i j}) + \sum_{k = 1}^{K - 1} I (g_{i} = k) e_{k} (t_{i j})$ , converging to ∈_ij uniformly.

2.1 Estimation and hypothesis testing

Let β₀ = (β₀₁, β₀₂, . . . , β_0L)^T, β_k = (β_k1, β_k2, . . . , β_kL)^T, for k = 1, 2, . . . , K – 1, and $∊_{i}^{'} = (∊_{i 1}^{'}, ∊_{i 2}^{'}, \dots, ∊_{i n_{i}}^{'})$ , and denote

Z_{i} = {(\begin{matrix} Z_{i 1}^{T} \\ Z_{i 2}^{T} \\ ⋮ \\ Z_{i n_{i}}^{T} \end{matrix})}_{n_{i} \times q} and X_{i} = {(\begin{matrix} B_{1} (t_{i 1}) & B_{2} (t_{i 1}) & \dots & B_{L} (t_{i 1}) \\ B_{1} (t_{i 2}) & B_{2} (t_{i 2}) & \dots & B_{L} (t_{i 2}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ B_{1} (t_{i n_{i}}) & B_{2} (t_{i n_{i}}) & \dots & B_{L} (t_{i n_{i}}) \end{matrix})}_{n_{i} \times L} .

Model (4) can thus be rewritten as

Y_{i} = Z_{i} α + X_{i} β_{0} + \sum_{k = 1}^{K - 1} I (g_{i} = k) X_{i} β_{k} + ∊_{i}^{'},

(5)

for i = 1, 2, . . . , m, where Y_i = (Y_i1, Y_i2, . . . , Y_{in_i})^T are the repeated measurements of response on the ith subject. Then the generalized estimation equations (GEE) method, developed by Liang and Zeger,¹⁸ can be used to estimate the parameters in (5). The application of B-spline within GEE framework can be found in Hua and Zhang¹⁹ and He et al.¹⁴ For the simplicity of notation, we let $β = {(α^{T}, β_{0}^{T}, β_{1}^{T}, \dots, β_{K - 1}^{T})}^{T}$ and $U_{i} = (Z_{i}, X_{i}, I (g_{i} = 1) X_{i}, \dots, I (g_{i} = K - 1) X_{i})$ . The GEE estimator $\hat{β} = {({\hat{α}}^{T}, {\hat{β}}_{1}^{T}, \dots, {\hat{β}}_{K - 1}^{T})}^{T}$ is obtained by solving the following equations

\sum_{i = 1}^{m} U_{i}^{T} Σ_{i}^{- 1} (Y_{i} - U_{i} β) = 0,

(6)

where $Σ_{i} = A_{i}^{1 ∕ 2} {R A}_{i}^{1 ∕ 2}$ is the covariance matrix of Y_i, A_i is the diagonal marginal variance of Y_i, and R is the working correlation matrix. The asymptotic variance of $\hat{β}$ is estimated by the sandwich formula in the GEE framework.

By approximating the functions f₀(t) and δ_k(t) in Model (1) through B-splines, we essentially convert the problem of testing the group differences on the time effect, i.e., trajectories, over the entire follow-up period into the following hypothesis testing

H_{0} : β_{k} = \underline{0} versus H_{1} : β_{k} \neq \underline{0},

(7)

for k = 1, 2, . . . , K – 1. Given the asymptotic distribution of $\hat{β}$ , the vector ${\hat{β}}_{k}$ asymptotically has the following distribution

{\hat{β}}_{k} \sim N (β_{k}, {\hat{Ω}}_{k}),

where ${\hat{Ω}}_{k}$ is the submatrix of $\hat{Ω}$ , the estimated asymptotical covariance matrix of $\hat{β}$ by the the sandwich formula of GEE. A Wald-type test statistic,

W^{2} = {\hat{β}}_{k}^{'} {\hat{Ω}}_{k}^{- 1} {\hat{β}}_{k},

(8)

has an asymptotical χ² distribution with L degrees of freedom (df), where L is the length of ${\hat{β}}_{k}$ .

Besides testing the group differences for the overall trajectories, another advantage of our developed method is its ability to test the group differences of the time effect or trajectories over a subinterval of the follow-up period. This ability attributes to the locality property of the B-spline basis functions. Indeed, once the inner knots are allocated, the function estimate for δ_k is completely determined by a consecutive subset of parameters in ${\hat{β}}_{k}$ that are associated with the subinterval. That is, this subset of parameters is uniquely identified according to the locality property. Therefore, the group differences over the subinterval of interest can be detected by testing whether the identified subset of parameters in β_k are simultaneously zero as in (8). An example is shown in Section 3.2 below.

2.2 Knots selection

To implement our developed method in Section 2.1, a key practical issue is to specify data-adaptive B-spline basis functions. In non-parametric regression, low order of spline is usually preferred, such as linear, quadratic, or cubic spline with the spline order r taking value of two, three, or four respectively. In our study, quadratic B-spline basis functions (i.e., r = 3) are used in order to keep the model sufficiently flexible yet less complicated, and inner knots are selected as equally spaced quantile knots as frequently suggested in the literature. The number of inner quantile knots s is selected by 10-fold cross validation using the generalized residual sum of squares (GRSS) as the selection criterion (see Section 3 for an example). The GRSS is defined as

GRSS (s) = \sum_{i = 1}^{m} {(Y_{i} - {\hat{Y}}_{i})}^{'} {\hat{Σ}}_{i}^{- 1} (Y_{i} - {\hat{Y}}_{i}),

(9)

where ${\hat{Y}}_{i} = U_{i} \hat{β}$ , i = 1, 2, . . . , m, are fitted values for the ith subject and ${\hat{Σ}}_{i}$ is the estimated working covariance matrix based on the GEE estimates. The GRSS criterion balances both bias and variation in estimating the functions f₀(t_ij) and δ_k(t_ij) using B-spline approximation. The final selection for number of inner knots will be the model that minimizes the GRSS criterion.

3 Analysis of real data

3.1 Background

As described in Section 1, 1094 African American patients with chronic kidney disease were enrolled into the AASK study between 1995 and 1998. Most of them were followed up to September 2001, and those on the calcium channel blocker arm were terminated in September 2000 based on the recommendation of the data and safety monitoring board.^2,10 Beyond the baseline measurements such as age and gender, the longitudinal kidney function, measured by GFR in mL/min per 1.73m², was assessed at baseline, three, six, and every three months thereafter. Approximately 850 patients who remained available in 2002 further consented for the AASK Genomics Study. A total of 692 patients with valid DNA samples were successfully genotyped and included in our analysis. Our genetic study aimed to evaluate the association of GSTM1 gene with the progression of the kidney disease. Earlier studies have demonstrated that patients who carry GSTM1 null allele, GSTM1(0), have increased risk of cardiovascular disease, and this is thought to be due to the reduced ability or inability to handle oxidative stress and the resultant cellular damage.

In this study, we would like to evaluate whether patients with the GSTM1(0) allele have more accelerated GFR decline than those with GSTM1-active variant. Thus we are interested in testing whether the GFR declining patterns or trajectories over time are significantly different between patients with GSTM1(0) and those without. Detailed patient characteristics for the 692 patients in the study cohort by GSTM1 genotype have been reported previously (Table 1 in Chang et al.¹²). Briefly, about 27% patients were classified in GSTM1(0) group and 73% in GSTM1-active group. The mean age for the study cohort was 54.2 ± 10.6 years, 59% were male, and 51% had history of cardiovascular disease. Patients were distributed evenly in the randomization factors (blood pressure control level and antihypertensive drug group). These baseline characteristics were included in our semi-parametric model.

Table 1.

Model comparison.

	P-value¹	QIC²	N. of params³

Semi-parametric model (1)	0.0753	28088	18
Marginal models
Linear model	0.1307	28095	12
Piecewise linear model (breakpoint at 20 months)	0.1691	28098	14
Quadratic model	0.2102	28104	14
Mixed effects models
Linear model	0.1701	28102	13
Piecewise linear model (breakpoint at 20 months)	0.3126	28114	15
Quadratic model	0.3649	28120	15

Open in a new tab

P-value for trajectory difference between two genetic groups

QIC: Pan's quasi-likelihood under the independence model criterion

N. of params: Number of parameters in model

3.2 Results

To capture the true decline of renal function in our analysis, we model the change in GFR from baseline measure instead of actual GFR. To adjust for potential confounding factors, patient baseline characteristics such as age, gender, MAP, history of cardiovascular disease, blood pressure control level and antihypertensive drug group are included as the covariates Z_ij in model (1). The first-order autoregressive correlation structure (AR(1)) is assumed as the working correlation matrix for GEE among the repeated measurements of GFR response within the same subject. To estimate the parameters, α, β₀, . . . , β_K–1, in model (4), we utilize the data up to 60 months, where sufficient patients with GFR measurements are observed. Thus the overall follow-up period is from randomization (time 0) to 60 months. Let s denote the number of inner knots placed within [0, 60], we select its value based on the best model fitting by 10-fold cross validation with the GRSS criterion. Specifically, we let s vary from 0 to 10. The value of GRSS for the model with s inner knots is calculated as in (9) and shown in Figure 1. It appears that the best model is the one with s = 1 optimal inner knot, making it as median of all visited time points across subjects.

Cross-validation for selecting number of inner knots s. The y-axis represents the model selection criterion of GRSS, and the x-axis is the number of inner knots placed within [0, 60] months. The solid red point indicates where the minimum GRSS is attained.

Therefore, in the study of genetic effect of GSTM1, we use quadratic B-spline basis function with one inner knot to fit the longitudinal change in GFR. Figure 2 shows the observed and fitted time effects or trajectories for the two genetic groups, adjusted for patient baseline characteristics. Two fitted functions for the change in GFR from baseline clearly show decreasing trends, indicating deterioration of kidney function over the entire follow-up period. Also, for the patients in both genetic groups, there are apparent curvatures in the trajectories for the change in GFR. Such non-linear trajectory patterns demonstrate the advantage of semi-parametric modeling approach, which would not be captured easily in a parametric model. In addition, the trajectories for the change in GFR between the two groups apparently decline at different rates. The GSTM1(0) group declines slower in the first 22 months, crossing with the GSTM1-active group approximately at 22 months and then accelerates afterwards.

Observed and fitted curves of changes in GFR from baseline in *GSTM1(0)* Group (black solid circles for observed and black solid line for fitted) and *GSTM1*-active Group (red hollow circles for observed and red dashed line for fitted). Each circle represents GFR change averaged on that particular scheduled time point. The y-axis represents the fitted change in GFR from baseline, and the x-axis is the follow-up time in months.

To test the trajectory difference between the two groups over the follow-up period, we apply the hypothesis testing under semi-parametric framework as proposed in Section 2. The results in Table 1 show that the GSTM1(0) group is marginally different in GFR declining pattern from the GSTM1-active group over [0, 60] months, with p-value 0.0753. Meanwhile, the following six commonly used parametric models, i.e. linear marginal/mixed effects models, piecewise linear marginal/mixed effects models, and quadratic marginal/mixed effects models, have also been applied to the data. The linear marginal model depicts GFR change trajectory by linear form of time and linear mixed effects model additionally includes random intercept and random time slope with both models adjusted for the same set of baseline characteristic covariates as in semi-parametric model. The piecewise linear marginal/mixed effects models and the quadratic marginal/mixed effects models are specified in a similar fashion. The p-values are obtained by testing the significance of interaction terms between genotype group and time. None of these parametric models detected significant GFR trajectory difference between GSTM1(0) group and GSTM1-active group, as summarized in Table 1.

The fitted changes in GFR by our semi-parametric model clearly reveal different declining patterns of kidney progression between the two groups after 20 months. We further test if the two trajectories are significantly different after 20 months, the long term effect. As the optimal inner knot is at 20 months, the subinterval from 20 to 60 months corresponds to the B-spline coefficients β₁₂, β₁₃ and β₁₄ in model (4) by the locality property as discussed in Section 2.1. The testing results in Table 2 show that the trajectory of change in GFR for the GSTM1(0) group is indeed significantly different from that in the GSTM1-active group for the long-term follow-up period (p-value=0.038).

Table 2.

Hypothesis testing for genetic impact on progression of GFR.

	GSTM1(0) Group vs GSTM1-active Group
	χ² statistic	df	P-value

Trajectory test for overall effect ([0, 60] months)	8.484	4	0.0753
Trajectory test for long-term effect ([20, 60] months)	8.425	3	0.0380

Open in a new tab

Model diagnostics is an important aspect in data analysis for evaluating model goodness-of-fit, for detecting outliers, and for identifying influential observations. In the absence of likelihood ratio tests, model diagnostic tools are needed for the GEE approach where correlated data generally arise.²⁰ By extending Akaike's information criterion to the GEE framework, Pan²¹ proposed to use the quasi-likelihood constructed under the working independence model with the naive and robust covariance estimates of estimated regression coefficients. In our study, we applied Pan's quasi-likelihood under the independence model criterion (QIC) to our semi-parametric model as well as to the six parametric models. Our semi-parametric model achieved the minimum QIC as shown in Table 1. This further confirms the better goodness-of-fit of our semi-parametric model over those parametric models.

In our semi-parametric model, some baseline characteristics have significant effects on the response, as presented in Table 3. Specifically, older age and higher baseline GFR are apparently associated with greater decline of GFR. Compared with ACE inhibitor, patients on β-blocker have a greater decline in GFR, while those on calcium channel blocker actually have an increase in the change of GFR. Our results on these covariate effects are consistent with previous reports.^12,22

Table 3.

Estimation results for factors adjusted in semi-parametric model.

Factor	Estimate	Standard error	P value
Age	−0.0851	0.0292	0.0036
Gender (Male vs. Female)	0.5213	0.6452	0.4190
Baseline GFR	−0.1048	0.0208	<0.0001
Baseline MAP	−0.0328	0.0183	0.0726
History of cardiovascular disease	0.2930	0.6065	0.6291
Drug group (β-blocker vs. ACE inhibitor)	−1.4368	0.6507	0.0272
Drug group (calcium channel block vs. ACE inhibitor)	2.0190	0.8899	0.0233
Blood pressure goal (102-107 vs. ≤92 mm-Hg)	−0.1947	0.5986	0.7450

Open in a new tab

To further explore the complexity of the study, we also investigated GFR trajectory difference among genotype × treatment groups and among genotype × blood pressure goal groups. Patients in GSTM1(0) and β-blocker and those in GSTM1(0) and low blood pressure goal are, respectively, considered as reference group in each of the two analysis. Fitted GFR change curves are presented in Figure 3. Hypothesis testing results on progression of GFR change are summarized in Table 4. This result shows that GFR trajectories decline at variable rates with different shapes for the combinations of genotype with antihypertensive drug or for the combinations of genotype with blood pressure goal. Indeed, patients in the GSTM1(0) × β-blocker group (the reference group) had the steepest GFR decline. The GFR trajectories in calcium channel blocker group at either GSTM1 level as well as in the GSTM1-active × ACE inhibitor group appear to decline slower and they are significantly different from that of the reference group. Similarly, patients with GSTM1-active genotype apparently have slower deterioration in kidney function, and their trajectories are significantly different from that in the GSTM1(0) group. Although the trajectories for patients with GSTM1(0) seemingly differ for the two blood pressure goals, the difference is not statistically significant.

(a): Fitted curves of changes in GFR among Genotype (solid line for *GSTM1(0)* Group and dashed line for *GSTM1-active* Group) × Treatment (red for ACE inhibitor, black for β-blocker and green for calcium channel blocker); (b): Fitted curves of changes in GFR among Genotype (solid line for *GSTM1(0)* Group and dashed line for *GSTM1-active* Group) × Blood pressure goal (black for low BP goal and red for usual BP goal).

Table 4.

Hypothesis testing for genotype × treatment and genotype × blood pressure goal on progression of GFR

	Overall trajectory test([0, 60] months)			Long-term trajectory test([20, 60] months)
	χ² statistic	df	P-value	χ² statistic	df	P-value
Genotype × Antihypertensive Drug
Reference group: GSTM1(0) × β-blocker
GSTM1(0) × ACE inhibitor	4.502	4	0.3423	4.493	3	0.2129
GSTM1(0) × calcium channel blocker	25.156	4	<0.0001	24.840	3	<0.0001
GSTM1-active × ACE inhibitor	9.324	4	0.0535	9.107	3	0.0279
GSTM1-active × β-blocker	0.911	4	0.9230	0.905	3	0.8244
GSTM1-active × calcium channel blocker	12.140	4	0.0163	11.768	3	0.0082
Genotype × Blood pressure goal
Reference group: GSTM1(0) × Low BP goal
GSTM1(0) × Usual BP goal	9.473	4	0.0503	6.528	3	0.0886
GSTM1-active × Low BP goal	8.641	4	0.0707	8.537	3	0.0361
GSTM1-active × Usual BP goal	7.954	4	0.0933	7.948	3	0.0471

Open in a new tab

3.3 Simulation

Our proposed test for trajectory difference relies on the Wald-type test statistics defined in (8). In this section, we conduct a simulation study to investigate the finite sample performance of the proposed test based on AASK study data. Specifically, a total number of 500 patients are generated with the same set of baseline covariates included in our semi-parametric model; i.e., for the i^th patient,

y_{i j} = Z_{i j}^{T} α + f_{0} (t_{i j}) + f_{1} (t_{i j}) I (g_{i} = G S T M 1 -active) + ∊_{i j},

(10)

for i = 1, 2, . . . , 500 and j = 1, 2, . . . , n_i, where n_i is a random integer from 6 to 12; Z_ij is a 7-dimensional baseline characteristic covariates vector including age, gender, MAP, baseline GFR, history of cardiovascular disease, blood pressure control level and antihypertensive drug group; α is the coefficient vector corresponding to the 7 baseline covariates and is specified to be the same as the estimated one from the real data in Section 3.2; f₀(·) is the underlying true GFR change curve for subjects in for GSTM1(0) group and f₁(·) is the trajectory differences in GFR change between GSTM1(0) and GSTM1-active groups, both of which are specified be similar to the two estimated GFR change curves in Section 3.2; t_ij's are uniformly generated from 0 to T = 60; I(·) is the indicator function indicating the genetic group membership of the i^th subject; ∈_i = (∈_i1, ∈_i2, . . . , ∈_{in_i})^T is the random vector having AR(1) correlation structure with σ = 13.3 and ρ = 0.86, same as those estimated from the data.

Our proposed semi-parametric model is applied to the simulated data with three different working correlation structures, AR(1) (the true structure), compound symmetry (CS), and independence. The testing for trajectory difference in the GFR change is based on the Wald test with 5% significance level. Out of the M = 500 simulation runs, power of the test is reported as the percentage of successfully detecting the curve difference. Furthermore, the goodness-of-fit of estimation is evaluated by the weighted average squared error (WASE) defined as

WASE = \frac{1}{N} \sum_{i = 1}^{n} \sum_{j = 1}^{n_{i}} \frac{{(f_{k} (t_{i j}) - {\hat{f}}_{k} (t_{i j}))}^{2}}{range (f_{k} (t))},

(11)

where $N = \sum_{i = 1}^{n} n_{i}$ and $range (f_{k} (t)) = \max_{t \in [0, T]} f_{k} (t) - \min_{t \in [0, T]} f_{k} (t)$ . Both testing power and WASE are summarized in Table 5. In addition, boxplots of WASE are provided in Figure 4.

Table 5.

Simulation on Testing and estimation for trajectory difference with 500 runs.

		WASE
Woking correlation structure	Power	mean	sd
AR(1)	0.9460	0.6274	0.2880
Compound Symmetry	0.9180	0.9576	0.3285
Independence	0.8580	1.1518	0.5159

Open in a new tab

Boxplot of WASE of fitted semi-parametric model with three different working correlation structure of M = 500 simulation runs

Out of the 500 runs, 94.6% of the runs successfully detected the GFR change trajectory difference between the two genetic groups for AR(1) working correlation, 91.8% for compound symmetry, and 85.8% for working independence. Numerical outcomes in Table 5 show that accounting for the within cluster correlation improves both testing results and estimation precision.

4 Discussion

In this work, we use a semi-parametric model to capture the non-linear decline of longitudinal renal function and test the trajectory difference over the follow-up period between the two genotype groups. While the existing methods can only test the difference at an individual time point, our interests are focused on testing the functional effects over the entire or subinterval of follow-up period. By utilizing B-spline approximation, we effectively convert the problem of testing trajectory differences over time into a linear model for repeated measurements. Thus our developed testing method can be easily implemented through testing the selected parameters estimated by GEE, using existing commonly used statistical packages. Our developed method has the flexibility of testing the trajectory differences over the entire follow-up period, or a subinterval of interest. In addition, the semi-parametric model can accommodate various trajectory patterns, which is particularly attractive and important in flexible model fitting, and thus avoid the potential pattern mis-specification in parametric model setting. However, to estimate each trajectory function reliably over the testing period, the B-spline approximation method requires that the maximum follow-up lengths are equal among groups under comparison and patient visit time points to scatter evenly within each group. Thus, our developed method is more efficient when the patient visit time is less skewed, though the longitudinal data are not required to be balanced.

In our analysis, we first selected the number of knots for the B-splines, and the trajectory differences were tested using the model with the selected knots. Therefore, our analysis did not account for the uncertainty introduced by model selection. Hjort and Claeskens²³ studied the post-selection inference issues. Following the data splitting idea in Wasserman and Roeder²⁴ and Meinshausen, Meier, and Bühlmann²⁵ to address this issue, we have explored to select the knots using a random sample of 30% of the subjects, and estimate the selected model for inferences using the remaining 70% subjects. The p-values for overall and long-term effects of GSTM1 gene are 0.0783 and 0.0388, respectively, which are very close to the results in Table 2 using the full data set.

With our developed method, it is more likely to detect the true underlying trajectory differences that could be missed in parametric modeling, as shown in our clinical example. Our method can also be used to test the trajectory interaction between two or more factors by considering combined levels of these factors as shown in Figure 3 and Table 4. In addition, we plan to test the interaction effect of GSTM1 and APOL1 risk variants, as a recent study found that African Americans with the APOL1 risk variants experience faster progression of chronic kidney disease and have a significantly increased risk of kidney failure.²⁶

Acknowledgement

We are most grateful to the two referees and the editor for their constructive comments and suggestions. We thank the AASK trial participants and Ancillary Studies Committee for granting access to the DNA and Trial data. This work was supported in part by the National Institutes of Health Grant R01 DK094907 to T. H. Le.

Contributor Information

Feiyang Niu, Department of Statistics, University of Virginia, Charlottesville, VA 22904 USA.

Jianhui Zhou, Department of Statistics, University of Virginia, Charlottesville, VA 22904 USA.

Thu H. Le, Division of Nephrology, Department of Medicine, University of Virginia, Charlottesville, VA 22908 USA

Jennie Z. Ma, Division of Biostatistics, Department of Public Health Sciences, University of Virginia, Charlottesville, VA 22908 USA

References

1.Fitzmaurice GM, Laird NM, Ware JH. Applied Longitudinal Analysis. John Wiley and Sons; New York: 2004. [Google Scholar]
2.Wright JT, Jr, Bakris G, Greene T, et al. Effect of blood pressure lowering and antihypertensive drug class on progression of hypertensive kidney disease: results from the AASK trial. Journal of the American Medical Association. 2002;288:2421–2431. doi: 10.1001/jama.288.19.2421. [DOI] [PubMed] [Google Scholar]
3.Singer JD. Using SAS PROC MIXED to fit multilevel models, hierarchical models, and individual growth models. Journal of Educational and Behavioral Statistics. 1998;23:323–355. [Google Scholar]
4.Lin DY, Ying Z. Semiparametric and nonparametric regression analysis of longitudinal data. Journal of the American Statistical Association. 2001;96:103–113. [Google Scholar]
5.Hoover DR, Rice JA, Wu CO, Yang LP. Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika. 1998;85:809–822. [Google Scholar]
6.Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge University Press; Cambridge: 2003. [Google Scholar]
7.Ngo L, Wand MP. Smoothing with Mixed Model Software. Journal of Statistical Software. 2004;9:1–54. [Google Scholar]
8.Durban M, Harezlak J, Wand MP, Carroll RJ. Simple fitting of subject-specific curves for longitudinal data. Statistics in Medicine. 2005;24:1153–1167. doi: 10.1002/sim.1991. [DOI] [PubMed] [Google Scholar]
9.Chen J, Johnson BA, Wang XQ, O'Quigley J, Isaac M, Zhang D, Liu L. Trajectory analyses in alcohol treatment research. Alcohol Clin Exp Res. 2012;36:1442–1448. doi: 10.1111/j.1530-0277.2012.01748.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Gassman JJ, Greene T, Wright JT, Jr, et al. Design and statistical aspects of the African American Study of Kidney Disease and Hypertension (AASK). Journal of the American Society of Nephrology. 2003;14:S154–S165. doi: 10.1097/01.asn.0000070080.21680.cb. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Agodoa LY, Appel L, Bakris GL, et al. African American Study of Kidney Disease and Hy pertension (AASK) Study Group. Effect of ramipril vs amlodipine on renal outcomes in hy pertensive nephrosclerosis: a randomized controlled trial. Journal of the American Medical Association. 2001;285:2719–2728. doi: 10.1001/jama.285.21.2719. [DOI] [PubMed] [Google Scholar]
12.Chang J, Ma JZ, Zeng Q, et al. Loss of GSTM1, a NRF2 target, is associated with accelerated progression of Hypertensive Kidney Disease in the African American Study of Kidney Disease (AASK). American Journal of Physiology - Renal Physiology. 2013;304:F348–F355. doi: 10.1152/ajprenal.00568.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Lin X, Carroll RJ. Semiparametric regression for clustered data using generalized estimation equations. Journal of the American Statistical Association. 2001;96:1045–1056. [Google Scholar]
14.He X, Fung WK, Zhu Z. Robust estimation in generalized partial linear models for clustered data. Journal of the American Statistical Association. 2005;100:1176–1184. [Google Scholar]
15.Leng C, Zhang W, Pan J. Semiparametric mean-covariance regression analysis for longitudinal data. Journal of the American Statistical Association. 2010;105:181–193. [Google Scholar]
16.Schumaker LL. Spline Functions: Basic Theory. Wiley-Interscience; New York: 1981. [Google Scholar]
17.He X, Shen L. Linear regression after spline transformation. Biometrika. 1997;84:474–481. [Google Scholar]
18.Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]
19.Hua L, Zhang Y. Spline-based semiparametric projected generalized estimating equation method for panel count data. Biostatistics. 2012;13:440–454. doi: 10.1093/biostatistics/kxr028. [DOI] [PubMed] [Google Scholar]
20.Oh S, Carriere KC, Park T. Model diagnostic plots for repeated measures data using the generalized estimating equations approach. Computational Statistics & Data Analysis. 2008;53:222–232. [Google Scholar]
21.Pan W. Akaikes information criterion in generalized estimating equations. Biometrics. 2001;57:120–125. doi: 10.1111/j.0006-341x.2001.00120.x. [DOI] [PubMed] [Google Scholar]
22.Rowe JW, Andres R, Tobin JD, Norris AM, Shock NW. The effect of age on creatinine clearance in men: a cross-sectional and longitudinal study. The Journals of Gerontology. 1976;31:155–63. doi: 10.1093/geronj/31.2.155. [DOI] [PubMed] [Google Scholar]
23.Hjort NL, Claeskens G. Frequentist model average estimators. Journal of the American Statistical Association. 2003;98:879–899. [Google Scholar]
24.Wasserman L, Roeder K. High dimensional variable selection. The Annals of Statistics. 2009;36:1567–1594. doi: 10.1214/08-aos646. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Meinshausen N, Meier L, Buühlmann P. P-values for high-dimensional regression. Journal of the American Statistical Association. 2009;104:1671–1681. [Google Scholar]
26.Parsa A, Kao WHL, Xie D, et al. APOL1 risk variants, race, and progression of chronic kidney disease. The New England Journal of Medicine. 2013;369:2183–2196. doi: 10.1056/NEJMoa1310345. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Fitzmaurice GM, Laird NM, Ware JH. Applied Longitudinal Analysis. John Wiley and Sons; New York: 2004. [Google Scholar]

[R2] 2.Wright JT, Jr, Bakris G, Greene T, et al. Effect of blood pressure lowering and antihypertensive drug class on progression of hypertensive kidney disease: results from the AASK trial. Journal of the American Medical Association. 2002;288:2421–2431. doi: 10.1001/jama.288.19.2421. [DOI] [PubMed] [Google Scholar]

[R3] 3.Singer JD. Using SAS PROC MIXED to fit multilevel models, hierarchical models, and individual growth models. Journal of Educational and Behavioral Statistics. 1998;23:323–355. [Google Scholar]

[R4] 4.Lin DY, Ying Z. Semiparametric and nonparametric regression analysis of longitudinal data. Journal of the American Statistical Association. 2001;96:103–113. [Google Scholar]

[R5] 5.Hoover DR, Rice JA, Wu CO, Yang LP. Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika. 1998;85:809–822. [Google Scholar]

[R6] 6.Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge University Press; Cambridge: 2003. [Google Scholar]

[R7] 7.Ngo L, Wand MP. Smoothing with Mixed Model Software. Journal of Statistical Software. 2004;9:1–54. [Google Scholar]

[R8] 8.Durban M, Harezlak J, Wand MP, Carroll RJ. Simple fitting of subject-specific curves for longitudinal data. Statistics in Medicine. 2005;24:1153–1167. doi: 10.1002/sim.1991. [DOI] [PubMed] [Google Scholar]

[R9] 9.Chen J, Johnson BA, Wang XQ, O'Quigley J, Isaac M, Zhang D, Liu L. Trajectory analyses in alcohol treatment research. Alcohol Clin Exp Res. 2012;36:1442–1448. doi: 10.1111/j.1530-0277.2012.01748.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Gassman JJ, Greene T, Wright JT, Jr, et al. Design and statistical aspects of the African American Study of Kidney Disease and Hypertension (AASK). Journal of the American Society of Nephrology. 2003;14:S154–S165. doi: 10.1097/01.asn.0000070080.21680.cb. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Agodoa LY, Appel L, Bakris GL, et al. African American Study of Kidney Disease and Hy pertension (AASK) Study Group. Effect of ramipril vs amlodipine on renal outcomes in hy pertensive nephrosclerosis: a randomized controlled trial. Journal of the American Medical Association. 2001;285:2719–2728. doi: 10.1001/jama.285.21.2719. [DOI] [PubMed] [Google Scholar]

[R12] 12.Chang J, Ma JZ, Zeng Q, et al. Loss of GSTM1, a NRF2 target, is associated with accelerated progression of Hypertensive Kidney Disease in the African American Study of Kidney Disease (AASK). American Journal of Physiology - Renal Physiology. 2013;304:F348–F355. doi: 10.1152/ajprenal.00568.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Lin X, Carroll RJ. Semiparametric regression for clustered data using generalized estimation equations. Journal of the American Statistical Association. 2001;96:1045–1056. [Google Scholar]

[R14] 14.He X, Fung WK, Zhu Z. Robust estimation in generalized partial linear models for clustered data. Journal of the American Statistical Association. 2005;100:1176–1184. [Google Scholar]

[R15] 15.Leng C, Zhang W, Pan J. Semiparametric mean-covariance regression analysis for longitudinal data. Journal of the American Statistical Association. 2010;105:181–193. [Google Scholar]

[R16] 16.Schumaker LL. Spline Functions: Basic Theory. Wiley-Interscience; New York: 1981. [Google Scholar]

[R17] 17.He X, Shen L. Linear regression after spline transformation. Biometrika. 1997;84:474–481. [Google Scholar]

[R18] 18.Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]

[R19] 19.Hua L, Zhang Y. Spline-based semiparametric projected generalized estimating equation method for panel count data. Biostatistics. 2012;13:440–454. doi: 10.1093/biostatistics/kxr028. [DOI] [PubMed] [Google Scholar]

[R20] 20.Oh S, Carriere KC, Park T. Model diagnostic plots for repeated measures data using the generalized estimating equations approach. Computational Statistics & Data Analysis. 2008;53:222–232. [Google Scholar]

[R21] 21.Pan W. Akaikes information criterion in generalized estimating equations. Biometrics. 2001;57:120–125. doi: 10.1111/j.0006-341x.2001.00120.x. [DOI] [PubMed] [Google Scholar]

[R22] 22.Rowe JW, Andres R, Tobin JD, Norris AM, Shock NW. The effect of age on creatinine clearance in men: a cross-sectional and longitudinal study. The Journals of Gerontology. 1976;31:155–63. doi: 10.1093/geronj/31.2.155. [DOI] [PubMed] [Google Scholar]

[R23] 23.Hjort NL, Claeskens G. Frequentist model average estimators. Journal of the American Statistical Association. 2003;98:879–899. [Google Scholar]

[R24] 24.Wasserman L, Roeder K. High dimensional variable selection. The Annals of Statistics. 2009;36:1567–1594. doi: 10.1214/08-aos646. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Meinshausen N, Meier L, Buühlmann P. P-values for high-dimensional regression. Journal of the American Statistical Association. 2009;104:1671–1681. [Google Scholar]

[R26] 26.Parsa A, Kao WHL, Xie D, et al. APOL1 risk variants, race, and progression of chronic kidney disease. The New England Journal of Medicine. 2013;369:2183–2196. doi: 10.1056/NEJMoa1310345. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Testing the trajectory difference in a semi-parametric longitudinal model

Feiyang Niu

Jianhui Zhou

Thu H Le

Jennie Z Ma

Abstract

1 Introduction