Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Sep 24.
Published in final edited form as: Biometrics. 2017 Dec 18;74(3):1072–1081. doi: 10.1111/biom.12837

Detecting Treatment Differences in Group Sequential Longitudinal Studies with Covariate Adjustment

Neal O Jeffries 1,*, James F Troendle 1, Nancy L Geller 1
PMCID: PMC7515605  NIHMSID: NIHMS1621526  PMID: 29265179

Summary.

In longitudinal studies comparing two treatments over a series of common follow-up measurements, there may be interest in determining if there is a treatment difference at any follow-up period when there may be a non-monotone treatment effect over time. To evaluate this question, Jeffries and Geller (2015) examined a number of clinical trial designs that allowed adaptive choice of the follow-up time exhibiting the greatest evidence of treatment difference in a group sequential testing setting with Gaussian data. The methods are applicable when a few measurements were taken at prespecified follow-up periods. Here, we test the intersection null hypothesis of no difference at any follow-up time versus the alternative that there is a difference for at least one follow-up time. Results of Jeffries and Geller (2015) are extended by considering a broader range of modeled data and the inclusion of covariates using generalized estimating equations. Testing procedures are developed to determine a set of follow-up times that exhibit a treatment difference that accounts for multiplicity in follow-up times and interim analyses.

Keywords: Generalized estimating equations, Generalized linear models, Group sequential design, Longitudinal analysis

1. Introduction

Clinical trials that regularly record measurements over time may be used to determine if there exist differences between treatment arms during the follow-up periods. It may be of interest to know which, if any, period shows the most evidence of a difference and/or which of a number of potential periods show a difference. As an example of the first question, consider a potential therapy for which there is little prior knowledge as to how long after the intervention is given the benefit may be most apparent. The second question may address how long any benefit may last. As a clinical example, we examine quality of life measurements for a standard and experimental intervention in heart failure. We are particularly interested in settings where an intervention’s effect may be non-monotonic and not easily summarized by a single measure such as the slope or area under the curve (AUC). Additionally, there may be interest in stopping the trial early if an interim analysis reveals important differences exist and the methodology incorporates this option.

Group sequential analysis for longitudinal studies is not new. Armitage, Stratton, and Worthington (1985) compared the cumulative sum of normally distributed longitudinal measurements taken at a common set of equally spaced follow-up times. Entry was assumed to occur simultaneously and the authors developed models incorporating autocorrelated within-person error terms and provided approximate values for adjusted significance levels. This work was broadened by Geary (1988) who developed a four-parameter model that included the Armitage, Stratton, and Worthington (1985) model as a submodel but retained the same restrictive assumptions of normality, simultaneous entry, and common set of follow-up times.

Lee and DeMets (1991) extended this work with a linear mixed model approach that allowed for staggered entry and a different number of follow-up times among individuals. The change over time was assumed to follow a simple parametric pattern, for example, linear or quadratic growth. Wu and Lan (1992) and Kittelson, Sharples, and Emerson (2005) used generalizations of area under the response curve formed by an individual’s responses as a summary measure instead of the sum or fitted slope parameter. Gange and DeMets (1996) developed group sequential testing for the generalized estimating equations setting and this work was extended to general covariate settings by Jennison and Turnbull (1997).

Jeffries and Geller (2015) presented a number of adaptive/flexible designs for a group sequential setting for normally distributed data in a two arm randomized trial without using summary measures or prescribing a specific parametric form for the responses over follow-up times. That work is expanded here by considering a generalized estimating equation approach that allows for more general response models and the inclusion of covariates. We compare a GEE based method using the distribution of a max statistic to more conventional approaches for detecting a difference between longitudinal profiles with group sequential testing. In addition, we present an approach to determine which follow-up times show differences at the interim and/or final planned analysis that protects familywise error in the strong sense.

2. Model Description

Consider a trial that randomizes up to a predetermined number of participants, N, to either a control or experimental arm and accrual occurs over a broad time period. Further, the study collects follow-up measures of an outcome of interest on K occasions, for example, every 6 months for 3 years so that K = 6. In addition, suppose M analyses are conducted (1 final analysis and M − 1 interim analyses) in which treatment differences between the two arms will be assessed at the K different follow-up periods. Let δk parameterize the true difference between the experimental and control arms at the kth follow-up period. Depending upon the data structure δk could represent the difference in mean responses, the log of an odds ratio, a function of regression model coefficients, or other measures of difference. Let Zk(Nmk) denote an asymptotically normally distributed test statistic used to test the null hypothesis H0k : δk ≤ 0. The NmkN is the total number of individuals providing data for the test of treatment difference at the kth follow-up time for the mth analysis. Initially, we consider one-sided tests where a higher response is desirable. Zk(Nmk) could arise from a number of testing approaches, for example, t-test, a comparison of proportions, or a contrast from a regression model. We assume if δk = 0 then Zk(Nmk) has a standard normal distribution; otherwise, if δk > 0 (< 0) then Zk(Nmk) is still normally distributed, however E {Zk(Nmk)} > 0 (< 0).

The intersection null hypothesis of interest is H0=k=1KH0k:δk0. Jeffries and Geller (2015) presented a number of approaches for testing the intersection null in this situation with one interim and a final analysis. Here, we focus on an extension of a method presented there.

Let s(t) denote a spending function for allocating Type I error where 0 ≤ t ≤ 1 denotes the study’s information time. It is anticipated that as many as N participants may provide data for K follow-up measurements. For the first interim analysis define α(1) = s(t1) where t1 = the total number of follow-up measurements observed at the interim analysis divided by the total number of follow-up measurements expected if the trial is not stopped early. Let N1k, k = 1, …, K denote the number of observed measurements for the kth follow-up time at the first interim analysis. The interim analysis test statistics Z1(N11), …, ZK(N1K) are available and let Σ(1) denote the true K × K correlation matrix for Z1(N11), …, ZK(N1K). Then for Z*(1)=maxkZk(N1k) we can find a b*(1) such that

PH0(Z*(1)>b*(1))PH00(Z*(1)>b*(1))=1b*(1)b*(1)fK{z1(N11),,zK(N1K);Σ(1)}dz1(N11)dzK(N1K)=α(1) (1)

where fK{·, Σ(1)} denotes a multivariate normal distribution with a mean vector 0 and known correlation matrix Σ(1). Here, H00 corresponds to the intersection hypothesis k=1KH0k:δk=0. In practice Σ(1) is estimated, say by Σ(1)^, at the time of the interim analysis and the multivariate normal integration (Genz et al., 2012) may be embedded within a root-finding function to find b*(1) given α(1) and the estimated correlation. Then, we reject the intersection null hypothesis H0 = ⋂H0k : δk ≤ 0 at the first interim analysis if Z*(1)>b*(1).

If we do not reject the intersection null hypothesis at the first interim analysis then a second analysis is conducted where we evaluate test statistics Zk(N2k), k = 1, …, K where N2k indicates the number of observations available for the kth follow-up time at the second analysis. As before, calculate t2 as the total number of observations available at the second analysis divided by the total number of expected observations if the trial continues to its planned conclusion and α(2) = s(t2), the cumulative amount of type I error spent through the second analysis. Now reject the same intersection null hypothesis if Z*(2)>b*(2) where Z*(2)=maxkZk(N2k) and b*(2) satisfies

1α(2)=PH00(Z*(1)<b*(1),Z*(2)<b*(2))=b*(2)b*(2)b*(1)b*(1)f2K{z1(N11),,zK(N1K),z1(N21),,zK(N2K);Σ(2)}dz1(N11)dzK(N1K)dz1(N21)dzK(N2K), or equivalently, (2)
α(2)α(1)=PH00(Zx(1)<b*(1),Z*(2)>b*(2))=b*(2)b*(2)b*(1)b*(1)f2K{z1(N11),,zK(N1K),z1(N21),,zK(N2K);Σ(2)}dz1(N11)dzK(N1K)dz1(N21)dzK(N2K) (3)

and Σ(2) is the true 2K × 2K correlation matrix of the test statistics from the first and second analyses and f2K{·, Σ(2)} denotes a multivariate normal distribution with mean vector 0 and correlation matrix Σ(2). Given α(2), an estimate of Σ(2), and b*(1) computed at the first analysis, one can again use a root-finding function and multivariate integration to determine b*(2). The methodology could be extended to multiple interim analyses; in general, given α(1), …, α(m), b*(1),,b*(m1), and an estimate of the correlation matrix Σ(m) one can compute a threshold b*(m) for maxk Zk(Nmk) using integration like that in equation (2) which satisfies

PH00(Z*(1)<b*(1),,Z*(m1)<b*(m1),Z*(m)>b*(m))α(m)α(m1).

In Figure 1, we illustrate the ideas and notation in a setting with 6, 9, and 12 month follow-up periods and a maximum sample size of N = 400. Accrual occurs uniformly and an interim analysis is planned after data from the first 17 months of study time are available. At study month 17, 6 month follow-up data are available for the subjects enrolled during the first 11 months so N11 = 183 and Z1(N11) = Z1(183). The test statistic based on the 9 month follow-up data is denoted as Z2(133) as about 133 individuals are expected to provide 9 month data at the interim analysis. Similarly the test statistic at the interim analysis for 12 month data is Z3(83). Z*(1)=max{Z1(183),Z2(133),Z3(83)} is calculated and if sufficiently large one concludes there is a significant evidence against the null hypothesis of no positive treatment difference at any follow-up period. In section 4, we consider which follow-up periods show sufficient evidence of a treatment difference, taking into account multiple comparisons.

Figure 1.

Figure 1.

Temporal availability of test statistics.

If Z*(1) does not exceed the b*(1) threshold then accrual continues and a second test statistic Z*(2)=max{Z1(400),Z2(400),Z3(400)} is computed and compared to b*(2). Web Appendix B describes a straightforward extension of these ideas for testing two-sided hypotheses, that is, H0k:δk=0,k=1,,K.

2.1. GEE Context

Here, we present the testing procedures within the context of generalized estimating equations (GEE) with one interim and a final analysis although the results may be generalized to multiple interim analyses. The notation is similar to the development in Liang and Zeger (1986). Let Yik, k = 1, …, K, i = 1, …, N denote the kth follow-up measurement of interest from the ith individual randomized. We assume that marginally Yik has a generalized linear model structure with density/mass function

f(yik;β,ϕ)=exp{yikθika(θik)+b(yik)ϕ}

where θik = h(ηik) and ηik = xikT β and xikT denotes the transpose of the p × 1 vector xik. Standard assumptions yield E(Yik) = a′(θik) and Var(Yik) = ϕa″(θik) where the ′ and ″ denote first and second derivatives with respect to θik and ϕ is a scale parameter. Here h(·) connects the linear predictors, ηik = xikT β, to θik. It is to be understood that the expectations and variances in this GEE context are conditional upon the covariates xik.

The GEE approach is well suited to modeling correlated data from an individual. Let kiK denote the number of observed follow-up measures for the ith individual when an analysis is conducted. Define Yi=(Yi1,,Yiki)T and Xi=(xi1,xi2,,xiki)T as a ki × p covariate matrix and μi={a(θi1),,a(θiki)}T. The within person variability is modeled by

Vi=Var(Yi)=ϕAi1/2Ri(γ)Ai1/2

where Ai is a ki × ki diagonal matrix with a′(θik) on the diagonal and Ri(γ) is a modeled correlation matrix parameterized by γ. Vi is the “working” covariance matrix. With a working assumption of independence, Ri is the identity matrix. Alternatively, with a small set of follow up times, for example K = 3, it is reasonable to use an unstructured correlation matrix R with elements rk1k2 that can be estimated. In this case, γ corresponds to the K(K − 1)/2 off-diagonal correlations rk1k2.

Given working covariance parameters estimated by ϕ^ and γ^ the GEE estimates of β are defined as solutions to the following estimating equations:

i=1NDiTV^i1{Yiμi(β)}=0 where Di=μi(β)β.

Although different working covariance structures yield different β^ estimates it is shown in Liang and Zeger (1986) that under general conditions and assuming the mean structure is correctly specified then the different β^ are consistent and

N(β^β)(1Ni=1NDiTVi1Di)11Ni=1NDiTVi1{Yiμi(β)}. (4)

The RHS of (4) has an asymptotic multivariate Gaussian distribution with mean vector 0 and variance–covariance matrix

(limN1Ni=1NDiTVi1Di)1×limN1Ni=1NDiTVi1Cov(Yi)Vi1Di×(limN1Ni=1NDiTVi1Di)1. (5)

This robust sandwich variance–covariance matrix depends on the form of Vi assumed and can be estimated given ϕ^, γ^, β^. The square root of the diagonal of this matrix (divided by N) using the estimated parameters yields the estimated robust standard errors in GEE output.

In the context of a two arm randomized clinical trial with an experimental and control arm and data recorded over K follow-up periods, we specify the following model

ηik=xikTβ=βC1I{k=1}++βCKI{k=K}+Wi0Tβ0+βD1I{k=1,Arm=Experimental}++βDKI{k=K,Arm=Experimental} (6)

Here I{} are indicator variables designating the follow-up period and if the ith person randomized receives the experimental treatment as opposed to the control treatment, βCk, k = 1, …, K denotes the effect for the control arm at the kth measurement, and βDk, k = 1, …, K denotes the treatment-control difference at the kth follow-up time. Wi0 denotes a vector of baseline covariates measured prior to randomization and β0 are the associated coefficients. Our intersection null hypothesis is H0 = ⋂k H0k : βDk ≤ 0; here βDk corresponds to δk in the previous section. This formulation does assume the follow-up measurements are taken at approximately the same common set of K time points although no parametric assumptions are made regarding the shape of the response profile over time, for example, linear or quadratic response.

We allow for an interim analysis in which N11N12 ≥ ⋯ ≥ N1K individuals have provided data for each of the follow-up periods and they are partitioned over the treatment and control arms so that each βCk and βDk can be estimated.

In Jeffries and Geller (2015) a covariance function was derived to estimate the K × K covariance/correlation matrix associated with the βDk estimates for normally distributed data without covariates. Here, we can use the output from GEE to extract the relevant K × K portion of the estimated covariance matrix and convert that into the required estimated correlation matrix, Σ(1)^. The information time for this interim analysis can be taken as the fraction of the eventual total expected number of responses that are observed at the time of the interim analysis.

We define Z*(1)=maxkZk(N1k) where Zk(N1k)=βDk(1)^/seDk(1)^ and the βDk(1)^ and seDk(1)^ terms are the estimated coefficients and estimated standard errors that are available from the GEE output. To estimate Σ(1), first, let Ω(1)^ denote the p × p estimated variance–covariance matrix for the coefficients. Then the estimated correlations between Zk1(N1k1) and Zk2(N1k2) can be obtained from the K × K submatrix of

{diag(Ω(1)^)}1/2Ω(1)^{diag(Ω(1)^)}1/2

corresponding to the βDk^ terms where {diag(Ω(1)^)}1/2 denotes a diagonal matrix with ψ(x) = x−1/2 applied to the diagonal elements of Ω(1)^. Using this Σ(1)^, b1* is computed from equation (1) and the intersection null hypothesis H0 = ⋂k H0k : βDk ≤ 0 is rejected if Z*(1)>b1*.

For the case of M = 2, that is only one interim and a final analysis, if we fail to reject at the interim analysis, then we continue to full accrual of data and conduct a similar analysis at the end of the study using

{Z1(N21),,ZK(N2K)}=(βD1(2)^/seD1(2)^,,βDK(2)^/seDK(2)^)

where the (2) superscripts and “2” subscripts in the N2k terms indicate these quantities are based on all the data available at the end of the study. In the absence of missing data N2k = N for all k.

To find an appropriate cutoff at the end of the study, we need the correlation matrix, Σ(2), for all 2K variables {Z1(N11), …, ZK(N1K), Z1(N21), …, ZK(N2K)}. Note that although Σ(1) has dimension K × K, Σ(2) has dimension 2K × 2K. The upper left K × K submatrix of Σ(2) is estimated by Σ(1)^ obtained at the interim analysis. The lower right K × K submatrix of Σ(2) is estimated by GEE output obtained from the final analysis in the same way Σ(1) was estimated, that is, the relevant K × K submatrix of

{diag(Ω(2)^)}1/2Ω(2)^{diag(Ω(2)^)}1/2

where Ω(2)^ is the estimated variance–covariance matrix from the second analysis.

Now consider the entries of Σ(2) in rows 1 through K and columns K + 1 through 2K. A more general expression for determining Cov{Zk1(N1k1),Zk2(N2k2)} can be obtained by first writing the approximation in equation (4) for the interim and final analysis:

N(m)(β(m)^β){1N(m)i=1N(m)(Di(m))T(Vi(m))1Di(m)}1×1N(m)i=1N(m)(Di(m))T(Vi(m))1{Yi(m)μi(β)(m)}

where (m) = (1) and (m) = (2) corresponds to quantities determined at the interim and final analysis, respectively. N(m) corresponds to the maximum number of observations at any of the follow-up periods for the mth analysis—it should typically correspond to Nm1, the number of observations at the first follow-up period in the mth analysis. The asymptotic methods and assumptions that show (5) is the appropriate variance–covariance matrix for (4) can be used to show that, asymptotically,

Cov{N(1)(β(1)^β),N(2)(β(2)^β)}={limN(1)1N(1)i=1N(1)(Di(1))T(Vi(1))1Di(1)}1×limN(1)1N(1)N(2)i=1N(1)(Di(1))T(Vi(1))1Cov(Yi(1),Yi(2))(Vi(2))1Di(2)×{limN(2)1N(2)i=1N(2)(Di(2))T(Vi(2))1Di(2)}1 (7)

where we consider N(1)/N(2) a fixed fraction < 1 as both terms go to ∞. An estimate of this matrix can be computed with a moderate amount of programming and the output from the interim and final analyses. The resulting estimated covariance matrix can be converted into an estimated correlation matrix by appropriate division by diagonal elements in GEE output. Details of these computations are presented in Web Appendix C.

Hence, one can construct Σ(2)^, a 2K × 2K estimated correlation matrix for Z1(N11), …, ZK(N1K), Z1(N21), …, ZK(N2K). Using Σ(2)^ the appropriate values for b*(2) can be calculated using (2) and the intersection null hypothesis is rejected at the final analysis if Z*(2)=maxkZk(N2k)>b*(2). Although the development in this section was written with M = 2 analyses, the generalization to more than one interim and final analysis is straightforward.

As noted in Liang and Zeger (1986), this procedure works when data are missing completely at random as is the case for an interim analysis where missing data arise solely because not enough follow-up time has elapsed for some individuals. These authors also note other instances in which a weaker missing at random assumption may be sufficient, for example, if the assumed form of the working correlation matrix R is correct (as would be the case with an unstructured correlation matrix) with Gaussian or binary outcomes.

3. Simulations

Simulations for type I error and power are based on the following data generation model:

Yik=βC1I{k=1}+βC2I{k=2}+βC3I{k=3}+βAgeAgei+βD1I{k=1,Treat=Exper}+βD2I{k=2,Treat=Exper}+βD3I{k=3,Treat=Exper}+ϵik,k{1,2,3} andYi0=βAgeAgei+ϵi0 (8)

where Agei and Yi0 correspond to baseline age and baseline value of Y for the ith person. Age is transformed to have a standard normal distribution. Within person correlation was driven by correlation in the ϵik terms that are Gaussian with mean 0 and with pairwise covariance Cov(ϵij, ϵik) exp(−|TjTk|/15) where T0 = 0, T1 = 6, T2 = 9, and T3 = 12 months. Accrual occurred randomly in both arms at a uniform rate over an 18 month period. The outcome measure was assessed at baseline, and at 6 months, 9 months, and 1 year after baseline. In all simulations, βCk = 0 while the βDk varied according to the simulation scenario. One interim analyses was conducted when 50% of the randomized individuals provided a measurement for the 3rd follow-up period, that is, N1K = 0.5N. A final analysis was also conducted if results for the interim analysis were not significant.

A number of testing procedures were evaluated, each designed for a 5% error rate:

  1. A Bonferroni adjustment approach in which the change from baseline to the kth follow-up period is the outcome measure and tested via a two group t-test. The p-value threshold for each of the 3 tests at the interim analysis is s(0.62)/3 where s(t) = 0.05t3 is the spending function and approximately 62% of the expected responses are observed given the stated accrual patterns and follow-up times. This spending function was chosen for simplicity but is otherwise arbitrary. The p-value threshold at the final analysis is given by {0.05 − s(0.62)/3.

  2. a max T test approach as described in Jeffries and Geller (2015). This approach uses the change from baseline to the kth follow-up period as the outcome measure and employs a two group t-test. The thresholds for significance for an interim and final analysis are based on the same ideas for determining the b*(1) and b*(2) here, but a different correlation matrix is required. No use is made of covariate information.

  3. GEE approach based on Z*(m)=maxkβDk(m)^/seDk(m)^,m=1,2 using (6) with Yi0 as the only covariate.

  4. GEE approach based on Z*(m)=maxkβDk(m)^/seDk(m)^,m=1,2 with baseline age and Yi0 as covariates.

  5. A GEE approach based on the model in (d) in which Zk(Nmk)=βDk(m)^/seDk(m)^,k=1,,K are sequentially monitored, each at a Bonferroni corrected alpha level 0.05/K. This approach includes baseline age and Yi0 as covariates but does not use the distribution of Z*(m)=maxkβDk(m)^/seDk(m)^.

Table 1 shows there is some Type I error inflation for smaller sample sizes in the non-Bonferroni approaches. The inflation can be reduced, in some cases substantially, by using a t distribution instead of a multivariate normal distribution when finding thresholds as in equations (1) and (2). As the sample size increases the error inflation dissipates. Otherwise, each approach shows appropriate Type I error control across the range of scenarios although the Bonferroni based methods are conservative, as expected. The strong agreement in numerical values across rows reflects that the same random number seeds were used to generate the data although some slight residual variation across rows still arises from the use of Monte Carlo simulation in the multivariate integration process (Genz et al., 2012) and occasional convergence difficulties. (Convergence problems occur for less than 0.01% of the simulations; in these tables a failed convergence for one method suppresses that simulation’s results for all 5 methods.) Aside from variation from convergence problems, results for the t-test methods (approaches (a) and (b)) should only vary by sample size and whether a normal or t distribution was used. Methods using the GEE will vary in addition by the working correlation structure assumption and whether an Age coefficient is included or not.

Table 1.

Type I error simulations

Max enroll per arm Working corr. βAge (a) Bonferroni (b) t-test (c) GEE w/o age covariate (d) GEE w/age covariate (e) GEE w/age and Bonferroni correction
100 Indep 0 0.03223 0.05264 0.05385 0.05438 0.04277
100(t dist) Indep 0 0.02920 0.04916 0.05041 0.05052 0.04086
100 Unstruc 0 0.03222 0.05263 0.05370 0.05396 0.04145
100(t dist) Unstruc 0 0.02919 0.04915 0.05026 0.05047 0.03963
100 Indep 0.30 0.03223 0.05265 0.05380 0.05437 0.04277
100(t dist) Indep 0.30 0.02920 0.04916 0.05041 0.05052 0.04086
100 Unstruc 0.30 0.03222 0.05263 0.05370 0.05396 0.04145
100(t dist) Unstruc 0.30 0.02919 0.04915 0.05026 0.05047 0.03963
200 Indep 0 0.03122 0.05088 0.05211 0.05219 0.04045
200(t dist) Indep 0 0.02989 0.04896 0.05029 0.05031 0.03944
200 Unstruc 0 0.03122 0.05088 0.05216 0.05190 0.03915
200(t dist) Unstruc 0 0.02989 0.04896 0.04988 0.05015 0.03817
200 Indep 0.30 0.03122 0.05088 0.05271 0.05218 0.04045
200(t dist) Indep 0.30 0.02989 0.04896 0.05029 0.05031 0.03944
200 Unstruc 0.30 0.03122 0.05088 0.05216 0.05190 0.03915
200(t dist) Unstruc 0.30 0.02989 0.04896 0.04988 0.05015 0.03817
400 Indep 0 0.03131 0.05120 0.05221 0.05243 0.04013
400(t dist) Indep 0 0.03057 0.05025 0.05116 0.05132 0.03964
400 Unstruc 0 0.03131 0.05120 0.05226 0.05255 0.03915
400(t dist) Unstruc 0 0.03057 0.05025 0.05137 0.05159 0.03862
400 Indep 0.30 0.03131 0.05120 0.05198 0.05241 0.04013
400(t dist) Indep 0.30 0.03057 0.05030 0.05087 0.05132 0.03964
400 Unstruc 0.30 0.03131 0.05120 0.05195 0.05252 0.03915
400(t dist) Unstruc 0.30 0.03057 0.05033 0.05114 0.05160 0.03862

Note: Each scenario/row was based on 100,000 simulations. The standard error for estimated Type I error is approximately 0.0007. Scenarios differ by the number enrolled, whether a multivariate normal or t distribution was used to determine the thresholds, whether the working correlation used independence assumption or an unstructured framework, and the value of the βAge coefficient. βDk = 0 for k = 1, 2, 3. When a multivariate t distribution was used, the common degrees of freedom was based on the number of observations available for the interim analysis.

Table 2 shows power for different scenarios, and we see important differences here. The approaches based on t-tests rather than GEE models suffer from a loss of power. The t-tests are based on differences between follow-up periods from baseline, that is, Yi0 is subtracted from follow-up values whereas the modeling approaches use Yi0 as a RHS covariate. These power differences are what would be expected when comparing change score approaches to analysis of covariance. Predictably, the t-test approaches (a) and (b) show no changes in power with various βAge values. The GEE approach without Age in the model shows deteriorating performance as the magnitude of the Age effect increases—thus reflecting more misspecification. When the Age effect is 0, models (c) and (d) are essentially the same. The power of the GEE models (methods (d) and (e)) with Age as a covariate do not change as the effect of Age increases—examination of the data generation model shows increasing βAge will not change estimates or standard error estimates of the βDk values. All approaches show increasing power with increasing values of βD3. Method (d) generally shows superior power and the benefit of including relevant covariate information.

Table 2.

Power for 5 approaches, 10,000 simulations

Working corr. structure βage βD3 (a) Bonferroni (b) t-test (c) GEE w/o age covariate (d) GEE w/age covariate (e) GEE w/age and Bonferroni correction
Unstr 0 0.20 0.3794 0.4702 0.5863 0.5867 0.5326
Unstr 0.50 0.20 0.3794 0.4702 0.5690 0.5870 0.5326
Unstr 1.00 0.20 0.3794 0.4702 0.5230 0.5868 0.5326
Unstr 1.50 0.20 0.3794 0.4702 0.5031 0.5868 0.5327
Unstr 0 0.25 0.5628 0.6547 0.7773 0.7769 0.7364
Unstr 0.50 0.25 0.5628 0.6547 0.7518 0.7769 0.7364
Unstr 1.00 0.25 0.5628 0.6547 0.7125 0.7769 0.7364
Unstr 1.50 0.25 0.5629 0.6547 0.6922 0.7768 0.7365
Unstr 0 0.30 0.7344 0.8047 0.9061 0.9062 0.8806
Unstr 0.50 0.30 0.7345 0.8048 0.8868 0.9062 0.8806
Unstr 1.00 0.30 0.7344 0.8048 0.8571 0.9062 0.8806
Unstr 1.50 0.30 0.7344 0.8048 0.8406 0.9063 0.8806
Indep 0 0.20 0.3794 0.4702 0.5846 0.5848 0.5358
Indep 0.50 0.20 0.3794 0.4702 0.5549 0.5847 0.5358
Indep 1.00 0.20 0.3794 0.4702 0.5203 0.5847 0.5358
Indep 1.50 0.20 0.3794 0.4702 0.5012 0.5847 0.5358
Indep 0 0.25 0.5628 0.6547 0.7746 0.7740 0.7355
Indep 0.50 0.25 0.5628 0.6547 0.7489 0.7743 0.7355
Indep 1.00 0.25 0.5628 0.6547 0.7081 0.7743 0.7355
Indep 1.50 0.25 0.5628 0.6547 0.6871 0.7742 0.7355
Indep 0 0.30 0.7344 0.8047 0.9040 0.9040 0.8808
Indep 0.50 0.30 0.7344 0.8048 0.8853 0.9040 0.8808
Indep 1.00 0.30 0.7344 0.8048 0.8539 0.9040 0.8808
Indep 1.50 0.30 0.7344 0.8048 0.8363 0.9040 0.8808

Note: Each scenario/row was based on 10,000 simulations. The standard error for estimated power is bounded by 0.005. A normal distribution (rather than a t distribution) was used to calculate threshold values.

The differences arising from an independence versus unstructured correlation structure are minor, except when the interim analysis is examined. As mentioned in Liang and Zeger (1986) differences arising from working correlation assumptions may be smaller for balanced data—unbalanced data will lead to larger differences. Unbalanced data arise at the interim analysis as some individuals have 1, 2, and 3 follow-up observations. When data are balanced (e.g., the final analysis when everyone has 3 follow-up observations) there is no appreciable difference between the independence and unstructured approaches. See Web Appendix A for results showing these interim analyses results and further simulations with smaller sample sizes, M = 3 analyses, K = 5 follow-up periods, and compound symmetry dictating the true correlation between ϵij and ϵik, j, k ≥ 0 in equation (8). The results suggest some care must be taken to evaluate the robustness of estimates to differences in working correlation assumptions and that the use of the t distribution may be overly conservative when sample sizes are small at the interim analysis.

The results show that utilization of a model with baseline outcomes can substantially increase power over an unmodeled approach and suggest the benefit is likely greater still if other important covariates are included.

4. Determining which Follow-Up Periods Show Differences

Thus far the methodology has focused on determining if any follow-up period shows a difference. However, there may be interest in determining which set of follow-up periods show a difference and doing so in a way that accommodates multiplicity concerns—concerns related to the number of follow-up periods as well as the number of interim and final analyses.

In the one-sided testing context of K follow-up times and M sequential analyses with cumulative alpha thresholds α(m), m = 1, …, M, we have boundary thresholds b*(m) satisfying

PH00(Z*(1)>b*(1))α(1)
PH00(Z*(1)<b*(1),,Z*(m1)<b*(m1),Z*(m)>b*(m))α(m)α(m1).

For one-sided testing, consider test procedure A defined as follows: Let m′ satisfy Z*(1)<b*(1),,Z*(m1)<b*(m1), Z*(m)>b*(m). Reject H0k : δk ≤ 0 for all k with Zk(Nmk)>b*(m), that is, reject H0k for all corresponding test statistics exceeding the first crossed boundary.

Lemma 1. Testing Procedure A controls familywise error in the strong sense.

Proof of this lemma is shown in Appendix A. A similar Lemma can be constructed for two-sided testing and is shown as Lemma 2 in Web Appendix B with a proof.

5. Application: SOLVD Trial

The Studies of Left Ventricular Dysfunction (SOLVD) treatment trial was a double-blind, randomized, placebo-controlled trial to assess the effect of enalapril (an ACE inhibitor vasodilator) on mortality in a heart failure population (The SOLVD Investigators, 1991). A quality of life survey was administered at baseline, 6 weeks, 1 year, and 2 years after baseline (Rogers et al., 1994). The survey assessed each participant’s overall general health self-perception and was recorded on a 5 point scale (recoded so that higher scores indicate better quality). Table 3 shows the mean and standard deviation of the self-assessment score at the various time points. Attention is restricted to those who completed the survey at all three follow-up time points. The New York Heart Association heart failure score is a four point measure of the degree of heart failure and was obtained as a baseline measurement in the study. Here, we use it as a baseline covariate in modeling the general health score denoted Yik:

Yik=βC1I{tk=6wk}+βC2I{tk=1yr}+βC3I{tk=2yr}+βD1I{tk=6wk,Treat=enalapril}+βD2I{tk=1yr,Treat=enalapril}+βD3I{tk=2yr,Treat=enalapril}+βNNYHAi0+β0Yi0+ϵik

where the ϵik follow a Gaussian distribution.

Table 3.

Summary statistics for General Health Self-Assessment in SOLVD. Higher scores indicate better self-reported general health. Only participants with complete data over three follow-up periods are counted here.

Placebo, NPlac = 514 Enalapril NEnal = 537
Mean Std dev. Mean Std dev.
Baseline 2.53 0.91 2.54 0.96
6 Weeks 2.63 0.94 2.73 0.94
1 Year 2.75 0.92 2.76 0.96
2 Year 2.70 0.94 2.76 0.94

Enrollment in SOLVD occurred over a 34 month period. Here, we present an interim analysis occurring when approximately 25% of the 2 year outcome data are available. This corresponds to about 60% of 1 year, and 91% of 6 week data. This analysis is illustrative, that is, it was not performed as part of the SOLVD study, and the availability of the data at the interim analysis follows from assuming uniform accrual and entry in the order of the study ID number in the publicly available data (see Web Appendix D for the SOLVD data source). The t-statistics for the interim analysis were (t6wk, t1yr, t2yr) = (1.59, 0.80, 1.70). For the Bonferroni approach (a) the corresponding threshold was 2.74 and the threshold for the approach based on the maximum of the t-statistic was 2.73. Consequently neither the (a) nor (b) approach based on t-statistics reached significance (at a 5% level). GEE models were computed with and without the NYHA covariate. The corresponding z–statistics for the three follow-up periods without the NYHA information were (z6wk, z1yr, z2yr) = (1.91, 0.97, 1.04) while those with the covariates were (2.05, 1.00, 1.02). For both models, the threshold for the maximum of the z–statistics was 2.73. The test statistics for method (e) are those used in method (d) and the threshold value is 2.74. Consequently, none of the three GEE methods reached their thresholds for significance at the interim analysis.

A second analysis used all data available at the end of the study. The t-statistics were (t6wk, t1yr, t2yr) = (1.74, −0.07, 0.66). The Bonferroni and max t-statistic thresholds were 2.21 and 2.09, respectively, so neither approach reached significance. The z–statistics for the GEE model without NYHA were (2.06, 0.06, 0.89) and those for the model with the NYHA covariate were (2.15, 0.11, 0.90). The threshold for models (c) and (d) was 2.11 and the threshold for method (e) was 2.16. Consequently only approach (d), the GEE approach using the NYHA information and incorporating correlation between follow-up time periods leads to a rejection of the intersection null hypothesis at the 5% level of significance and the conclusion that a significant improvement exists for the six week measure of self-perceived general health.

6. Discussion

We presented a flexible approach for analyzing longitudinal data in a group sequential setting that is especially suited for non-monotone treatment differences over time. The use of indicator variables in equation (6) allows the model to capture patterns of treatment differences that are not easily expressed by simple parametric functions or summary measures like AUC.

The approach allows for covariates that increases power relative to change score models and approaches that do not employ covariates. The method uses existing software and is therefore relatively simple to implement. The approach should be generalizable to other settings with covariates such as mixed-effects models. In addition, we have shown that a procedure that rejects the null hypothesis of no treatment difference for all follow-up periods with test statistics exceeding the boundary threshold will maintain familywise error in the strong sense.

It is noteworthy that the test statistics available at an interim analysis are based on different amounts of data and earlier follow-up periods should typically have more observations. If there exists the same magnitude of positive treatment difference for each of the K follow-up periods, that is, δk = δ for all k = 1, …, K, the larger observed sample size for the earlier follow-up periods will tend to produce larger test statistics. Consequently there may be a tendency in this approach for interim analyses to indicate differences at earlier follow-up periods than would be observed in an analysis at the planned study conclusion with all follow-up data available. This may be undesirable in some instances, for example, if there is interest in knowing how long a treatment difference lasts. This effect could be removed by basing interim test statistics only on a common set of individuals (e.g., those who have been in the study long enough to reach the Kth follow-up period) but this has the disadvantage of not using all available data.

Although the method was presented as if the interim analyses require sufficient accrual so that there are some data for all K follow-up periods, that does not need to be the case. For example, the first interim analysis does not require that some individuals reach the last follow-up period. In this case not all the analyses will involve K test statistics, however the notation and computations are easily altered for this situation.

Among the limitations of the approach is that we assume the timing of planned follow-up measurements is the same for each individual—such similarity makes it easy to model non-parametric patterns with indicator functions. Also, it may be possible to sharpen the boundaries for the procedure that controls familywise error in the strong sense using ideas from Marcus, Peritz, and Gabriel (1976); further work will explore this possibility. However, this closed testing approach will entail nontrivial computational burdens if K is not small. As is often the case, parameter estimates from designs that focus attention on the most extreme test statistics may produce parameter estimates that are subject to selection bias. Future work will explore how to address these restrictions and concerns.

7. Supplementary Materials

Web Appendix A (referenced in Section 3), Web Appendix B (referenced in Sections 2 and 4), Web Appendix C (referenced in Section 2), Web Appendix D (referenced in Section 5), and R code for conducting simulations in Section 3 are available with this article at the Biometrics website on Wiley Online Library.

Supplementary Material

Supllement
r-code-for-simulations

Acknowledgements

This work utilized the resources of the NIH HPC Biowulf cluster. The authors are employees of the National Heart, Lung, and Blood Institute. The views expressed in this article are the authors’ and do not necessarily represent the views of the National Heart, Lung, and Blood Institute; National Institutes of Health; or the United States Department of Health and Human Services.

Appendix A

Lemma 1. Test Procedure A controls FWE strongly with levelα(M).

Proof. Let τ = i1, …, iv denote subset that one-sided the of {1, …, K} such the null hypotheses are true, that is, H0iu:δiu0 for u = 1, …, v. Let Hτ denote the corresponding intersection null hypothesis Hτ=u=1vH0iu:δiu0. Let δ_=(δ1,,δK). Under Hτ the v elements in δ_ corresponding to τ are non-positive, the remaining elements of δ_ are positive. Define δ_ to have kth component δ_k=min(δ_k,0) so that δ_ has non-positive elements. The only components that differ between δ_ and δ_ correspond to the Kv follow-up times not represented in Hτ. □

Recall from Section 2 that H00 corresponds to δk = 0, for all k = 1, …, K. We denote the mean of Zk(Nmk)=Ek(m) for k = 1, …, K. Ek(m) is a function of δk, Nmk, and possible nuisance parameters such that the sign of Ek(m) matches the sign of δk (and if one is zero, then the other is zero). Further define Uk(m)=Zk(Nmk)Ek(m) so that U1(1),,UK(1),,U1(M),,UK(M) is multivariate normal of dimension M × K with zero means and the same correlation structure as the corresponding Zk(Nmk) values.

By definition Test Procedure A rejects a hypothesis k′ if

Zk(N1k)>b*(1) or
Z*(1)<b*(1),,Z*(m1)<b*(m1),Z*(m)>b*(m) and
Zk(Nmk)>b*(m) for some m>1.

For each m = 1, …, M define Z*τ(m)=max{Zi1(Nmi1),,Ziv(Nmiv)}, that is, the maximal z–statistic among the corresponding true hypotheses at analysis stage m. Then a Type I error occurs if and only if Z*τ(m)>b*(m) where m′ denotes the first analysis in which Z*(m)>b*(m) We want to show Pδ_(Any Type I Error for Procedure A)α(M).

Pδ_(Any Type I Error for Procedure A)=Pδ_[Z*τ(1)>b*(1){m=2M(Z*(1)<b*(1),,Z*(m1)<b*(m1),Z*τ(m)>b*(m))}]Pδ_{m=1M(Z*τ(m)>b*(m))}=Pδ_{m=1M(Z*τ(m)>b*(m))}
(because δ_ and δ_ agree on components {δi1,,δiv})Pδ_{m=1M(Z*(m)>b*(m))}=Pδ_(m=1M[max{Z1(Nm1),,ZK(NmK)}>b*(m)])=Pδ_(m=1M[max{U1(m)+E1(m),,UK(m)+EK(m)}>b*(m)])Pδ_(m=1M[max{U1(m),,UK(m)}>b*(m)])(because Ek(m)0)=PH00(m=1M{Z*(m)>b*(m)})=PH00(Z*(1)>b*(1))+m=2MPH00(Z*(1)<b*(1),,Z*(m1)<b*(m1),Z*(m)>b*(m))α(M)

This demonstrates the lemma.

References

  1. Armitage P, Stratton IM, and Worthington HV (1985). Repeated significance tests for clinical trials with a fixed number of patients and variable follow-up. Biometrics 41, 353.. [PubMed] [Google Scholar]
  2. Gange SJ and DeMets DL (1996). Sequential Monitoring of Clinical Trials with Correlated Responses. Biometrika 83, 157.. [Google Scholar]
  3. Geary DN (1988). Sequential testing in clinical trials with repeated measurements. Biometrika 75, 311.. [Google Scholar]
  4. Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, et al. (2012). mvtnorm: Multivariate Normal and t Distributions. R package version 0.9–9992 URL http://CRAN.R-project.org/package=mvtnorm. [Google Scholar]
  5. Jeffries N and Geller NL (2015). Longitudinal clinical trials with adaptive choice of follow-up time. Biometrics 71, 469.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Jennison C and Turnbull BW (1997). Group-Sequential Analysis Incorporating Covariate Information. Journal of the American Statistical Association 92, 1330.. [Google Scholar]
  7. Kittelson JM, Sharples K, and Emerson SS (2005). Group sequential clinical trials for longitudinal data with analyses using summary statistics. Statistics in Medicine 24, 2457.. [DOI] [PubMed] [Google Scholar]
  8. Lee JW and DeMets DL (1991). Sequential comparison of changes with repeated measures data. Journal of the American Statistical Association 86, 757.. [Google Scholar]
  9. Liang KY and Zeger SL (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13.. [Google Scholar]
  10. Marcus R, Peritz E, and Gabriel KR (1976). On closed testing procedures with special reference to ordered analysis of variance. Biometrika 63, 655.. [Google Scholar]
  11. Rogers WJ, Johnstone DE, Yusuf S, Weiner DH, Gallagher P, Bittner VA, et al. (1994). Quality of life among 5025 patients with left ventricular dysfunction randomized between placebo and enalapril: The studies of left ventricular dysfunction. Journal of the American College of Cardiology 23, 393.. [DOI] [PubMed] [Google Scholar]
  12. The SOLVD Investigators (1991). Effect of enalapril on survival in patients with reduced left ventricular ejection fractions and congestive heart failure. The New England Journal of Medicine 325, 293.. [DOI] [PubMed] [Google Scholar]
  13. Wu MC, Lan KKG (1992). Sequential monitoring for comparison of changes in a response variable in clinical studies. Biometrics 48, 765.. [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supllement
r-code-for-simulations

RESOURCES