Summary.
In longitudinal studies comparing two treatments over a series of common follow-up measurements, there may be interest in determining if there is a treatment difference at any follow-up period when there may be a non-monotone treatment effect over time. To evaluate this question, Jeffries and Geller (2015) examined a number of clinical trial designs that allowed adaptive choice of the follow-up time exhibiting the greatest evidence of treatment difference in a group sequential testing setting with Gaussian data. The methods are applicable when a few measurements were taken at prespecified follow-up periods. Here, we test the intersection null hypothesis of no difference at any follow-up time versus the alternative that there is a difference for at least one follow-up time. Results of Jeffries and Geller (2015) are extended by considering a broader range of modeled data and the inclusion of covariates using generalized estimating equations. Testing procedures are developed to determine a set of follow-up times that exhibit a treatment difference that accounts for multiplicity in follow-up times and interim analyses.
Keywords: Generalized estimating equations, Generalized linear models, Group sequential design, Longitudinal analysis
1. Introduction
Clinical trials that regularly record measurements over time may be used to determine if there exist differences between treatment arms during the follow-up periods. It may be of interest to know which, if any, period shows the most evidence of a difference and/or which of a number of potential periods show a difference. As an example of the first question, consider a potential therapy for which there is little prior knowledge as to how long after the intervention is given the benefit may be most apparent. The second question may address how long any benefit may last. As a clinical example, we examine quality of life measurements for a standard and experimental intervention in heart failure. We are particularly interested in settings where an intervention’s effect may be non-monotonic and not easily summarized by a single measure such as the slope or area under the curve (AUC). Additionally, there may be interest in stopping the trial early if an interim analysis reveals important differences exist and the methodology incorporates this option.
Group sequential analysis for longitudinal studies is not new. Armitage, Stratton, and Worthington (1985) compared the cumulative sum of normally distributed longitudinal measurements taken at a common set of equally spaced follow-up times. Entry was assumed to occur simultaneously and the authors developed models incorporating autocorrelated within-person error terms and provided approximate values for adjusted significance levels. This work was broadened by Geary (1988) who developed a four-parameter model that included the Armitage, Stratton, and Worthington (1985) model as a submodel but retained the same restrictive assumptions of normality, simultaneous entry, and common set of follow-up times.
Lee and DeMets (1991) extended this work with a linear mixed model approach that allowed for staggered entry and a different number of follow-up times among individuals. The change over time was assumed to follow a simple parametric pattern, for example, linear or quadratic growth. Wu and Lan (1992) and Kittelson, Sharples, and Emerson (2005) used generalizations of area under the response curve formed by an individual’s responses as a summary measure instead of the sum or fitted slope parameter. Gange and DeMets (1996) developed group sequential testing for the generalized estimating equations setting and this work was extended to general covariate settings by Jennison and Turnbull (1997).
Jeffries and Geller (2015) presented a number of adaptive/flexible designs for a group sequential setting for normally distributed data in a two arm randomized trial without using summary measures or prescribing a specific parametric form for the responses over follow-up times. That work is expanded here by considering a generalized estimating equation approach that allows for more general response models and the inclusion of covariates. We compare a GEE based method using the distribution of a max statistic to more conventional approaches for detecting a difference between longitudinal profiles with group sequential testing. In addition, we present an approach to determine which follow-up times show differences at the interim and/or final planned analysis that protects familywise error in the strong sense.
2. Model Description
Consider a trial that randomizes up to a predetermined number of participants, N, to either a control or experimental arm and accrual occurs over a broad time period. Further, the study collects follow-up measures of an outcome of interest on K occasions, for example, every 6 months for 3 years so that K = 6. In addition, suppose M analyses are conducted (1 final analysis and M − 1 interim analyses) in which treatment differences between the two arms will be assessed at the K different follow-up periods. Let δk parameterize the true difference between the experimental and control arms at the kth follow-up period. Depending upon the data structure δk could represent the difference in mean responses, the log of an odds ratio, a function of regression model coefficients, or other measures of difference. Let Zk(Nmk) denote an asymptotically normally distributed test statistic used to test the null hypothesis H0k : δk ≤ 0. The Nmk ≤ N is the total number of individuals providing data for the test of treatment difference at the kth follow-up time for the mth analysis. Initially, we consider one-sided tests where a higher response is desirable. Zk(Nmk) could arise from a number of testing approaches, for example, t-test, a comparison of proportions, or a contrast from a regression model. We assume if δk = 0 then Zk(Nmk) has a standard normal distribution; otherwise, if δk > 0 (< 0) then Zk(Nmk) is still normally distributed, however E {Zk(Nmk)} > 0 (< 0).
The intersection null hypothesis of interest is . Jeffries and Geller (2015) presented a number of approaches for testing the intersection null in this situation with one interim and a final analysis. Here, we focus on an extension of a method presented there.
Let s(t) denote a spending function for allocating Type I error where 0 ≤ t ≤ 1 denotes the study’s information time. It is anticipated that as many as N participants may provide data for K follow-up measurements. For the first interim analysis define α(1) = s(t1) where t1 = the total number of follow-up measurements observed at the interim analysis divided by the total number of follow-up measurements expected if the trial is not stopped early. Let N1k, k = 1, …, K denote the number of observed measurements for the kth follow-up time at the first interim analysis. The interim analysis test statistics Z1(N11), …, ZK(N1K) are available and let Σ(1) denote the true K × K correlation matrix for Z1(N11), …, ZK(N1K). Then for we can find a such that
(1) |
where fK{·, Σ(1)} denotes a multivariate normal distribution with a mean vector 0 and known correlation matrix Σ(1). Here, H00 corresponds to the intersection hypothesis . In practice Σ(1) is estimated, say by , at the time of the interim analysis and the multivariate normal integration (Genz et al., 2012) may be embedded within a root-finding function to find given α(1) and the estimated correlation. Then, we reject the intersection null hypothesis H0 = ⋂H0k : δk ≤ 0 at the first interim analysis if .
If we do not reject the intersection null hypothesis at the first interim analysis then a second analysis is conducted where we evaluate test statistics Zk(N2k), k = 1, …, K where N2k indicates the number of observations available for the kth follow-up time at the second analysis. As before, calculate t2 as the total number of observations available at the second analysis divided by the total number of expected observations if the trial continues to its planned conclusion and α(2) = s(t2), the cumulative amount of type I error spent through the second analysis. Now reject the same intersection null hypothesis if where and satisfies
(2) |
(3) |
and Σ(2) is the true 2K × 2K correlation matrix of the test statistics from the first and second analyses and f2K{·, Σ(2)} denotes a multivariate normal distribution with mean vector 0 and correlation matrix Σ(2). Given α(2), an estimate of Σ(2), and computed at the first analysis, one can again use a root-finding function and multivariate integration to determine . The methodology could be extended to multiple interim analyses; in general, given α(1), …, α(m), , and an estimate of the correlation matrix Σ(m) one can compute a threshold for maxk Zk(Nmk) using integration like that in equation (2) which satisfies
In Figure 1, we illustrate the ideas and notation in a setting with 6, 9, and 12 month follow-up periods and a maximum sample size of N = 400. Accrual occurs uniformly and an interim analysis is planned after data from the first 17 months of study time are available. At study month 17, 6 month follow-up data are available for the subjects enrolled during the first 11 months so N11 = 183 and Z1(N11) = Z1(183). The test statistic based on the 9 month follow-up data is denoted as Z2(133) as about 133 individuals are expected to provide 9 month data at the interim analysis. Similarly the test statistic at the interim analysis for 12 month data is Z3(83). is calculated and if sufficiently large one concludes there is a significant evidence against the null hypothesis of no positive treatment difference at any follow-up period. In section 4, we consider which follow-up periods show sufficient evidence of a treatment difference, taking into account multiple comparisons.
Figure 1.
Temporal availability of test statistics.
If does not exceed the threshold then accrual continues and a second test statistic is computed and compared to . Web Appendix B describes a straightforward extension of these ideas for testing two-sided hypotheses, that is, .
2.1. GEE Context
Here, we present the testing procedures within the context of generalized estimating equations (GEE) with one interim and a final analysis although the results may be generalized to multiple interim analyses. The notation is similar to the development in Liang and Zeger (1986). Let Yik, k = 1, …, K, i = 1, …, N denote the kth follow-up measurement of interest from the ith individual randomized. We assume that marginally Yik has a generalized linear model structure with density/mass function
where θik = h(ηik) and ηik = xikT β and xikT denotes the transpose of the p × 1 vector xik. Standard assumptions yield E(Yik) = a′(θik) and Var(Yik) = ϕa″(θik) where the ′ and ″ denote first and second derivatives with respect to θik and ϕ is a scale parameter. Here h(·) connects the linear predictors, ηik = xikT β, to θik. It is to be understood that the expectations and variances in this GEE context are conditional upon the covariates xik.
The GEE approach is well suited to modeling correlated data from an individual. Let ki ≤ K denote the number of observed follow-up measures for the ith individual when an analysis is conducted. Define and as a ki × p covariate matrix and . The within person variability is modeled by
where Ai is a ki × ki diagonal matrix with a′(θik) on the diagonal and Ri(γ) is a modeled correlation matrix parameterized by γ. Vi is the “working” covariance matrix. With a working assumption of independence, Ri is the identity matrix. Alternatively, with a small set of follow up times, for example K = 3, it is reasonable to use an unstructured correlation matrix R with elements that can be estimated. In this case, γ corresponds to the K(K − 1)/2 off-diagonal correlations .
Given working covariance parameters estimated by and the GEE estimates of β are defined as solutions to the following estimating equations:
Although different working covariance structures yield different estimates it is shown in Liang and Zeger (1986) that under general conditions and assuming the mean structure is correctly specified then the different are consistent and
(4) |
The RHS of (4) has an asymptotic multivariate Gaussian distribution with mean vector 0 and variance–covariance matrix
(5) |
This robust sandwich variance–covariance matrix depends on the form of Vi assumed and can be estimated given , , . The square root of the diagonal of this matrix (divided by N) using the estimated parameters yields the estimated robust standard errors in GEE output.
In the context of a two arm randomized clinical trial with an experimental and control arm and data recorded over K follow-up periods, we specify the following model
(6) |
Here I{} are indicator variables designating the follow-up period and if the ith person randomized receives the experimental treatment as opposed to the control treatment, βCk, k = 1, …, K denotes the effect for the control arm at the kth measurement, and βDk, k = 1, …, K denotes the treatment-control difference at the kth follow-up time. Wi0 denotes a vector of baseline covariates measured prior to randomization and β0 are the associated coefficients. Our intersection null hypothesis is H0 = ⋂k H0k : βDk ≤ 0; here βDk corresponds to δk in the previous section. This formulation does assume the follow-up measurements are taken at approximately the same common set of K time points although no parametric assumptions are made regarding the shape of the response profile over time, for example, linear or quadratic response.
We allow for an interim analysis in which N11 ≥ N12 ≥ ⋯ ≥ N1K individuals have provided data for each of the follow-up periods and they are partitioned over the treatment and control arms so that each βCk and βDk can be estimated.
In Jeffries and Geller (2015) a covariance function was derived to estimate the K × K covariance/correlation matrix associated with the βDk estimates for normally distributed data without covariates. Here, we can use the output from GEE to extract the relevant K × K portion of the estimated covariance matrix and convert that into the required estimated correlation matrix, . The information time for this interim analysis can be taken as the fraction of the eventual total expected number of responses that are observed at the time of the interim analysis.
We define where and the and terms are the estimated coefficients and estimated standard errors that are available from the GEE output. To estimate Σ(1), first, let denote the p × p estimated variance–covariance matrix for the coefficients. Then the estimated correlations between and can be obtained from the K × K submatrix of
corresponding to the terms where denotes a diagonal matrix with ψ(x) = x−1/2 applied to the diagonal elements of . Using this , is computed from equation (1) and the intersection null hypothesis H0 = ⋂k H0k : βDk ≤ 0 is rejected if .
For the case of M = 2, that is only one interim and a final analysis, if we fail to reject at the interim analysis, then we continue to full accrual of data and conduct a similar analysis at the end of the study using
where the (2) superscripts and “2” subscripts in the N2k terms indicate these quantities are based on all the data available at the end of the study. In the absence of missing data N2k = N for all k.
To find an appropriate cutoff at the end of the study, we need the correlation matrix, Σ(2), for all 2K variables {Z1(N11), …, ZK(N1K), Z1(N21), …, ZK(N2K)}. Note that although Σ(1) has dimension K × K, Σ(2) has dimension 2K × 2K. The upper left K × K submatrix of Σ(2) is estimated by obtained at the interim analysis. The lower right K × K submatrix of Σ(2) is estimated by GEE output obtained from the final analysis in the same way Σ(1) was estimated, that is, the relevant K × K submatrix of
where is the estimated variance–covariance matrix from the second analysis.
Now consider the entries of Σ(2) in rows 1 through K and columns K + 1 through 2K. A more general expression for determining can be obtained by first writing the approximation in equation (4) for the interim and final analysis:
where (m) = (1) and (m) = (2) corresponds to quantities determined at the interim and final analysis, respectively. N(m) corresponds to the maximum number of observations at any of the follow-up periods for the mth analysis—it should typically correspond to Nm1, the number of observations at the first follow-up period in the mth analysis. The asymptotic methods and assumptions that show (5) is the appropriate variance–covariance matrix for (4) can be used to show that, asymptotically,
(7) |
where we consider N(1)/N(2) a fixed fraction < 1 as both terms go to ∞. An estimate of this matrix can be computed with a moderate amount of programming and the output from the interim and final analyses. The resulting estimated covariance matrix can be converted into an estimated correlation matrix by appropriate division by diagonal elements in GEE output. Details of these computations are presented in Web Appendix C.
Hence, one can construct , a 2K × 2K estimated correlation matrix for Z1(N11), …, ZK(N1K), Z1(N21), …, ZK(N2K). Using the appropriate values for can be calculated using (2) and the intersection null hypothesis is rejected at the final analysis if . Although the development in this section was written with M = 2 analyses, the generalization to more than one interim and final analysis is straightforward.
As noted in Liang and Zeger (1986), this procedure works when data are missing completely at random as is the case for an interim analysis where missing data arise solely because not enough follow-up time has elapsed for some individuals. These authors also note other instances in which a weaker missing at random assumption may be sufficient, for example, if the assumed form of the working correlation matrix R is correct (as would be the case with an unstructured correlation matrix) with Gaussian or binary outcomes.
3. Simulations
Simulations for type I error and power are based on the following data generation model:
(8) |
where Agei and Yi0 correspond to baseline age and baseline value of Y for the ith person. Age is transformed to have a standard normal distribution. Within person correlation was driven by correlation in the ϵik terms that are Gaussian with mean 0 and with pairwise covariance Cov(ϵij, ϵik) exp(−|Tj − Tk|/15) where T0 = 0, T1 = 6, T2 = 9, and T3 = 12 months. Accrual occurred randomly in both arms at a uniform rate over an 18 month period. The outcome measure was assessed at baseline, and at 6 months, 9 months, and 1 year after baseline. In all simulations, βCk = 0 while the βDk varied according to the simulation scenario. One interim analyses was conducted when 50% of the randomized individuals provided a measurement for the 3rd follow-up period, that is, N1K = 0.5N. A final analysis was also conducted if results for the interim analysis were not significant.
A number of testing procedures were evaluated, each designed for a 5% error rate:
A Bonferroni adjustment approach in which the change from baseline to the kth follow-up period is the outcome measure and tested via a two group t-test. The p-value threshold for each of the 3 tests at the interim analysis is s(0.62)/3 where s(t) = 0.05t3 is the spending function and approximately 62% of the expected responses are observed given the stated accrual patterns and follow-up times. This spending function was chosen for simplicity but is otherwise arbitrary. The p-value threshold at the final analysis is given by {0.05 − s(0.62)/3.
a max T test approach as described in Jeffries and Geller (2015). This approach uses the change from baseline to the kth follow-up period as the outcome measure and employs a two group t-test. The thresholds for significance for an interim and final analysis are based on the same ideas for determining the and here, but a different correlation matrix is required. No use is made of covariate information.
GEE approach based on using (6) with Yi0 as the only covariate.
GEE approach based on with baseline age and Yi0 as covariates.
A GEE approach based on the model in (d) in which are sequentially monitored, each at a Bonferroni corrected alpha level 0.05/K. This approach includes baseline age and Yi0 as covariates but does not use the distribution of .
Table 1 shows there is some Type I error inflation for smaller sample sizes in the non-Bonferroni approaches. The inflation can be reduced, in some cases substantially, by using a t distribution instead of a multivariate normal distribution when finding thresholds as in equations (1) and (2). As the sample size increases the error inflation dissipates. Otherwise, each approach shows appropriate Type I error control across the range of scenarios although the Bonferroni based methods are conservative, as expected. The strong agreement in numerical values across rows reflects that the same random number seeds were used to generate the data although some slight residual variation across rows still arises from the use of Monte Carlo simulation in the multivariate integration process (Genz et al., 2012) and occasional convergence difficulties. (Convergence problems occur for less than 0.01% of the simulations; in these tables a failed convergence for one method suppresses that simulation’s results for all 5 methods.) Aside from variation from convergence problems, results for the t-test methods (approaches (a) and (b)) should only vary by sample size and whether a normal or t distribution was used. Methods using the GEE will vary in addition by the working correlation structure assumption and whether an Age coefficient is included or not.
Table 1.
Type I error simulations
Max enroll per arm | Working corr. | βAge | (a) Bonferroni | (b) t-test | (c) GEE w/o age covariate | (d) GEE w/age covariate | (e) GEE w/age and Bonferroni correction |
---|---|---|---|---|---|---|---|
100 | Indep | 0 | 0.03223 | 0.05264 | 0.05385 | 0.05438 | 0.04277 |
100(t dist) | Indep | 0 | 0.02920 | 0.04916 | 0.05041 | 0.05052 | 0.04086 |
100 | Unstruc | 0 | 0.03222 | 0.05263 | 0.05370 | 0.05396 | 0.04145 |
100(t dist) | Unstruc | 0 | 0.02919 | 0.04915 | 0.05026 | 0.05047 | 0.03963 |
100 | Indep | 0.30 | 0.03223 | 0.05265 | 0.05380 | 0.05437 | 0.04277 |
100(t dist) | Indep | 0.30 | 0.02920 | 0.04916 | 0.05041 | 0.05052 | 0.04086 |
100 | Unstruc | 0.30 | 0.03222 | 0.05263 | 0.05370 | 0.05396 | 0.04145 |
100(t dist) | Unstruc | 0.30 | 0.02919 | 0.04915 | 0.05026 | 0.05047 | 0.03963 |
200 | Indep | 0 | 0.03122 | 0.05088 | 0.05211 | 0.05219 | 0.04045 |
200(t dist) | Indep | 0 | 0.02989 | 0.04896 | 0.05029 | 0.05031 | 0.03944 |
200 | Unstruc | 0 | 0.03122 | 0.05088 | 0.05216 | 0.05190 | 0.03915 |
200(t dist) | Unstruc | 0 | 0.02989 | 0.04896 | 0.04988 | 0.05015 | 0.03817 |
200 | Indep | 0.30 | 0.03122 | 0.05088 | 0.05271 | 0.05218 | 0.04045 |
200(t dist) | Indep | 0.30 | 0.02989 | 0.04896 | 0.05029 | 0.05031 | 0.03944 |
200 | Unstruc | 0.30 | 0.03122 | 0.05088 | 0.05216 | 0.05190 | 0.03915 |
200(t dist) | Unstruc | 0.30 | 0.02989 | 0.04896 | 0.04988 | 0.05015 | 0.03817 |
400 | Indep | 0 | 0.03131 | 0.05120 | 0.05221 | 0.05243 | 0.04013 |
400(t dist) | Indep | 0 | 0.03057 | 0.05025 | 0.05116 | 0.05132 | 0.03964 |
400 | Unstruc | 0 | 0.03131 | 0.05120 | 0.05226 | 0.05255 | 0.03915 |
400(t dist) | Unstruc | 0 | 0.03057 | 0.05025 | 0.05137 | 0.05159 | 0.03862 |
400 | Indep | 0.30 | 0.03131 | 0.05120 | 0.05198 | 0.05241 | 0.04013 |
400(t dist) | Indep | 0.30 | 0.03057 | 0.05030 | 0.05087 | 0.05132 | 0.03964 |
400 | Unstruc | 0.30 | 0.03131 | 0.05120 | 0.05195 | 0.05252 | 0.03915 |
400(t dist) | Unstruc | 0.30 | 0.03057 | 0.05033 | 0.05114 | 0.05160 | 0.03862 |
Note: Each scenario/row was based on 100,000 simulations. The standard error for estimated Type I error is approximately 0.0007. Scenarios differ by the number enrolled, whether a multivariate normal or t distribution was used to determine the thresholds, whether the working correlation used independence assumption or an unstructured framework, and the value of the βAge coefficient. βDk = 0 for k = 1, 2, 3. When a multivariate t distribution was used, the common degrees of freedom was based on the number of observations available for the interim analysis.
Table 2 shows power for different scenarios, and we see important differences here. The approaches based on t-tests rather than GEE models suffer from a loss of power. The t-tests are based on differences between follow-up periods from baseline, that is, Yi0 is subtracted from follow-up values whereas the modeling approaches use Yi0 as a RHS covariate. These power differences are what would be expected when comparing change score approaches to analysis of covariance. Predictably, the t-test approaches (a) and (b) show no changes in power with various βAge values. The GEE approach without Age in the model shows deteriorating performance as the magnitude of the Age effect increases—thus reflecting more misspecification. When the Age effect is 0, models (c) and (d) are essentially the same. The power of the GEE models (methods (d) and (e)) with Age as a covariate do not change as the effect of Age increases—examination of the data generation model shows increasing βAge will not change estimates or standard error estimates of the βDk values. All approaches show increasing power with increasing values of βD3. Method (d) generally shows superior power and the benefit of including relevant covariate information.
Table 2.
Power for 5 approaches, 10,000 simulations
Working corr. structure | βage | βD3 | (a) Bonferroni | (b) t-test | (c) GEE w/o age covariate | (d) GEE w/age covariate | (e) GEE w/age and Bonferroni correction |
---|---|---|---|---|---|---|---|
Unstr | 0 | 0.20 | 0.3794 | 0.4702 | 0.5863 | 0.5867 | 0.5326 |
Unstr | 0.50 | 0.20 | 0.3794 | 0.4702 | 0.5690 | 0.5870 | 0.5326 |
Unstr | 1.00 | 0.20 | 0.3794 | 0.4702 | 0.5230 | 0.5868 | 0.5326 |
Unstr | 1.50 | 0.20 | 0.3794 | 0.4702 | 0.5031 | 0.5868 | 0.5327 |
Unstr | 0 | 0.25 | 0.5628 | 0.6547 | 0.7773 | 0.7769 | 0.7364 |
Unstr | 0.50 | 0.25 | 0.5628 | 0.6547 | 0.7518 | 0.7769 | 0.7364 |
Unstr | 1.00 | 0.25 | 0.5628 | 0.6547 | 0.7125 | 0.7769 | 0.7364 |
Unstr | 1.50 | 0.25 | 0.5629 | 0.6547 | 0.6922 | 0.7768 | 0.7365 |
Unstr | 0 | 0.30 | 0.7344 | 0.8047 | 0.9061 | 0.9062 | 0.8806 |
Unstr | 0.50 | 0.30 | 0.7345 | 0.8048 | 0.8868 | 0.9062 | 0.8806 |
Unstr | 1.00 | 0.30 | 0.7344 | 0.8048 | 0.8571 | 0.9062 | 0.8806 |
Unstr | 1.50 | 0.30 | 0.7344 | 0.8048 | 0.8406 | 0.9063 | 0.8806 |
Indep | 0 | 0.20 | 0.3794 | 0.4702 | 0.5846 | 0.5848 | 0.5358 |
Indep | 0.50 | 0.20 | 0.3794 | 0.4702 | 0.5549 | 0.5847 | 0.5358 |
Indep | 1.00 | 0.20 | 0.3794 | 0.4702 | 0.5203 | 0.5847 | 0.5358 |
Indep | 1.50 | 0.20 | 0.3794 | 0.4702 | 0.5012 | 0.5847 | 0.5358 |
Indep | 0 | 0.25 | 0.5628 | 0.6547 | 0.7746 | 0.7740 | 0.7355 |
Indep | 0.50 | 0.25 | 0.5628 | 0.6547 | 0.7489 | 0.7743 | 0.7355 |
Indep | 1.00 | 0.25 | 0.5628 | 0.6547 | 0.7081 | 0.7743 | 0.7355 |
Indep | 1.50 | 0.25 | 0.5628 | 0.6547 | 0.6871 | 0.7742 | 0.7355 |
Indep | 0 | 0.30 | 0.7344 | 0.8047 | 0.9040 | 0.9040 | 0.8808 |
Indep | 0.50 | 0.30 | 0.7344 | 0.8048 | 0.8853 | 0.9040 | 0.8808 |
Indep | 1.00 | 0.30 | 0.7344 | 0.8048 | 0.8539 | 0.9040 | 0.8808 |
Indep | 1.50 | 0.30 | 0.7344 | 0.8048 | 0.8363 | 0.9040 | 0.8808 |
Note: Each scenario/row was based on 10,000 simulations. The standard error for estimated power is bounded by 0.005. A normal distribution (rather than a t distribution) was used to calculate threshold values.
The differences arising from an independence versus unstructured correlation structure are minor, except when the interim analysis is examined. As mentioned in Liang and Zeger (1986) differences arising from working correlation assumptions may be smaller for balanced data—unbalanced data will lead to larger differences. Unbalanced data arise at the interim analysis as some individuals have 1, 2, and 3 follow-up observations. When data are balanced (e.g., the final analysis when everyone has 3 follow-up observations) there is no appreciable difference between the independence and unstructured approaches. See Web Appendix A for results showing these interim analyses results and further simulations with smaller sample sizes, M = 3 analyses, K = 5 follow-up periods, and compound symmetry dictating the true correlation between ϵij and ϵik, j, k ≥ 0 in equation (8). The results suggest some care must be taken to evaluate the robustness of estimates to differences in working correlation assumptions and that the use of the t distribution may be overly conservative when sample sizes are small at the interim analysis.
The results show that utilization of a model with baseline outcomes can substantially increase power over an unmodeled approach and suggest the benefit is likely greater still if other important covariates are included.
4. Determining which Follow-Up Periods Show Differences
Thus far the methodology has focused on determining if any follow-up period shows a difference. However, there may be interest in determining which set of follow-up periods show a difference and doing so in a way that accommodates multiplicity concerns—concerns related to the number of follow-up periods as well as the number of interim and final analyses.
In the one-sided testing context of K follow-up times and M sequential analyses with cumulative alpha thresholds α(m), m = 1, …, M, we have boundary thresholds satisfying
For one-sided testing, consider test procedure A defined as follows: Let m′ satisfy , . Reject H0k : δk ≤ 0 for all k with , that is, reject H0k for all corresponding test statistics exceeding the first crossed boundary.
Lemma 1. Testing Procedure A controls familywise error in the strong sense.
Proof of this lemma is shown in Appendix A. A similar Lemma can be constructed for two-sided testing and is shown as Lemma 2 in Web Appendix B with a proof.
5. Application: SOLVD Trial
The Studies of Left Ventricular Dysfunction (SOLVD) treatment trial was a double-blind, randomized, placebo-controlled trial to assess the effect of enalapril (an ACE inhibitor vasodilator) on mortality in a heart failure population (The SOLVD Investigators, 1991). A quality of life survey was administered at baseline, 6 weeks, 1 year, and 2 years after baseline (Rogers et al., 1994). The survey assessed each participant’s overall general health self-perception and was recorded on a 5 point scale (recoded so that higher scores indicate better quality). Table 3 shows the mean and standard deviation of the self-assessment score at the various time points. Attention is restricted to those who completed the survey at all three follow-up time points. The New York Heart Association heart failure score is a four point measure of the degree of heart failure and was obtained as a baseline measurement in the study. Here, we use it as a baseline covariate in modeling the general health score denoted Yik:
where the ϵik follow a Gaussian distribution.
Table 3.
Summary statistics for General Health Self-Assessment in SOLVD. Higher scores indicate better self-reported general health. Only participants with complete data over three follow-up periods are counted here.
Placebo, NPlac = 514 | Enalapril NEnal = 537 | |||
---|---|---|---|---|
Mean | Std dev. | Mean | Std dev. | |
Baseline | 2.53 | 0.91 | 2.54 | 0.96 |
6 Weeks | 2.63 | 0.94 | 2.73 | 0.94 |
1 Year | 2.75 | 0.92 | 2.76 | 0.96 |
2 Year | 2.70 | 0.94 | 2.76 | 0.94 |
Enrollment in SOLVD occurred over a 34 month period. Here, we present an interim analysis occurring when approximately 25% of the 2 year outcome data are available. This corresponds to about 60% of 1 year, and 91% of 6 week data. This analysis is illustrative, that is, it was not performed as part of the SOLVD study, and the availability of the data at the interim analysis follows from assuming uniform accrual and entry in the order of the study ID number in the publicly available data (see Web Appendix D for the SOLVD data source). The t-statistics for the interim analysis were (t6wk, t1yr, t2yr) = (1.59, 0.80, 1.70). For the Bonferroni approach (a) the corresponding threshold was 2.74 and the threshold for the approach based on the maximum of the t-statistic was 2.73. Consequently neither the (a) nor (b) approach based on t-statistics reached significance (at a 5% level). GEE models were computed with and without the NYHA covariate. The corresponding z–statistics for the three follow-up periods without the NYHA information were (z6wk, z1yr, z2yr) = (1.91, 0.97, 1.04) while those with the covariates were (2.05, 1.00, 1.02). For both models, the threshold for the maximum of the z–statistics was 2.73. The test statistics for method (e) are those used in method (d) and the threshold value is 2.74. Consequently, none of the three GEE methods reached their thresholds for significance at the interim analysis.
A second analysis used all data available at the end of the study. The t-statistics were (t6wk, t1yr, t2yr) = (1.74, −0.07, 0.66). The Bonferroni and max t-statistic thresholds were 2.21 and 2.09, respectively, so neither approach reached significance. The z–statistics for the GEE model without NYHA were (2.06, 0.06, 0.89) and those for the model with the NYHA covariate were (2.15, 0.11, 0.90). The threshold for models (c) and (d) was 2.11 and the threshold for method (e) was 2.16. Consequently only approach (d), the GEE approach using the NYHA information and incorporating correlation between follow-up time periods leads to a rejection of the intersection null hypothesis at the 5% level of significance and the conclusion that a significant improvement exists for the six week measure of self-perceived general health.
6. Discussion
We presented a flexible approach for analyzing longitudinal data in a group sequential setting that is especially suited for non-monotone treatment differences over time. The use of indicator variables in equation (6) allows the model to capture patterns of treatment differences that are not easily expressed by simple parametric functions or summary measures like AUC.
The approach allows for covariates that increases power relative to change score models and approaches that do not employ covariates. The method uses existing software and is therefore relatively simple to implement. The approach should be generalizable to other settings with covariates such as mixed-effects models. In addition, we have shown that a procedure that rejects the null hypothesis of no treatment difference for all follow-up periods with test statistics exceeding the boundary threshold will maintain familywise error in the strong sense.
It is noteworthy that the test statistics available at an interim analysis are based on different amounts of data and earlier follow-up periods should typically have more observations. If there exists the same magnitude of positive treatment difference for each of the K follow-up periods, that is, δk = δ for all k = 1, …, K, the larger observed sample size for the earlier follow-up periods will tend to produce larger test statistics. Consequently there may be a tendency in this approach for interim analyses to indicate differences at earlier follow-up periods than would be observed in an analysis at the planned study conclusion with all follow-up data available. This may be undesirable in some instances, for example, if there is interest in knowing how long a treatment difference lasts. This effect could be removed by basing interim test statistics only on a common set of individuals (e.g., those who have been in the study long enough to reach the Kth follow-up period) but this has the disadvantage of not using all available data.
Although the method was presented as if the interim analyses require sufficient accrual so that there are some data for all K follow-up periods, that does not need to be the case. For example, the first interim analysis does not require that some individuals reach the last follow-up period. In this case not all the analyses will involve K test statistics, however the notation and computations are easily altered for this situation.
Among the limitations of the approach is that we assume the timing of planned follow-up measurements is the same for each individual—such similarity makes it easy to model non-parametric patterns with indicator functions. Also, it may be possible to sharpen the boundaries for the procedure that controls familywise error in the strong sense using ideas from Marcus, Peritz, and Gabriel (1976); further work will explore this possibility. However, this closed testing approach will entail nontrivial computational burdens if K is not small. As is often the case, parameter estimates from designs that focus attention on the most extreme test statistics may produce parameter estimates that are subject to selection bias. Future work will explore how to address these restrictions and concerns.
7. Supplementary Materials
Web Appendix A (referenced in Section 3), Web Appendix B (referenced in Sections 2 and 4), Web Appendix C (referenced in Section 2), Web Appendix D (referenced in Section 5), and R code for conducting simulations in Section 3 are available with this article at the Biometrics website on Wiley Online Library.
Supplementary Material
Acknowledgements
This work utilized the resources of the NIH HPC Biowulf cluster. The authors are employees of the National Heart, Lung, and Blood Institute. The views expressed in this article are the authors’ and do not necessarily represent the views of the National Heart, Lung, and Blood Institute; National Institutes of Health; or the United States Department of Health and Human Services.
Appendix A
Lemma 1. Test Procedure A controls FWE strongly with level ≤ α(M).
Proof. Let τ = i1, …, iv denote subset that one-sided the of {1, …, K} such the null hypotheses are true, that is, for u = 1, …, v. Let Hτ denote the corresponding intersection null hypothesis . Let . Under Hτ the v elements in corresponding to τ are non-positive, the remaining elements of are positive. Define to have kth component so that has non-positive elements. The only components that differ between and correspond to the K − v follow-up times not represented in Hτ. □
Recall from Section 2 that H00 corresponds to δk = 0, for all k = 1, …, K. We denote the mean of for k = 1, …, K. is a function of δk, Nmk, and possible nuisance parameters such that the sign of matches the sign of δk (and if one is zero, then the other is zero). Further define so that is multivariate normal of dimension M × K with zero means and the same correlation structure as the corresponding Zk(Nmk) values.
By definition Test Procedure A rejects a hypothesis k′ if
For each m = 1, …, M define , that is, the maximal z–statistic among the corresponding true hypotheses at analysis stage m. Then a Type I error occurs if and only if where m′ denotes the first analysis in which We want to show .
This demonstrates the lemma.
References
- Armitage P, Stratton IM, and Worthington HV (1985). Repeated significance tests for clinical trials with a fixed number of patients and variable follow-up. Biometrics 41, 353.–. [PubMed] [Google Scholar]
- Gange SJ and DeMets DL (1996). Sequential Monitoring of Clinical Trials with Correlated Responses. Biometrika 83, 157.–. [Google Scholar]
- Geary DN (1988). Sequential testing in clinical trials with repeated measurements. Biometrika 75, 311.–. [Google Scholar]
- Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, et al. (2012). mvtnorm: Multivariate Normal and t Distributions. R package version 0.9–9992 URL http://CRAN.R-project.org/package=mvtnorm. [Google Scholar]
- Jeffries N and Geller NL (2015). Longitudinal clinical trials with adaptive choice of follow-up time. Biometrics 71, 469.–. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jennison C and Turnbull BW (1997). Group-Sequential Analysis Incorporating Covariate Information. Journal of the American Statistical Association 92, 1330.–. [Google Scholar]
- Kittelson JM, Sharples K, and Emerson SS (2005). Group sequential clinical trials for longitudinal data with analyses using summary statistics. Statistics in Medicine 24, 2457.–. [DOI] [PubMed] [Google Scholar]
- Lee JW and DeMets DL (1991). Sequential comparison of changes with repeated measures data. Journal of the American Statistical Association 86, 757.–. [Google Scholar]
- Liang KY and Zeger SL (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13.–. [Google Scholar]
- Marcus R, Peritz E, and Gabriel KR (1976). On closed testing procedures with special reference to ordered analysis of variance. Biometrika 63, 655.–. [Google Scholar]
- Rogers WJ, Johnstone DE, Yusuf S, Weiner DH, Gallagher P, Bittner VA, et al. (1994). Quality of life among 5025 patients with left ventricular dysfunction randomized between placebo and enalapril: The studies of left ventricular dysfunction. Journal of the American College of Cardiology 23, 393.–. [DOI] [PubMed] [Google Scholar]
- The SOLVD Investigators (1991). Effect of enalapril on survival in patients with reduced left ventricular ejection fractions and congestive heart failure. The New England Journal of Medicine 325, 293.–. [DOI] [PubMed] [Google Scholar]
- Wu MC, Lan KKG (1992). Sequential monitoring for comparison of changes in a response variable in clinical studies. Biometrics 48, 765.–. [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.