Abstract
Principal surrogate (PS) endpoints are relatively inexpensive and easy to measure study outcomes that can be used to reliably predict treatment effects on clinical endpoints of interest. Few statistical methods for assessing the validity of potential PSs utilize time-to-event clinical endpoint information and to our knowledge none allow for the characterization of time-varying treatment effects. We introduce the time-dependent and surrogate-dependent treatment efficacy curve, , and a new augmented trial design for assessing the quality of a biomarker as a PS. We propose a novel Weibull model and an estimated maximum likelihood method for estimation of the curve. We describe the operating characteristics of our methods via simulations. We analyze data from the Diabetes Control and Complications Trial, in which we find evidence of a biomarker with value as a PS.
Keywords: Case–control study, Causal inference, Clinical trials, Principal stratification, Survival analysis, Treatment efficacy curve, Weibull model
1. Introduction
A valid principal surrogate (PS) endpoint can be used as a primary outcome for evaluating and comparing treatments in phase I–II trials, and for predicting phase III treatment effects without requiring large efficacy trials to directly assess clinical treatment effects. Frangakis and Rubin (2002) introduced the principal stratification framework and a definition of a PS. Since then, alternative definitions of and criteria for assessing a PS have been suggested and several methods for evaluation have been developed (e.g., Taylor and others, 2005; Follmann, 2006; Gilbert and Hudgens, 2008; Li and others, 2010; Wolfson and Gilbert, 2010; Huang and Gilbert, 2011; Zigler and Belin, 2012; Huang and others, 2013).
There are two distinct strategies for assessing the value of a biomarker as a PS. Both strategies quantify the prediction of the treatment effect on the clinical outcome; however, one bases the prediction on the treatment effect on the candidate PS and the other on the candidate PS under active treatment. The strength of either predictive association can be displayed via the causal effect predictiveness (CEP) function, which is a surface if treatment effect on the surrogate is considered or a marginal curve if the surrogate under active treatment is considered. An example of a marginal CEP curve is the surrogate-dependent treatment efficacy curve, TE(s) (Gilbert and Hudgens, 2008). We will focus on the active treatment PS strategy for our methods, but we will discuss both strategies in detail outlining when they are equivalent.
To our knowledge, only one method of PS evaluation under any strategy allows for time-to-event clinical endpoints subject to right censoring (Qin and others, 2008); however, time-constancy of treatment effects was assumed via a proportional hazards model. Time-varying treatment efficacy occurs in many trials, for example, Duerr and others (2012). We propose to accommodate both the use of time-to-event data and the potential for time-varying treatment efficacy by introducing the time-dependent and surrogate-dependent marginal treatment efficacy curve, . We propose an estimated maximum likelihood (EML) method for estimating the curve via a novel parameterization of the conditional Weibull distribution.
In Section 2, we outline our notation, introduce the time-dependent risk estimands, and give some assumptions that are helpful for identifying these estimands. In Section 3, we outline the strategies for classifying the value of a biomarker as a PS and discuss when they are equivalent. In Section 4, we outline previously suggested augmented trial designs and introduce a new augmentation to aid in evaluation. In Section 5, we introduce a Weibull model for the risk estimands and outline our suggested procedure for its use in evaluating time-varying treatment efficacy. In Section 6, we outline the results of a simulation study of the methods; in Section 7, we give the results of an analysis of the Diabetes Control and Complications Trial (DCCT) using our proposed methods. In Section 8, we discuss some potential limitations of our methods and make suggestions for future research. A table of acronyms can be found in Appendix B of the supplementary material available at Biostatistics online.
2. Notation and time-dependent risk estimands
2.1. Notation
Let Z be the treatment indicator, 0 for control/placebo and 1 for treatment. Let W be a baseline measurement taken prior to randomization. In the principal stratification framework of Frangakis and Rubin (2002), we use potential outcomes, where all post-randomization measures are considered under either treatment arm for each individual. Let be the potential time from randomization to clinical event for individual i had s/he received treatment . Let be the indicator of , where is the potential censoring time. Let . Let be the candidate surrogate under treatment arm .
The candidate surrogate S is measured at a fixed time point after randomization. If is less than or equal to , is undefined. Subjects with are excluded from the analysis cohort. Let R be the indicator that is observed. We assume that the observed and potential outcomes ¡inlinegif¿Ti(1),Ci(0),Ci(1),Δi(0),Δi(1)}¡/inlinegif¿, are independently and identically distributed. Let and be the joint cumulative distribution function (CDF) of and W and the conditional cdf of given W, respectively. Let denote the observed variables for subject i.
2.2. Time-dependent risk estimands
In the time-to-event setting there are many ways to define risk. One could define the marginal potential time-dependent and surrogate-dependent risks using the conditional cdf for T, , by
(2.1) |
where the subscript of a function indicates the level of for the potential outcomes. Contrasts in these conditional risks measure a causal effect of treatment assignment on failure time in subgroups defined by . We define one such contrast . This form of directly extends the surrogate-dependent treatment efficacy curve of Gilbert and Hudgens (2008), , to the time-dependent setting. One could also define the risks based on the hazard function, , where is the conditional probability density function (pdf) of T, assuming it exists.
Comparisons of and are non-causal, as outlined in Hernán (2010), because the hazard-based risk estimand in the treatment arm is conditional on a different potential set, , than is the hazard-based risk estimand in the control arm, . Comparisons based on the hazards are still of interest in this setting as they are comparisons over subgroups defined by levels of (Qin and others, 2008). There are also some settings, such as rare events, when and do not differ greatly, making this much less of a concern. We use both the hazard-based TE curve, , and the CDF-based TE curve, , to illustrate our purposed methods. The flexible parameterization of the Weibull model, given in Assumption A8, allows for the characterization of a variety of time-varying TE when risk is hazard-based; the CDF-based TE curve is always time-dependent and does not illustrate the nuances of time-dependent risk as well as the hazard-based TE. The CDF-based TE curve is useful for definitive evaluation of a biomarker as a PS because it is a causal contrast of risks over the treatment arms. Figure 1 illustrates the difference between the and curves.
Following Qin and others (2008), Assumptions A1–A4 reduce the number of missing potential outcomes and help identify the risk estimands from the observed data.
A1: Stable Unit Treatment Value Assumption (SUTVA) and Consistency.
A2: Ignorable Treatment Assignment.
A3: Equal individual clinical risk up to time : if and only if .
A4: Random censoring: for .
Assumptions A1–A3 have been used and discussed previously in the literature (Gilbert and Hudgens, 2008; Gilbert and others, 2008; Qin and others, 2008). A relaxed version of Assumption A4, independence of and conditional on , is sufficient for identification of the risk estimands. However, Assumption A4 as stated is needed to account for censoring in the manner we have prescribed in our model, given as Assumption A8. Assumption A3 is not fully testable and will be violated in some trials. We continue to make A3 here because it is plausible for our motivating example, in which there are no failure events prior to . Assumptions A1–A4 imply that the conditional distribution of , given , equals that for T given for .
3. Definition of a PS
A PS was first defined in Frangakis and Rubin (2002) as a biomarker S such that causal treatment effects on the clinical outcome only exist when causal treatment effects exist for S. Building on Frangakis and Rubin (2002), Joffe and Greene (2009) characterized a PS as an intermediate endpoint such that treatment effect on the surrogate can be used to reliably predict treatment effect on the clinical endpoint. Joffe and Greene (2009) and Frangakis and Rubin (2002) stated criteria that can be used to assess biomarkers as PS using the joint risk estimands, (Gilbert and Hudgens, 2008), where is the clinical event indicator. Frangakis and Rubin (2002) give the criterion that a biomarker is a PS if for all , which Gilbert and Hudgens (2008) called average causal necessity (ACN). Subsequent work added a second criterion that a good PS should have a risk contrast over the arms of the trial that varies widely in . Together these criteria satisfy the Joffe and Greene (2009) definition of a PS.
Gilbert and Hudgens (2008) and Wolfson and Gilbert (2010) note that for a biomarker to be useful as a surrogate it need only help group subjects by TE levels. This led to a utility-driven alternative strategy for assessing biomarkers as PS, such that a biomarker has value as a PS if the potential treatment effect on the biomarker under active treatment assignment can be used to reliably predict the treatment effect on the clinical endpoint. A biomarker can be assessed under active treatment based on the marginal risks defined given in Equation (2.1).
To assess a biomarker under the active treatment, only a marginal version of the second criterion is needed to establish that a biomarker has value as a PS. The marginal risk criterion is that varies widely in . Biomarkers satisfying this criterion are useful for evaluating future treatments, with the objective to move more treatment recipients to the range where treatment is highly effective. As noted in Gilbert and Hudgens (2008), in the special case where all of the equal some constant c, termed constant biomarker (CB), the joint second criterion and the marginal second criterion are equivalent and ACN can be evaluated based on . Our methods are based on the marginal risk estimands, and thus can be used to assess biomarkers under active treatment in all cases and under both arms when CB holds. For either evaluation strategy greater variation in the TE function over the range of or , marginal or joint, suggests increasing value as a PS.
4. Augmented trial design
The marginal risk estimands condition on , which is missing for all placebo recipients in a standard clinical trial. Follmann (2006) outlined two vaccine trial design augmentations for inferring for placebo recipients, baseline immunogenicity predictor (BIP), as coined in Gilbert and Hudgens (2008), and closeout placebo vaccination (CPV). A useful BIP, which we will call W, is highly correlated with and is easily measurable at baseline. Under CPV, placebo recipients uninfected and uncensored at the end of the follow-up period for infection, , are vaccinated, and their immune response biomarker, , is measured at time after vaccination. BIP and CPV trial augmentation can be used in combination or separately.
Although the BIP and CPV trial design augmentations were originally proposed in terms of vaccine trials, they can easily be generalized to the clinical trial setting. The concept of a BIP is easily extended to be any baseline measurement(s) that are predictive of the candidate PS. A BIP under this definition is not a priori considered irrelevant to the clinical outcome and therefore should be considered for inclusion in the model for outcome on a case by case basis. Similarly, the concept of CPV can be extended such that non-active treatment arm subjects who are not censored and do not have an observed clinical event prior to closeout, closeout, are given treatment and then followed until trial closeout plus , when the candidate surrogate is measured. In order to replace missing values with the closeout measurement , we adapt the assumptions made by Qin and others (2008).
A5: Time constancy of the true immune response at time , : for placebo recipients with closeout and , , and , where and are iid random errors with mean zero;
A6: No infections during the close-out period, ,
where is the indicator of the clinical event during the close-out period of duration . Assumption A5 is not fully testable, but may be plausible when the follow-up time is not long relative to the age of the subjects and when environmental factors are unlikely to change . An obvious testable implication of A6 is that no subjects undergoing CPV should be observed to have an event before is measured. Some deviations from A5 and A6 may be acceptable and sensitivity analysis can be performed to evaluate the influence of such deviations.
We propose an additional trial augmentation; we will refer to this augmentation as the baseline surrogate measure (BSM). The BSM augmentation is defined simply as measuring the biomarker of interest at baseline; this can be useful for multiple purposes. Under Assumption A7, given below, the BSM measurement can replace missing values and the difference biomarker satisfies Case CB . Even when A7 does not hold, the BSM is useful as a potentially highly correlated BIP. Assumption A7 is no change in the biomarker from baseline to in a non-active treatment arm. This is stated formally as follows:
A7: .
Assumption A7 has testable hypotheses that the distribution of in the control arm has point mass at zero; violations of this assumption are easily observable. When a measurement error is present, the measurements of may differ from zero for many subjects and A7 still holds. If it is believed that a measurement error is present and non-systematic, one can test for evidence against A7 by testing the null . Both tests allow for some quantification of the plausibility of A7; however, subject-matter knowledge is just as important when determining if A7 is plausible. This augmentation is available in our motivating data set and greater discussion of Assumption A7 and its implications for candidate PS evaluation can be found in Section 7.
5. Weibull structural risk model
We assume a Weibull model for the conditional pdf of T given Z, which parameterizes both the scale, , and shape, , components of the Weibull model with treatment Z and potential surrogate . Specifically, our Weibull assumption A8 states that
A8:
where is the parameterized conditional Weibull survivor function for the treatment arm , and the conditional hazard function is given by
(5.1) |
for . Given A8, the conditional likelihood of the observed data can be written as
(5.2) |
We use a parametric form for the joint cdf of and W in our simulations, , which implies a form for , and we estimate using maximum likelihood. The choice of the estimated form of should be based on the trial data and can be tailored to the particular type of and W data observed. The model can be of any form which integration is feasible. Huang and Gilbert (2011) use a semiparametric model for and this can also be used with the Weibull model, as we demonstrate in the DCCT example. Regardless of the choice of model for , Monte-Carlo integration is suggested over numerical integration to reduce computational burden. Once an estimate for is obtained, we can use it in the likelihood above to obtain an estimated likelihood over . We can then maximize for estimates of . This general approach of EML was introduced by Pepe and Fleming (1991) and used by Follmann (2006) and Gilbert and Hudgens (2008) among others.
We state a result for the identifiability of in Appendix A of the supplementary material available at Biostatistics online. Using the identifiability result for and following the proof in Gilbert and Hudgens (2008), it can be shown that the observed estimated likelihood has a unique maximum given the data from a BIP-augmented trial, provided that there are observed failures. This result also holds in a CPV augmented trial design, following the proof of Proposition 3 in Wolfson (2009). Given that the appropriate assumptions hold, A1–A4 and A8 for BIP and A1–A6 and A8 for CPV, as well as implicit conditioning on , the unique solutions to the EML imply identification of the causal estimands of interest. Given the identifiability of and data augmentations, the EML estimators are also consistent for for consistent (Pepe and Fleming, 1991). However, the asymptotic distributional results of Pepe and Fleming (1991) for general EML estimators do not carry over to our setting, due to the zero probability of observing in infected placebo recipients. We suggest using the bootstrap for variance estimation and inference.
We refer to model A8 as the time-dependent hazard TE Weibull model. The conditional risks, , can be expressed as functions of the coefficients for any of the conditional risk forms of interest (i.e. based on a hazard, cumulative hazard or cdf). Although the model is depicted without the inclusion of additional baseline that variables, baseline variables such as the BIP, W, can be included in the scale term if it is believed they may be associated with outcome. Figure 1 depicts some example , , and curves.
5.1. Evaluating surrogate value under the Weibull model
We propose a three-step process for evaluating a potential PS using the above estimation method.
Step 1: Fit the time-dependent-hazard TE Weibull model via EML; determine the EML estimates by maximizing , where is defined in Equation (5.2).
Step 2: Test for time-varying conditional hazard-based treatment efficacy, : , by testing .
Step 3: If the data support , fit the time-independent hazard TE Weibull model, outlined below and in Appendix A of supplementary material available at Biostatistics online. Use estimates from this model for figures and inference on surrogate quality. If the data support rejection of , use the time-dependent-hazard TE model estimates for figures and inference on surrogate quality.
The testable null hypotheses of interest are as follows:
: , Null equivalent ;
: , Null equivalent ;
: , Null equivalent ;
: , Null equivalent ;
: , Null equivalent
We suggest Wald tests for all nulls. If we fail to reject , we use a simpler model for inference that fully characterizes the scale component and only allows for terms in the shape component that will affect hazard-based risk equally for the two treatment groups (parameterization outlined in Appendix A of supplementary material available at Biostatistics online). We refer to this model as the time-independent hazard TE Weibull model and place stars on the model coefficients to distinguish them from their time-dependent hazard TE counterparts. This model can again be fit via EML and can also accommodate the inclusion of baseline variables in the scale parameter.
The justifications for the coefficient equivalents of and are not conceptually difficult but require some algebra and are given in Appendix A of supplementary material available at Biostatistics online. The CDF-based TE is always time-dependent and nulls based on it are always a subset of the null space of the hazard-based TE tests for the same model. For this reason, we suggested testing both the hazard-based TE null and the CDF-based TE null, accounting for multiple testing, to evaluate a biomarker as having any value as a PS. Sequential testing is also suggested; if the data do not support rejection of , the appropriate tests for any surrogate value are and . Similarly, when data do support rejection of , the appropriate tests for any surrogate value are and .
If case CB holds for S or , via BSM and Assumption A7, then assessment of ACN can be made using the marginal risk estimands as parameterized by the Weibull models; tight confidence intervals about that include c for all t of interest are support for ACN. Further discussion of the different strategies of PS evaluation can be found in the discussion section and above in Section 3.
The time-dependent hazard model allows for time variation of in many forms, as illustrated in Figure 1. If the data support rejection of the null hypothesis , the most comprehensive way to evaluate the time variation is to plot the estimated for a range of values for several different time points of interest. In addition, one can plot the estimated for a range of time points and less than the longest follow-up time, for several different values. These plots provide a clear visual indication of the surrogate value of S as well as the meaning of any significant time variation; an example of this type of plot can be seen Figure S1 in Appendix B of the supplementary material available at Biostatistics online. Hypothesis tests can be used to provide inference about the nature of the time dependence depicted by the curve. Some suggested coefficient-based hypothesis tests are outlined in Appendix A of supplementary material available at Biostatistics online.
6. Simulation
Simulated data follow a 1:1 randomized, two-arm trial with 2000 subjects per treatment arm using the various case–control sampling designs for CPV and BIP. Suppose that the conditional cdf of T, given and Z, follows a Weibull model and that follows a bivariate normal model with correlation . Information lost to drop-out occurs completely at random, and occurs at a rate of 5% per year. Event times are censored at 3 years post , at which time the trials have 50% TE on average, with an average of 104 treatment group infections and 208 placebo group infections over the 1000 simulated trials. This follows the HIV vaccine trial design proposed in Gilbert and others (2011).
We investigate Weibull models for T given and Z under 7 different scenarios. We investigate three different PS quality levels which characterize time-independent hazard-based TE curves: a high quality surrogate, a marginal quality surrogate, and a useless surrogate. We call these the time independent scenarios. We also consider four scenarios with differing amounts of time dependence in . We investigate a high-quality surrogate and a marginal-quality surrogate with time dependence in alone and a high-quality surrogate and a marginal-quality surrogate under time-dependence that is both associated with the surrogate quality and with . We refer to this as the multiple time-dependent scenario, labeled as “Multi Time-dep” in Tables 1, 2 and 3, and all four settings with time-dependence as the time-dependent scenarios.
Table 1.
Time Indep. |
Time-dep. |
Multi Time-dep. |
Time Indep. |
Time-dep. |
Multi Time-dep. |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Estimand | No Val. | Some Val. | High Val. | Some | High | Some | High | No | Some | High | Some | High | Some | High |
Full sampling and | 1:5 and full | |||||||||||||
0.50 | 1.10 | 0.90 | 0.20 | 0.00 | 0.20 | 0.70 | 0.40 | 1.10 | 0.80 | 0.10 | 0.20 | 0.10 | 0.60 | |
0.10 | 0.40 | 0.10 | 1.50 | 0.10 | 1.40 | 0.10 | 0.10 | 0.40 | 0.10 | 1.60 | 0.00 | 1.70 | 0.10 | |
0.50 | 1.10 | 0.90 | 0.40 | 0.30 | 0.40 | 0.40 | 0.50 | 1.10 | 0.80 | 0.20 | 0.00 | 0.80 | 0.60 | |
0.40 | 0.40 | 0.10 | 1.40 | 0.00 | 0.00 | 0.20 | 0.10 | 0.40 | 0.10 | 1.50 | 0.30 | 0.40 | 0.00 | |
No and full | full and 1:5 | |||||||||||||
0.70 | 1.20 | 0.70 | 0.20 | 0.40 | 0.50 | 0.70 | 0.70 | 1.20 | 0.70 | 0.20 | 0.40 | 0.50 | 0.70 | |
1.00 | 0.50 | 0.10 | 1.90 | 0.30 | 1.20 | 0.10 | 1.00 | 0.50 | 0.10 | 1.90 | 0.30 | 1.20 | 0.10 | |
0.60 | 1.10 | 0.70 | 0.30 | 0.50 | 0.40 | 0.30 | 0.60 | 1.10 | 0.70 | 0.30 | 0.50 | 0.40 | 0.30 | |
0.80 | 0.40 | 0.00 | 0.80 | 0.40 | 0.10 | 0.20 | 0.80 | 0.40 | 0.00 | 0.80 | 0.40 | 0.10 | 0.20 | |
1:5 and | no and 1:5 | |||||||||||||
0.70 | 1.10 | 0.70 | 0.10 | 0.30 | 0.50 | 0.70 | 0.50 | 1.10 | 0.60 | 0.00 | 0.10 | 0.30 | 0.60 | |
0.10 | 0.50 | 0.00 | 1.60 | 0.20 | 1.60 | 0.10 | 0.10 | 0.30 | 0.10 | 1.40 | 0.20 | 1.30 | 0.10 | |
0.60 | 1.10 | 0.70 | 0.30 | 0.50 | 0.40 | 0.30 | 0.60 | 1.10 | 0.60 | 0.40 | 0.50 | 0.30 | 0.30 | |
0.00 | 0.50 | 0.00 | 1.30 | 0.20 | 0.40 | 0.10 | 0.00 | 0.30 | 0.10 | 1.20 | 0.00 | 0.30 | 0.10 |
Average bias over the 1000 simulations less than 1 Monte Carlo standard error in all cases.
Table 2.
Time indep. |
Time-dep. |
Multi time-dep. |
|||||
---|---|---|---|---|---|---|---|
Null | No val. | Some val. | High val. | Some | High | Some | High |
Full sampling | |||||||
0.05 | 0.04 | 0.05 | 0.37 | 0.42 | 0.59 | 0.53 | |
0.04 | 0.05 | 0.09 | 0.41 | 0.50 | 0.84 | 0.74 | |
0.05 | 0.45 | 1.00 | 0.34 | 0.90 | 0.43 | 0.92 | |
¶ | 0.06 | 0.30 | 1.00 | 0.26 | 0.88 | 0.48 | 0.94 |
0.06 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |
0.05 | 0.86 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |
1:5 case:control sampling | |||||||
0.08 | 0.05 | 0.10 | 0.40 | 0.48 | 0.83 | 0.73 | |
§¶ | 0.06 | 0.43 | 1.00 | 0.25 | 0.81 | 0.43 | 0.90 |
0.07 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |
No sampling | |||||||
0.08 | 0.08 | 0.09 | 0.40 | 0.49 | 0.84 | 0.75 | |
§¶ | 0.06 | 0.45 | 1.00 | 0.26 | 0.86 | 0.45 | 0.90 |
0.06 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Proportional hazards test based on the Cox model. Test of based on a joint Wald test of . Test of based on a Wald test of . ¶ Test of based on a Wald test . §¶ The model-specific test of surrogate value based on the hazard, test §in the time-independent case and test ¶ in the time-dependent.Test based on a Wald test of . Test based on a Wald test of . The model-specific test of surrogate value based on the CDF, test in the time-independent case and test # in the time-dependent.
Table 3.
Time indep. |
Time-dep. |
Multi time-dep. |
|||||
---|---|---|---|---|---|---|---|
Null | No val. | Some val. | High val. | Some | High | Some | High |
Full sampling | |||||||
0.08 | 0.05 | 0.09 | 0.39 | 0.47 | 0.82 | 0.73 | |
§¶ | 0.05 | 0.39 | 0.99 | 0.21 | 0.78 | 0.35 | 0.88 |
0.06 | 0.87 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |
1:5 case:control sampling | |||||||
0.07 | 0.05 | 0.08 | 0.38 | 0.46 | 0.82 | 0.72 | |
§¶ | 0.05 | 0.42 | 1.00 | 0.22 | 0.79 | 0.37 | 0.86 |
0.06 | 0.86 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |
No sampling | |||||||
0.08 | 0.05 | 0.09 | 0.39 | 0.47 | 0.83 | 0.72 | |
§¶ | 0.05 | 0.43 | 1.00 | 0.21 | 0.78 | 0.36 | 0.85 |
0.06 | 0.87 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Test of based on a joint Wald test of =. § Test of based on a Wald test of . ¶ Test of based on a Wald test . §¶ The model-specific test of surrogate value based on the hazard, test §in the time-independent case and test ¶ in the time-dependent.Test based on a Wald test of . Test based on a Wald test of . The model-specific test of surrogate value based on the CDF, test in the time-independent case and test # in the time-dependent.
We also consider six different types of case–control sampling of / all for . The six case–control sampling scenarios considered are broken into two groups of three to consider the issues of case–control sampling of and separately. First, we consider case–control sampling of , measuring for all treated subjects. Case–control sampling of refers to obtaining measurements from a random sample of non-active treatment subjects for whom . We consider 1:5 case–control sampling of , no sampling of , and sampling of all non-active treatment subjects with closeout. We then investigate the effects of subsampling of , by holding case–control sampling of at 1:5 and again varying sampling of between 1:5 case–control, no sampling, and all non-active treatment subjects with . Case–control sampling of is the same as that for , with the addition of obtaining measurements for all treated subjects with observed events at closeout, .
Table 1 displays the percent bias for various points on the and curves for each of the seven surrogate types and 6 sampling scenarios. We find that the Weibull EML estimation method has satisfactory performance in terms of minimal percent bias for points on both the and curves and with average bias less than one Monte Carlo standard error in all cases. In addition, all model coefficient estimates have minimal bias, with all estimates of mean bias well within one Monte Carlo standard error (results not shown).
We display in Tables 2 and 3 the results from Wald tests of – and – for each of the seven surrogate scenarios and 6 sampling scenarios; Monte Carlo standard errors are used in the Wald tests due to the computational burden of the bootstrap. We also display the power of a test of proportional hazards (PH), , using a Cox model containing treatment alone based on the Schoenfeld residuals (Grambsch and Therneau, 1994). We find that with full sampling the Cox-based PH test has lower power to reject than the Weibull model-based test. We find that the test of has power ranging from 0.37 to 0.84 and nearly correct type 1 error.
We also display in Table 2 the results for the tests of the nulls and , for all of the surrogate scenarios and two of the sampling scenarios based on the time-dependent hazard Weibull model. We find that both tests have adequate power and correct size in the time-independent hazard scenarios. Tests of the nulls and have noticeably less power than the tests of and in the truly time-independent hazards scenarios; this justifies reverting to the simpler model when there is no evidence of time-dependent hazards. In the time-dependent hazard setting, the tests of and have power ranging from (0.21–1.00). The CDF-based TE test of any surrogate value, testing null , has markedly better power than is all cases. This is also true in the time-independent hazard scenarios, where has markedly better power than is all cases.
The non-hierarchical power and type 1 error rate of testing and over the various sampling scenarios can be seen in Tables 2 and 3. To evaluate type 1 error and power, the entire suggested hierarchical testing procedure for assessing surrogate value was followed for all full sample simulations. Under this procedure we found that correct type 1 error and power was approximately the same as a power-of--weighted average of and . For example, the power for the hierarchical procedure for the high-quality surrogate with multiple types of time dependence is 0.935, which is almost exactly for that scenario. This was similarly found for null hypotheses and , suggesting that the hierarchical procedure maintains the correct size in all scenarios.
Power to reject all nulls declines from full sampling to case–control sampling of . This decline is much more noticeable in the tests of surrogate quality; this is not surprising given that the coefficients involved in testing are associated with . It is clear from a comparison of Tables 2 and 3 that subsampling of has a greater impact on power than subsampling of . In some of our simulation scenarios there exists a paradox of reduced power with increased sampling of for a fixed level of sampling; this paradox was first observed in Gilbert and others (2011). The paradox is a characteristic of the EML estimator as explained in Huang and others (2013). For weaker BIP, adding has been shown in previous works to improve power compared with BIP alone (Gilbert and others, 2011).
The simulations for full sampling were repeated for lower-quality BIPs with (Table S1 in Appendix B of supplementary material available at Biostatistics online). As expected, power decreases rapidly and bias increases slowly as decreases. Hence, highly predictive BIPs are essential for accurate and reasonably precise PS evaluation for EML-based methods; this was also found to be true in previous works (Follmann, 2006; Gilbert and Hudgens, 2008; Huang and Gilbert, 2011). Based on our simulation results, the 0.701 correlation in our motivating example is adequate for unbiased and reasonably precise estimation via the Weibull EML method.
07. DCCT example
The DCCT enrolled 1441 persons with type 1 diabetes from 1983 to 1989 to determine the effects of intensive diabetes therapy on long-term complications of diabetes. Participants in DCCT were randomly assigned to intensive diabetes therapy aimed at lowering glucose concentrations as close as safely possible to the normal range or to conventional therapy aimed at preventing hyperglycemic symptoms. One of the outcomes of the DCCT, nephropathy (damage to the kidneys), is the leading cause of death and dialysis in the young with type 1 diabetes, particularly those with poorly controlled glucose levels. Nephropathy is often defined by a high albumin excretion rate, as micro-albuminuria (defined as an albumin excretion rate ) is the best non-invasive indicator of kidney damage. The trial ended early in 1993 due to overwhelming evidence of treatment efficacy, with an average of 6.5 years of follow-up; the estimated adjusted mean risk of micro-albuminuria was reduced by 56%, P-value 0.01 (DCCT/EDIC Research, 2011).
The current study includes all participants who were free from micro-albuminuria at baseline (); baseline micro-albuminuria was balanced over the arms of the trial. The difference in log-transformed hemoglobin A1C (HBA1C) measurements from baseline to year 1 is the candidate PS. The event of interest is the onset of persistent micro-albuminuria, which is defined as having two consecutive albumin excretion rate measurements . Right censoring occurs due to drop-out or due to the end of the trial in 1993. No subject had an event prior to year post-randomization. All subjects had the BIP measured, which was defined as a linear combination of the BSM measurement, age, BMI and smoking status fit via linear regression to change in HBA1C using the Akaike information criterion. The estimated Spearman correlation between the BIP and the candidate PS is (0.7). We use a linear combination of baseline variables here to demonstrate that a set of weaker BIPs can be combined to form a higher-quality BIP; when fitting a linear combination, care should be taken to only include variables that truly improve predictive power and to avoid overfitting.
We fit the time-dependent hazard Weibull model to these data assuming a semiparametric location-scale model for and assuming a parametric normal model. We found that the results were nearly identical, but the parametric model was more efficient. We found only marginal evidence of time dependence (P-values ), and for parsimony we use the time-independent hazard model as the main model for analysis. Figure S1 in Appendix B of the supplementary material (available at Biostatistics online) depicts the time-dependent hazard model for the parametric analysis. Under the time-independent hazard Weibull model we find evidence to support that 1-year change in HBA1C is a high-quality surrogate as the P-values for testing both and are for both the semiparametric and the parametric analysis. Figure 2 illustrates the estimated and curves for the time-independent hazard model, assuming a location-scale model for . Figure 2 in Appendix B of the supplementary material (available at Biostatistics online) depicts these same results for the parametric analysis. We also ran the analysis adjusting for the BIP. The adjusted models were very similar to the unadjusted, with all P-values within the same range as the unadjusted analysis and very similar TE curves.
There is evidence to suggest that Assumption A7 does not hold if there is no measurement error. As there is information to suggest the presence of measurement error in these measurements of HBA1C, we also test the null and find that this suggests no evidence against A7; (P-value 0.863). Figure 3 in Appendix B of the supplementary material (available at Biostatistics online) depicts the observed association between one year change in HBA1C and the clinical outcome separately by treatment arm. The estimated is 0.13 (95% CI 1.27, 0.458) for the semiparametric analysis; the P-value for is 0.228. This is consistent with but not supportive of ACN. This is not surprising as treatment continued for 9 years after the candidate surrogate was measured. However, 1-year change in HBA1C under active treatment still strongly modifies the and curves. Therefore, HBA1C reduction at year 1 is a good target for treatments in this setting.
8. Discussion
PSs are important endpoints for Phase I and II trials. Few PS evaluation methods allow for a time-to-event clinical endpoint with right-censoring and, to our knowledge, none allow for or characterize the time-dependent effects of the treatment. There is evidence of time-varying treatment effects in many treatment and vaccine efficacy trials. Methods of PS evaluation that do not allow for or characterize time-varying effects may classify potential PS as high-quality ignoring their lack of durability or dismiss high-quality surrogates in trials that have rapidly waning TE.
The time-dependent hazard TE Weibull model allows for the characterization of the time-varying treatment effects in the time-to-event setting. The EML method is an adequate means to estimate the parameters of the time-dependent hazard TE Weibull model, allowing for flexible modeling of the PS given BIP distribution. The EML estimators perform well when there is a highly predictive BIP, but the need for a highly correlated BIP is a limitation of EML estimation. When a highly correlated BIP is available, EML is consistent and relatively efficient without requiring as was recently suggested by Zigler and Belin (2012).
The CPV argumentation does not seem to materially improve power with EML estimation. This suggests that full likelihood should be considered as an alternative to EML when CPV is available, (Follmann, 2006). Huang and others (2013) develop a pseudoscore method for PS evaluation that improves efficiency over EML methods when CPV is available. However, in cases where CPV is not available, EML methods and pseudoscore methods perform similarly and an extension of the pseudoscore method to time-to-event has yet to be developed.
There are two concepts of what makes a biomarker useful as a PS. For one, the quality of a biomarker as a PS can be measured by the degree of variation in the marginal treatment efficacy curve over the biomarker under the active treatment and for the other evaluation of the biomarker under both trial arms is required. When the CB holds, these concepts are equivalent. We have proposed a BSM trial augmentation plus assumption, A7, that increases the number of trials where CB is likely to hold for some candidate PS. The suggested BSM augmentation is likely feasible when the candidate surrogate of interest is not difficult to measure at baseline. In trials with adequate augmentation, our methods can be useful for evaluating biomarkers under active treatment as surrogates for time-to-event clinical endpoints regardless of the validity of assumption CB and under both concepts of principal surrogacy when the CB holds.
Supplementary Material
supplementary Material is available at http://biostatistics.oxfordjournals.org.
Funding
The research of Dr E.E.G. was partially supported by the National Institute Of Allergy And Infectious Diseases (NIAID) of the National Institutes of Health (NIH) under award numbers R37AI054165 and R37AI032042. The research of Dr P.B.G. was partially supported under NIAID NIH grant number R37AI054165. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Supplementary Material
Acknowledgements
The authors are grateful to the DCCT Research Group for releasing their data to the public domain. Conflict of Interest: None declared.
References
- DCCT/EDIC Research, Group. Intensive diabetes therapy and glomerular filtration rate in type 1 diabetes. The New England Journal of Medicine. 2011;365(25):2366–2376. doi: 10.1056/NEJMoa1111732. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Duerr A., Huang Y., Buchbinder S., Coombs R. W., Sanchez J., del Rio C., Casapia M., Santiago S., Gilbert P. B., Corey L. Extended follow-up confirms early vaccine-enhanced risk of HIV acquisition and demonstrates waning effect over time among participants in a randomized trial of recombinant adenovirus HIV vaccine (Step study) Journal of Infectious Diseases. 2012;206(2):258–266. doi: 10.1093/infdis/jis342. and others. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Follmann D. Augmented designs to assess immune response in vaccine trials. Biometrics. 2006;62(4):1161–1169. doi: 10.1111/j.1541-0420.2006.00569.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frangakis C. E., Rubin D. B. Principal stratification in causal inference. Biometrics. 2002;58(1):21–29. doi: 10.1111/j.0006-341x.2002.00021.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilbert P. B., Grove D., Gabriel E. E., Huang Y., Gray G., Hammer S. M., Buchbinder S. P., Kublin J., Corey L., Self S. G. A sequential phase 2b trial design for evaluating vaccine efficacy and immune correlates for multiple HIV vaccine regimens. Statistical Communications in Infectious Diseases. 2011;3(1) doi: 10.2202/1948-4690.1037. [Epub ahead of print] [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilbert P. B., Hudgens M. G. Evaluating candidate principal surrogate endpoints. Biometrics. 2008;64(4):1146–1154. doi: 10.1111/j.1541-0420.2008.01014.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilbert P. B., Qin L., Self S. G. Evaluating a surrogate endpoint at three levels, with application to vaccine development. Statistics in Medicine. 2008;27(23):4758–4778. doi: 10.1002/sim.3122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grambsch P. M., Therneau T. M. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika. 1994;81(3):515–526. [Google Scholar]
- Hernán M. The hazards of hazard ratios. Epidemiology. 2010;21(1):13–15. doi: 10.1097/EDE.0b013e3181c1ea43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang Y., Gilbert P. B. Comparing biomarkers as principal surrogate endpoints. Biometrics. 2011;67(4):1442–1451. doi: 10.1111/j.1541-0420.2011.01603.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang Y., Gilbert P. B., Wolfson J. Design and estimation for evaluating principal surrogate markers in vaccine trials. Biometrics. 2013;69(2):301–309. doi: 10.1111/biom.12014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Joffe M. M., Greene T. Related causal frameworks for surrogate outcomes. Biometrics. 2009;65(2):530–538. doi: 10.1111/j.1541-0420.2008.01106.x. [DOI] [PubMed] [Google Scholar]
- Li Y., Taylor J. M. G., Elliott M. R. A Bayesian approach to surrogacy assessment using principal stratification in clinical trials. Biometrics. 2010;66(2):523–531. doi: 10.1111/j.1541-0420.2009.01303.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pepe M. S., Fleming T. R. A nonparametric method for dealing with mismeasured covariate data. Journal of the American Statistical Association. 1991;86(413):108–113. [Google Scholar]
- Qin L., Gilbert P. B., Follmann D., Dongfeng L. Assessing surrogate endpoints in vaccine trials with case-cohort sampling and the Cox model. Annals of Applied Statistics. 2008;2(1):386–407. doi: 10.1214/07-AOAS132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taylor J. M. G., Wang Y., Thibaut R. Counterfactual links to the proportion of treatment effect explained by a surrogate marker. Biometrics. 2005;61(4):1102–1111. doi: 10.1111/j.1541-0420.2005.00380.x. [DOI] [PubMed] [Google Scholar]
- Wolfson J. Statistical methods for identifying surrogate endpoints in vaccine trials [Doctor of Philosophy Dissertation] 2009 University of Washington, Department of Biostatistics. [Google Scholar]
- Wolfson J., Gilbert P. B. Statistical identifiability and the surrogate endpoint problem, with application to vaccine trials. Biometrics. 2010;66(4):1153–1161. doi: 10.1111/j.1541-0420.2009.01380.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zigler C. M., Belin T. R. A Bayesian approach to improved estimation of causal effect predictiveness for a principal surrogate endpoint. Biometrics. 2012;68:922–932. doi: 10.1111/j.1541-0420.2011.01736.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.