Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2012 Mar 6;13(4):609–624. doi: 10.1093/biostatistics/kxs003

A unified procedure for meta-analytic evaluation of surrogate end points in randomized clinical trials

James Y Dai 1,*, James P Hughes 2
PMCID: PMC3616754  PMID: 22394448

Abstract

The meta-analytic approach to evaluating surrogate end points assesses the predictiveness of treatment effect on the surrogate toward treatment effect on the clinical end point based on multiple clinical trials. Definition and estimation of the correlation of treatment effects were developed in linear mixed models and later extended to binary or failure time outcomes on a case-by-case basis. In a general regression setting that covers nonnormal outcomes, we discuss in this paper several metrics that are useful in the meta-analytic evaluation of surrogacy. We propose a unified 3-step procedure to assess these metrics in settings with binary end points, time-to-event outcomes, or repeated measures. First, the joint distribution of estimated treatment effects is ascertained by an estimating equation approach; second, the restricted maximum likelihood method is used to estimate the means and the variance components of the random treatment effects; finally, confidence intervals are constructed by a parametric bootstrap procedure. The proposed method is evaluated by simulations and applications to 2 clinical trials.

Keywords: Causal inference, Meta-analysis, Surrogacy

1. INTRODUCTION

Surrogate end points are of great interest in randomized clinical trials when the clinically meaningful end point is expensive or it takes a long duration to occur. For a posttreatment intermediate outcome to qualify as a surrogate end point, it should reliably predict the treatment effect on the clinical end point. While few surrogates have been established (Fleming and DeMets, 1996), there is clearly a need for identifying surrogate markers to accelerate evaluation of new therapies and interventions (Rolan, 1997).

Statistical evaluation of surrogate end points has accumulated a large body of literature in the past 20 years. See, for example, Weir and Walley (2006), for a comprehensive review. In a landmark paper, Prentice (1989) defined a surrogate end point as “a response variable for which a test of the null hypothesis of no relationship to the treatment groups under comparison is also a valid test of the corresponding null hypothesis based on the true endpoint.” He proposed that surrogates should capture the full net effect of the treatment and should correlate with the true end points. While complete mediation is often hard to attain, Freedman and others (1992) developed a metric denoted as “proportion of treatment effect explained” (PTE), which is 1 minus the ratio of the treatment effect adjusted for surrogates and the net treatment effect, to quantify the degree of validity of a candidate surrogate end point. Concern has been expressed that the PTE measure is highly variable and may be outside of the range of 0 to 1.

Frangakis and Rubin (2002) noted that “statistical surrogates,” such as the PTE measure, involve assessing the treatment effect given the observed postrandomization surrogate markers and thus do not have a causal interpretation due to potential selection bias (Rosenbaum, 1984). They proposed a new definition of surrogacy, “principal surrogate,” based on the potential outcome framework (Rubin, 1974, Rubin, 1978). In essence, if one can identify individual-level treatment effects on the true outcome and individual-level treatment effects on the surrogate end point, it is sensible to evaluate surrogacy by assessing the association between these 2 sets of treatment effects. While individual-level treatment effects are generally not identifiable, Gilbert and Hudgens (2008) developed an estimation procedure by incorporating a baseline covariate(s) that predicts the unobserved potential outcomes in HIV vaccine trials. Li and others (2010) proposed a Bayesian estimation method to evaluate the probabilities of counterfactuals when both clinical and surrogate end points are binary.

Another thread of statistical evaluation of surrogacy is the meta-analytical approach, motivated by the lack of power to evaluate surrogacy in any single trial. In particular, Daniels and Hughes (1997) first considered Bayesian random-effects models to assess the association between treatment effects on the potential surrogate and the clinical outcome in multiple trials, using Markov chain Monte Carlo techniques. When both end points are normally distributed, Buyse and others (2000) recommended 2 coefficients of determination, one at the trial level (R trial 2) and the other at the individual level (R indiv 2). Both quantities can be estimated from meta-analysis using linear mixed models. Gail and others (2000) considered prediction accuracy of treatment effect on the clinical end point when a new trial comes in with only the surrogate end point measured.

For the nonnormal outcomes that are abundant in clinical trials, for example, binary and time- to-event end points, estimation of R trial 2 is not immediate due to modeling and computational difficulties in inferring variance components in a random-effects model with nonnormal outcomes. Several methods have been proposed, mostly dealing with special cases where the surrogate end point and the clinical end point have the same form. Notably, Renard and others (2002) proposed generalized linear mixed models (GLMMs) with latent variables in the binary–binary setting; for 2 failure time end points, Burzykowski and others (2001) proposed a two-stage procedure with copula models for the association of marginal survival functions. On the other hand, similar efforts have been devoted to extend the definition of the individual-level surrogacy, R indiv 2, to nonnormal settings. From an information theory perspective, Alonso and Molenberghs (2006) unified several measures previously proposed for the individual-level surrogacy, including a summary correlation measure R Λ 2 for repeated measurements (Alonso and others, 2006) and a general scaled likelihood reduction factor (Alonso and others, 2004).

Existing methods for nonnormal data entail that specialized models, if available, have to be invoked to handle the correlation of surrogate end point and clinical end point in the same participant, for example, copula models for 2 failure time outcomes, so that different settings require different models. None of these methods can model a surrogate end point and a clinical end point that have different forms, for example, a repeated binary surrogate and a failure time clinical end point, as in our data example in Section 6.1. In this article, we propose a unifying 3-step procedure for a broad range of outcome forms, including binary outcomes, failure time outcomes, and repeated measures. The joint distribution of the treatment effects is first obtained by an estimating equation approach, thereby circumventing the difficulty in modeling the correlation of clinical and surrogate end points, then variance components of between-trial treatment effects are estimated by a restricted maximum likelihood (REML) method, and finally, a parametric bootstrap procedure is employed to obtain their confidence intervals (CIs). Compared to existing methods, this 3-step procedure provides maximal flexibility for diverse forms of clinical and surrogate end points.

For each component of our proposed procedure, it is useful to emphasize its novelty relative to existing works. First, our proposed estimating equation approach is more general than that proposed by Gail and others (2000) since we consider the treatment effect directly in estimating equations rather than developing separate marginal models in the treatment arm and in the control arm. Semiparametric models, such as the Cox proportional hazard model widely used in randomized clinical trials, can thus be used instead of imposing distributional assumption on survival times. Daniels and Hughes (1997) proposed a Bayesian random-effects model that also accommodates general forms of outcomes. However, the estimation of the within-trial covariance between 2 treatment effects has to be obtained by a nonparametric bootstrap procedure. The estimation equation approach directly yields the within-trial covariance between 2 treatment effects. Second, the REML estimation of between-trial variance components is an adaptation of the bivariate random-effects model in the meta-analysis literature (van Houwelingen and others, 2002), which is easily implementable. Third, the CIs based on our proposed parametric bootstrap procedure provides better coverage than the likelihood-based CIs for the estimated correlation (see Section 4), for example, the two-stage estimation procedure in Burzykowski and others (2001) because the latter can be overly optimistic when a limited number of trials is available.

In Section 2, we define and discuss several metrics for the meta-analytical evaluation of surrogacy in a general regression setting that encompasses nonnormal outcomes. We describe the 3-step estimation procedure in Section 3 and present simulation results in Section 4, 2 data examples in Section 5, and close by a short discussion.

2. METRICS OF SURROGACY FOR PREDICTION

At the core of the requirement for a surrogate end point, “knowing the effect of treatment on the surrogate allows prediction of the effect of treatment on the more clinically relevant outcome” (Joffe and Greene, 2009), when a new trial comes in with only the surrogate end point measured. The meta-analytic evaluation of surrogate end points assesses empirically the predictability of treatment effects on the clinical end point by the surrogate in a series of studies, assuming that the new study employs the same type of treatments or drugs as the set of completed trials (Daniels and Hughes, 1997, Buyse and others, 2000, Gail and others, 2000). This amounts to quantifying the correlation of the 2 sets of treatment effects in multiple clinical trials. Buyse and others (2000) defined R trial 2 and R indiv 2 in the context of linear mixed models, where R trial 2 is the coefficient of determination for the 2 sets of trial-level treatment effects and R indiv 2 is the squared correlation between the surrogate and the clinical end points after adjustment for the trial effects and the treatment effect. We now generalize these concepts to any regression model that yields asymptotic linear estimators of treatment effects.

Consider data from J trials, j = 1,…,J, the jth trial of which contains n j participants. For the ith subject in the jth trial, i = 1,…,n j, a clinical end point Y ij, a surrogate end point X ij, and a binary treatment assignment Z ij are measured. Let 𝒯 1j be the effect of Z on X in the jth trial, and similarly, define 𝒯 2j to be the effect of Z on Y in the jth trial. For outcomes that can be parameterized in a generalized linear model (GLM), the following models are assumed for the treatment effects:

2. (2.1)
2. (2.2)

where g 1 and g 2 are link functions, for example, the logit function for a binary outcome, α 0j and γ 0j define the intercepts in the jth trial. For failure time outcomes, Cox proportional hazard models can be constructed with trial-specific baseline hazard.

We hypothesize that (𝒯 1j,𝒯 2j) are random effects and their joint distribution is bivariate normal,

2. (2.3)

Because of randomization, we are able to estimate these trial-averaged treatment effects consistently by 𝒯^1j and 𝒯^2j, the asymptotically linear estimators (ALEs) as we show in Section 2.2. Conditional on (𝒯 1j,𝒯 2j), the joint distribution of 𝒯^1j, 𝒯^2j can be derived,

2. (2.4)

where Σllj are elements of the asymptotic covariance matrix of (𝒯^1j), 𝒯^2j for l,l = 1,2. Marginally, the variance of Inline graphic consists of the between-trial variability in (2.3) and the within-trial variability in (2.4). The latter decreases with sample size in each trial, while the former is not affected by sample size. Note that differing from Buyse and others (2000), we do not treat the intercepts α 0j and γ 0j as random effects to reduce computation complexity for a nonnormal regression model. This should have little impact on the estimation of treatment effects since by randomization treatment effects are independent of the estimators of baseline parameters.

Now suppose a new trial with sample size n j + 1 has the surrogate end point measured but not the clinical end point. We assume that the pair of treatment effects for the new trial follows the bivariate normal distribution in (2.3). Suppose we estimate the treatment effect on the surrogate as Inline graphic and its estimated variability as Inline graphic, where Σ11j + 1 is the asymptotic variance. The bivariate random variables Inline graphic are asymptotic normal with mean (μ 2,μ 1), variance Inline graphic and covariance σ 12. This leads to the expected treatment effect on the clinical end point and the variability,

2.

The within-trial variability Inline graphic decreases with n j + 1 and is presumably much smaller than the between-trial variability, if heterogeneity of treatment effects across trials is evident. To accurately predict the treatment effect on the clinical end point in the new trial, it is thus of interest to quantify 3 metrics from the J existing trials:

  1. Inline graphic the squared correlation between 2 sets of true treatment effects in the J trials. The higher this squared correlation is, the more accurate the prediction of the treatment effect on the clinical end point could be. In linear models, ρ 2 is the R trials(r) 2 in the reduced random-effects model proposed by Buyse and others (2000), though ρ 2 is defined in a general regression setting that covers nonnormal outcomes. For a good surrogate end point, we need ρ 2 to be well over 0.5, ideally close to 1.

  2. Inline graphic the slope of regressing 𝒯 2j on 𝒯 1j. It corresponds to the parameter β in Daniels and Hughes (1997), and in a single trial, it reduces to the relative effect proposed in Buyse and Molenberghs (1998). λ 1 can be viewed as a summary of dose correspondence between 2 sets of treatment effects. If there is causal relationship between the surrogate and the clinical end points, one expect to see a bigger effect on the surrogate and a bigger effect on the clinical end point. This echoes the use of dose–response as evidence for causation in observational studies (Breslow and Day, 1980). Ghosh and others (2010) recently discussed the connection of the relative effect and the causal effect of the surrogate on the clinical outcome. Certainly the value of λ 1 depends on the scales of 𝒯 1j and 𝒯 2j, for example, 𝒯 1j may be in the linear scale and 𝒯 2j may be in the scale of log odds ratio or log hazard ratio.

  3. Inline graphic the expected treatment effect on the clinical end point when the treatment effect on the surrogate is 0. It corresponds to the parameter α in Daniels and Hughes (1997). Though the emphasis in predicting the treatment effect on the clinical end point is on the accuracy of such prediction (R trial 2), it is informative to know whether the predicted treatment effect on the clinical end point is 0 when the treatment effect on the surrogate is 0 as it has ties to the Prentice's definition of surrogacy, as well as the “causal necessity” condition in principal surrogacy (Gilbert and Hudgens, 2008). In meta-analytical literature, a similar intercept measure has been discussed in Daniels and Hughes (1997).

The second and the third metric are complementary to the first one in making a prediction of the treatment effect on the clinical end points, knowing the treatment effect on the surrogate. A good surrogate would have a high ρ 2 so that one can accurately make the prediction. It should also yield a small λ 0 so that no effect on the surrogate suggests no effect on the clinical end point. A large λ 1 is not required since its value is scale dependent but could be used for comparative purposes when several candidate surrogates are available and standardized to the same scale. These criteria essentially assess the association between 2 sets of treatment effects averaged across a number of trials. They are useful for the prediction, but they do not generally have a causal interpretation for the effect of the surrogate on the clinical end points in the individual level.

Although our focus in this paper is on the trial-level surrogacy, it is useful to discuss the implication of the models (2.1)–(2.4) to the individual-level surrogacy. Buyse and Molenberghs (1998) suggested the association between the surrogate and the clinical end points after adjustment for the treatment effect or R indiv 2 in linear models (Buyse and others, 2000), as an individual-level surrogacy measure. It is not required for prediction when a new trial comes in with only the surrogate measured, but a high R indiv 2 does improve prediction when the new trial has clinical end point partially collected (Li and Taylor, 2010). Extending the concept of R indiv 2 to repeated measures and nonnormal outcomes has motivated a series of works that are unified by information theory (Alonso and others, 2004, Alonso and others, 2006, Alonso and Molenberghs, 2006). Under our framework based on (2.3) and (2.4), it is straightforward to generalize R indiv 2 to the correlation of the estimated treatment effects. Define Inline graphic to be the asymptotic correlation of Inline graphic conditional on (𝒯 1j,𝒯 2j). It is easy to show that in the linear models, r j 2 is equivalent to R indiv 2, the residual correlation of X and Y after adjusting for the treatment. Thus, r j 2 extends this concept to the nonnormal settings. The estimation of r j 2, though not a focus of this article, can be easily attained by the estimation equation approach we describe later in Section 3.

3. A 3-STEP PROCEDURE TO ASSESS λ0, λ1 AND ρ

We start from the models in (2.1)–(2.4). For a series of estimated treatment effects Inline graphic there are 2 levels of variability: the within-trial variability from the estimation (2.4) and the between-trial variability (2.3). In the procedure we propose, we first estimate the treatment effects consistently in trials, along with the within-trial variability, that is, the asymptotic distribution of the estimated treatment effects. We then use an expectation–maximization (EM) algorithm to compute REML estimates for (μ 1,μ 2,σ 1 2,σ 12,σ 2 2), and hence λ 1, λ 2, and ρ.

3.1. Within-trial joint distribution of Inline graphic

Denote by α the complete set of parameters in the regression model (2.1) and denote by γ the complete set of parameters in the regression model (2.2). (𝒯 1j,𝒯 2j) are contained in α and γ, respectively. The estimators for α and γ are often formulated by ALEs (Newey and Powell, 1990, Robins and others, 1994). Let n = ∑j n j, an estimator α^ is asymptotically linear if Inline graphic The function B 1 is referred to as the “influence function” of α^ in the sense of Casella and Berger (2002). The influence function B 2 of an ALE for γ is defined similarly. ALE can be obtained by solving a system of estimating equations that are sums of n independent score contributions. Let ∑ji = 1 nj U 1ij = 0 be the set of estimating equations solved for α, and let ∑ji = 1 nj U 2ij = 0 be the set of estimating equations to be solved for γ. Let A 1 = E[ U 1ij/ α] and A 2 = E[ U 2ij/ γ]. Thus, the influence functions can be written as B 1ij = A 1 − 1 U 1ij and B 2i = A 2 − 1 U 2ij.

The random vectors U 1ij and U 2ij are i.i.d. with zero mean, but for the same i and j, U 1ij and U 2ij are correlated. The joint distribution of α^ and γ^ is established using the central limit theorem, Slutsky's theorem, and the Cramer–Wold device:

3.1. (3.5)

where E(B lij B lij), l,l = 1,2, are 2J×2J submatrices of the full covariance matrix. The limiting distribution of Inline graphic can be retrieved from the joint distribution of (α^,γ^).

This estimating equation approach generalizes the marginal score approach in Gail and others (2000). In particular, Gail and others (2000) start the estimation from the marginal distribution of Y and X in the treatment arm and the control arm, separately. Thus, a distributional assumption is needed for Y and X, for example, a Weibull distribution for time-to-event outcomes. Our approach first estimates the marginal treatment effects separately. The joint distribution of treatment effects are obtained by estimation equation theory, thus applicable to a more general class of models, for example, a GLM, correlated data modeled by generalized estimating equations (GEEs), and censored time-to-event data fitted by a Cox proportional hazard model. The linearization of score functions and regularity conditions in each of these models can be found in respective literature (McCullagh, 1983, Liang and Zeger, 1986, Lin and Wei, 1989). Among these, the estimating functions and influence functions for a logistic regression or GEEs are immediate to compute, while the linearization of score functions for a Cox model requires extra calculation, see, for example, Theorem 2.1 of Lin and Wei (1989).

3.2. REML estimates of λ 0, λ 1, and ρ

Consider both the between-trial variability (2.3) and the within-trial variability (2.4), the marginal distribution of Inline graphic when sample sizes within trials are large, can be approximated as

3.2.

The estimation of (μ 1,μ 2,σ 1 2,σ 2 2,σ 12) is achieved by maximum (approximate) likelihood of J pairs of Inline graphic, assuming Σj is known. The likelihood is approximate in the sense that it is based on the estimated asymptotic joint distribution of Inline graphic not the probability of observing individual data. This strategy is commonly used in meta-analysis (van Houwelingen and others, 2002).

To proceed to estimation, it is convenient to express the model as a bivariate random-effects model with J independent observations, T j = β + b j + u j, j = 1,…,J, where β is the fixed-effects parameter, b j is the random effect, and u j is the estimation error. All these variables are bivariate normal, T j = (𝒯 1j,𝒯 2j), β = (μ 1,μ 2), b j = (b 1j,b 2j), u j = (ε jj). The distribution of b j are bivariate normal 𝒩(0,D) in (2.3), and they are independent of the pair of estimation error u j = (ε jj), which have the distribution in (2.4). Let V j = D + Σj denote the variance of T j. If we view T j as observables, the model resembles a random-effects model for longitudinal data, except that we assume that the distributions of ε j and εj are known. Maximum likelihood (ML) estimates and REML estimates can be computed by an EM algorithm (Laird and Ware, 1982, Laird, 1982), treating b j as missing data. REML estimates of variance components are generally preferable to ML estimates, particularly when the sample size is small. We next describe the EM algorithm to estimate (μ 1,μ 2,σ 1 2,σ 2 2,σ 12).

If we observe b j, it is straightforward to estimate the variance components Inline graphic Thus, in the E-step, we compute the expectations of these sufficient statistics given current parameter estimates. In particular, these expectations for the REML estimates are computed as follows:

3.2.

which includes the extra variability from estimating the fixed-effects parameters μ 1 and μ 2. In the M-step, the fixed-effects estimates and the random-effects estimates are updated,

3.2.

as well as the variance components σ^12,σ^22, and σ^12. From these estimates, we compute λ^0, λ^1, and ρ^.

3.3. Parametric bootstrap

The variances for the estimates of fixed-effects parameters and variance components can be computed by inverting the information matrix, though the small number of trials or subgroups may render this approach inadequate. The profile likelihood approach typically yields a better coverage probability, though it assumes an asymptotic χ 2 distribution for the likelihood ratio, and its computation is algebraically cumbersome for the compound parameters we have, for example, the correlation. We propose a parametric bootstrap procedure, which first samples J pairs of treatment effects from the bivariate normal distribution

3.3.

and then reestimates (μ 1,μ 2,σ 1 2,σ 2 2,σ 12) by the procedure in Section 3.2, assuming Σ^j is known. The 3 metrics we defined in Section 2 are computed correspondingly. The 2.5% and 97.5% quantiles of 1000 bootstrap samples were used as the lower bound and the upper bound of 95% CI.

4. SIMULATIONS

We evaluate the performance of the point estimates and the bootstrap CIs in small samples by simulations. We assume that both the surrogate and the clinical end points are binary. There are k trials with heterogeneity in treatment effects, each of which has equal sample size. We generate the individual-level data on treatment and surrogate and clinical end points in 2 steps. In the first step, the treatment effects on the surrogate and treatment effects on the clinical end point among K trials were generated by a bivariate normal distribution with mean (0.5,0.4), variance (σ 2,σ 2), and covariance σ 12. In the second step, the treatment assignment was randomly assigned to each participant with probability 0.5. The probabilities of the clinical end point and the surrogate are generated by the logistic functions in (2.1) and (2.2), using varying intercepts in each trial and the treatment effects.

To study the performance of the 3 estimators, we vary 4 attributes of the total variability: the number of trials k, which represents the amount of information for estimating between-trial variability; the total sample size n, which for a fixed k controls the precision of the within-trial estimators; the between-trial variability σ 2, which measures the magnitude of heterogeneity of treatment effects; and the between-trial covariance σ 12, which determines the size of ρ. In Table 1, we evaluate the bias, the 95% coverage probability, and the type I error or power performance.

Table 1.

The small-sample performance of the 3 estimators in 1000 simulations. In each simulation, k trials were generated, each of which has n/k participants. The treatment effects for the surrogate and for the clinical end point among k trials are generated by the bivariate normal distribution with mean (0.5, 0.4), variance (σ 2, σ 2), and correlation ρ. The numbers in bold font are the type I errors for respective parameters. CP, cyclophosphamide plus cisplatin

k = 10
k = 20
n = 2000
n = 8000
n = 2000
n = 8000
λ 0 λ 1 ρ λ 0 λ 1 ρ λ 0 λ 1 ρ λ 0 λ 1 ρ
σ 2 = 0.25, ρ = 0.0
Bias 0.003 0.051 0.024 0.005 0.028 0.015 0.029 0.007 0.029 0.016 0.007 0.001
    Likelihood 95% CP 0.956 0.952 0.867 0.891 0.911 0.844 0.980 0.981 0.940 0.937 0.943 0.907
    Bootstrap 95% CP 0.953 0.970 0.970 0.912 0.940 0.940 0.973 0.981 0.981 0.951 0.964 0.964
Type I error/power 0.168 0.030 0.030 0.322 0.060 0.060 0.171 0.019 0.019 0.469 0.036 0.036
σ 2 = 0.25, ρ = 0.4
Bias – 0.022 0.071 0.001 – 0.005 0.022 0.005 – 0.029 0.121 0.029 0.006 0.034 0.016
    Likelihood 95% CP 0.964 0.948 0.873 0.906 0.902 0.832 0.978 0.970 0.942 0.947 0.944 0.892
    Bootstrap 95% CP 0.963 0.984 0.983 0.920 0.935 0.935 0.966 0.981 0.978 0.972 0.966 0.950
Type I error/power 0.064 0.082 0.082 0.120 0.213 0.213 0.078 0.099 0.099 0.147 0.279 0.279
σ 2 = 0.25, ρ = 0.8
Bias – 0.009 0.019 – 0.046 – 0.012 0.021 0.008 – 0.011 0.018 – 0.067 – 0.006 0.018 – 0.002
    Likelihood 95% CP 0.977 0.950 0.936 0.946 0.925 0.823 0.986 0.958 0.977 0.972 0.946 0.876
    Bootstrap 95% CP 0.953 0.962 0.953 0.953 0.970 0.970 0.976 0.977 0.991 0.970 0.970 0.984
Type I error/power 0.047 0.296 0.296 0.053 0.747 0.747 0.023 0.229 0.229 0.030 0.893 0.893
σ 2 = 0.5, ρ = 0.0
Bias 0.002 0.066 0.031 0.003 0.020 0.010 0.034 0.006 – 3 × 10−4 0.023 0.002 – 5 × 10−4
    Likelihood 95% CP 0.941 0.923 0.853 0.903 0.905 0.848 0.967 0.962 0.904 0.944 0.945 0.907
    Bootstrap 95% CP 0.936 0.944 0.944 0.899 0.941 0.941 0.957 0.953 0.953 0.940 0.965 0.965
Type I error/power 0.189 0.056 0.056 0.303 0.059 0.059 0.276 0.047 0.047 0.477 0.035 0.035
σ 2 = 0.5, ρ = 0.4
Bias 0.009 0.059 0.007 0.003 0.013 0.008 – 0.066 0.113 0.027 0.010 0.011 0.005
    Likelihood 95% CP 0.956 0.937 0.856 0.903 0.900 0.835 0.978 0.934 0.904 0.944 0.941 0.898
    Bootstrap 95% CP 0.941 0.952 0.953 0.904 0.926 0.926 0.963 0.937 0.956 0.957 0.961 0.960
Type I error/power 0.074 0.159 0.159 0.129 0.223 0.223 0.114 0.181 0.181 0.169 0.294 0.294
σ 2 = 0.5, ρ = 0.8
Bias – 0.019 0.043 0.008 – 0.007 0.009 0.001 – 0.059 0.083 – 0.016 – 0.002 0.016 – 0.001
    Likelihood 95% CP 0.982 0.940 0.848 0.948 0.914 0.846 0.992 0.922 0.914 0.974 0.947 0.877
    Bootstrap 95% CP 0.967 0.982 0.994 0.930 0.948 0.920 0.943 0.937 0.986 0.961 0.964 0.953
Type I error/power 0.033 0.576 0.576 0.070 0.807 0.807 0.057 0.590 0.590 0.039 0.966 0.966

The biases of the 3 estimators appear to be generally small and go to 0 when the sample size increases. For n = 2000 and k = 10, the bias can be at most 0.071 for λ^1 when σ 2 = 0.25 and ρ = 0.4. The biases of ρ^ and λ^0 are much smaller. When n = 8000, so that the within-trial variability is much smaller, the biases of all 3 estimators are negligible. A similar pattern is observed for the scenarios when k = 20. The number of trials, the magnitude of heterogeneity, and the size of ρ have little influence on the bias performance.

We compare the 95% coverage probabilities computed by 2 methods: one is the proposed bootstrap method and the other is the likelihood-based method used in Burzykowski and others (2001) that inverts information matrix of estimated variance components and uses the delta method. Almost in every setting, the likelihood-based method provides a smaller coverage for the correlation than the nominal value, while the coverages for λ 0 and λ 1 are much better behaved. The poor coverage for the correlation provided by the likelihood-based method is likely caused by the constraint ρ 2 ≤ 1 as well as a small number of trials. For the bootstrap method, the 95% coverage probabilities are generally acceptable, sometimes conservative, given the random variation of 1000 simulations, except in a few settings for λ 0, for example, σ 2 = 0.5,ρ = 0,n = 8000,k = 10 and σ 2 = 0.5,ρ = 0.4,n = 8000,k = 10. The main factor that impacts the performance of the coverage probability appears to be the number of trials: a larger number for k generally provides a better coverage probability as it should provide more information toward between-trial variance components.

The type I error of testing the hypothesis that a parameter is 0 appears to be satisfactory, see the numbers highlighted by the bold font in Table 1. Not surprisingly, the larger the effect size of ρ, the higher power one would get for testing ρ = 0 or λ 1 = 0. For the same effect size, the between-trial variability appears to have a major influence on the power, particularly when the sample size is small per trial. The reason is that a larger between-trial variability will make it easier to separate between-trial and within-trial variability. The increment of the sample size acts similarly by reducing the within-trial variability. A larger number of trials will generally help to estimate the between-trial variability when the total sample size is fixed, except in one setting where σ 2 = 0.25,ρ = 0.8,n = 2000. In this case, the increase of the number of trials is not able to offset the decrease of sample size in each trial.

In this particular example where both the surrogate and the clinical end points are binary, we also evaluate the potential of the likelihood approach using a GLMM. There are a number of procedures developed to avoid numerical integration involved in evaluating the likelihood in the presence of random effects. Among them, the popular penalized quasi-likelihood (PQL) method was chosen here since it is readily available in the S/R function glmmPQL() and it has been discussed in the meta-analysis context (Breslow and Clayton, 1993, van Houwelingen and others, 2002). Table 2 shows the percentage of converged simulations, the bias, and the 95% coverage probability among the converged simulations for the estimators computed by glmmPQL(). The simulation setup is exactly the same as that in Table 1. The convergence performance is poor when the sample size per trial is 100 or 200, so that it is hard to judge the bias and the coverage probability in these scenarios. The performance is much improved with sample size per trial reaching 400 or 800, perhaps more realistic in meta-analysis, and the biases for all 3 estimators appear to be small. However, the performance of the 95% coverage probability is not satisfactory, with too narrow coverage under a majority of simulation settings. This is particularly the case for the variance estimators of ρ^ and λ^1. These observations are consistent with the previous reports on difficulties of PQL for mixed models for binary data (Rodríguez and Goldman, 1995, Engel, 1998).

Table 2.

The small-sample performance of the 3 estimators estimated by the glmmPQL() function in R. The simulation settings are exactly the same as those in Table 1. The bias and 95% coverage probability were evaluated among those simulations that attain convergence. CP, cyclophosphamide plus cisplatin

k = 10 k = 20
n = 2000
n = 8000
n = 2000
n = 8000
λ 0 λ 1 ρ λ 0 λ 1 ρ λ 0 λ 1 ρ λ 0 λ 1 ρ
σ 2 = 0.25, ρ = 0.0
    % converged 23.8 93.3 6.3 89.3
    Bias 0.032 – 0.006 – 0.008 – 0.001 0.014 0.010 0.057 0.007 6e-5 0.020 0.012 0.0001
    95% CP 0.912 0.937 0.866 0.879 0.817 0.752 0.937 0.984 0.984 0.907 0.821 0.785
σ 2 = 0.25, ρ = 0.4
    % converged 16.6 89.3 4.7 71.4
    Bias 0.104 – 0.197 – 0.205 – 0.016 0.048 0.022 0.266 – 0.369 – 0.376 – 0.006 0.033 0.011
    95% CP 0.892 0.813 0.903 0.908 0.802 0.739 0.617 0.596 0.979 0.940 0.959 0.814
σ 2 = 0.25, ρ = 0.8
    % converged 3 70.2 1.3 15.0
    Bias 0.178 – 0.383 – 0.471 – 0.043 0.081 0.072 0.471 – 0.799 – 0.800 – 0.005 0.005 – 0.005
    95% CP 0.733 0.633 1.000 0.967 0.721 0.474 0.077 0.154 1.000 0.993 0.873 0.880
σ 2 = 0.5, ρ = 0.0
    % converged 59.4 99.1 33.5 99.8
    Bias 0.017 0.021 0.016 0.001 0.012 0.008 0.063 – 0.004 – 0.014 0.026 0.0002 – 0.001
    95% CP 0.914 0.872 0.808 0.905 0.843 0.782 0.907 0.916 0.866 0.929 0.861 0.832
σ 2 = 0.5, ρ = 0.4
    % converged 45.5 97.9 21.5 98.0
    Bias 0.051 – 0.053 – 0.073 – 0.009 0.029 0.006 0.138 – 0.157 – 0.167 – 0.011 0.054 0.034
    95% CP 0.925 0.846 0.831 0.933 0.839 0.779 0.907 0.916 0.866 0.956 0.855 0.809
σ 2 = 0.5, ρ = 0.8
    % converged 10.5 85.6 1.0 47.7
    Bias 0.089 – 0.130 – 0.162 – 0.021 0.046 0.022 0.250 – 0.231 – 0.279 – 0.009 0.017 0.007
    95% CP 0.924 0.848 0.905 0.929 0.861 0.832 0.800 0.800 0.900 0.996 0.904 0.847

5. DATA EXAMPLES

5.1. The EXPLORE study

As alluded by Joffe and Greene (2009), if there is strong prior knowledge on subgroups among which treatment effects are different, we could use these subgroups as units of meta-analysis, based on the same conditions we elaborated for clinical centers. These analyses do not yield definitive evidence for evaluation of prediction for a new trial, yet they provide useful information of surrogacy when only a single trial is available. In HIV prevention trials, behavioral risk groups are often considered to have heterogeneity in prevention effects. We show an example in HIV prevention trials below.

The EXPLORE study was a multisite two-trial randomized controlled phase IIb trial designed to test the effect of a behavioral intervention in preventing acquisition of HIV infection among 4295 men who have sex with men in the United States (The EXPLORE Study Team, 2004). In addition to the primary HIV infection end point, secondary end points on behavioral risk-taking were assessed at twice-yearly visits, including serodiscordant unprotected receptive anal (SDURA) intercourse, serodiscordant unprotected anal (SDUA) intercourse, and unprotected anal (UA) intercourse. The HIV infection rate was 18.2% (95% CI − 4.7 to 36.0) lower in the intervention trial than the standard trial (The EXPLORE Study Team, 2004).

One of the secondary analyses planned is to evaluate self-reported changes in risk behaviors as a correlate of protection against acquisition of HIV. The primary end point is censored time-to-infection, and the surrogate is longitudinally collected behavior measures. Using our proposed methods, we assessed the correlation of 2 sets of treatment effects, one on time to HIV acquisition and the other on self-reported UA, SDUA, or SDURA, in subgroups defined by all combinations of alcohol consumption, depression, and noninjection-drug use (8 subgroups). These potential effect modifiers were prespecified before the primary analysis (The EXPLORE Study Team, 2004). Figure 1 shows the scatter plot of treatment effects on HIV acquisition in a Cox proportional hazard model and treatment effects on SDURA, the most sensitive one to the treatment among the 3 behavior measures, in a GEE model. The area of a circle is proportional to the size of the subgroup.

Fig. 1.

Fig. 1.

The EXPLORE study: the x-axis is the treatment effect (log odds ratio) on SDURA intercourse; the y-axis is the treatment effect (log hazard ratio) on time to HIV acquisition. Eight subgroups were formed based on alcohol consumption, depression, and noninjection-drug use. The sizes of circles are proportional to sample sizes in subgroups.

From Figure 1, there seems to be a linear pattern between 2 sets of treatment effects, except for one outlier group. SDURA might be a promising surrogate of HIV acquisition. However, our estimation yields ρ = 0.54 with 95% CI [ − 0.95,0.96], λ 1 = 3.82 with 95% CI [ − 4.86,5.22], and λ 0 = 0.5 with 95% CI [ − 1.13,0.85]. None of these criteria is statistically significant. The CIs for ρ and λ 1 are unacceptably wide, possibly because there is insufficient sample size within subgroups or the heterogeneity of treatment effects is indeed small. If we perform a global test of varying treatment effects among 8 subgroups, neither end point yields a significant p-value. This result suggests that there is not enough evidence to support SDURA as a surrogate end point. Other than the sample size consideration, the lack of evidence may also come from reporting bias of behavioral risk or merely the fact that a single binary measure is not adequate to cover the entire profile of behavioral risk-taking in the past half-year.

5.2. A meta-analysis of advanced ovarian cancer trials

We also applied our method to the data from a meta-analysis of 4 randomized multicenter trials in advanced ovarian cancer (Ovarian Cancer Meta-analysis Project, 1991). These 4 trials compared 2 treatment modalities: cyclophosphamide plus cisplatin versus cyclophosphamide plus adriamycin plus cisplatin. The clinical end point is the survival time and the surrogate is the progression-free survival time. The data set was later updated to contain a minimum follow-up of 10 years in all trials (Ovarian Cancer Meta-analysis Project, 1998). Strong correlation (0.95) is observed between the survival time and the progression-free survival time.

A total of 50 clinical centers are available for analysis, including 2 smaller trials that are treated as 2 separate centers. These centers are used as units of the proposed meta-analysis. To allow within-center treatment effects to be estimated with reasonable precision, we first drop the centers with less than 3 patients on either treatment arm, then collapse the centers that have small sample sizes and similar treatment effects for progression-free survival time. The minimum sample size of a “center” was set to be 50. We thus used 12 “centers” in the analysis, with sample size ranges from 54 to 274. The rationale of collapsing smaller sites is 2-fold: first, the method relies on approximated likelihood of the asymptotic normal distribution; second, small sites will yield large within-center variability that leads to lack of precision for estimating between-center variability.

Figure 2 shows the treatment effects on the progression-free survival time and treatment effects on the survival time among 12 “centers.” This is clearly a linear trend between 2 sets of effects. The correlation is high (0.989) with a narrow 95% CI [0.896,0.999]. The slope is almost 1 with its 95% CI excluding 0. The intercept does not significantly differ from 0. These results suggest that the progression-free survival time is a valid surrogate for the survival time.

Fig. 2.

Fig. 2.

The ovarian cancer trials: the x-axis is the treatment effect (log hazard ratio) on time to progression or death; the y-axis is the treatment effect (log hazard ratio) on time to death. Twelve groups were formed by collapsing small clinical sites to have minimal within-group sample size 50. The sizes of circles are proportional to sample sizes in the groups.

Burzykowski and others (2001) assessed the surrogacy of the progression-free survival time, using copula models to describe the correlation between the survival time and the progression-free survival time. They considered a 2-step estimating procedure, though they used the Fisher information of the likelihood to compute the CI. They also exclude centers with less than 3 patients per arm, though they did not collapse small sites. In our analysis, if we do not collapse small centers, we found that the CIs are markedly wider, for instance, ρ^=0.67 with 95% CI [ − 0.21,0.99], while their analysis yields R trial 2 = 0.95[0.76,1.14] in one of their models. The discrepancy may come from several sources: our estimating equation approach is different from, though more general than, the copula models they used; we used parametric bootstrap to obtain the CI, which is perhaps better fitted for a limited number of groups than the Wald CIs based on the Fisher information. The latter approach could yield the upper bound of the CI above 1.

6. DISCUSSION

The main contribution of our paper is that the proposed estimation method, based on estimating equations on marginal treatment effects, circumvents the difficulty of modeling the within-subject correlation of binary and survival outcomes in the meta-analysis approach. We propose a parametric bootstrap procedure to construct CI, which is particularly suitable to small numbers of subgroups or trials involved in meta-analysis. Such benefit can also be attained by applying a Bayesian approach like Daniels and Hughes (1997) to trial-level treatment effect estimates after the first step of our procedure.

One challenge to the meta-analytical evaluation of surrogacy is to have a number of previous trials on the study drug or “similar drugs” that presumably take action in the same biological pathway through the candidate surrogate. If there is significant heterogeneity among study populations, and a new trial is likely also conducted in these centers, it may be useful to use these clinical centers as units of meta-analysis, admitting that they do not provide the same level of evidence in surrogacy as multiple trials. In HIV prevention trials funded by National Institutes of Health clinical trial networks, study sites can be quite heterogeneous, for example, sites in United states and sites in Africa, representing different sexual risk groups and regional epidemics, and these sites are likely used again in the next prevention trial. In this context, it remains of interest to evaluate the potential of clinical sites as the units of the meta-analysis.

FUNDING

Fred Hutchinson Cancer Research Center Faculty Development Fund to J.Y.D.; grants from National Institute of Health (R01 AI089341, U01 AI068615, U01AI46749).

Acknowledgments

We are grateful to the editor and reviewers for constructive comments. We thank Marc Buyse for providing the ovarian cancer data and Peter Gilbert for helpful discussion. We also thank Lin Gu for the help in simulation codes. Conflict of Interest: None declared.

Two grants used for the funding of this paper were originally missed from the funding information. They have now been added.

References

  1. Alonso A, Geys H, Molenberghs G. A unifying approach for surrogate marker validation based on Prentice's criteria. Statistics in Medicine. 2006;25:205–221. doi: 10.1002/sim.2315. [DOI] [PubMed] [Google Scholar]
  2. Alonso A, Molenberghs G. Surrogate marker evaluation from an information theory perspective. Biometrics. 2006;63:180–186. doi: 10.1111/j.1541-0420.2006.00634.x. [DOI] [PubMed] [Google Scholar]
  3. Alonso A, Molenberghs G, Burzykowski T, Renard D, Geys H, Shkedy Z, Tibaldi F, Abrahantes J, Buyse M. Prentice's approach and the meta-analytic paradigm: a reflection on the role of statistics in the evaluation of surrogate endpoints. Biometrics. 2004;60:724–728. doi: 10.1111/j.0006-341X.2004.00222.x. [DOI] [PubMed] [Google Scholar]
  4. Breslow NE, Clayton DG. Approximate inference in generalized linear mixed models. Journal of the American Statistical Association. 1993;88:9–25. [Google Scholar]
  5. Breslow NE, Day NE. Statistical Methods in Cancer Research. Volume 1. The Analysis of Case-Control Studies. Lyon: IARC Scientific Publications; 1980. [PubMed] [Google Scholar]
  6. Burzykowski T, Molenberghs G, Buyse M, Geys H, Renard D. Validation of surrogate end points in multiple randomized clinical trials with failure time end points. Applied Statistics. 2001;50:405–422. [Google Scholar]
  7. Buyse M, Molenberghs G. The validation of surrogate endpoints in randomization experiments. Biometrics. 1998;54:1014–1029. [PubMed] [Google Scholar]
  8. Buyse M, Molenberghs G, Burzykowski T, Renard D, Geys H. The validation of surrogate endpoints in meta-analyses of randomized experiments. Biostatistics. 2000;1:49–67. doi: 10.1093/biostatistics/1.1.49. [DOI] [PubMed] [Google Scholar]
  9. Casella G, Berger RL. Statistical Inference. Pacific Grove, CA: Duxbury; 2002. [Google Scholar]
  10. Daniels MJ, Hughes MD. Meta-analysis for the evaluation of potential surrogate markers. Statistics in Medicine. 1997;16:1965–1982. doi: 10.1002/(sici)1097-0258(19970915)16:17<1965::aid-sim630>3.0.co;2-m. [DOI] [PubMed] [Google Scholar]
  11. Engel B. A simple illustration of the failure of PQL, IRREML and APHL as approximate ML methods for mixed models for binary data. Biometrical Journal. 1998;40:141–154. [Google Scholar]
  12. Fleming TR, DeMets DL. Surrogate end points in clinical trials: are we being misled. Annals of Internal Medicine. 1996;125:605–613. doi: 10.7326/0003-4819-125-7-199610010-00011. [DOI] [PubMed] [Google Scholar]
  13. Frangakis CE, Rubin DB. Principle stratification in causal inference. Biometrics. 2002;58:21–29. doi: 10.1111/j.0006-341x.2002.00021.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Freedman LS, Graubard BI, Schatzkin A. Statistical validation of intermediate endpoints for chronic diseases. Statistics in Medicine. 1992;11:167–178. doi: 10.1002/sim.4780110204. [DOI] [PubMed] [Google Scholar]
  15. Gail MH, Pfeiffer R, Van Houwelingen HC, Carrol RJ. On meta-analytic assessment of surrogate outcomes. Biostatistics. 2000;1:231–246. doi: 10.1093/biostatistics/1.3.231. [DOI] [PubMed] [Google Scholar]
  16. Ghosh D, Elliott MR, Taylor JMG. Links between analysis of surrogate endpoints and endogeneity. Statistics in Medicine. 2010;29:2869–2879. doi: 10.1002/sim.4027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Gilbert PB, Hudgens MG. Evaluating candidate principal surrogate endpoints. Biometrics. 2008;64:1146–1154. doi: 10.1111/j.1541-0420.2008.01014.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Joffe MM, Greene T. Related causal frameworks for surrogate outcomes. Biometrics. 2009;65:530–538. doi: 10.1111/j.1541-0420.2008.01106.x. [DOI] [PubMed] [Google Scholar]
  19. Laird NM. Computation of variance components using the EM algorithm. Journal of Statistical Computation and Simulation. 1982;14:295–303. [Google Scholar]
  20. Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed] [Google Scholar]
  21. Li Y, Taylor JMG. Predicting treatment effects using biomarker data in a meta-analysis of clinical trials. Statistics in Medicine. 2010;29:1875–1889. doi: 10.1002/sim.3931. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Li Y, Taylor JMG, Elliott MG. A Bayesian approach to surrogacy assessment using principal stratification in clinical trials. Biometrics. 2010;66:523–531. doi: 10.1111/j.1541-0420.2009.01303.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]
  24. Lin DY, Wei LJ. The robust inference for the Cox proportional hazard models. Journal of American Statistical Association. 1989;84:1074–1078. [Google Scholar]
  25. McCullagh P. Quasi-likelihood functions. Annals of Statistics. 1983;11:59–67. [Google Scholar]
  26. Newey WK, Powell J. Efficient estimation of linear and type I censored regression models under conditional quantile restrictions. Econometric Theory. 1990;6:295–317. [Google Scholar]
  27. Ovarian Cancer Meta-analysis Project. Cyclophosphamide plus cisplatin versus cyclophosphamide, doxorubicin, and cisplatin chemotherapy of ovarian carcinoma: a meta-analysis. Journal of Clinical Oncology. 1991;9:1668–1674. doi: 10.1200/JCO.1991.9.9.1668. [DOI] [PubMed] [Google Scholar]
  28. Ovarian Cancer Meta-analysis Project. Cyclophosphamide plus cisplatin versus cyclophosphamide, doxorubicin, and cisplatin chemotherapy of ovarian carcinoma: a meta-analysis. Classic Papers and Current Comments. 1998;3:237–243. [Google Scholar]
  29. Prentice RL. Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in Medicine. 1989;8:431–440. doi: 10.1002/sim.4780080407. [DOI] [PubMed] [Google Scholar]
  30. Renard D, Geys H, Molenberghs G, Burzykowski T, Buyse M. Validation of surrogate endpoints in multiple randomized clinical trials with discrete outcomes. Biometrical Journal. 2002;44:921–935. [Google Scholar]
  31. Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of American Statistical Association. 1994;89:846–866. [Google Scholar]
  32. Rodríguez G, Goldman N. An assessment of estimation procedures for multilevel models with binary responses. Journal of the Royal Statistical Society A. 1995;158:73–89. [Google Scholar]
  33. Rolan P. The contribution of clinical pharmacology surrogates and models to drug development—a critical appraisal. British Journal of Clinical Pharmacology. 1997;44:219–225. doi: 10.1046/j.1365-2125.1997.t01-1-00583.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Rosenbaum PR. The consequences of adjustment for a concomitant variable that has been affected by treatment. The Journal of the Royal Statistical Society, Series A. 1984;47:656–666. [Google Scholar]
  35. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology. 1974;66:688–701. [Google Scholar]
  36. Rubin DB. Bayesian inference for causal effects: the role of randomization. The Annals of Statistics. 1978;6:34–58. [Google Scholar]
  37. The EXPLORE Study Team. Effects of a behavioural intervention to reduce acquisition of HIV infection among men who have sex with men: the explore randomised controlled study. Lancet. 2004;364:41–50. doi: 10.1016/S0140-6736(04)16588-4. [DOI] [PubMed] [Google Scholar]
  38. van Houwelingen HC, Arends LR, Stijnen T. Advanced methods in meta-analysis: multivariate approach and meta-regression. Statistics in Medicine. 2002;21:589–624. doi: 10.1002/sim.1040. [DOI] [PubMed] [Google Scholar]
  39. Weir CJ, Walley RJ. Statistical evaluation of biomarkers as surrogate endpoints: a literature review. Statistics in Medicine. 2006;25:183–203. doi: 10.1002/sim.2319. [DOI] [PubMed] [Google Scholar]

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES