Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Jun 1.
Published in final edited form as: Biometrics. 2021 Dec 7;79(2):799–810. doi: 10.1111/biom.13600

Testing for Heterogeneity in the Utility of a Surrogate Marker

Layla Parast 1, Tianxi Cai 2, Lu Tian 3
PMCID: PMC9170832  NIHMSID: NIHMS1756809  PMID: 34874550

Abstract

In studies that require long-term and/or costly follow-up of participants to evaluate a treatment, there is often interest in identifying and using a surrogate marker to evaluate the treatment effect. While several statistical methods have been proposed to evaluate potential surrogate markers, available methods generally do not account for or address the potential for a surrogate to vary in utility or strength by patient characteristics. Previous work examining surrogate markers has indicated that there may be such heterogeneity i.e., that a surrogate marker may be useful (with respect to capturing the treatment effect on the primary outcome) for some subgroups, but not for others. This heterogeneity is important to understand, particularly if the surrogate is to be used in a future trial to replace the primary outcome. In this paper, we propose an approach and estimation procedures to measure the surrogate strength as a function of a baseline covariate W and thus, examine potential heterogeneity in the utility of the surrogate marker with respect to W. Within a potential outcome framework, we quantify the surrogate strength/utility using the proportion of treatment effect on the primary outcome that is explained by the treatment effect on the surrogate. We propose testing procedures to test for evidence of heterogeneity, examine finite sample performance of these methods via simulation, and illustrate the methods using AIDS clinical trial data.

Keywords: heterogeneity, kernel methods, nonparametric methods, potential outcomes, surrogate marker, treatment effect

1. Introduction

For many clinical outcomes, randomized clinical trials to evaluate the effectiveness of a treatment often require measuring a primary outcome that is expensive, invasive and/or requires long-term follow-up of participants. In such settings, there is substantial interest in identifying and using surrogate markers - measurements or outcomes measured at an earlier time or with less cost that are predictive of the primary clinical outcome of interest - to evaluate the treatment effect. Several statistical methods have been proposed to evaluate potential surrogate markers including parametric and nonparametric methods (Prentice, 1989; Freedman et al., 1992; Lin et al., 1997; Wang and Taylor, 2002; Parast et al., 2016), methods within a principal stratification framework (Gilbert and Hudgens, 2008; Joffe and Greene, 2009; Conlon et al., 2014; Huang and Gilbert, 2011), and methods for a meta-analytic setting (Daniels and Hughes, 1997; Renard et al., 2002; Burzykowski et al., 2001, 2005) i.e., where information from multiple trials is available.

However, currently available methods generally do not account for or address the potential for a surrogate to vary in strength or utility by certain patient characteristics. Previous work examining surrogate markers in clinical trials has indicated that there may be such heterogeneity in the utility of a surrogate marker (Lin et al., 1993; Cohen and Lindsell, 2012; Wang-Lopez et al., 2015; Spieker and Huang, 2017). That is, a surrogate marker may be useful (with respect to capturing the treatment effect on the primary outcome) for some subgroups, but not useful for others. This heterogeneity is important to understand, particularly if the surrogate is to be used in a future trial to potentially replace the primary outcome (Parast et al., 2019; Price et al., 2018). With respect to heterogeneity in the average treatment effect itself, there has certainly been an extensive amount of work done to provide approaches to assess and test for such heterogeneity (Crump et al., 2008; Willke et al., 2012; Wager and Athey, 2018). However, heterogeneity in the utility of a surrogate marker is more complex as it involves assessing heterogeneity in not only the average treatment effect on the primary outcome, but also potential heterogeneity in both the treatment effect on the surrogate and the relationship between the surrogate and the primary outcome. To our knowledge, there are no methods to assess and rigorously test for such potential heterogeneity with respect to a baseline covariate in the surrogate marker setting.

Our goal is to develop methods to examine and test for heterogeneity in the strength of a surrogate marker. As a measure of surrogate strength, we focus on the proportion of the treatment effect on the primary outcome that is explained by the treatment effect on the surrogate marker, denoted as RS, within a potential outcome framework (Freedman et al., 1992; Wang and Taylor, 2002; Parast et al., 2016). While limitations of this metric have been discussed, we focus on it due to its widespread use in practice when examining a surrogate within a single study and thus, methods to assess heterogeneity within this context would likely be most useful in this area (Lin et al., 1997; VanderWeele, 2013; Inker et al., 2016; Agyemang et al., 2018; Sprenger et al., 2020). As an example of potential heterogeneity, consider the use of change in CD4 cell count as a surrogate marker for plasma HIV-1 RNA, which is of interest because RNA is relatively expensive to obtain(Calmy et al., 2007). In our application in this paper to AIDS clinical trial data, we show that the proportion of the treatment effect on RNA that is explained by CD4 varies significantly by baseline CD4 level, ranging from 50–70% for lower baseline CD4 counts and from 10–20% for higher baseline CD4 counts. If CD4 was to be used in the future to make inference about the treatment effect on RNA without regard to these differences by baseline CD4, such inference could lead to inaccurate conclusions about the treatment effect.

In this paper, we first propose an approach and estimation procedures to measure the surrogate strength as a function of a baseline covariate W and thus, examine potential heterogeneity in the utility of the surrogate marker with respect to W. We then propose testing procedures, both an omnibus test and a trend-based test, to test for evidence of heterogeneity. We examine the performance of these methods using a simulation study and illustrate the methods using data from an AIDS clinical trial. We focus on a continuous W but additionally propose and illustrate methods for settings where W is discrete.

2. Notation, Setting and Assumptions

2.1. Notation and Setting

Let Y denote the primary outcome, S denote the surrogate marker, Z denote the treatment indicators where treatment is randomized and Z ∈ {0, 1} (i.e., treatment vs. control), and W denotes a single continuous baseline covariate of interest. We use potential outcomes notation where each person has a potential {Y(1), Y(0), S(1), S(0)} where Y(g) is the outcome when Z = g and S(g) is the surrogate when Z = g. Importantly, a potential outcomes framework is useful here in order to understand what assumptions will be required (see Section 2.2).

Throughout, we focus on assessing the utility of the surrogate marker using the proportion of treatment effect explained quantity. To define this quantity, we first define the overall treatment effect as:

Δ=E(Y(1)Y(0))=E(Y(1))E(Y(0)).

Following Wang and Taylor (2002) and Parast et al. (2016), the “residual” treatment effect is defined as

ΔS=E(Y(1)Y(0)S(1)=S(0)=s)dFS(0)(s) (1)
=E(Y(1)S(1)=s)dFS(0)(s)E(Y(0)S(0)=s)dFS(0)(s) (2)

where FS(0)() is the marginal cumulative distribution function of S(0) and we’ve made the assumption that Y(1)S(0)|S(1) and Y(0)S(1)|S(0) for identifiability. This quantity, ΔS, captures the leftover treatment effect on the primary outcome, after accounting for the treatment effect on the surrogate marker. Informally, it reflects the expected treatment effect if the surrogate marker distribution was forced to be equal in both groups, where we have selected the reference distribution for the surrogate marker to be the distribution in the control group. The proportion of the treatment effect on the primary outcome that is explained by the treatment effect on S is then defined as RS = 1 − ΔS/Δ.

2.2. Assumptions

Throughout, we require the following assumptions which parallel assumptions that are often required when evaluating surrogate markers in general (Wang and Taylor, 2002; Taylor et al., 2005; Parast et al., 2017):

  • (C1) μ1(s, w) is monotone increasing in s, where μg(s, w) = E(Y(g) | S(g) = s, W = w)

  • (C2) P(S(1) > s | W = w) ≥ P(S(0) > s | W = w) for all s and w

  • (C3) μ1(s, w) ≥ μ0(s, w) for all s and w

  • (C4) S is a continuous random variable with finite support over an interval [a, b] and S(0) | W = w and S(1) | W = w have the same support.

  • (C5) Y(1)S(0) | S(1), W and Y(0)S(1) | S(0), W

Assumption (C1) implies that the surrogate marker is positively associated with the primary outcome; (C2) implies that there is a non-negative treatment effect on the surrogate marker for the subgroup of patients with the same covariate W; and (C3) implies that there is a non-negative effect of treatment on the primary outcome beyond that on the surrogate marker in the subgroup of patients with the same covariates W. These assumptions guard against a surrogate paradox situation (i.e., when the treatment has a positive effect on the surrogate, the surrogate and primary outcome are positively associated, but the treatment in fact has a negative effect on the primary outcome) within any subgroup of patients with the same covariates (VanderWeele, 2013). Without loss of generality, Assumption (C2) and (C3) are stated under the assumption that higher values for the surrogate and the primary outcome are “better”; if in fact, lower values were “better”, these assumptions should be adjusted to reflect non-positive treatment effects. Assumption (C4) is needed for our kernel-based estimation approach. In general, Assumptions (C1)-(C3) can be effectively examined empirically. It can be difficult to verify Assumption (C4) with moderate sample sizes. Assumption (C5) is not directly testable from observed data. However, it is only required for ensuring that the defined “residual” treatment effect as a function of W has the desired causal interpretation. We discuss these assumptions further in Web Appendix A.

3. Assessing Heterogeneity

3.1. Approach

Assume there is interest in examining heterogeneity with respect to the baseline covariate W. Our goal is to define and estimate RS and thus, ΔS and Δ, as a function of W. Let

Δ(w)=E(Y(1)W=w)E(Y(0)W=w), and
ΔS(w)=E(Y(1)Y(0)S(1)=S(0)=s,W=w)dF0(sw)=μ1(s,w)dFS(0)(sw)μ0(s,w)dFS(0)(sw)

where FS(g)(w) is the cumulative distribution function of S(g) given W = w and the second equality follows from Assumption (C5). Then we may define RS(w) = 1 − ΔS(w)/Δ(w) as a measure of surrogate strength for individuals with the same baseline covariate W = w. In the next section we propose a nonparametric estimation procedure to estimate each of these quantities so that they may be examined as a function of w.

It is interesting to consider what types of relationships between Y, S, and W would imply heterogeneity in the utility of the surrogate. When W is associated with Y but the association does not differ by treatment group, and W is not associated with S, one would not expect there to be heterogeneity. However, when these same associations hold, but the association between W and Y does differ by treatment group, one would expect heterogeneity; that is, even when W is not associated with S itself, treatment effect heterogeneity oftentimes results in heterogeneity in the utility of the surrogate. When the complexities of these associations increase i.e., W is associated with S and this association may or may not differ by treatment group, one would also expect heterogeneity but the level of heterogeneity (e.g. large differences vs. small differences) is difficult to determine partly due to the construction of our measure of interest as a ratio. We gain more insight into such settings in our simulation study in Section 5.

Remark 1.

When considering use of the surrogate in a future study, it is of interest to note that the average residual treatment effect, ES(W)}, is not the same as ΔS in (2) when there is heterogeneity in the utility of the surrogate. The former measures the treatment effect on the surrogate as E{μ1(S(1), W)} − E{μ1(S(0), W)}, while the latter measures it as E{μ˜1(S(1))}E{μ˜1(S(0))}, where μ˜g(s)=E(Y(g)S(g)=s), and when there is heterogeneity, μ1(S(g),W)μ˜1(S(g)). This will be important to consider if one is using the surrogate marker to estimate and test for a treatment effect on the primary outcome in a future study; we further discuss the impact of heterogeneity on testing in a future study in Section 7.

3.2. Nonparametric estimation

We propose a nonparametric estimation method for Δ(w), ΔS(w), and RS(w) involving two-dimensional smoothing over (S, W). The observed data consists of {Ygi, Sgi, Wgi} for person i in treatment group g; let ng denote the number of individuals in treatment group g. First, since the average treatment effect Δ can be estimated simply using Δ^=n11i=1n1Y1in01i=1n0Y0i, we propose to estimate Δ(w) as

Δ^(w)=μ^1(w)μ^0(w)

where

μ^g(w)=i=1ngKhg(Wgiw)Ygii=1ngKhg(Wgiw),g=0,1,

K(·) is a smooth symmetric density function with finite support, Kh(·) = K(·/h)/h, and h1 and h0 are bandwidths, which may be data dependent. To avoid a need for bias correction in subsequent statistical inference, we utilize undersmoothing and select all bandwidths throughout to be of order O(nϵ), ϵ ∈ (1/5, 1/2), where n = n1 + n0. Here we assume that πj = limn→∞ nj/n ∈ (0, 1), j = 0, 1 where n = n0 + n1.

For the residual treatment effect, noting that without W, the quantity ΔS can be estimated by

Δ^S=n01i=1n0{j=1n1Kh˜(S1jS0i)Y1jj=1n1Kh˜(S1jS0i)}n01i=1n0Y0i, (3)

with an appropriate smoothing bandwidth h˜, we propose to estimate ΔS(w) using two-dimensional smoothing as

Δ^S(w)=μ^10(w)μ^0(w),

where μ^10(w)=μ^1(s,w)dF^S(0)(sw),

F^S(0)(sw)=i=1n0Kh2(W0iw)I(S0is)i=1n0Kh2(W0iw),
and μ^1(s,w)=i=1n1Kh3(S1is)Kh4(W1iw)Y1ii=1n1Kh3(S1is)Kh4(W1iw)

are nonparametric smoothed estimators of the conditional cumulative distribution of S(0) given W = w, and the conditional expectation of Y(1) given (S(1), W) = (s, w), respectively. Similarly, all bandwidths h2, h3, and h4 are undersmoothed to eliminate the need for bias correction. Finally, we define a nonparametric estimate of RS(w) as R^S(w)=1Δ^S(w)/Δ^(w). In Web Appendix B, we propose parallel estimation procedures and an omnibus test for the case when W is discrete.

Remark 2.

This estimator of the residual treatment effect depends on consistent estimators of μ1(s, w), FS(0)(sw), and μg(w), g = 0, 1. One could alternatively consider replacing the need of estimating μ1(s, w) by estimating a quantity akin to the “propensity score” fS(0)(sw)/fS(1)(sw), where fS(g)(sw)=dFS(g)(sw)/ds, g = 0, 1. Specifically, ΔS(w) could alternatively be estimated by

i=1n1Kh¯(W1iw)r^01(S(1i)w)Y1ii=1n1Kh¯(W1iw)μ^0(w),

where r^01(sw) is a consistent estimator of r01(sw)=fS(0)(sw)/fS(1)(sw) and h¯ is a smoothing bandwidth. The nonparametric estimation of r01(s | w) still requires two-dimensional smoothing over both s and w.

Remark 3.

If the dimension of covariate W is greater than one, denoted as W in this remark, multi-dimensional smoothing utilized within the nonparametric estimation approach for Δ^S(w) may not work well when the sample size is not very large due to the curse of dimensionality. As an alternative, we propose a set of semiparametric models to allow for assessment of heterogeneity. Specifically, one could consider assuming the following: (1) a varying coefficient model for μ1(s,w)=gY{β1(s)w¯}, where w¯=(1,w), gY (·) is a known, strictly increasing link function and β1(s) is the unknown function of s. This model allows the effect of W on Y to vary over s. Let the estimator of β1(s) and μ1(s, w) be denoted by β^1(s) and gY(β^1(s)w), respectively; (2) a general transformation model for the cumulative distribution function of S(0)W=w,FS(0)(sw)=gS1[gS{F0(s)}+γw], where gS(·) is a given link function and F0(·) is an unknown baseline distribution function. Let the estimator of FS(0)(sw) be denoted by F^S(0)(sw)=gS1[gS{F^0(s)}+γ^w], where (F^0(),γ^) is the consistent estimator of (F0(·), γ). The residual treatment effect, ΔS(w) can then be estimated by

gY(β^1(s)w)dF^S(0)(sw)gY(β^0(s)w)dF^S(0)(sw),

and the surrogacy RS(w) can be estimated accordingly. Note that even when W is a scalar, a flexible semiparametric model could alternatively be used to model E(Y(1) | S(1) = s, W = w), if desired. For example, one may consider the additive model μ1(s, w) = gY {βS(s) + γw(w)}, where gY (·) is a given link function. The advantage of such an alternative is that the associated inference would not involve two-dimensional smoothing, and thus could more feasibly be used in settings with smaller sample sizes.

3.3. Inference and Variance Estimation

In Web Appendix C, we show that under mild regularity conditions Δ^S(w) is a consistent estimator of ΔS(w) and that as n → ∞,

nh(Δ^S(w)ΔS(w)Δ^(w)Δ(w))N(0,ΣΔ(w)),

assuming that hi/h ∈ [rL, rU], i = 0, 1, 2, 3, 4 for finite positive constants rL and rU. It then follows that R^S(w) is a consistent estimator of RS(w) and, by the delta method, nh{R^S(w)RS(w)} also converges weakly to a mean zero normal distribution with variance σR2(w). The variance-covariance matrix ΣΔ(w) as well as closed form variance estimates are given in Web Appendix C. Using these estimates, one may construct 95% confidence intervals for Δ(w), ΔS(w), and RS(w) using a normal approximation. We examine the performance of the variance estimates and confidence interval construction in our simulation study in Section 5.

It is also possible to construct a simultaneous confidence band for RS(w) over a given interval [wa, wb] within the support of W. We describe this procedure and provide justification for the validity of this proposed confidence band in Web Appendix D. We illustrate this confidence band in our simulation study (Section 5) and AIDS application (Section 6).

4. Testing Procedures

4.1. Omnibus Test

While our aim in Section 3 was to develop methods to assess potentially heterogeneity, our goal in this section is to formally test for the presence of heterogeneity in the utility of the surrogate marker. That is, we wish to test the null hypothesis:

H0:RS() is constant within [wa,wb],i.e.,τ s.t. w[wa,wb],RS(w)=τ,
HA:RS() is not constant within [wa,wb],i.e.,τ,w[wa,wb] s.t. RS(w)τ.

To achieve this, we also let (h1, h0) = (h4, h2) and consider a supremum type test statistic

T=supw[wa,wb]nh|Δ^S(w)(1τ^)Δ^(w)|σ^D(w),

where σ^D2(w) is a consistent estimator of the variance of nh{Δ^S(w)(1τ^)Δ^(w)}, and

τ^=1wawbΔ^S(w)dwwawbΔ^(w)dw

Under the null hypothesis that RS(w) = τ :

Δ^S(w)Δ^(w)(1τ^) (4)
={Δ^S(w)ΔS(w)}{Δ^(w)Δ(w)}{1RS(w)}+Δ(w)(τ^τ), (5)
={Δ^S(w)ΔS(w)}{Δ^(w)Δ(w)}(1τ) (6)
Δ(w)wawbΔ(w)dw(wawb{Δ^S(w)ΔS(w)}dw(1τ)wawb{Δ^(w)Δ(w)}dw), (7)

whose distribution can be approximated by the conditional distribution of a stochastic process Z*(w), defined in Web Appendix E, given the observed data. Therefore, we may generate a large number of Z*(w) by repeatedly simulating Ugis from independent N(0, 1) and let

T*=supw[wa,wb]|nhZ*(w)σ^D(w)|,

where σ^D(w) is the empirical variance of simulated nhZ*(w). After obtaining B realizations of T*:{Tb*,b=1,,B}, the p-value for testing the constant surrogacy over W can be approximated by B1b=1BI(Tb*tobs), where tobs is the observed test-statistic T. In Web Appendix E, we provide justification for the validity of this proposed testing procedure. The term (7) and its counterpart in Z*(w) are in the order of Op(n−1/2) and asymptotically negligible, but we include them to account for the variance of τ^ in finite samples. While an advantage of this test is that it is omnibus, it may lack power for detecting a specific alternative. In the following section, we propose an alternative testing approach.

4.2. Trend test

In some settings, it may be of interest to consider an alternative hypothesis in which Rs(w) is monotone increasing or decreasing in w. In such a case, we propose to consider the following test statistic:

S=nwawbR^S(w)(wwb+wa2)dw.

Under the null

S=nwawb{R^S(w)RS(w)}(wwb+wa2)dw,

converges weakly to a mean zero Gaussian distribution, whose variance can be approximated by the conditional variance of

S*=nwawbD*(w)(wwb+wa2)dw,

given the observed data, where the process D*(w) is defined in Web Appendix D. In practice, we may generate a large number of S* and calculate its empirical variance denoted by σ^S. The p-value for this “trend” test can then be approximated by P(|N(0,1)|S/σ^S).

5. Simulation Study

The goals of this simulation study were to (a) examine the finite sample performance of the proposed estimators R^S(w) and Δ^S(w) in terms of bias, standard error estimation, and coverage of the confidence intervals, (b) evaluate the performance of the proposed confidence band from Section 3.3 in terms of coverage, and (c) examine the properties of the proposed tests for heterogeneity with respect to power and Type 1 error. To this end, we examined 6 simulation settings covering both a continuous and discrete W. For all settings, α = 0.05, and results are summarized over 500 replications; we examine all settings with (n1, n0) = (2000, 1500) and (500, 500). In simulation setting 1, W1 ~ U(0, 2), W0 ~ U(0, 2), S1 ~ N(6, 4), S0 ~ N(5, 1), Y1|S1, W1 = 3 + 6S1 + 4W1 + N(0, 9), and Y0|S0, W0 = 2 + 5S0 + W0 + N(0, 9) such that SgWg, and Yg depends on both Sg and Wg where the association differs by treatment group. In this setting, Δ(w) = 12 + 3w, ΔS(w) = 6 + 3w and RS(w) = 2/(4 + w). In simulation setting 2, W1 ~ U(0, 2), W0 ~ U(0, 2), S1 ~ N(6, 4) + W1, S0 ~ N(5, 1) + W0, Y1|S1, W1 = 3 + 6S1 + 4W1 + N(0, 9), and Y0|S0, W0 = 2 + 5S0 + W0 + N(0, 9) such that Sg depends on Wg with same association in two groups, and, like setting 1, Yg depends on both Sg and Wg, where the association differs by treatment group. In this setting, Δ(w) = 12 + 4w, ΔS(w) = 6 + 4w and RS(w) = 3/(6 + 2w). In simulation setting 3, W1 ~ U(0, 2), W0 ~ U(0, 2), S1 ~ N(6, 4) + 2W1, S0 ~ N(5, 1) + W0, Y1|S1, W1 = 3 + 6S1 + 4W1 + N(0, 9), and Y0|S0, W0 = 2 + 5S0 + W0 + N(0, 9) such that Sg depends on Wg, Yg depends on both Sg and Wg, and these associations differ by treatment group. In this setting, Δ(w) = 12 + 10w, ΔS(w) = 6 + 4w and RS(w) = (3 + 3w)/(6 + 5w). Finally, in simulation setting 4, data are generated such that there is no heterogeneity, W1 ~ U(0, 2), W0 ~ U(0, 2), S1 ~ N(6, 4), S0 ~ N(5, 1), Y1|S1, W1 = 3 + 6S1 + N(0, 9), and Y0|S0, W0 = 2 + 5S0 + N(0, 9). In this setting, Δ(w) = 12, ΔS(w) = 6 and RS(w) = 1/2. With respect to our bandwidth selection, we let h1=h4=2×1.06×min(σW1,IQR1/1.34)n12/5 and h2=h0=1.06×min(σW0,IQR0/1.34)n02/5. where σWj and IQRj were the empirical standard deviation and inter-quartile range of Wj, respectively (Scott, 1992); we discuss bandwidth selection further in Section 7. The number of resampling iterations/realizations used to construct the 95% confidence band and conduct hypothesis testing was B = 500.

Settings 5 and 6 investigate settings with a discrete W. In simulation setting 5, W1 and W0 ∈ {0, 1, 2} with equal probability, S1 ~ N(6, 9) + 2W1, S0 ~ N(3.4, 1) + W0, and Y1|S1, W1 = 3 + 6S + 4W1 + N(0, 9), and Y0|S0, W0 = 2 + 5S0 + 2W0 + N(0, 9). In this setting, Δ(w) = 20 + 9w, ΔS(w) = 4.4 + 3w and RS(w) = (15.6 + 6w)/(20 + 9w). Finally, in setting 6, W1 and W0 ∈ {0, 1, 2} with equal probability, S1 ~ N(6, 4), S0 ~ N(5, 1), and Y1|S1, W1 = 3 + 6S1 + N(0, 9), and Y0|S0, W0 = 2 + 5S0 + N(0, 9). In this setting, Δ(w) = 12, ΔS(w) = 6 and RS(w) = 1/2 implying no heterogeneity.

Table 1 summarizes the performance of the proposed estimation method for ΔS(w) and RS(w) at w = 0.4, 0.8, 1.6 and 1.8, representing the 20%, 40%, 60%, and 80% quantiles of Wj, j = 0, 1, respectively, in settings 1, 2, 3 and 4 over 500 replications when n1 = 2000 and n0 = 1500. Specifically, Table 1 includes the empirical bias, the empirical standard error of the estimators, average standard error estimates, and empirical coverage level of the Wald-type confidence intervals. Figure 1 also illustrates the observed performance of the point estimate and standard error estimates from setting 1. Figure 2 illustrates the pointwise confidence intervals and confidence band for RS(w) for a single iteration from setting 1. These results show good performance in terms of small bias, standard error estimates close to their empirical counterparts, and coverage levels close to the nominal level.

Table 1:

Simulation results for estimation of ΔS(w) and RS(w) in Settings 1, 2, 3, and 4 when n1 = 2000 and n0 = 1500; ESE = empirical standard error, ASE = average standard error; Coverage = coverage of the 95% confidence intervals

Setting 1
ΔS(w) RS(w)
w 0.400 0.800 1.200 1.600 0.400 0.800 1.200 1.600
Truth 7.200 8.400 9.600 10.800 0.455 0.417 0.385 0.357
Estimate 7.294 8.476 9.664 10.872 0.443 0.408 0.377 0.351
Bias 0.094 0.076 0.064 0.072 −0.011 −0.009 −0.007 −0.006
ESE 0.423 0.451 0.438 0.413 0.051 0.047 0.046 0.044
ASE 0.430 0.430 0.430 0.431 0.051 0.048 0.046 0.044
Coverage 0.952 0.934 0.938 0.954 0.960 0.966 0.954 0.946
Setting 2
ΔS(w) RS(w)
w 0.400 0.800 1.200 1.600 0.400 0.800 1.200 1.600
Truth 7.600 9.200 10.800 12.400 0.441 0.395 0.357 0.326
Estimate 7.695 9.277 10.866 12.474 0.430 0.386 0.350 0.320
Bias 0.095 0.077 0.066 0.074 −0.011 −0.009 −0.007 −0.006
ESE 0.422 0.451 0.437 0.413 0.050 0.045 0.044 0.041
ASE 0.430 0.430 0.431 0.431 0.050 0.047 0.044 0.041
Coverage 0.950 0.934 0.932 0.952 0.958 0.968 0.950 0.942
Setting 3
ΔS(w) RS(w)
w 0.400 0.800 1.200 1.600 0.400 0.800 1.200 1.600
Truth 7.600 9.200 10.800 12.400 0.525 0.540 0.550 0.557
Estimate 7.728 9.348 10.960 12.627 0.515 0.531 0.542 0.549
Bias 0.128 0.148 0.160 0.227 −0.010 −0.009 −0.008 −0.008
ESE 0.438 0.473 0.498 0.529 0.039 0.030 0.025 0.023
ASE 0.439 0.452 0.465 0.481 0.038 0.030 0.025 0.022
Coverage 0.944 0.918 0.916 0.896 0.948 0.942 0.948 0.928
Setting 4 (no heterogeneity)
ΔS(w) RS(w)
w 0.400 0.800 1.200 1.600 0.400 0.800 1.200 1.600
Truth 6.000 6.000 6.000 6.000 0.500 0.500 0.500 0.500
Estimate 6.093 6.075 6.064 6.073 0.488 0.490 0.491 0.491
Bias 0.093 0.075 0.064 0.073 −0.012 −0.010 −0.009 −0.009
ESE 0.423 0.452 0.436 0.413 0.053 0.051 0.053 0.052
ASE 0.430 0.430 0.430 0.431 0.053 0.052 0.052 0.052
ASE 0.430 0.430 0.430 0.431 0.053 0.052 0.052 0.052
Coverage 0.952 0.934 0.944 0.954 0.960 0.968 0.960 0.944

Figure 1:

Figure 1:

Estimation in Setting 1 summarized over 500 replications when n1 = 2000 and n0 = 1500: (a) estimate vs. truth for ΔS(w), (b) average standard error (ASE) vs. empirical standard error (ESE) for ΔS(w), (c) estimate vs. truth for RS(w), and (d) ASE vs. ESE for RS(w)

Figure 2:

Figure 2:

Pointwise confidence intervals and confidence band for RS(w) for a single iteration from Setting 1 when n1 = 2000 and n0 = 1500

Table 2 presents the empirical power of the two proposed tests (the omnibus test and trend test) for heterogeneity in settings 1, 2, and 3, and the empirical type 1 error of these two tests in setting 4, when the null was true, when n1 = 2000 and n0 = 1500. The type 1 error rate was maintained at the 0.05 level in setting 4. The power of the trend test was higher than that for the omnibus test, as expected. Table 2 also presents the empirical coverage level of the 95% confidence band for RS(w), w ∈ [0.25, 1.75]. These results show that the empirical coverage level of the confidence band was satisfactory in all four settings.

Table 2:

Power/type 1 error and confidence band coverage in Settings 1, 2, 3, and 4 when n1 = 2000 and n0 = 1500

Omnibus test Trend test Confidence Band Coverage for RS(w), w ∈ [0.25, 1.75]
Setting 1 (Power) 0.282 0.648 0.952
Setting 2 (Power) 0.392 0.806 0.950
Setting 3 (Power) 0.138 0.270 0.926
Setting 4 (Type 1 error) 0.046 0.058 0.944

Table 3 summarizes estimation performance in the discrete case, for settings 5 and 6, when n1 = 2000 and n0 = 1500. Similar to settings 1 through 4, the resulting bias is small, standard error estimates are close to their empirical counterparts, and coverage levels are close to the nominal level. The power of the proposed test in setting 5 was 0.784; the type 1 error rate of the proposed test in setting 6 was 0.042, again close to α = 0.05.

Table 3:

Simulation results for estimation of ΔS(w) and RS(w) in Settings 5 and 6 where W is discrete when n1 = 2000 and n0 = 1500; ESE = empirical standard error, ASE = average standard error; Coverage = coverage of the 95% confidence intervals

Setting 5
ΔS(w) RS(w)
w 0 1 2 0 1 2
Truth 4.400 7.400 10.400 0.780 0.745 0.726
Estimate 4.460 7.482 10.531 0.777 0.742 0.723
Bias −0.060 −0.082 −0.131 0.003 0.003 0.004
ESE 0.249 0.294 0.347 0.014 0.012 0.011
ASE 0.254 0.289 0.352 0.014 0.011 0.010
Coverage 0.944 0.940 0.930 0.964 0.940 0.920
Setting 6 (no heterogeneity)
ΔS(w) RS(w)
w 0 1 2 0 1 2
Truth 6.000 6.000 6.000 0.500 0.500 0.500
Estimate 6.033 6.021 6.029 0.496 0.497 0.496
Bias −0.033 −0.021 −0.029 0.004 0.003 0.004
ESE 0.211 0.216 0.207 0.025 0.025 0.027
ASE 0.211 0.212 0.211 0.026 0.026 0.026
Coverage 0.950 0.944 0.954 0.964 0.964 0.942

Simulation results for all settings with a smaller sample size n1 = n0 = 500 are provided in Web Appendix F. Overall, the results from this simulation study illustrate good performance of the proposed estimation and testing procedures in finite samples.

6. Application

We use our proposed procedures to examine potential heterogeneity in the utility of surrogate marker in the AIDS Clinical Trials Group 320 Study (Hammer et al., 1997). This study was a randomized, double-blind, placebo-controlled trial that compared a three-drug regimen with a two-drug regimen in HIV-infected patients with a CD4 cell count of 200 or less per cubic millimeter and at least three months of prior zidovudine therapy. Results showed better performance for the three-drug regimen with respect to progression to AIDS, death, change in CD4, and change in plasma HIV-1 RNA.

Our primary outcome of interest is the change in plasma HIV-1 RNA from baseline to 24 weeks and the surrogate marker of interest is change in CD4 cell count from baseline to 24 weeks. CD4 is of interest as a surrogate marker here because RNA is relatively expensive to measure (Calmy et al., 2007). Our analytic sample included 418 individuals randomized to the three-drug regimen group and 412 individuals randomized to the two-drug regimen group. The average change in log RNA from baseline to 24 weeks was −2.15 and −0.55 (log10 copies/ml) in the three-drug group and the two-drug group, respectively, resulting in a treatment effect of −1.60 with a p-value < 0.0001 using a two-sample t-test. The estimated residual treatment effect was Δ^S=0.93, and proportion of treatment effect explained was R^S=41.5%.

We first examine the extent to which the proportion of treatment effect explained by change in CD4 count depends on the baseline CD4 count. Both the omnibus test and trend test reject the null hypothesis with p < 0.001, providing evidence of significant heterogeneity in the utility of the change in CD4 count as a surrogate for change in RNA with respect to baseline CD4. Figure 3 shows the estimates of Δ(w), ΔS(w) and RS(w) as a function of baseline CD4, as well as pointwise confidence intervals for each and the confidence band for RS(w). These estimates reflect a decrease in the proportion of treatment effect explained with increasing baseline CD4, with Rs(w) ranging from 0.5 to 0.7 for lower baseline CD4 counts and from 0.1 to 0.2 for higher baseline CD4 counts.

Figure 3:

Figure 3:

Estimates of Δ(w), ΔS(w) and RS(w) from the AIDS clinical trial with baseline CD4 as the baseline covariate of interest, change in log RNA from baseline to 24 weeks as the primary outcome, and change in CD4 from baseline to 24 weeks as the surrogate marker

Second, we examine race/ethnicity (White vs. non-White) as the discrete baseline covariate of interest applying the methods described in Web Appendix B. The test for heterogeneity results in a p-value of 0.23 indicating no significant evidence of heterogeneity in the utility of the surrogate with respect to race/ethnicity. For illustrative purposes only (see Web Appendix A), we report the estimates within each subgroup. Among White participants, Δ^(white )=1.83(95%CI:2.05,1.61), Δ^S(white )=1.14(95%CI:1.43,0.85) and R^S(white )=0.38(95% CI :0.25,0.51). Among non-White participants, Δ^(non-white )=1.31(95% CI : 1.53,1.09), Δ^S(non-white )=0.66(95% CI : 0.92,0.40) and R^S(non-white )=0.50(95%CI:0.33,0.66), suggesting that the surrogate strength is weaker among white participants than among non-white participants.

7. Discussion

In this paper, we proposed an approach and estimation procedures to examine potential heterogeneity in the utility of the surrogate marker with respect to a baseline covariate. In addition, we developed testing procedures, both an omnibus test and a trend-based test, to test for evidence of heterogeneity. Our simulation study demonstrated satisfactory finite sample performance of our proposed methods. Our exploratory analysis of an AIDS clinical trial illustrated the use of the procedures with both a discrete baseline covariate, race/ethnicity, and a continuous baseline covariate, CD4 cell count. An R package implementing the methods proposed here, named hetsurr, is available on Github (see Web Appendix G).

The presence of heterogeneity has important implications for use of the surrogate marker in a future study. First, the heterogeneity information provided by our proposed methods can be used to identify a region of interest i.e., a subset of values of W for which the surrogate is especially strong. This identified region could be used to inform future trial design/recruitment or to inform further investigation into the treatment mechanisms (see Web Appendix A). Second, when there is an evidence of heterogeneity, use of that surrogate marker in a new study population that has a different participant-mix than the original study (used to evaluate the utility of the surrogate marker) may lead to inaccurate conclusions about the treatment effect if the heterogeneity is not taken into account. For example, Parast et al. (2019) proposed an approach to test for a treatment effect earlier using a surrogate marker; however, we expect that if there is heterogeneity, this testing procedure may not perform well. If the surrogate is going to be used to test for a treatment effect, certain transportability assumptions are generally made about the new study compared to the existing study (Price et al., 2018; Wang et al., 2020), and these assumptions will likely not hold in the presence of surrogacy heterogeneity. Future work on the implications of heterogeneity with respect to using the surrogate as a replacement of the primary outcome in future studies would be useful.

In practice, we recommend that the process of examining heterogeneity in the utility of the surrogate marker should follow guidance for subgroup analyses in clinical trials. That is, subgroups to be examined should be ideally pre-specified and the interpretation of surrogate strength for certain subgroups should be considered in light of all subgroups that are tested i.e., after appropriate multiple testing adjustment (Wang et al., 2007); see Web Appendix A for further discussion.

Our proposed methods have some limitations. First, it is difficult or even impossible to verify some assumptions enumerated in (C1)-(C5). Potential violations to these assumptions should be carefully considered; it is likely that discussions about the validity of these assumptions with clinical experts may shed light on whether they are reasonable in a particular clinical setting (see Web Appendix A). In addition, recent work suggesting numerical approaches to examine sensitivity to violations of such assumptions may be useful (Elliott et al., 2015). Second, we require the selection of several bandwidth parameters and there are many options one could choose from (see Web Appendix G). Results may be sensitive to the bandwidth selected especially with a moderate sample size. Third, in this paper, we focus on a surrogate that is either continuous or discrete, a primary outcome that is either continuous or binary, and a treatment effect measured by the mean difference. Our methods cannot easily accommodate a setting where the surrogate and/or primary outcome are time-to-event outcomes subject to censoring. Lastly, while the nonparametric two-dimensional smoothing approach is robust in the sense that no stringent model assumptions are required, the method does require a relatively large sample size. In settings with smaller sample sizes, a parametric or semiparametric version of the framework presented here, such as the approach described in Remark 3 of Section 3.2, could be considered, though such an approach would strongly rely on the parametric specifications.

Supplementary Material

supinfo

Acknowledgements

Support for this research was provided by National Institutes of Health grant R01DK118354.

We are grateful to the AIDS Clinical Trial Group (ACTG) for providing the AIDS data.

Footnotes

Supporting Information

Web Appendices referenced in Sections 3, 4, 5, and 7 are available in the Supplementary Materials. In addition, a zip file containing code to replicate the simulation study and AIDS example is available with this paper at the Biometrics website on Wiley Online Library.

Data Availability Statement

The data from the ACTG 320 study used in this paper are publicly available upon request from the AIDS Clinical Trial Group: https://actgnetwork.org/submit-a-proposal/.

References

  1. Agyemang E, Magaret AS, Selke S, Johnston C, Corey L, and Wald A (2018). Herpes simplex virus shedding rate: surrogate outcome for genital herpes recurrence frequency and lesion rates, and phase 2 clinical trials end point for evaluating efficacy of antivirals. The Journal of Infectious Diseases 218, 1691–1699. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Burzykowski T, Molenberghs G, and Buyse M (2005). The evaluation of surrogate endpoints. Springer. [Google Scholar]
  3. Burzykowski T, Molenberghs G, Buyse M, Geys H, and Renard D (2001). Validation of surrogate end points in multiple randomized clinical trials with failure time end points. Journal of the Royal Statistical Society: Series C (Applied Statistics) 50, 405–422. [Google Scholar]
  4. Calmy A, Ford N, Hirschel B, Reynolds SJ, Lynen L, Goemaere E, De La Vega FG, Perrin L, and Rodriguez W (2007). Hiv viral load monitoring in resource-limited regions: optional or necessary? Clinical infectious diseases 44, 128–134. [DOI] [PubMed] [Google Scholar]
  5. Cohen RM and Lindsell CJ (2012). When the blood glucose and the hba1c don’t match: turning uncertainty into opportunity. Diabetes Care 35, 2421–2423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Conlon AS, Taylor JM, and Elliott MR (2014). Surrogacy assessment using principal stratification when surrogate and outcome measures are multivariate normal. Biostatistics 15, 266–283. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Crump RK, Hotz VJ, Imbens GW, and Mitnik OA (2008). Nonparametric tests for treatment effect heterogeneity. The Review of Economics and Statistics 90, 389–405. [Google Scholar]
  8. Daniels MJ and Hughes MD (1997). Meta-analysis for the evaluation of potential surrogate markers. Statistics in medicine 16, 1965–1982. [DOI] [PubMed] [Google Scholar]
  9. Elliott MR, Conlon AS, Li Y, Kaciroti N, and Taylor JM (2015). Surrogacy marker paradox measures in meta-analytic settings. Biostatistics 16, 400–412. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Freedman LS, Graubard BI, and Schatzkin A (1992). Statistical validation of intermediate endpoints for chronic diseases. Statistics in medicine 11, 167–178. [DOI] [PubMed] [Google Scholar]
  11. Gilbert PB and Hudgens MG (2008). Evaluating candidate principal surrogate endpoints. Biometrics 64, 1146–1154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Hammer SM, Squires KE, Hughes MD, Grimes JM, Demeter LM, Currier JS, Eron JJ Jr, Feinberg JE, Balfour HH Jr, Deyton LR, et al. (1997). A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or less. New England Journal of Medicine 337, 725–733. [DOI] [PubMed] [Google Scholar]
  13. Huang Y and Gilbert PB (2011). Comparing biomarkers as principal surrogate endpoints. Biometrics 67, 1442–1451. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Inker LA, Mondal H, Greene T, et al. (2016). Early change in urine protein as a surrogate end point in studies of iga nephropathy: an individual-patient meta-analysis. American Journal of Kidney Diseases 68, 392–401. [DOI] [PubMed] [Google Scholar]
  15. Joffe MM and Greene T (2009). Related causal frameworks for surrogate outcomes. Biometrics 65, 530–538. [DOI] [PubMed] [Google Scholar]
  16. Lin D, Fischl MA, and Schoenfeld D (1993). Evaluating the role of cd4-lymphocyte counts as surrogate endpoints in human immunodeficiency virus clinical trials. Statistics in medicine 12, 835–842. [DOI] [PubMed] [Google Scholar]
  17. Lin D, Fleming T, De Gruttola V, et al. (1997). Estimating the proportion of treatment effect explained by a surrogate marker. Statistics in medicine 16, 1515–1527. [DOI] [PubMed] [Google Scholar]
  18. Parast L, Cai T, and Tian L (2017). Evaluating surrogate marker information using censored data. Statistics in Medicine 36, 1767–1782. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Parast L, Cai T, and Tian L (2019). Using a surrogate marker for early testing of a treatment effect. Biometrics 75, 1253–1263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Parast L, McDermott MM, and Tian L (2016). Robust estimation of the proportion of treatment effect explained by surrogate marker information. Statistics in Medicine 35, 1637–1653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Prentice RL (1989). Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in medicine 8, 431–440. [DOI] [PubMed] [Google Scholar]
  22. Price BL, Gilbert PB, and van der Laan MJ (2018). Estimation of the optimal surrogate based on a randomized trial. Biometrics 74, 1271–1281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Renard D, Geys H, Molenberghs G, Burzykowski T, and Buyse M (2002). Validation of surrogate endpoints in multiple randomized clinical trials with discrete outcomes. Biometrical Journal 44, 921–935. [Google Scholar]
  24. Scott D (1992). Multivariate density estimation. Wiley, New York. [Google Scholar]
  25. Spieker AJ and Huang Y (2017). A method to address between-subject heterogeneity for identification of principal surrogate markers in repeated low-dose challenge hiv vaccine studies. Statistics in medicine 36, 4071–4080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Sprenger T, Kappos L, Radue E-W, Gaetano L, Mueller-Lenke N, Wuerfel J, Poole EM, and Cavalier S (2020). Association of brain volume loss and long-term disability outcomes in patients with multiple sclerosis treated with teriflunomide. Multiple Sclerosis Journal 26, 1207–1216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Taylor JM, Wang Y, and Thiébaut R. (2005). Counterfactual links to the proportion of treatment effect explained by a surrogate marker. Biometrics 61, 1102–1111. [DOI] [PubMed] [Google Scholar]
  28. VanderWeele TJ (2013). Surrogate measures and consistent surrogates. Biometrics 69, 561–565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Wager S and Athey S (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113, 1228–1242. [Google Scholar]
  30. Wang R, Lagakos SW, Ware JH, Hunter DJ, and Drazen JM (2007). Statistics in medicine—reporting of subgroup analyses in clinical trials. New England Journal of Medicine 357, 2189–2194. [DOI] [PubMed] [Google Scholar]
  31. Wang X, Parast L, Tian L, and Cai T (2020). Model-free approach to quantifying the proportion of treatment effect explained by a surrogate marker. Biometrika 107, 107–122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Wang Y and Taylor JM (2002). A measure of the proportion of treatment effect explained by a surrogate marker. Biometrics 58, 803–812. [DOI] [PubMed] [Google Scholar]
  33. Wang-Lopez Q, Chalabi N, Abrial C, Radosevic-Robin N, Durando X, Mouret-Reynier M-A, Benmammar K-E, Kullab S, et al. (2015). Can pathologic complete response (pcr) be used as a surrogate marker of survival after neoadjuvant therapy for breast cancer? Critical reviews in oncology/hematology 95, 88–104. [DOI] [PubMed] [Google Scholar]
  34. Willke RJ, Zheng Z, Subedi P, Althin R, and Mullins CD (2012). From concepts, theory, and evidence of heterogeneity of treatment effects to methodological approaches: a primer. BMC medical research methodology 12, 185. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supinfo

Data Availability Statement

The data from the ACTG 320 study used in this paper are publicly available upon request from the AIDS Clinical Trial Group: https://actgnetwork.org/submit-a-proposal/.

RESOURCES