Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jun 12.
Published in final edited form as: Stat Med. 2022 Nov 13;42(1):68–88. doi: 10.1002/sim.9602

Using a Surrogate with Heterogeneous Utility to Test for a Treatment Effect

Layla Parast 1, Tianxi Cai 2, Lu Tian 3
PMCID: PMC10259671  NIHMSID: NIHMS1902812  PMID: 36372072

Summary

The primary benefit of identifying a valid surrogate marker is the ability to use it in a future trial to test for a treatment effect with shorter follow-up time or less cost. However, previous work has demonstrated potential heterogeneity in the utility of a surrogate marker. When such heterogeneity exists, existing methods that use the surrogate to test for a treatment effect while ignoring this heterogeneity may lead to inaccurate conclusions about the treatment effect, particularly when the patient population in the new study has a different mix of characteristics than the study used to evaluate the utility of the surrogate marker. In this paper, we develop a novel test for a treatment effect using surrogate marker information that accounts for heterogeneity in the utility of the surrogate. We compare our testing procedure to a test that uses primary outcome information (gold standard) and a test that uses surrogate marker information, but ignores heterogeneity. We demonstrate the validity of our approach and derive the asymptotic properties of our estimator and variance estimates. Simulation studies examine the finite sample properties of our testing procedure and demonstrate when our proposed approach can outperform the testing approach that ignores heterogeneity. We illustrate our methods using data from an AIDS clinical trial to test for a treatment effect using CD4 count as a surrogate marker for RNA.

Keywords: heterogeneity, hypothesis test, nonparametric methods, surrogate marker, treatment effect

1 |. INTRODUCTION

There has been a substantial growth in clinical and methodological research on identifying and using valid surrogate markers in the past few decades. A valid surrogate marker is a biological measurement that can be used as a replacement for a primary outcome of interest in a clinical study. Many statistical methods have been proposed to evaluate and validate surrogate markers using a wide variety of innovative methodological approaches.1,2,3,4,5 The primary benefit of identifying a valid surrogate marker is the ability to use it in a future trial to test for a treatment effect with less required follow-up time or less cost. For example, the U.S. Food and Drug Administration announced in 2020 that a surrogate marker that could be measured earlier than COVID-19 infection could be used to assess the vaccine efficacy in preventing infection,6 thus potentially allowing for earlier identification of effective vaccines.

Several statistical methods have been proposed in recent years to assess the treatment effect on the primary outcome based on surrogate marker information. For example, Parast et al. (2019)7 proposed a nonparametric approach to test for a treatment effect in a time-to-event outcome setting based on a surrogate marker measured at an earlier time point utilizing information about the relationship between the surrogate marker and primary outcome obtained from a prior study. Chen et al. (2020)8 suggested a model-based approach that uses surrogate information to make interim decisions about whether to drop a treatment arm or stop a trial for futility. Price et al. (2018)9 defined an optimal surrogate that optimally predicts a primary outcome and proposed super-learner and targeted super-learner based estimation procedures. Athey et al. (2019)10 proposed to combine multiple surrogate markers to predict a long term outcome and estimate a treatment effect, and explicitly characterized the difference between the treatment effect estimated based on the primary outcome versus the surrogate combination.

Previous clinical and methodological work has demonstrated potential heterogeneity in the utility of a surrogate marker i.e. that a surrogate marker may be more useful (with respect to capturing the treatment effect on the primary outcome) for some subgroups than for others.11 Parast et al. (2021)12 offers a nonparametric estimation procedure and formal test for heterogeneity of surrogate utility with respect to a baseline covariate. When such heterogeneity exists, existing methods that use the surrogate to test for a treatment effect while ignoring this heterogeneity may lead to inaccurate conclusions about the treatment effect, particularly when the patient population in the current study has a different mix of characteristics than the prior study (used to evaluate the utility of the surrogate marker).

For example, in the simulation study in this paper, we examine a setting where the estimated treatment effect based on the primary outcome is 33.7 (standard error [SE] = 1.6); applying the testing approach of Parast et al. (2019)7 which uses surrogate marker information but does not account for heterogeneity, the estimated treatment effect on the primary outcome is 39.2 (SE=3.5). The approach of Parast et al. (2019)7 guarantees that the treatment effect based on the surrogate will be a lower bound for the true treatment effect on the primary outcome under certain conditions. However, these conditions may be violated when there is heterogeneity in the utility of the surrogate and thus leads to this type of situation where the estimated treatment effect using the surrogate is much higher than that using the primary outcome. Our approach that we propose in this paper which incorporates heterogeneity produces a treatment effect estimate that retains the lower bound property, with similar power to the treatment effect using the primary outcome. While we focus on heterogeneity with respect to a continuous baseline covariate, we provide a motivational example in Appendix A where there is heterogeneity with respect to a discrete covariate, gender. In this example, the surrogate marker is strong among males (explaining 99% of the treatment effect on the primary outcome) but weaker among females (explaining 67%). In a new study where the distribution of gender is 95% female and 5% male and the treatment effect on the primary outcome is 38.95, using the surrogate marker and accounting for heterogeneity in surrogacy produces an estimated treatment effect on the primary outcome equal to 17.95 while ignoring heterogeneity produces an estimate of 44.5, again, failing to correctly provide a lower bound on the true treatment effect. In contrast, if we consider a future study where the distribution of gender is 5% female and 95% male, the treatment effect on the primary outcome is 74.05, while the treatment effect using the surrogate and accounting for heterogeneity is 71.05 versus not accounting for heterogeneity is 44.5, indicating a potential loss in power to detect a treatment effect when heterogeneity is ignored.

In this paper, we develop a novel test for a treatment effect using surrogate marker information that accounts for heterogeneity in the utility of the surrogate. We compare our testing procedure to a test that uses primary outcome information only (gold standard) and a test that uses surrogate marker information, but ignores heterogeneity. We demonstrate the validity of our testing procedure and derive the asymptotic properties of our estimator and variance estimates. A simulation study is used to examine the finite sample properties of our testing procedure and demonstrate when our proposed approach can outperform the testing approach that ignores heterogeneity. In particular, we demonstrate examples where the test of Parast et al. (2019)7 provides an incorrect estimate with respect to the treatment effect. We illustrate our approach using data from an AIDS clinical trial to test for a treatment effect using CD4 count as a surrogate marker for plasma HIV-1 RNA.

2 |. TESTING PROCEDURE

2.1 |. Notation and Setting

We focus on a setting where we are currently conducting a study to examine the effect of a treatment on a primary outcome of interest, denoted by Y, and we additionally have data available from a prior study. We assume that this prior study was used to examine the strength of the surrogate, denoted by S, and heterogeneity in the utility of the surrogate, and has measurements of both Y and S of the current study. Let Z denote the treatment indicators where treatment is randomized and Z{0,1} (i.e., treatment vs. control), and W denote a baseline covariate such that S has been shown to have heterogeneous utility with respect to this covariate. Without loss of generality, we take W to be continuous; all proposed procedures can easily accommodate a discrete W as well. We focus on a setting with heterogeneity with respect to a single baseline covariate W; in Section 3.3, we discuss an extension to multiple W. In addition, we assume we are in a setting where either S is measured earlier than Y or S is measured at the same time as Y but is less expensive, invasive or burdensome, and there is no censoring or missing data. Throughout this paper, we quantify surrogate strength/utility using the quantity: the proportion of treatment effect on the primary outcome explained by the treatment effect on the surrogate marker.13,3,5 We use potential outcomes notation where each person has a potential Y(1),Y(0),S(1),S(0) where Y(g) is the outcome when Z=g and S(g) is the surrogate when Z=g. Observed data from the current study is denoted as and consists of 𝒟=Ygi,Sgi,Wgi,i=1,,ng;g=0,1, where ng denotes the number of individuals in treatment group g.

The goal in the current study is to test for a treatment effect on the primary outcome quantified as

H0:ΔE(Y(1)Y(0))=E(Y(1))E(Y(0))=0.

Our aim is to leverage information from the prior study to test H0 using surrogate marker information in order to reduce study follow-up time, costs, and/or participant burden, i.e., making inference on Δ without using Ygi,i=1,,1,ng;g=0,1. We use a superscript p to denote “prior” when referring to data or quantities from the prior study. For example, we denote observed data from the prior study by 𝒟p=Ygip,Sgip,Wgip,i=1,,ngp,g=0,1, where ngp is the sample size of treatment group g.

2.2 |. Assumptions

Given that our setting rests on the existence of a valid surrogate marker, we first define S to be a valid surrogate marker for Y if the following conditions hold:

  • (C1) EY(0)S(0)=s,W=w is a monotone function of s;

  • (C2) PS(1)>sW=wPS(0)>sW=w for all s and w;

  • (C3) EY(1)S(1)=s,W=wEY(0)S(0)=s,W=w for all s and w.

  • (C4) A large proportion of the treatment effect on the primary outcome can be explained by the treatment effect on the surrogate marker for all w.

Assumptions (C1)-(C3) are parallel to those required in Wang and Taylor (2002)3 and Parast et al. (2017)14 and protect against the surrogate paradox situation.15 Assumption (C1) implies that the surrogate marker is either “positively” or “negatively” related to the time of the primary outcome, (C2) implies that there is a positive treatment effect on the surrogate marker, and (C3) implies that there is a non-negative residual treatment effect beyond that on the surrogate marker. Assumptions (C1-C3) together guarantee that EY(1)W=wEY(0)W=w, for all w in the support of W (see Appendix B). Lastly, (C4) states that the proportion of the treatment effect explained by the surrogate marker must be large and guarantees the strength of the surrogate marker of interest for all individuals in the study. While this is somewhat vague, there is no agreed upon value that signifies a “large” proportion, though previous work has tended to view values of 0.6-0.75 or higher as large.16,13,17 If the existing heterogeneity is such that the surrogate is strong for some w and weak for other w, it should not be used as a replacement of the primary outcome for all individuals in a future study. Instead, one may consider using the surrogate as a replacement only among those with a W where the surrogate is strong; we discuss this further in the Discussion.

In order to ensure that the proposed test statistic to be described in Section 2.3, has a reasonable interpretation with respect to Δ, we also require:

  • (C5) EY(0)S(0)=s,W=w=EY(0p)S(0p)=s,Wp=w for all s and w;

  • (C6) EY(0p)S(0p)=s,Wp=w is estimable for any (s,w)ΩJ, where ΩJ is the common compact support for both S(g),W(g) in g=0,1.

Assumption (C5) implies that in the control groups, the current study and the prior study share the same conditional expectation for Y given S and W. This assumption is reasonable when, for example, the control condition in both studies are the same, such as “usual care.” Importantly, such an assumption is not required to hold for the treatment groups and it relaxes the requirement that the distribution of Y conditional on S be transportable from the prior to current study. Even so, this assumption is admittedly very strong and needs to be carefully considered before using this approach; however, any testing procedure that attempts to borrow information from a prior study to test a hypothesis in a future study is going to require some type of strong transportability assumption. If there is reason to believe that such transportability between studies is not appropriate, then the prior study should not be considered for informing the future study. Assumption (C6) ensures that we can approximate EY0S0=s,W0=w for all observed pairs of S(g) and W(g),g=0,1 in the current study. We discuss robustness to these assumptions as well as additional assumptions needed for a causal interpretation in Appendix B.

2.3 |. Proposed Testing Procedure

Recall that our aim is to take advantage of information from the prior study to test H0 using surrogate marker information such that this test accounts for known heterogeneity in the utility of the surrogate marker. To achieve this goal we note that Δ can be expressed as:

Δ=E(Y(1))E(Y(0))=Δ(w)dFW(w)=[μ1(s,w)dF(1)(sw)]dFW(w)[μ0(s,w)dF(0)(sw)]dFW(w) (1)

where μg(s,w)EY(g)S(g)=s,W=w,F(g)(sw)FS(g)W(sw) is the conditional cumulative distribution function of S(g) given W=w, and FW(w) is the cumulative distribution of W. In expressing Δ as (1), we have simply used a conditional expectation to incorporate S and W into our expression. By expressing Δ in this way, this motivates the following earlier treatment effect definition:

ΔH=[μ0(s,w)dF(1)(sw)]dFW(w)[μ0(s,w)dF(0)(sw)]dFW(w) (2)
=μ0p(s,w)dF(1)(s,w)μ0p(s,w)dF(0)(s,w) (3)

where F(g)(s,w) is the cumulative distribution function of S(g),W in the current study. The only change in going from (1) to (2), is that we have replaced μ1(s,w) with μ0(s,w) in the first term which will ensure that this quantity provides a lower bound on the treatment effect. In the second equality, (3), we replace μ0(s,w) with μ0p(s,w) which follows from Assumption (C5). The expression (3) is now a quantity that only involves μ0p(s,w) which is the conditional risk in the prior study, and the distribution of S and W in the current study. Importantly, the expression does not involve Y from the current study at all. In practice, μ0p(s,w) is unknown and must be replaced with an estimate, μˆ0p(s,w), which we describe in Section 3.1. Because of this, we define the following earlier average treatment effect quantity, where the ~ notation makes the dependence on information from the prior study explicit:

Δ˜H=μ^0p(s,w)dF(1)(s,w)μ^0p(s,w)dF(0)(s,w)=E{μ^0p(S(1),W)μ^0p(S(0),W)𝒟p}.

This quantity, Δ~H, measures the treatment effect on a transformation of the surrogate marker and baseline covariate, i.e., the difference between μˆ0pS(1),W and μˆ0pS(0),W. First, due to randomization, W has the same distribution between two treatment groups and Δ~H has an appealing causal interpretation reflecting the treatment effect on the surrogate marker. Second, Δ~H represents the part of the treatment effect on the primary outcome explained by the surrogate marker and an approximation to ΔH, which is the quantity of our primary interest. Under the null hypothesis of no average treatment effect on the primary outcome, there will also be no average treatment effect in any subgroup of patients with W=w (see Appendix B). Under the null, Assumptions (C1)-(C3) imply that S(1)W=w has the same distribution as S(0)W=w for all w in the support of W, and thus, Δ~H=0. Therefore, we may formally define our test statistic for H0 based on the early average treatment effect as ZH=nΔ^H/σ^H, where Δ^H is a root-n consistent estimate of Δ~H and σ^H2 is the estimated variance of nΔ^HΔ~H. We reject H0 when ZH is large. In Section 3, we propose robust procedures to construct ΔˆH and σˆH. Obviously, this is a valid test for both the null H0H:Δ~H=0 and the null H0:Δ=0.

One important merit of constructing the test statistic based on an estimator of Δ~H is that this earlier average treatment effect is smaller than if we used the true conditional expectations within each treatment group in probability. That is, PΔ~HΔ1 and thus, Δ~H is a conservative measure of the average treatment effect, Δ. Importantly, this early treatment effect and associated test account for heterogeneity in the utility of the surrogate by explicitly utilizing a condition mean function that depends on W. In the following section we describe other tests that may be considered; in our numerical studies, we compare our approach with these alternatives.

2.4 |. Alternative Testing Approaches

We consider two alternative tests that would be reasonable options for testing H0 in this setting. The first quite obvious approach is simply to assume the primary outcome is measured in the current study and use primary outcome information to estimate Δ and conduct a t-test of H0:Δ=0. This reflects the gold standard as it directly tests the hypothesis we are interested in. Importantly though, the whole point of this setting is to provide a way to not have to measure the primary outcome. We include this option so that we can compare to this gold standard.

The second alternative test we examine is one which uses information from the prior study about the relationship between the surrogate and the primary outcome, but does not account for heterogeneity. This test is an extension of a test proposed in Parast et al. (2019)7 which was developed for the time-to-event outcome setting. Our description of it here, for a non-survival setting, is new and will be useful in practice for those analyzing a non-survival study in a setting with no heterogeneity in the utility of the surrogate. Similar to our proposed test, but without regard for W, we note that Δ=μ1(s)dF(1)(s)μ0(s)dF(0)(s) where μg(s)=EY(g)S(g) which motivates the following earlier treatment effect definition:

ΔP=μ0(s)dF(1)(s)μ0(s)dF(0)(s)=μ0p(s)dF(1)(s)μ0p(s)dF(0)(s)

where μ0p(s)EY(0p)=yS(0p)=s. Since μ0p(s) is unknown, we approximate ΔP with

Δ˜P=μ^0(s)dF(1)(s)μ^0(s)dF(0)(s)=μ^0p(s)dF(1)(s)μ^0p(s)dF(0)(s).

where μ^p0(s) is a consistent estimator of μ0p(s). As with the proposed test, this early treatment effect quantity replaces μg(s) with μ^0(s) for both treatment groups and will ensure it is a lower bound on the Δ under certain conditions. This test, however, requires the assumption that μˆ0p(s)μ0p(s)=μ0(s) i.e., that this conditional expectation in the control group is the same in the current study as the prior study. It is important to note that this assumption may not hold when there is heterogeneity in the utility of the surrogate marker. To test H0:Δ=0, we instead test H0P:Δ~P=0 and define the test statistic for H0P based on the early treatment effect as ZP=nΔ^P/σ^P, where Δ^P is a root-n consistent estimate of Δ~P and σ^P2 is the estimated variance of nΔ^PΔ~P. We reject H0P and H0 when ZP is large.

In Appendix C, we discuss estimation and testing for Δ using the primary outcome, propose estimation procedures to obtain Δ^P and σ^P, and discuss why we do not consider directly testing the surrogate. Intuitively, we would expect that both our proposed test and this test based on Δ~P should work well when there is no heterogeneity. When there is heterogeneity, we expect that the test based on Δ~P (or even ΔP) could lead to erroneous conclusions about the treatment effect and/or have less power than the proposed test.

3 |. ESTIMATION AND INFERENCE

3.1 |. Estimation of Proposed Δ~H

For our proposed testing procedure, we first define

μ^0p(s,w)=i=1n0pKh2(S0ips)Kh3(W0ipw)Y0ipi=1n0pKh2(S0ips)Kh3(W0ipw),and
m^g(w;μ(,))=i=1ngKhg(Wgiw)μ(Sgi,Wgi)i=1ngKhg(Wgiw),

as nonparametric smoothed estimators of the conditional expectation of Y(0) given S(0),W=(s,w) in the prior study, and the conditional expectation of μS(g),W given W=w and a bivariate function μ(,) in the current study, respectively. Here, Kh()=K(/h)/h,K() is a smooth symmetric density function with finite support, h0,h1,h2,h3 are specified bandwidths which may be data dependent, and n0p denotes the sample size of group Z=0 in the prior study. We utilize undersmoothing and select all bandwidths throughout to be of order Onϵ,ϵ(1/4,1/2), to eliminate the asymptotic bias, where n=n1+n0 in an effort to avoid a need for bias correction in subsequent statistical inference.

A very straightforward estimate of Δ~H would be

n11i=1n1μ^0(p)(S1i,W1i)n01i=1n0μ^0(p)(S0i,W0i) (4)

which simply takes our estimated conditional mean function from the prior study and applies it to data in the current study. However, it is possible for us to improve upon this estimator in terms of efficiency. To do this, we note that

Δ˜H=E[E(μ^0p(S(1),W)W)]E[E(μ^0p(S(0),W)W)]E[m^1(W;μ^0p)]E[m^0(W;μ^0p)],

and thus we now consider an estimate of Δ~H as

n11i=1n1m^1(W1i;μ^0p)n01i=1n0m^0(W0i;μ^0p), (5)

which is asymptotically equivalent to (4). Note that this estimate only uses S(g) and W data from the current study (no Y data from the current study) and μˆ0p(s,w), which in turns depends on S(0p),Wp,Y(0p) data in group Z=0 from the previous study.

While either (4) or (5) would be consistent estimates of Δ~H, we utilize the fact that the distributions of W from the two treatment arms are identical due to randomization and construct the estimator:

Δ^H=1n1+n0{[i=1n0m^1(W0i;μ^0p)+i=1n1m^1(W1i;μ^0p)][i=1n0m^0(W0i;μ^0p)+i=1n1m^0(W1i;μ^0p)]}. (6)

We show in Appendix D that (6) improves upon the efficiency of (5). Essentially, Δ^H is equivalent to an augmented version of the simple estimator (described below), taking advantage of the independence of W and treatment, since treatment was randomized.

In Appendix D we show that conditional on μ^0p(,), Δ^H is a consistent estimate of Δ˜H, and that n{Δ^HΔ˜H} weakly converges to a mean zero normal distribution as n. A consistent estimate of the conditional variance of ΔˆH given the prior study, σH2, can be obtained as

σ^H2=1n12i=1n1(S˜1iπ0m^1(W1i;μ^0p)π1m^0(W1i;μ^0p)π1Δ^H)2+1n02i=1n0(S˜0iπ0m^1(W0i;μ^0p)π1m^0(W0i;μ^0p)π0Δ^H)2

where πg=ng/n and S~gi=μ0(p)Sgi,Wgi. Our testing procedure uses the test statistic ZH=Δ^H/σ^H and rejects the null hypothesis when ZH>Φ1(1α/2). As n0p,Δ~HΔH=op(1) and Δ~H can be viewed as a consistent estimator of ΔH. More importantly, under Assumptions (C1), (C2), (C3) and (C5), PΔ~HΔ1 as n, indicating that the test for Δ~H=0 is a valid test for Δ=0 with probability approaching 1 as the sample size of the prior study increases to infinity.

Remark. The efficiency of the simple estimator

n11i=1n1m^1(W1i;μ^0p)n01i=1n0m^0(W0i;μ^0p)n11i=1n1μ^0(p)(S1i,W1i)n01i=1n0μ^0(p)(S0i,W0i),

can be improved by considering the fact that EmW1i;μˆ0p=EmW0i;μˆ0p for any transformation m() due to randomization. Specifically, one may consider a new class of consistent estimators indexed by m():RR,

{n11i=1n1[μ^0(p)(S1i,W1i)m(W1i;μ^0p)]n01i=1n0[μ^0(p)(S0i,W0i)m(W0i;μ^0p)]}.

The optimal choice of m() minimizing the asymptotic variance is

mopt(w)=π0E(μ^0(p)(S1,w)W1=w)+π1E(μ^0(p)(S0,w)W0=w).

In practice, m0(w) can be consistently estimated by m^opt(w)=π0m^1w;μ^0(p)+π1m^0w;μ^0(p). Denote the resulting estimator of Δ~Hby

Δ^HAUG=n11i=1n1[μ^0(p)(S1i,W1i)m^opt(W1i;μ^0p)]n01i=1n0[μ^0(p)(S0i,W0i)m^opt(W0i;μ^0p)].

In Appendix D we show that conditional on μ^0(p)(,),Δ^HAUG is a consistent estimate of Δ~H and that nΔ^HAUGΔ~H weakly converges to a mean zero normal distribution as n. The conditional variance of Δ^HAUGμ^0(p)(,),σAUG2, can be consistently estimated by

σ^AUG2=1n12i=1n1[μ^0(p)(S1i,W1i)m^1(W1i;μ^0p)]2+1n02i=1n0[μ^0(p)(S0i,W0i)m^0(W0i;μ^0p)]2+π12n12i=1n1[m^1(W1i;μ^0p)m^0(W1i;μ^0p)Δ^H)]2+π02n02i=1n0[m^1(W0i;μ^0p)m^0(W0i;μ^0p)Δ^H]2.

In Appendix D, we show that Δ^HAUG is asymptotically equivalent to our proposed Δ^H and σ^H/σ^AUG=1+op(1).

3.2 |. Inference

To construct a confidence interval for Δ~H we use our estimated variance σ^H2 and define a 100(1α)% confidence interval as Δ^H±Z1α/2σˆH. We examine the empirical performance of our proposed estimation procedure, variance estimation, confidence interval construction, and testing procedure in Section 4.

It is important to note that we consider the prior study, the study from which we estimate the conditional mean function, μˆ0p(s,w), as fixed. This is a reasonable assumption given that in practice, there is truly some previously conducted prior study which one is using to inform testing in the current study. However, one could argue that this prior study should be considered random and that all inference should be derived as such. In such a case, the estimation of our point estimate Δ^H would remain the same but the standard estimation and confidence interval construction would be more complex.

3.3 |. Multiple Baseline Covariates

While in this paper we focus only on heterogeneity with respect to a single baseline covariate, it may be the case that there is heterogeneity with respect to multiple baseline covariates. In such a case, one still can consider a straightforward estimator for the treatment effect using surrogate marker and baseline covariates:

n11i=1n1μ^0m(p)(S1i,W1i)n01i=1n0μ^0m(p)(S0i,W0i)

where μ^0m(p)(s,w) is an estimator of μ0(s,w)EY(0)S(0)=s,W=w and W is a baseline covariate vector of interest (including an intercept term, with a slight abuse of notation). The difficulty is that fully nonparametric estimation of μ0(s,w) will likely be infeasible for practical sample sizes with a vector W of moderate dimension, e.g., ≥3. In such a case, one may be willing to consider a parametric or semi-parametric model. For example, an estimator can be obtained based on a simple regression model μ0(s,w)=gYβ0s+β1w, where gY() is a known, strictly increasing link function and β0 and β1 are unknown regression coefficients to be estimated based on the prior study. Alternatively, one could consider a more flexible varying coefficient model for μ0p(s,w) such as μ0(s,w)=gYB(s)w, where B(s)=β1(s),β2(s),,βL(s), and βl(s) is the unknown smooth function of s to be estimated nonparametrically. This modeling approach would allow complex interactions between S and W. Here, we use the additional subscript m in μ^0m(p)(,) to emphasize the fact that this estimator of μ0(,) will now be fully or partially dependent on model assumptions, i.e., model-based. Certainly, given this model dependence, robustness (or lack thereof) to model misspecification would need to be carefully considered when using this approach in practice.

4 |. SIMULATION STUDY

4.1 |. Simulation Goals and Setup

The two main goals of our simulation study were: 1) to examine the finite sample properties of our estimation procedure for Δ~H in terms of bias, accuracy of our variance calculation, and coverage of constructed confidence intervals, and 2) to compare testing results based on the three different testing quantities: Δ^ (using the primary outcome, gold standard) vs. Δ^P (using the surrogate marker, ignoring heterogeneity) vs. Δ^H (using the surrogate marker, accounting for heterogeneity). For the testing results, we focus on the point estimates themselves, the resulting effect sizes (point estimate/standard error estimate), and power. Importantly, when there is heterogeneity, we do not necessarily aim to demonstrate improved power with our proposed approach but rather, to demonstrate settings where the testing procedure using Δ^P (using the surrogate marker, ignoring heterogeneity) can be incorrect.

To achieve these goals, we examined eight simulation settings. For all settings, results were summarized over 500 replications; we examined all settings with n1p,n0p=(1000,800) (sample sizes in prior study) and n1,n0=(300,300) (sample sizes in current study). All simulation settings were also repeated with n1p,n0p=(300,300) (sample sizes in prior study) and n1,n0=(300,300); results were similar and are not shown here. In setting 1, we generated data such that there was heterogeneity in the utility of the surrogate with respect to a baseline covariate and the distribution of this baseline covariate was different in the current study compared to the prior study. Specifically, in the prior study, which is fixed in all simulations, W1ipU(0,10), W0ipU(0,10),S1ipgamma(shape=2.78,scale=2.78), and S0ipgamma(shape=2.5,scale=2.5). We then generate the outcomes from:

Y1ip=I(W1ip<5)(3.5+5S1ip)+I(W1ip5)(16S1ip)+N(0,16),
Y0ip=I(W0ip<5)(3.2+4S0ip)+I(W0ip5)(15.95S0ip)+N(0,16).

where throughout N(a,b) indicates a normal distribution with mean a and variance b. The motivation behind this setup was (a) to generate a surrogate marker where higher values are desirable and the surrogate level tends to be higher in the treated group, and (b) to generate an outcome where the surrogate marker is positively associated with the outcome but this association is stronger in magnitude in the treated group, reflecting residual treatment effect beyond the surrogate marker. In addition, to induce heterogeneity, we generate data such that the treatment effect on the primary outcome and the association between primary outcome and surrogate marker depend on whether the covariate is less than or greater than 5. With this setup, there was a statistically significant heterogeneity in surrogacy based on the test for heterogeneity proposed by Parast et al. (2021); the estimated proportion of treatment effect explained by the surrogate marker was 0.52 for Wgip<5 and 0.95 for Wgip5,g{0,1}. In this setting, the Sgi,YgiWgi in the current study was generated the same as in the prior study, but W1i and W0i were generated from a U(0,4), which is different from the prior study. Note that for all patients in the current study, the surrogate strength is not very strong and thus, we would expect that using the surrogate but ignoring heterogeneity will lead to an overestimation of the treatment effect. While the variability of the primary outcome, Ygi, is large in both treatment groups, the size of the treatment effect is large as well. For example, in this setting, our results will show that the average estimated treatment effect on the outcome in the current study is 14.10, and the empirical power of testing the treatment effect is 100% using the primary outcome only.

In setting 2, Wgip and YgipSgip,Wgip in the prior study were generated exactly the same as in setting 1, but S1ipgamma(shape=2.66,scale=2.66) and S0ipgamma(shape=2.5,scale=2.5). The motivation behind this change in the distributions for the surrogate marker is that we aimed to make the treatment effect on both the primary outcome and surrogate marker smaller than in setting 1, in order to explore how the various tests performed when less power would be expected. As in setting 1, there was significant heterogeneity in surrogacy with the estimated proportion of treatment effect explained by the surrogate being 0.39 for Wgip<5 and 0.90 for Wgip5. The current study was generated the same as the prior study except that W1i and W0i were generated from a U(6,10) distribution. In contrast to setting 1, for all patients in the current study, the surrogate is strong and thus, we would expect that using the surrogate but ignoring heterogeneity will lead to an underestimation of the treatment effect. With respect to the size of the treatment effect and empirical power in this setting, our results will show that the average treatment effect on the outcome in the current study is 13.34 , and the empirical power of testing the treatment effect is 69% using the primary outcome only.

In setting 3, Wgi,Sgi in the prior study were generated as in setting 2, but Y1ip=IW1ip<5(3.5+5×7)+IW1ip5)16S1ip+N(0,16) and Y0ip=IW0ip<5(3.2+4×6.25)+IW0ip515.95S0i+N(0,16). The motivation behind this change in the distributions for Y was to explicitly make the surrogate useless among those with Wgip<5 i.e., a more extreme version of setting 2. As expected, there was significant surrogacy heterogeneity with the treatment effect on the surrogate marker not explaining any of the treatment effect on the primary outcome among patients with Wgip<5, and explaining the majority of the treatment effect on the primary outcome among patients with Wgip5 (proportion explained 0.92). Similar to setting 2, the current study was generated the same as the prior study except that W1i and W0i were generated from a U(6,10) distribution and thus, we expect a potentially larger gain in power using our proposed approach (though again, this is not our primary goal). With respect to the size of the treatment effect and empirical power in this setting, our results will show that the average treatment effect on the primary outcome in the current study is 13.34 , and the empirical power of testing the treatment effect is 69% using the primary outcome only, parallel to setting 2.

In setting 4, the prior study was generated exactly the same as in setting 1, and the current study was generated exactly the same as the prior study, i.e., W1i and W0i were generated from a U(0,10) distribution. Here, even though there is heterogeneity as described above for setting 1, since the covariate distribution is the same in prior and current studies, we expect the tests ignoring vs. accounting for heterogeneity to produce similar results. With respect to the size of the treatment effect and empirical power in this setting, our results will show that the average treatment effect on the primary outcome in the current study is 19.12 , and the empirical power of testing the treatment effect is 96% using the primary outcome only.

In setting 5, data were generated such that there is no heterogeneity. Specifically, in the prior study, W1ipU(0,10),W0ipU(0,10),S1ipgamma(shape=2.78,scale=2.78), S0ipgamma(shape=2.5,scale=2.5), Y1ip=3.5+5S1ip+N(0,1), and Y0ip=3.2+4S0ip+N(0,1), independent of the baseline covariate. The proportion of the treatment effect explained by the surrogate in the prior study was 0.47, which is homogeneous in the study population. Data from the current study was distributed the same as for the prior study. The purpose of this setting was to examine how the tests perform when there is no heterogeneity and no difference in distribution from the prior study to the current study. With respect to the size of the treatment effect and empirical power in this setting, our results will show that the average treatment effect on the outcome in the current study is 13.90 , and the empirical power of testing the treatment effect is 100% using the primary outcome only.

In setting 6, data are generated similar to setting 1 but with lower variability in the primary outcome resulting in a much larger effect size. In the prior study, W1ipU(0,10),W0ipU(0,10),S1ipgamma(shape=3,scale=3), S0ipgamma(shape=2.1,scale=2.2). For W1ip<5 and W0ip<5,Y1ip=3.5+5S1ip+N(0,1), and Y0ip=1+3S0ip+N(0,1), respectively. For W1ip5 and W0ip5,Y1ip=16S1ip+N(0,1) and Y0ip=15.8S0ip+N(0,1), respectively. There was a substantial heterogeneity in the utility of the surrogate with the proportion of treatment effect explained by the surrogate being 0.67 for Wgip<5 and 0.98 for Wgip5. In the current study, the S and Y were generated the same as in the prior study, but W1i and W0i were generated from a U(0,4) distribution. As in setting 1, since the surrogate strength is not very strong in the current study, we would expect that using the surrogate but ignoring heterogeneity will lead to an overestimation of the treatment effect. With respect to the size of the treatment effect and empirical power in this setting, our results will show that the average treatment effect on the outcome in the current study is 33.70 , and the empirical power of testing the treatment effect is 100% using the primary outcome only.

Settings 7 and 8 reflect a null treatment effect setting and we include them so that we may examine the empirical Type 1 error rate. In both settings, data from the prior study are generated as WgipU(0,10),Sgipgamma(shape=2.5,scale=2.5), and Ygip=3.2+4Sgip+N(0,16) for g=0,1. That is, there is neither treatment effect on the surrogate marker nor the treatment effect on the primary outcome, and Sgi and Ygi are positively associated. In setting 7, data in the current study are generated exactly as the prior study. In setting 8, data in the current study are generated such that Sgi,YgiWgi are generated the same as the prior study, but WgiU(0,4),g{0,1}, i.e., the distribution of the baseline covariate is different in the current study. The purpose of setting 8 is to specifically examine estimation and testing when there is no treatment effect and no heterogeneity, but the current study does have a different patient population compared to the prior study. In both settings, the true treatment effect on the primary outcome is 0 and the empirical Type 1 error of the test using the primary outcome is 0.06. In both settings, there is no empirical evidence that S is an “informative” surrogate marker, and no empirical evidence of heterogeneity in surrogacy, as expected.

With respect to our bandwidth selection, we let h0=1.06×minσW0,IQR0/1.34n02/5 and h1=1.06×minσW1,IQR0/1.34n12/5 where σWg and IQRg were the empirical standard deviation and inter-quartile range of Wg, and h2=2×1.06×minσS0p,IQR1/1.34n0p2/5 and h3=2×1.06×minσW0p,IQR2/1.34n0p2/5 where σS0p and IQR1 were the empirical standard deviation and inter-quartile range of S0p, respectively, and σW0p and IQR2 were the empirical standard deviation and inter-quartile range of W0p, and h4=1.06×minσS0p,IQR1/1.34n0p0.31.18,7

4.2 |. Simulation Results

Table 1 shows estimation results for Δ^H for all settings, using our proposed estimating procedure. We examine bias in coverage with respect to both Δ~H (fixed prior study) and ΔH. These results demonstrate good performance with minimal bias, average standard error estimates that are close to the empirical standard error, and coverage of the confidence intervals close to the nominal value of 95%.

TABLE 1.

Estimation results from the simulation study using the proposed procedure to estimate Δ~H; note that settings 7 and 8 are null settings with no treatment effect; bias and coverage are examined with respect to Δ~H (prior study fixed) and ΔH; Bias~= bias with respect to Δ~H, quantified as |Δ^HΔ~H|/Δ~H except for settings 7 and 8 where it is quantified without dividing by Δ~H; Bias = bias with respect to ΔH, quantified as |Δ^HΔH|/ΔH except for settings 7 and 8 where it is quantified without dividing by the truth; ESE = empirical standard error, ASE = average standard error (average of the square root of the closed form variance estimate), Cov~= coverage of 95% confidence intervals with respect to Δ~H; Cov = coverage of 95% confidence intervals with respect to ΔH

Estimate Bias Bias~ ESE ASE Cov Cov~

Setting 1 6.32 0.07 0.05 1.82 1.79 0.96 0.96
Setting 2 12.53 0.05 0.07 5.39 5.22 0.94 0.94
Setting 3 12.52 0.05 0.07 5.39 5.22 0.94 0.94
Setting 4 14.72 0 0.05 4.12 4.13 0.96 0.95
Setting 5 5.75 0.03 0.04 1.38 1.4 0.95 0.95
Setting 6 12.97 0.01 0.02 1.05 1.27 0.98 0.98
Setting 7 −0.03 0.03 0.16 1.31 1.25 0.94 0.94
Setting 8 −0.03 0.03 0.16 1.31 1.26 0.94 0.94

Table 2 shows results from testing using Δ^,Δ^P, and Δ^H. In setting 1 where there is heterogeneity and the distribution of W in the current study is different from the prior study, results show that Δ^P overestimates the treatment effect and thus, does not retain the lower boundedness property. In contrast, our approach using Δ^H does not overestimate the treatment effect. The power using Δ^H is smaller than that using Δ^, but this is expected since the data generation in this setting is such that the population in the current study is composed largely of individuals where the surrogate marker is not very strong. In setting 2 where there is again heterogeneity and the distribution of W in the current study is different from the prior study, results show that both Δ^P and Δ^H are less than Δ^, but Δ^H is much closer to Δ^ and has power equivalent to that using Δ^. This, again, is what was expected since the data generation in this setting is such that the population in the current study is composed largely of individuals where the surrogate marker is strong. In setting 3, which is similar to setting 2 but we have made the data more extreme with the surrogate being useless for those with W<5, results show a larger departure in Δ^P from Δ^, and a larger decrease in power for Δ^P compared to Δ^H. In setting 4 where there is heterogeneity but the distribution of W in both the prior study and the current study is the same, we see similar point estimates for Δ^P and Δ^H but a slightly higher standard error and lower power for Δ^H. This indicates that in some settings, we may pay a price in terms of power and efficiency when we use the approach that accounts for heterogeneity when it is not necessary. In setting 5, where there is no heterogeneity, we see similar performance for Δ^P and Δ^H. In setting 6, where we have a very large treatment effect on the primary outcome, there is heterogeneity and the distribution of W in the current study is different from the prior study, results show that, as expected, Δ^P overestimates the treatment effect and does not retain the lower boundedness property, as in setting 1. In settings 7 and 8, where there is no treatment effect, results show that all three testing procedures perform well with an estimated treatment effect close to zero and Type 1 error rate close to 0.05. We additionally examined the efficiency gain comparing our proposed estimator to the simple estimator in (4); indeed, we did observe efficiency gains using our proposed estimator, quantified by the ratio of the estimated standard error using our proposed estimate to that using the simple estimate, that ranged from 0.79-0.98 across settings.

TABLE 2.

Testing results from the simulation study comparing testing results based on the three different testing quantities: Δ^ (using the primary outcome, gold standard) vs. Δ^P (using the surrogate marker, ignoring heterogeneity) vs. Δ^H (using the surrogate marker, accounting for heterogeneity); ESE = empirical standard error, ASE = average standard error (average of the square root of the closed form variance estimate), Effect size = estimate divided by the estimated standard error (i.e., square root of the closed form variance estimate), Power/Type 1 error = proportion of replications for which the test rejects the null i.e., p-value of the test is <0.05

Setting 1

Estimate ESE ASE Effect size Power

Δ 14.10 1.64 1.65 8.55 1.00
ΔP 14.53 3.61 3.65 3.99 0.98
ΔH 6.32 1.82 1.79 3.62 0.95

Setting 2

Estimate ESE ASE Effect size Power

Δ 13.34 5.54 5.42 2.47 0.69
ΔP 7.64 3.38 3.31 2.31 0.64
ΔH 12.53 5.39 5.22 2.39 0.67

Setting 3

Estimate ESE ASE Effect size Power

Δ 13.34 5.54 5.42 2.47 0.69
ΔP 6.00 2.81 2.76 2.18 0.58
ΔH 12.52 5.39 5.22 2.39 0.67

Setting 4

Estimate ESE ASE Effect size Power

Δ 19.12 5.17 5.20 3.68 0.96
ΔP 14.64 3.66 3.66 4.01 0.98
ΔH 14.72 4.12 4.13 3.56 0.95

Setting 5

Estimate ESE ASE Effect size Power

Δ 13.90 1.64 1.65 8.43 1.00
ΔP 5.77 1.38 1.38 4.18 0.99
ΔH 5.75 1.38 1.40 4.09 0.99

Setting 6

Estimate ESE ASE Effect size Power

Δ 33.70 1.61 1.60 21.08 1.00
ΔP 39.12 3.51 3.50 11.18 1.00
ΔH 12.97 1.05 1.27 10.23 1.00

Setting 7

Estimate ESE ASE Effect size Type 1 error

Δ −0.05 1.39 1.35 −0.04 0.06
ΔP −0.03 1.31 1.27 −0.02 0.06
ΔH −0.03 1.31 1.25 −0.02 0.06

Setting 8

Estimate ESE ASE Effect size Type 1 error

Δ −0.05 1.37 1.33 −0.04 0.06
ΔP −0.03 1.31 1.27 −0.02 0.06
ΔH −0.03 1.31 1.26 −0.02 0.06

In summary, results from this simulation study show 1) good finite sample performance of our estimation and inference procedures for ΔH,2) a potential slight loss in power when using the proposed Δ^H compared to Δ^P when accounting for heterogeneity is not needed, and 3) a potential for inaccurate conclusions and/or loss in power when Δ^P is used instead of the proposed ΔˆH when accounting for heterogeneity is needed.

5 |. APPLICATION

We apply our proposed approach to test for a treatment effect based on a heterogeneous surrogate using data from two distinct AIDS clinical trials, the AIDS Clinical Trials Group (ACTG) 320 Study and the ACTG 193A Study.19,20 These data are publicly available upon request from the AIDS Clinical Trial Group21. We consider the ACTG 320 Study as our prior study and the ACTG 193A Study as our current study. The ACTG 320 study was conducted among HIV-infected patients with a CD4 cell count of 200 or less per cubic millimeter and was a randomized, double-blind trial that compared a two-drug regimen (two nucleoside reverse transcriptase inhibitors [NRTI]) with a three-drug regimen (two NRTIs plus indinavir). There were a total of 830 participants, with 412 in the two-drug regimen group and 418 in the three-drug regimen group. The ACTG 193A study was a randomized, double-blind trial conducted among HIV-infected patients with a CD4 cell count of 50 or less per cubic millimeter. We focus on the comparison of a two-drug regimen (NRTIs) with a three-drug regimen (two NRTIs plus nevirapine). There were a total of 657 participants, with 327 in the two-drug regimen group and 330 in the three-drug regimen group. Our primary outcome Y is the change in plasma HIV-1 RNA from baseline to 24 weeks; our surrogate marker S is change in CD4 cell count from baseline to 24 weeks, as CD4 is relatively less expensive to measure compared to RNA.22 Both Y and S are available in ACTG 320 while only S is available in the publicly available data of ACTG 193A. Previous work has demonstrated significant heterogeneity in the utility of S with respect to W , baseline CD4 count, with the surrogate strength being stronger among those with a lower baseline CD4 count and weaker among those with a higher baseline CD4 count12 as shown in Figure 1. We aim to use our proposed method to test for a treatment effect on RNA using CD4 count as a surrogate marker, accounting for the known heterogeneity in the utility of the surrogate which was demonstrated in the prior study.

FIGURE 1.

FIGURE 1

Estimated proportion of the treatment effect on the primary outcome (change in RNA) explained by the treatment effect on the surrogate marker (change in CD4), denoted as RS, as a function of baseline CD4

In Figure 2 we show the distribution of the baseline covariate, baseline CD4, in the prior study compared to the current study. Clearly, the current study is composed of a different participant population with lower CD4 counts due to the study eligibility criteria. In Figure 1, we also see that the surrogate is strongest in this subgroup. Using our proposed approach, we obtain a treatment effect estimate of Δ^H=0.10 (standard error [SE]=0.03) with a p-value <0.001. Note that since lower plasma HIV-1 RNA is better, a negative change in RNA indicates a beneficial treatment effect for the three-drug regimen. Using the approach that does not account for heterogeneity, we obtain a treatment effect estimate closer to the null, but still significant: Δ^P=0.07(SE=0.02),p<0.001. That is, while the overall conclusion regarding the treatment effect based on the surrogate would be significant using either test, our proposed test provides a treatment effect point estimate that is larger in magnitude. This is expected since the surrogate strength is greater in this subgroup that makes up the current study, and our proposed approach takes advantage of that information.

FIGURE 2.

FIGURE 2

Distribution of baseline CD4 in current study vs. prior study

6 |. DISCUSSION

For settings where it is known that the strength of a surrogate marker varies by a certain baseline characteristic, we have proposed an approach and estimation procedures to appropriately test for a treatment effect using only the surrogate marker, accounting for this known heterogeneity. We demonstrated good finite sample performance of our estimation procedure and showed that our proposed testing procedure can outperform an approach that does not account for heterogeneity. An R package implementing the methods proposed here, named hettest, is available at https://github.com/laylaparast/hettest.

While we largely focus, specifically in the numerical studies, on settings where the distribution of W is different in the current study as compared to the prior study, it is still possible for a test based on Δ^P, i.e., ignoring heterogeneity, to provide inaccurate results about the treatment effect when there is heterogeneity in the utility of the surrogate and the W is distributed the same in the two studies; we provide an example in Appendix E.

In the presence of heterogeneity, both the treatment effect and the utility of the surrogate marker may depend on W. While we focus exclusively on the average treatment effect in this paper, it may be of interest to test for a treatment effect based on alternative summaries that account for such heterogeneity. For example, one may define Δw=EY(1)W(1)=wEY(0)W(0)=w and the subgroup specific earlier treatment effect ΔH(w)=μ0p(s,w)dF(1)(sw)μ0p(s,w)dF(0)(sw). Then we may test for a treatment effect based on S by examining a functional of ΔH(w) such as supwΔH(w) or ΔH(w)dw, the area under the curve produced by ΔH(w). Such alternative summaries of the treatment effect across a baseline covariate, W, are not unique to the surrogate marker setting as they have been extensively discussed in the general heterogeneous treatment effect literature.23,24 However, these alternative summaries may also prove useful in the heterogeneous surrogate setting and may offer new insights over simply looking at the average treatment effect.

Importantly, we require Assumptions (C1) – (C4) and in practice, they may be violated. Specifically, if the existing heterogeneity is such that the surrogate is not strong or, worse, the treatment effect on the surrogate marker and primary endpoint may be in different directions for some w, the surrogate should not be used as a replacement of the primary outcome for all individuals in a future study. Instead, one may consider using the surrogate as a replacement only among those with a w where assumptions (C1) – (C4) hold. To achieve this, one could consider first identifying a region of interest where the surrogacy is sufficiently strong e.g., Ωw such that the conditional average treatment effect on the primary endpoint Δ(w)δ0>0 and the proportion explained by the surrogate for W=w,RS(w)=ΔH(w)/Δ(w), is between 0.50 and 1.0, and then apply the proposed testing procedure that replaces Y with S for testing the average treatment effect in the subpopulation Ωw. If one is interested in studying the average treatment effect in the entire study population, one may combine the proposed test statistic with a new but simple test statistic measuring the strength of the treatment effect based on actual primary endpoints Y for patients in the complement of Ω. Such a hybrid approach has the potential to reduce costs if S is less costly to measure than Y and/or reduce the follow-up time needed for those in Ωw if S is measured earlier than Y. Though not exactly within this context, previous work has explored the potential for auxiliary information (including but not limited to surrogate markers) to improve efficiency when testing for a treatment or intervention effect.25,26 While this is beyond the scope of this paper, further work on this topic within the framework of a heterogeneous surrogate is warranted.

Our proposed approach has some limitations. First, if the current study includes participants with w values outside the observed distribution in the prior study, our approach will not be able to obtain μˆ0p(s,w) for that w without extrapolation. In such a case, when there is observed heterogeneity in the prior study, use of the surrogate marker to test for a treatment effect in the current study should likely be limited to those with w contained in the prior study. Second, given our use of kernel smoothing, we require a relatively large sample size. Robust nonparametric methods for surrogate markers are lacking in general for small sample size settings; future work in this area would be needed. Lastly, we require several assumptions, outlined in Section 2.2, which are generally untestable though they may be empirically explored using the observed data. These assumptions are needed for identifiability, to ensure our lower-boundedness property of ΔH (i.e., ΔHΔ), and to guard against the surrogate paradox which occurs when the surrogate and outcome are positively associated, the treatment has a positive effect on the surrogate, but the treatment in fact has a negative effect on the outcome.15 The surrogate paradox is especially of concern here as our primary goal is to make a conclusion about the treatment effect on the primary outcome based on information about the surrogate marker. While these assumptions are strong, they are more likely to hold than the parallel assumptions required for ΔP 7 to be valid due to the additional conditioning on W. Further work on methods that allow for more relaxed assumptions and/or that allow one to assess sensitivity to violations of these assumptions would be useful.27

Supplementary Material

supplementary

ACKNOWLEDGEMENTS

Support for this research was provided by National Institutes of Health grant R01DK11835. We are grateful to the AIDS Clinical Trial Group for providing the AIDS clinical trial data.

APPENDIX

APPENDIX A

Discrete Example

Let Y denote the primary outcome and S denote the surrogate marker. We use potential outcomes notation where each person has a potential Y(1),Y(0),S(1),S(0) where Y(g) and S(g) are the outcome and surrogate when the patient receives treatment g. Our main quantity of interest is the treatment effect on the primary outcome quantified as ΔEY(1)Y(0)=EY(1)EY(0). The earlier treatment effect incorporating S information is defined in the main text as

ΔP=μ0p(s)dF(1)(s)μ0p(s)dF(0)(s) (1)

where μ0p(s)EY(0p)=yS(0p)=s. In this example, we will have heterogeneity in the utility of the surrogate with respect to gender. Consider our prior study, which we refer to as Study A in this example, and is shown in Figure 1. The Study A sample is 50% female and 50% male. For all individuals, S(1),S(0) are independent of gender, and ES(1),ES(0)=(10,5). For females, EY(1)S(1)=s=3+5s and EY(0)S(0)=s=1+3S. It can be shown that for females, Δ=5316=37 and ΔP=15. The proportion of the treatment effect on the primary outcome that is explained by the surrogate among females is thus 15/37=41%, which would not be considered as a strong surrogacy. For males, EY(1)S(1)=s=15s and Y(0)S(0)=s=14.8S. It can be shown that for males, Δ,ΔP=(76,74) and the proportion explained by the surrogate marker is 97% among males, representing strong surrogacy.

To calculate ΔP for a future study, let’s consider the conditional mean that is central to this calculation, μ0p(s)=EY(0p)=yS(0p)=s) where the superscript p indicates that this is referring to the prior study, i.e., study A. In this example, this would be μ0p(s)=0.5×(1+3s)+0.5×14.8s=8.9s+0.5. Now assume our current study is Study B shown in Figure 1 which is 95% female and 5% male. Importantly, the joint distributions of Y(1),Y(0),S(1),S(0) in males and females remain as described above; the only difference is the distribution of gender. The treatment effect, Δ in this new study is 0.95×37+0.05×76=38.95. If one were to calculate ΔP not accounting for this known heterogeneity in the utility of the surrogate, the quantity obtained would be ΔP=8.9×10+0.58.9×50.5=44.5, recalling that ES(1)=10 and ES(0)=5 for all individuals in both studies. However, using our proposed approach which does account for heterogeneity, we use ΔH as the earlier treatment effect, defined in the main text as:

ΔH=μ0p(s,w)dF(1)(s,w)μ0p(s,w)dF(0)(s,w).

Thus, ΔH=95%×(1+3×10)+5%×(14.8×10)95%×(1+3×5)5%×(14.8×5)=17.95. Therefore ΔH<Δ<ΔP and ΔP no longer retains the property of providing a lower bound on the treatment effect on Y.

Now we consider a study, labeled Study C in Figure 1, which is 95% males and 5% females. Using similar calculations, we can show that Δ=74.05,ΔP=44.05 and ΔH=71.05. Thus, in this case, ΔH will provide better lower bound for Δ and the test based on ΔH is expected to be more powerful than that based on ΔP. The discrete case, as illustrated in this example, is relatively straightforward in terms of how to go about calculating the needed quantities separately by group and appropriately accounting for the different distribution in the new study. The continuous baseline covariate case, however, is more complex, and our Appendix C presents an example such that even if the prior and current studies have the same distribution for covariates, ΔP may still fail to be a valid lower bound for Δ.

APPENDIX B

As noted in this text, Assumptions (C1) – (C3) together guarantee that EY(1)W=wEY(0)W=w, for all w in the support of W. This result is due to the derivation:

Δ(w)=E(Y(1)W=w)E(Y(0)W=w)=sE(Y(1)S(1)=s,W=w)dF(1)(sw)sE(Y(0)S(0)=s,W=w)dF(0)(sw)sE(Y(0)S(0)=s,W=w)dF1(sw)sE(Y(0)S(0)=s,W=w)dF(0)(sw)=sE(Y(0)S(0)=s,W=w)d{F(1)(sw)F(0)(sw)}=s{F(0)(sw)F(1)(sw)}E(Y(0)S(0)=s,W=w)sds0,

where F(g)(sw)=PS(g)sW=w,g=0,1. That is, while treatment effect heterogeneity is allowed, the directions of the conditional average treatment effect among subgroups of patients with W=w need to be consistent. One important implication is that under the null H0:Δ=E{Δ(W)}=0, i.e., no average treatment effect, the conditional average treatment effect Δ(w)=0 for all w as well. Furthermore, from the derivation, it is clear that Δ(w)=0 if and only if both

  1. F(1)(sw)=F(0)(sw), i.e., PS(1)>sW=w=PS(0)>sW=w and

  2. EY(1)S(1)=s,W=w=EY(0)S(0)=s,W=w.

Specifically, Δ(w)=0 implies that there is no treatment effect on the distribution of the surrogate marker in the subgroup of patients with W=w. In summary, under Assumptions (C1)-(C3)

Δ=0Δ(w)=0S(1)|W=wS(0)|W=w.

This relationship allows us to test the common null H0:Δ=0 via testing a seemingly more restrictive null that S(1)W= wS(0)W=w, for all w in the support of W.

For (C2) and (C3), if the primary outcome or surrogate are such that lower values are “better”, one can simply define the outcome/surrogate as X where X is the initial value.

Assumptions (C5) – (C6) are not required for the validity of the testing procedure proposed in the next section in that the p-value under the null follows a uniform distribution even without them, but it allows us to estimate a lower bound of the average treatment effect, Δ, and construct the corresponding test statistic.

Under the following additional assumptions:

  • (C7) Y(1)S(0)S(1),W and Y(0)S(1)S(0),W;

  • (C8) Y(1p)S(0p)S(1p),Wp and Y(0p)S(1p)S(0p),Wp,

the treatment effect on the surrogate marker defined in Section 2.3 and on the primary outcome can be interpreted within a causal framework: the proposed test statistic is an estimate of the portion of the treatment effect on the primary outcome attributable to the treatment effect on the surrogate marker. Otherwise, the proposed treatment effect on the surrogate marker can always serve as a lower bound for the average treatment effect on Y and can be used in practice without assuming them.

To summarize, Assumptions (C1) – (C4) are needed for the validity of the proposed testing procedure, Assumptions (C5) – (C6) allow us to interpret the test statistic based on he surrogate marker and baseline covariate only as a “conservative” estimator (or a lower bound) of the average treatment effect on the primary outcome, and causal interpretation of the lower is possible under additional assumptions (C7) – (C8).

APPENDIX C

To estimate Δ using the primary outcome (gold standard) we use Δ^=n11i=1n1Y1in01i=1n0Y0i and conduct a t-test to test H0:Δ=0.

To estimate Δ~P, we use the nonparametric estimation approach of7 by estimating μ0p(s) as

μ^0p(s)=i=1n0pKh4(S0ips)Y0ipi=1n0pKh4(S0ips),

and then estimate Δ^P as

Δ^P=n11i=1n1μ^0p(S1i)n01i=1n0μ^0p(S0i).

Note that this estimate only uses S data from the current study (no Y data from the current study) and S,Y data from the previous study in group Z=0 only. To obtain an estimate for the standard error of Δ^P,σP, we simply take the empirical standard deviation of the transformed surrogate i.e., let Y~gi=μ^0pSgi, and then σ^P=var^Y~1i/n1+var^Y~0i/n0 where var^ indicates the empirical variance. This alternative testing procedure would then use the test statistic ZP=Δ^P/σ^P and reject the null hypothesis when ZP>Φ1(1α/2).

Importantly, one may also consider simply using the surrogate markers measured in the current study and define ΔM= ES(1)ES(0) and conduct a t-test of H0M:ΔM=0. The disadvantage of this approach is that there is no way to relate ΔM and Δ i.e., the estimate of ΔM does not give any helpful information about the magnitude of Δ. In addition, this approach does not take advantage of information from the previous study nor does it account for heterogeneity in the utility of the surrogate marker. For these reasons, we do not compare our approach to this test.

APPENDIX D

Our proposed estimator for Δ~H is

Δ^H=1n{i=1n0[m^1(W0i;μ^0p)m^0(W0i;μ^0p)]+i=1n1[m^1(W1i;μ^0p)m^0(W1i;μ^0p)]}.

Let μ~g=Eμ^0pS(g),Wμ^0p,g=0,1. It is obvious that Δ~H=μ~1μ~0. Also, let mgw;μ^0p=Eμ^0pS(g),WW=w.

In this section, we only consider the randomness in the current study, i.e., the probability measure is conditional on μˆ0p(,). Now consider the centered term

1ng=01j=1ngm^1(Wgj;μ^0p)μ˜1=1ng=01j=1ng[n11i=1n1Kh(W1iWgj)S˜1if^1(Wgj)]μ˜1,

which is

1nn1j=1n0i=1n1Kh(W1iW0j)S˜1if^1(W0j)+1ni=1n1[1n1j=1n1Kh(W1iW1j)f^1(W1j)]S˜1iμ˜1=1nn1j=1n0i=1n1Kh(W1iW0j)S˜1if^1(W0j)+1ni=1n1[1n1j=1n1Kh(W1iW1j)]S˜1if^1(W1i)μ˜1+Op(h2)=n0nn1i=1n1f^0(W1i)f^1(W1i)S˜1i+1ni=1n1S˜1iμ˜1+Op(h2)=1n1i=1n1(S˜1iμ˜1)+n0nn1i=1n1f^0(W1i)f^1(W1i)f^1(W1i)S˜1i+Op(h2)=1n1i=1n1(S˜1iμ˜1)+n0nn1i=1n1[1n0j=1n0Kh(W0jW1i)1n1j=1n1Kh(W1jW1i)]S˜1if1(W1i)+Op(h2)=1n1i=1n1(S˜1iμ˜1)+π0[1n0i=1n0m^1(W0i;μ^0p)1n1i=1n1m^1(W1i;μ^0p)]+Op(h2)=1n1i=1n1(S˜1iμ˜1)+π0[1n0i=1n0m1(W0i;μ^0p)1n1i=1n1m1(W1i;μ^0p)]+π0[1n0i=1n0(m^1(W0i;μ^0p)m1(W0i;μ^0p))1n1i=1n1(m^1(W1i;μ^0p)m1(W1i;μ^0p))]+Op(h2)

where πg=ng/n and f^1(w) is the nonparametric estimator for the density function of W based on observations in treatment group 1. Now, consider the expansion

m^1(w;μ^0p)m1(w;μ^0p)=1n1i=1n1Kh(W1iw){S˜1im1(W1i;μ^0p)}+Op(h2+log(n1)n1h)

uniform in w. Therefore,

1n0j=1n0{m^1(W0j;μ^0p)m1(W0j;μ^0p)}=1n1n0j=1n0i=1n1Kh(W1iW0j){S˜1im1(W1i;μ^0p)}+Op(h2+log(n1)n1h)=1n1i=1n0f^0(W1i){S˜1im1(W1i;μ^0p)}+Op(h2+log(n1)n1h)=1n1i=1n0f0(W1i){S˜1im1(W1i;μ^0p)}+Op(h2+log(n1)n1h)+op(1n1)

Similarly,

1n1i=1n1(m^1(W1i;μ^0p)m1(W1i;μ^0p))=1n1i=1n0f0(W1i){S˜1im1(W1i;μ^0p)}+Op(h2+log(n1)n1h)+op(1n0),

and

n[1n0i=1n0(m^1(W0i;μ^0p)m1(W0i;μ^0p))1n1i=1n1(m^1(W1i;μ^0p)m1(W1i;μ^0p))] (2)
=Op(n1h2+log(n1)n1h)+op(1). (3)

Therefore, when h=On1δ,δ(1/4,1/2), the right hand side of (3) becomes op(1), and thus

1ng=01j=1ngm^1(Wgj;μ^0p)μ˜1=nn1i=1n1(S˜1iμ˜1)+π0[nn0j=1n0m1(W0j;μ^0p)nn1j=1n1m1(W1j;μ^0p)]+op(1).

Finally, we have

n{Δ^HΔ˜H}=nn1i=1n1(S˜1iμ˜1)+π0[nn0i=1n0m1(W0i;μ^0p)nn1i=1n1m1(W1i;μ^0p)]nn0i=1n0(S˜0iμ˜0)+π1[nn1i=1n1m0(W1i;μ^0p)nn0i=1n0m0(W0i;μ^0p)]+op(1)=nn1i=1n1(S˜1iπ0m1(W1i;μ^0p)π1m0(W1i;μ^0p)π1(μ˜1μ˜0))nn0i=1n0(S˜0iπ0m1(W0i;μ^0p)π1m0(W0i;μ^0p)π0(μ˜1μ˜0))+op(1),

which converges weakly to a mean zero Gaussian distribution with a variance of

1π1E{S˜1iπ0m1(W1i;μ^0p)π1m0(W1i;μ^0p)π1Δ˜H}2+1π0E{S˜0iπ0m1(W0i;μ^0p)π1m0(W0i;μ^0p)π0Δ˜H}2.

Therefore, the variance of Δ^H can be estimated as

σ^H2=1n12i=1n1(S˜1iπ0m^1(W1i;μ^0p))π1m^0(W1i;μ^0p)π1Δ^H))2+1n02i=1n0(S˜0iπ0m^1(W0i;μ^0p)π1m^0(W0i;μ^0p)π0Δ^H)2

Next, we will derive the asymptotical distribution of nΔ^HAUGΔ~H. It is clear that

n(Δ^HAUGΔ˜H)=nn1i=1n1{S˜1iπ0m^1(W1i;μ^0p)π1m^0(W1i;μ^0p)π1Δ˜H}nn0i=1n1{S˜0iπ0m^1(W0i;μ^0p)π1m^0(W0i;μ^0p)π0Δ^H}=nn1i=1n1{S˜1iπ0m1(W1i;μ^0p)π1m0(W1i;μ^0p)π1Δ˜H}nn0i=1n1{S˜0iπ0m1(W0i;μ^0p)π1m0(W0i;μ^0p)π0Δ˜H}n[π0n0i=1n0(m^1(W0i;μ^0p)m1(W0i;μ^0p))π0n1i=1n1(m^1(W1i;μ^0p)m1(W1i;μ^0p))]n[π1n1i=1n1(m^1(W1i;μ^0p)m1(W1i;μ^0p))π1n0i=1n0(m^0(W0i;μ^0p)m1(W0i;μ^0p))]=nn1i=1n1{S˜1iπ0m1(W1i;μ^0p)π1m0(W1i;μ^0p)π1Δ˜H}nn0i=1n1{S˜0iπ0m1(W0i;μ^0p)π1m0(W0i;μ^0p)π0Δ˜H}+op(1)=n(Δ^HΔ˜H)+op(1).

Therefore, Δ^HAUG and Δ^H are asymptotically equivalent. Furthermore, noting that

S˜1iπ0m1(W1i;μ^0p)π1m0(W1i;μ^0p)π1Δ˜H={S˜1im1(W1i;μ^0p)}+π1{m1(W1i;μ^0p)m0(W1i;μ^0p)Δ˜H}

and

[{S˜1im1(W1i;μ^0p)}{m1(W1i;μ^0p)m0(W1i;μ^0p)Δ˜H}W1i]=0,

we have

E[S˜1iπ0m1(W1i;μ^0p)π1m0(W1i;μ^0p)π1Δ˜H]2=E[S˜1im1(W1i;μ^0p)]2+π12E[m1(W1i;μ^0p)m0(W1i;μ^0p)Δ˜H]2.

Similarly,

E[S˜0iπ0m1(W0i;μ^0p)π1m0(W0i;μ^0p)π0Δ˜H]2=E[S˜0im0(W0i;μ^0p)]2+π02E[m1(W0i;μ^0p)m0(W0i;μ^0p)Δ˜H]2.

Therefore, the variance of Δ^H(AUG) can also be consistently estimated by

σ^AUG2=1n12i=1n1[μ^0(p)(S1i,W1i)m^1(W1i;μ^0p)]2+1n02i=1n0[μ^0(p)(S0i,W0i)m^0(W0i;μ^0p)]2+π12n12i=1n1[m^1(W1i;μ^0p)m^0(W1i;μ^0p)Δ^H]2+π02n02i=1n0[m^1(W0i;μ^0p)m^0(W0i;μ^0p)Δ^H]2,

and Δ^(AUG)/Δ^H=1+op(1).

APPENDIX E

Here, we provide an example where there is heterogeneity in the utility of the surrogate and the W is distributed the same in the prior study and current study, but ΔP still fails to provide a lower bound for Δ. In both the prior study and the current study, we assume that log(W)ϵW,S(g)=W×expδ0g+ϵS, and Y(g)=S(g)W,g{0,1}, where δ0 is a positive constant, and ϵW and ϵS are two independent standard normals. It is obvious that μ0p(s,w)=sw and

Δ=ΔH=E(S(1)W)E(S(0)W)=E{WE(S(1)S(0)|W)}=E{W(exp(0.5+δ0)Wexp(0.5)W)}=exp(52)(exp(δ0)1).

Next, we have

μ0p(s)=E(WS(0)S(0)=s)=sE(W(0)S(0)=s)=s×exp(14)s12=exp(14)s32,

and

ΔP=E{(S(1))32exp(14)}E{(S(0))32exp(14)}=exp(52)(3δ021).

Consequently, in this setting, ΔP>Δ=ΔH even though the W has the same distribution in both studies.

FIGURE 1.

FIGURE 1

Discrete data example

References

  • 1.Prentice RL. Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in medicine 1989; 8(4): 431–440. [DOI] [PubMed] [Google Scholar]
  • 2.Burzykowski T, Molenberghs G, Buyse M. The evaluation of surrogate endpoints. Springer; . 2005. [Google Scholar]
  • 3.Wang Y, Taylor JM. A measure of the proportion of treatment effect explained by a surrogate marker. Biometrics 2002; 58(4): 803–812. [DOI] [PubMed] [Google Scholar]
  • 4.Gilbert PB, Hudgens MG. Evaluating candidate principal surrogate endpoints. Biometrics 2008; 64(4): 1146–1154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Parast L, McDermott MM, Tian L. Robust estimation of the proportion of treatment effect explained by surrogate marker information. Statistics in Medicine 2016; 35(10): 1637–1653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Avorn J, Kesselheim AS. Up is down—pharmaceutical industry caution vs. federal acceleration of Covid-19 vaccine approval. New England Journal of Medicine 2020; 383(18): 1706–1708. [DOI] [PubMed] [Google Scholar]
  • 7.Parast L, Cai T, Tian L. Using a surrogate marker for early testing of a treatment effect. Biometrics 2019; 75(4): 1253–1263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Chen X, Hartford A, Zhao J. A model-based approach for simulating adaptive clinical studies with surrogate endpoints used for interim decision-making. Contemporary clinical trials communications 2020; 18: 100562. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Price BL, Gilbert PB, Laan v dMJ. Estimation of the optimal surrogate based on a randomized trial. Biometrics 2018; 74(4): 1271–1281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Athey S, Chetty R, Imbens GW, Kang H. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. tech. rep, National Bureau of Economic Research; 2019. [Google Scholar]
  • 11.Lin D, Fischl MA, Schoenfeld D. Evaluating the role of CD4-lymphocyte counts as surrogate endpoints in human immunodeficiency virus clinical trials. Statistics in medicine 1993; 12(9): 835–842. [DOI] [PubMed] [Google Scholar]
  • 12.Parast L, Cai T, Tian L. Testing for Heterogeneity in the Utility of a Surrogate Marker. Biometrics, In press 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Freedman LS, Graubard BI, Schatzkin A. Statistical validation of intermediate endpoints for chronic diseases. Statistics in medicine 1992; 11(2): 167–178. [DOI] [PubMed] [Google Scholar]
  • 14.Parast L, Cai T, Tian L. Evaluating Surrogate Marker Information using Censored Data. Statistics in Medicine 2017; 36(11): 1767–1782. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.VanderWeele TJ. Surrogate measures and consistent surrogates. Biometrics 2013; 69(3): 561–565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Lin D, Fleming T, De Gruttola V, others. Estimating the proportion of treatment effect explained by a surrogate marker. Statistics in medicine 1997; 16(13): 1515–1527. [DOI] [PubMed] [Google Scholar]
  • 17.Eastell R, Barton I, Hannon R, Chines A, Garnero P, Delmas P. Relationship of early changes in bone resorption to the reduction in fracture risk with risedronate. Journal of Bone and Mineral Research 2003; 18(6): 1051–1056. [DOI] [PubMed] [Google Scholar]
  • 18.Scott D. Multivariate density estimation. Wiley, New York. 1992. [Google Scholar]
  • 19.Henry K, Erice A, Tierney C, et al. A randomized, controlled, double-blind study comparing the survival benefit of four different reverse transcriptase inhibitor therapies (three-drug, two-drug, and alternating drug) for the treatment of advanced AIDS. Journal of acquired immune deficiency syndromes and human retrovirology 1998; 19(4): 339–349. [DOI] [PubMed] [Google Scholar]
  • 20.Hammer SM, Squires KE, Hughes MD, et al. A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and CD4 cell counts of 200 per cubic millimeter or less. New England Journal of Medicine 1997; 337(11): 725–733. [DOI] [PubMed] [Google Scholar]
  • 21.ACTG. AIDS Clinical Trial Group: Proposals and Collaboration. https://actgnetwork.org/submit-a-proposal/; 2021.
  • 22.Calmy A, Ford N, Hirschel B, et al. HIV viral load monitoring in resource-limited regions: optional or necessary?. Clinical infectious diseases 2007; 44(1): 128–134. [DOI] [PubMed] [Google Scholar]
  • 23.Cai T, Tian L, Wong PH, Wei L. Analysis of randomized comparative clinical trial data for personalized treatment selections. Biostatistics 2011; 12(2): 270–282. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zhao L, Tian L, Cai T, Claggett B, Wei LJ. Effectively selecting a target population for a future comparative study. Journal of the American Statistical Association 2013; 108(502): 527–539. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Fleming TR, Prentice RL, Pepe MS, Glidden D. Surrogate and auxiliary endpoints in clinical trials, with potential applications in cancer and AIDS research. Statistics in medicine 1994; 13(9): 955–968. [DOI] [PubMed] [Google Scholar]
  • 26.Pepe MS. Inference using surrogate outcome data and a validation sample. Biometrika 1992; 79(2): 355–365. [Google Scholar]
  • 27.Elliott MR, Conlon AS, Li Y, Kaciroti N, Taylor JM. Surrogacy marker paradox measures in meta-analytic settings. Biostatistics 2015; 16(2): 400–412. doi: 10.1093/biostatistics/kxu043 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary

RESOURCES