Testing for Heterogeneity in the Utility of a Surrogate Marker

Layla Parast; Tianxi Cai; Lu Tian

doi:10.1111/biom.13600

. Author manuscript; available in PMC: 2024 Jun 1.

Published in final edited form as: Biometrics. 2021 Dec 7;79(2):799–810. doi: 10.1111/biom.13600

Testing for Heterogeneity in the Utility of a Surrogate Marker

Layla Parast ¹, Tianxi Cai ², Lu Tian ³

PMCID: PMC9170832 NIHMSID: NIHMS1756809 PMID: 34874550

Abstract

In studies that require long-term and/or costly follow-up of participants to evaluate a treatment, there is often interest in identifying and using a surrogate marker to evaluate the treatment effect. While several statistical methods have been proposed to evaluate potential surrogate markers, available methods generally do not account for or address the potential for a surrogate to vary in utility or strength by patient characteristics. Previous work examining surrogate markers has indicated that there may be such heterogeneity i.e., that a surrogate marker may be useful (with respect to capturing the treatment effect on the primary outcome) for some subgroups, but not for others. This heterogeneity is important to understand, particularly if the surrogate is to be used in a future trial to replace the primary outcome. In this paper, we propose an approach and estimation procedures to measure the surrogate strength as a function of a baseline covariate W and thus, examine potential heterogeneity in the utility of the surrogate marker with respect to W. Within a potential outcome framework, we quantify the surrogate strength/utility using the proportion of treatment effect on the primary outcome that is explained by the treatment effect on the surrogate. We propose testing procedures to test for evidence of heterogeneity, examine finite sample performance of these methods via simulation, and illustrate the methods using AIDS clinical trial data.

Keywords: heterogeneity, kernel methods, nonparametric methods, potential outcomes, surrogate marker, treatment effect

1. Introduction

For many clinical outcomes, randomized clinical trials to evaluate the effectiveness of a treatment often require measuring a primary outcome that is expensive, invasive and/or requires long-term follow-up of participants. In such settings, there is substantial interest in identifying and using surrogate markers - measurements or outcomes measured at an earlier time or with less cost that are predictive of the primary clinical outcome of interest - to evaluate the treatment effect. Several statistical methods have been proposed to evaluate potential surrogate markers including parametric and nonparametric methods (Prentice, 1989; Freedman et al., 1992; Lin et al., 1997; Wang and Taylor, 2002; Parast et al., 2016), methods within a principal stratification framework (Gilbert and Hudgens, 2008; Joffe and Greene, 2009; Conlon et al., 2014; Huang and Gilbert, 2011), and methods for a meta-analytic setting (Daniels and Hughes, 1997; Renard et al., 2002; Burzykowski et al., 2001, 2005) i.e., where information from multiple trials is available.

However, currently available methods generally do not account for or address the potential for a surrogate to vary in strength or utility by certain patient characteristics. Previous work examining surrogate markers in clinical trials has indicated that there may be such heterogeneity in the utility of a surrogate marker (Lin et al., 1993; Cohen and Lindsell, 2012; Wang-Lopez et al., 2015; Spieker and Huang, 2017). That is, a surrogate marker may be useful (with respect to capturing the treatment effect on the primary outcome) for some subgroups, but not useful for others. This heterogeneity is important to understand, particularly if the surrogate is to be used in a future trial to potentially replace the primary outcome (Parast et al., 2019; Price et al., 2018). With respect to heterogeneity in the average treatment effect itself, there has certainly been an extensive amount of work done to provide approaches to assess and test for such heterogeneity (Crump et al., 2008; Willke et al., 2012; Wager and Athey, 2018). However, heterogeneity in the utility of a surrogate marker is more complex as it involves assessing heterogeneity in not only the average treatment effect on the primary outcome, but also potential heterogeneity in both the treatment effect on the surrogate and the relationship between the surrogate and the primary outcome. To our knowledge, there are no methods to assess and rigorously test for such potential heterogeneity with respect to a baseline covariate in the surrogate marker setting.

Our goal is to develop methods to examine and test for heterogeneity in the strength of a surrogate marker. As a measure of surrogate strength, we focus on the proportion of the treatment effect on the primary outcome that is explained by the treatment effect on the surrogate marker, denoted as R_S, within a potential outcome framework (Freedman et al., 1992; Wang and Taylor, 2002; Parast et al., 2016). While limitations of this metric have been discussed, we focus on it due to its widespread use in practice when examining a surrogate within a single study and thus, methods to assess heterogeneity within this context would likely be most useful in this area (Lin et al., 1997; VanderWeele, 2013; Inker et al., 2016; Agyemang et al., 2018; Sprenger et al., 2020). As an example of potential heterogeneity, consider the use of change in CD4 cell count as a surrogate marker for plasma HIV-1 RNA, which is of interest because RNA is relatively expensive to obtain(Calmy et al., 2007). In our application in this paper to AIDS clinical trial data, we show that the proportion of the treatment effect on RNA that is explained by CD4 varies significantly by baseline CD4 level, ranging from 50–70% for lower baseline CD4 counts and from 10–20% for higher baseline CD4 counts. If CD4 was to be used in the future to make inference about the treatment effect on RNA without regard to these differences by baseline CD4, such inference could lead to inaccurate conclusions about the treatment effect.

In this paper, we first propose an approach and estimation procedures to measure the surrogate strength as a function of a baseline covariate W and thus, examine potential heterogeneity in the utility of the surrogate marker with respect to W. We then propose testing procedures, both an omnibus test and a trend-based test, to test for evidence of heterogeneity. We examine the performance of these methods using a simulation study and illustrate the methods using data from an AIDS clinical trial. We focus on a continuous W but additionally propose and illustrate methods for settings where W is discrete.

2. Notation, Setting and Assumptions

2.1. Notation and Setting

Let Y denote the primary outcome, S denote the surrogate marker, Z denote the treatment indicators where treatment is randomized and Z ∈ {0, 1} (i.e., treatment vs. control), and W denotes a single continuous baseline covariate of interest. We use potential outcomes notation where each person has a potential {Y⁽¹⁾, Y⁽⁰⁾, S⁽¹⁾, S⁽⁰⁾} where Y^(g) is the outcome when Z = g and S^(g) is the surrogate when Z = g. Importantly, a potential outcomes framework is useful here in order to understand what assumptions will be required (see Section 2.2).

Throughout, we focus on assessing the utility of the surrogate marker using the proportion of treatment effect explained quantity. To define this quantity, we first define the overall treatment effect as:

Δ = E (Y^{(1)} - Y^{(0)}) = E (Y^{(1)}) - E (Y^{(0)}) .

Following Wang and Taylor (2002) and Parast et al. (2016), the “residual” treatment effect is defined as

Δ_{S} = \int_{- \infty}^{\infty} E (Y^{(1)} - Y^{(0)} ∣ S^{(1)} = S^{(0)} = s) d F_{S^{(0)}} (s)

(1)

= \int_{- \infty}^{\infty} E (Y^{(1)} ∣ S^{(1)} = s) d F_{S^{(0)}} (s) - \int_{- \infty}^{\infty} E (Y^{(0)} ∣ S^{(0)} = s) d F_{S^{(0)}} (s)

(2)

where $F_{S^{(0)}} (\cdot)$ is the marginal cumulative distribution function of S⁽⁰⁾ and we’ve made the assumption that Y⁽¹⁾ ⊥ S⁽⁰⁾|S⁽¹⁾ and Y⁽⁰⁾ ⊥ S⁽¹⁾|S⁽⁰⁾ for identifiability. This quantity, Δ_S, captures the leftover treatment effect on the primary outcome, after accounting for the treatment effect on the surrogate marker. Informally, it reflects the expected treatment effect if the surrogate marker distribution was forced to be equal in both groups, where we have selected the reference distribution for the surrogate marker to be the distribution in the control group. The proportion of the treatment effect on the primary outcome that is explained by the treatment effect on S is then defined as R_S = 1 − Δ_S/Δ.

2.2. Assumptions

Throughout, we require the following assumptions which parallel assumptions that are often required when evaluating surrogate markers in general (Wang and Taylor, 2002; Taylor et al., 2005; Parast et al., 2017):

(C1) μ₁(s, w) is monotone increasing in s, where μ_g(s, w) = E(Y^(g) | S^(g) = s, W = w)
(C2) P(S⁽¹⁾ > s | W = w) ≥ P(S⁽⁰⁾ > s | W = w) for all s and w
(C3) μ₁(s, w) ≥ μ₀(s, w) for all s and w
(C4) S is a continuous random variable with finite support over an interval [a, b] and S⁽⁰⁾ | W = w and S⁽¹⁾ | W = w have the same support.
(C5) Y⁽¹⁾ ⊥ S⁽⁰⁾ | S⁽¹⁾, W and Y⁽⁰⁾ ⊥ S⁽¹⁾ | S⁽⁰⁾, W

Assumption (C1) implies that the surrogate marker is positively associated with the primary outcome; (C2) implies that there is a non-negative treatment effect on the surrogate marker for the subgroup of patients with the same covariate W; and (C3) implies that there is a non-negative effect of treatment on the primary outcome beyond that on the surrogate marker in the subgroup of patients with the same covariates W. These assumptions guard against a surrogate paradox situation (i.e., when the treatment has a positive effect on the surrogate, the surrogate and primary outcome are positively associated, but the treatment in fact has a negative effect on the primary outcome) within any subgroup of patients with the same covariates (VanderWeele, 2013). Without loss of generality, Assumption (C2) and (C3) are stated under the assumption that higher values for the surrogate and the primary outcome are “better”; if in fact, lower values were “better”, these assumptions should be adjusted to reflect non-positive treatment effects. Assumption (C4) is needed for our kernel-based estimation approach. In general, Assumptions (C1)-(C3) can be effectively examined empirically. It can be difficult to verify Assumption (C4) with moderate sample sizes. Assumption (C5) is not directly testable from observed data. However, it is only required for ensuring that the defined “residual” treatment effect as a function of W has the desired causal interpretation. We discuss these assumptions further in Web Appendix A.

3. Assessing Heterogeneity

3.1. Approach

Assume there is interest in examining heterogeneity with respect to the baseline covariate W. Our goal is to define and estimate R_S and thus, Δ_S and Δ, as a function of W. Let

Δ (w) = E (Y^{(1)} ∣ W = w) - E (Y^{(0)} ∣ W = w), and

Δ_{S} (w) = \int_{- \infty}^{\infty} E (Y^{(1)} - Y^{(0)} ∣ S^{(1)} = S^{(0)} = s, W = w) d F_{0} (s ∣ w) = \int_{- \infty}^{\infty} μ_{1} (s, w) d F_{S^{(0)}} (s ∣ w) - \int_{- \infty}^{\infty} μ_{0} (s, w) d F_{S^{(0)}} (s ∣ w)

where $F_{S^{(g)}} (\cdot ∣ w)$ is the cumulative distribution function of S^(g) given W = w and the second equality follows from Assumption (C5). Then we may define R_S(w) = 1 − Δ_S(w)/Δ(w) as a measure of surrogate strength for individuals with the same baseline covariate W = w. In the next section we propose a nonparametric estimation procedure to estimate each of these quantities so that they may be examined as a function of w.

It is interesting to consider what types of relationships between Y, S, and W would imply heterogeneity in the utility of the surrogate. When W is associated with Y but the association does not differ by treatment group, and W is not associated with S, one would not expect there to be heterogeneity. However, when these same associations hold, but the association between W and Y does differ by treatment group, one would expect heterogeneity; that is, even when W is not associated with S itself, treatment effect heterogeneity oftentimes results in heterogeneity in the utility of the surrogate. When the complexities of these associations increase i.e., W is associated with S and this association may or may not differ by treatment group, one would also expect heterogeneity but the level of heterogeneity (e.g. large differences vs. small differences) is difficult to determine partly due to the construction of our measure of interest as a ratio. We gain more insight into such settings in our simulation study in Section 5.

Remark 1.

When considering use of the surrogate in a future study, it is of interest to note that the average residual treatment effect, E{Δ_S(W)}, is not the same as Δ_S in (2) when there is heterogeneity in the utility of the surrogate. The former measures the treatment effect on the surrogate as E{μ₁(S⁽¹⁾, W)} − E{μ₁(S⁽⁰⁾, W)}, while the latter measures it as $E {{\tilde{μ}}_{1} (S^{(1)})} - E {{\tilde{μ}}_{1} (S^{(0)})}$ , where ${\tilde{μ}}_{g} (s) = E (Y^{(g)} ∣ S^{(g)} = s)$ , and when there is heterogeneity, $μ_{1} (S^{(g)}, W) \neq {\tilde{μ}}_{1} (S^{(g)})$ . This will be important to consider if one is using the surrogate marker to estimate and test for a treatment effect on the primary outcome in a future study; we further discuss the impact of heterogeneity on testing in a future study in Section 7.

3.2. Nonparametric estimation

We propose a nonparametric estimation method for Δ(w), Δ_S(w), and R_S(w) involving two-dimensional smoothing over (S, W). The observed data consists of {Y_gi, S_gi, W_gi} for person i in treatment group g; let n_g denote the number of individuals in treatment group g. First, since the average treatment effect Δ can be estimated simply using $\hat{Δ} = n_{1}^{- 1} \sum_{i = 1}^{n_{1}} Y_{1 i} - n_{0}^{- 1} \sum_{i = 1}^{n_{0}} Y_{0 i}$ , we propose to estimate Δ(w) as

\hat{Δ} (w) = {\hat{μ}}_{1} (w) - {\hat{μ}}_{0} (w)

where

{\hat{μ}}_{g} (w) = \frac{\sum_{i = 1}^{n_{g}} K_{h_{g}} (W_{g i} - w) Y_{g i}}{\sum_{i = 1}^{n_{g}} K_{h_{g}} (W_{g i} - w)}, g = 0, 1,

K(·) is a smooth symmetric density function with finite support, K_h(·) = K(·/h)/h, and h₁ and h₀ are bandwidths, which may be data dependent. To avoid a need for bias correction in subsequent statistical inference, we utilize undersmoothing and select all bandwidths throughout to be of order O(n^−ϵ), ϵ ∈ (1/5, 1/2), where n = n₁ + n₀. Here we assume that π_j = lim_n→∞ n_j/n ∈ (0, 1), j = 0, 1 where n = n₀ + n₁.

For the residual treatment effect, noting that without W, the quantity Δ_S can be estimated by

{\hat{Δ}}_{S} = n_{0}^{- 1} \sum_{i = 1}^{n_{0}} {\frac{\sum_{j = 1}^{n_{1}} K_{\tilde{h}} (S_{1 j} - S_{0 i}) Y_{1 j}}{\sum_{j = 1}^{n_{1}} K_{\tilde{h}} (S_{1 j} - S_{0 i})}} - n_{0}^{- 1} \sum_{i = 1}^{n_{0}} Y_{0 i},

(3)

with an appropriate smoothing bandwidth $\tilde{h}$ , we propose to estimate Δ_S(w) using two-dimensional smoothing as

{\hat{Δ}}_{S} (w) = {\hat{μ}}_{10} (w) - {\hat{μ}}_{0} (w),

where ${\hat{μ}}_{10} (w) = \int {\hat{μ}}_{1} (s, w) d {\hat{F}}_{S^{(0)}} (s ∣ w)$ ,

{\hat{F}}_{S^{(0)}} (s ∣ w) = \frac{\sum_{i = 1}^{n_{0}} K_{h_{2}} (W_{0 i} - w) I (S_{0 i} \leq s)}{\sum_{i = 1}^{n_{0}} K_{h_{2}} (W_{0 i} - w)},

and {\hat{μ}}_{1} (s, w) = \frac{\sum_{i = 1}^{n_{1}} K_{h_{3}} (S_{1 i} - s) K_{h_{4}} (W_{1 i} - w) Y_{1 i}}{\sum_{i = 1}^{n_{1}} K_{h_{3}} (S_{1 i} - s) K_{h_{4}} (W_{1 i} - w)}

are nonparametric smoothed estimators of the conditional cumulative distribution of S⁽⁰⁾ given W = w, and the conditional expectation of Y⁽¹⁾ given (S⁽¹⁾, W) = (s, w), respectively. Similarly, all bandwidths h₂, h₃, and h₄ are undersmoothed to eliminate the need for bias correction. Finally, we define a nonparametric estimate of R_S(w) as ${\hat{R}}_{S} (w) = 1 - {\hat{Δ}}_{S} (w) / \hat{Δ} (w)$ . In Web Appendix B, we propose parallel estimation procedures and an omnibus test for the case when W is discrete.

Remark 2.

This estimator of the residual treatment effect depends on consistent estimators of μ₁(s, w), $F_{S^{(0)}} (s ∣ w)$ , and μ_g(w), g = 0, 1. One could alternatively consider replacing the need of estimating μ₁(s, w) by estimating a quantity akin to the “propensity score” $f_{S^{(0)}} (s ∣ w) / f_{S^{(1)}} (s ∣ w)$ , where $f_{S^{(g)}} (s ∣ w) = d F_{S^{(g)}} (s ∣ w) / d s$ , g = 0, 1. Specifically, Δ_S(w) could alternatively be estimated by

\frac{\sum_{i = 1}^{n_{1}} K_{\bar{h}} (W_{1 i} - w) {\hat{r}}_{01} (S_{(1 i)} ∣ w) Y_{1 i}}{\sum_{i = 1}^{n_{1}} K_{\bar{h}} (W_{1 i} - w)} - {\hat{μ}}_{0} (w),

where ${\hat{r}}_{01} (s ∣ w)$ is a consistent estimator of $r_{01} (s ∣ w) = f_{S^{(0)}} (s ∣ w) / f_{S^{(1)}} (s ∣ w)$ and $\bar{h}$ is a smoothing bandwidth. The nonparametric estimation of r₀₁(s | w) still requires two-dimensional smoothing over both s and w.

Remark 3.

If the dimension of covariate W is greater than one, denoted as W in this remark, multi-dimensional smoothing utilized within the nonparametric estimation approach for ${\hat{Δ}}_{S} (w)$ may not work well when the sample size is not very large due to the curse of dimensionality. As an alternative, we propose a set of semiparametric models to allow for assessment of heterogeneity. Specifically, one could consider assuming the following: (1) a varying coefficient model for $μ_{1} (s, w) = g_{Y} {β_{1} {(s)}^{⊤} \bar{w}}$ , where $\bar{w} = {(1, w^{'})}^{'}$ , g_Y (·) is a known, strictly increasing link function and β₁(s) is the unknown function of s. This model allows the effect of W on Y to vary over s. Let the estimator of β₁(s) and μ₁(s, w) be denoted by ${\hat{β}}_{1} (s)$ and $g_{Y} ({\hat{β}}_{1} {(s)}^{⊤} w)$ , respectively; (2) a general transformation model for the cumulative distribution function of $S^{(0)} ∣ W = w, F_{S^{(0)}} (s ∣ w) = g_{S}^{- 1} [g_{S} {F_{0} (s)} + γ^{⊤} w]$ , where g_S(·) is a given link function and F₀(·) is an unknown baseline distribution function. Let the estimator of $F_{S^{(0)}} (s ∣ w)$ be denoted by ${\hat{F}}_{S^{(0)}} (s ∣ w) = g_{S}^{- 1} [g_{S} {{\hat{F}}_{0} (s)} + {\hat{γ}}^{⊤} w]$ , where $({\hat{F}}_{0} (\cdot), \hat{γ})$ is the consistent estimator of (F₀(·), γ). The residual treatment effect, Δ_S(w) can then be estimated by

\int g_{Y} ({\hat{β}}_{1} {(s)}^{⊤} w) d {\hat{F}}_{S^{(0)}} (s ∣ w) - \int g_{Y} ({\hat{β}}_{0} {(s)}^{⊤} w) d {\hat{F}}_{S^{(0)}} (s ∣ w),

and the surrogacy R_S(w) can be estimated accordingly. Note that even when W is a scalar, a flexible semiparametric model could alternatively be used to model E(Y⁽¹⁾ | S⁽¹⁾ = s, W = w), if desired. For example, one may consider the additive model μ₁(s, w) = g_Y {β_S(s) + γ_w(w)}, where g_Y (·) is a given link function. The advantage of such an alternative is that the associated inference would not involve two-dimensional smoothing, and thus could more feasibly be used in settings with smaller sample sizes.

3.3. Inference and Variance Estimation

In Web Appendix C, we show that under mild regularity conditions ${\hat{Δ}}_{S} (w)$ is a consistent estimator of Δ_S(w) and that as n → ∞,

\sqrt{n h} (\begin{matrix} {\hat{Δ}}_{S} (w) - Δ_{S} (w) \\ \hat{Δ} (w) - Δ (w) \end{matrix}) \to N (0, Σ_{Δ} (w)),

assuming that h_i/h ∈ [r_L, r_U], i = 0, 1, 2, 3, 4 for finite positive constants r_L and r_U. It then follows that ${\hat{R}}_{S} (w)$ is a consistent estimator of R_S(w) and, by the delta method, $\sqrt{n h} {{\hat{R}}_{S} (w) - R_{S} (w)}$ also converges weakly to a mean zero normal distribution with variance $σ_{R}^{2} (w)$ . The variance-covariance matrix Σ_Δ(w) as well as closed form variance estimates are given in Web Appendix C. Using these estimates, one may construct 95% confidence intervals for Δ(w), Δ_S(w), and R_S(w) using a normal approximation. We examine the performance of the variance estimates and confidence interval construction in our simulation study in Section 5.

It is also possible to construct a simultaneous confidence band for R_S(w) over a given interval [w_a, w_b] within the support of W. We describe this procedure and provide justification for the validity of this proposed confidence band in Web Appendix D. We illustrate this confidence band in our simulation study (Section 5) and AIDS application (Section 6).

4. Testing Procedures

4.1. Omnibus Test

While our aim in Section 3 was to develop methods to assess potentially heterogeneity, our goal in this section is to formally test for the presence of heterogeneity in the utility of the surrogate marker. That is, we wish to test the null hypothesis:

H_{0} : R_{S} (\cdot) is constant within [w_{a}, w_{b}], i . e ., \exists τ s.t. \forall w \in [w_{a}, w_{b}], R_{S} (w) = τ,

H_{A} : R_{S} (\cdot) is not constant within [w_{a}, w_{b}], i . e ., \forall τ, \exists w \in [w_{a}, w_{b}] s.t. R_{S} (w) \neq τ .

To achieve this, we also let (h₁, h₀) = (h₄, h₂) and consider a supremum type test statistic

T = sup_{w \in [w_{a}, w_{b}]} \frac{\sqrt{n h} | {\hat{Δ}}_{S} (w) - (1 - \hat{τ}) \hat{Δ} (w) |}{{\hat{σ}}_{D} (w)},

where ${\hat{σ}}_{D}^{2} (w)$ is a consistent estimator of the variance of $\sqrt{n h} {{\hat{Δ}}_{S} (w) - (1 - \hat{τ}) \hat{Δ} (w)}$ , and

\hat{τ} = 1 - \frac{\int_{w_{a}}^{w_{b}} {\hat{Δ}}_{S} (w) d w}{\int_{w_{a}}^{w_{b}} \hat{Δ} (w) d w}

Under the null hypothesis that R_S(w) = τ :

{\hat{Δ}}_{S} (w) - \hat{Δ} (w) (1 - \hat{τ})

(4)

= {{\hat{Δ}}_{S} (w) - Δ_{S} (w)} - {\hat{Δ} (w) - Δ (w)} {1 - R_{S} (w)} + Δ (w) (\hat{τ} - τ),

(5)

= {{\hat{Δ}}_{S} (w) - Δ_{S} (w)} - {\hat{Δ} (w) - Δ (w)} (1 - τ)

(6)

- \frac{Δ (w)}{\int_{w_{a}}^{w_{b}} Δ (w) d w} (\int_{w_{a}}^{w_{b}} {{\hat{Δ}}_{S} (w) - Δ_{S} (w)} d w - (1 - τ) \int_{w_{a}}^{w_{b}} {\hat{Δ} (w) - Δ (w)} d w),

(7)

whose distribution can be approximated by the conditional distribution of a stochastic process Z*(w), defined in Web Appendix E, given the observed data. Therefore, we may generate a large number of Z*(w) by repeatedly simulating U_gis from independent N(0, 1) and let

T^{*} = sup_{w \in [w_{a}, w_{b}]} | \sqrt{n h} \frac{Z^{*} (w)}{{\hat{σ}}_{D} (w)} |,

where ${\hat{σ}}_{D} (w)$ is the empirical variance of simulated $\sqrt{n h} Z^{*} (w)$ . After obtaining B realizations of $T^{*} : {T_{b}^{*}, b = 1, \dots, B}$ , the p-value for testing the constant surrogacy over W can be approximated by $B^{- 1} \sum_{b = 1}^{B} I (T_{b}^{*} \geq t_{o b s})$ , where t_obs is the observed test-statistic T. In Web Appendix E, we provide justification for the validity of this proposed testing procedure. The term (7) and its counterpart in Z*(w) are in the order of O_p(n^−1/2) and asymptotically negligible, but we include them to account for the variance of $\hat{τ}$ in finite samples. While an advantage of this test is that it is omnibus, it may lack power for detecting a specific alternative. In the following section, we propose an alternative testing approach.

4.2. Trend test

In some settings, it may be of interest to consider an alternative hypothesis in which R_s(w) is monotone increasing or decreasing in w. In such a case, we propose to consider the following test statistic:

S = \sqrt{n} \int_{w_{a}}^{w_{b}} {\hat{R}}_{S} (w) (w - \frac{w_{b} + w_{a}}{2}) d w .

Under the null

S = \sqrt{n} \int_{w_{a}}^{w_{b}} {{\hat{R}}_{S} (w) - R_{S} (w)} (w - \frac{w_{b} + w_{a}}{2}) d w,

converges weakly to a mean zero Gaussian distribution, whose variance can be approximated by the conditional variance of

S^{*} = \sqrt{n} \int_{w_{a}}^{w_{b}} D^{*} (w) (w - \frac{w_{b} + w_{a}}{2}) d w,

given the observed data, where the process D*(w) is defined in Web Appendix D. In practice, we may generate a large number of S* and calculate its empirical variance denoted by ${\hat{σ}}_{S}$ . The p-value for this “trend” test can then be approximated by $P (| N (0, 1) | \geq S / {\hat{σ}}_{S})$ .

5. Simulation Study

The goals of this simulation study were to (a) examine the finite sample performance of the proposed estimators ${\hat{R}}_{S} (w)$ and ${\hat{Δ}}_{S} (w)$ in terms of bias, standard error estimation, and coverage of the confidence intervals, (b) evaluate the performance of the proposed confidence band from Section 3.3 in terms of coverage, and (c) examine the properties of the proposed tests for heterogeneity with respect to power and Type 1 error. To this end, we examined 6 simulation settings covering both a continuous and discrete W. For all settings, α = 0.05, and results are summarized over 500 replications; we examine all settings with (n₁, n₀) = (2000, 1500) and (500, 500). In simulation setting 1, W₁ ~ U(0, 2), W₀ ~ U(0, 2), S₁ ~ N(6, 4), S₀ ~ N(5, 1), Y₁|S₁, W₁ = 3 + 6S₁ + 4W₁ + N(0, 9), and Y₀|S₀, W₀ = 2 + 5S₀ + W₀ + N(0, 9) such that S_g ⊥ W_g, and Y_g depends on both S_g and W_g where the association differs by treatment group. In this setting, Δ(w) = 12 + 3w, Δ_S(w) = 6 + 3w and R_S(w) = 2/(4 + w). In simulation setting 2, W₁ ~ U(0, 2), W₀ ~ U(0, 2), S₁ ~ N(6, 4) + W₁, S₀ ~ N(5, 1) + W₀, Y₁|S₁, W₁ = 3 + 6S₁ + 4W₁ + N(0, 9), and Y₀|S₀, W₀ = 2 + 5S₀ + W₀ + N(0, 9) such that S_g depends on W_g with same association in two groups, and, like setting 1, Y_g depends on both S_g and W_g, where the association differs by treatment group. In this setting, Δ(w) = 12 + 4w, Δ_S(w) = 6 + 4w and R_S(w) = 3/(6 + 2w). In simulation setting 3, W₁ ~ U(0, 2), W₀ ~ U(0, 2), S₁ ~ N(6, 4) + 2W₁, S₀ ~ N(5, 1) + W₀, Y₁|S₁, W₁ = 3 + 6S₁ + 4W₁ + N(0, 9), and Y₀|S₀, W₀ = 2 + 5S₀ + W₀ + N(0, 9) such that S_g depends on W_g, Y_g depends on both S_g and W_g, and these associations differ by treatment group. In this setting, Δ(w) = 12 + 10w, Δ_S(w) = 6 + 4w and R_S(w) = (3 + 3w)/(6 + 5w). Finally, in simulation setting 4, data are generated such that there is no heterogeneity, W₁ ~ U(0, 2), W₀ ~ U(0, 2), S₁ ~ N(6, 4), S₀ ~ N(5, 1), Y₁|S₁, W₁ = 3 + 6S₁ + N(0, 9), and Y₀|S₀, W₀ = 2 + 5S₀ + N(0, 9). In this setting, Δ(w) = 12, Δ_S(w) = 6 and R_S(w) = 1/2. With respect to our bandwidth selection, we let $h_{1} = h_{4} = 2 \times 1.06 \times min (σ_{W_{1}}, I Q R_{1} / 1.34) n_{1}^{- 2 / 5}$ and $h_{2} = h_{0} = 1.06 \times min (σ_{W_{0}}, I Q R_{0} / 1.34) n_{0}^{- 2 / 5}$ . where $σ_{W_{j}}$ and IQR_j were the empirical standard deviation and inter-quartile range of W_j, respectively (Scott, 1992); we discuss bandwidth selection further in Section 7. The number of resampling iterations/realizations used to construct the 95% confidence band and conduct hypothesis testing was B = 500.

Settings 5 and 6 investigate settings with a discrete W. In simulation setting 5, W₁ and W₀ ∈ {0, 1, 2} with equal probability, S₁ ~ N(6, 9) + 2W₁, S₀ ~ N(3.4, 1) + W₀, and Y₁|S₁, W₁ = 3 + 6S + 4W₁ + N(0, 9), and Y₀|S₀, W₀ = 2 + 5S₀ + 2W₀ + N(0, 9). In this setting, Δ(w) = 20 + 9w, Δ_S(w) = 4.4 + 3w and R_S(w) = (15.6 + 6w)/(20 + 9w). Finally, in setting 6, W₁ and W₀ ∈ {0, 1, 2} with equal probability, S₁ ~ N(6, 4), S₀ ~ N(5, 1), and Y₁|S₁, W₁ = 3 + 6S₁ + N(0, 9), and Y₀|S₀, W₀ = 2 + 5S₀ + N(0, 9). In this setting, Δ(w) = 12, Δ_S(w) = 6 and R_S(w) = 1/2 implying no heterogeneity.

Table 1 summarizes the performance of the proposed estimation method for Δ_S(w) and R_S(w) at w = 0.4, 0.8, 1.6 and 1.8, representing the 20%, 40%, 60%, and 80% quantiles of W_j, j = 0, 1, respectively, in settings 1, 2, 3 and 4 over 500 replications when n₁ = 2000 and n₀ = 1500. Specifically, Table 1 includes the empirical bias, the empirical standard error of the estimators, average standard error estimates, and empirical coverage level of the Wald-type confidence intervals. Figure 1 also illustrates the observed performance of the point estimate and standard error estimates from setting 1. Figure 2 illustrates the pointwise confidence intervals and confidence band for R_S(w) for a single iteration from setting 1. These results show good performance in terms of small bias, standard error estimates close to their empirical counterparts, and coverage levels close to the nominal level.

Table 1:

Simulation results for estimation of Δ_S(w) and R_S(w) in Settings 1, 2, 3, and 4 when n₁ = 2000 and n₀ = 1500; ESE = empirical standard error, ASE = average standard error; Coverage = coverage of the 95% confidence intervals

Setting 1
	Δ_S(w)				R_S(w)
w	0.400	0.800	1.200	1.600	0.400	0.800	1.200	1.600
Truth	7.200	8.400	9.600	10.800	0.455	0.417	0.385	0.357
Estimate	7.294	8.476	9.664	10.872	0.443	0.408	0.377	0.351
Bias	0.094	0.076	0.064	0.072	−0.011	−0.009	−0.007	−0.006
ESE	0.423	0.451	0.438	0.413	0.051	0.047	0.046	0.044
ASE	0.430	0.430	0.430	0.431	0.051	0.048	0.046	0.044
Coverage	0.952	0.934	0.938	0.954	0.960	0.966	0.954	0.946
Setting 2
	Δ_S(w)				R_S(w)
w	0.400	0.800	1.200	1.600	0.400	0.800	1.200	1.600
Truth	7.600	9.200	10.800	12.400	0.441	0.395	0.357	0.326
Estimate	7.695	9.277	10.866	12.474	0.430	0.386	0.350	0.320
Bias	0.095	0.077	0.066	0.074	−0.011	−0.009	−0.007	−0.006
ESE	0.422	0.451	0.437	0.413	0.050	0.045	0.044	0.041
ASE	0.430	0.430	0.431	0.431	0.050	0.047	0.044	0.041
Coverage	0.950	0.934	0.932	0.952	0.958	0.968	0.950	0.942
Setting 3
	Δ_S(w)				R_S(w)
w	0.400	0.800	1.200	1.600	0.400	0.800	1.200	1.600
Truth	7.600	9.200	10.800	12.400	0.525	0.540	0.550	0.557
Estimate	7.728	9.348	10.960	12.627	0.515	0.531	0.542	0.549
Bias	0.128	0.148	0.160	0.227	−0.010	−0.009	−0.008	−0.008
ESE	0.438	0.473	0.498	0.529	0.039	0.030	0.025	0.023
ASE	0.439	0.452	0.465	0.481	0.038	0.030	0.025	0.022
Coverage	0.944	0.918	0.916	0.896	0.948	0.942	0.948	0.928
Setting 4 (no heterogeneity)
	Δ_S(w)				R_S(w)
w	0.400	0.800	1.200	1.600	0.400	0.800	1.200	1.600
Truth	6.000	6.000	6.000	6.000	0.500	0.500	0.500	0.500
Estimate	6.093	6.075	6.064	6.073	0.488	0.490	0.491	0.491
Bias	0.093	0.075	0.064	0.073	−0.012	−0.010	−0.009	−0.009
ESE	0.423	0.452	0.436	0.413	0.053	0.051	0.053	0.052
ASE	0.430	0.430	0.430	0.431	0.053	0.052	0.052	0.052
ASE	0.430	0.430	0.430	0.431	0.053	0.052	0.052	0.052
Coverage	0.952	0.934	0.944	0.954	0.960	0.968	0.960	0.944

Open in a new tab

Figure 1: — Estimation in Setting 1 summarized over 500 replications when n₁ = 2000 and n₀ = 1500: (a) estimate vs. truth for Δ_S(w), (b) average standard error (ASE) vs. empirical standard error (ESE) for Δ_S(w), (c) estimate vs. truth for R_S(w), and (d) ASE vs. ESE for R_S(w)

Figure 2: — Pointwise confidence intervals and confidence band for R_S(w) for a single iteration from Setting 1 when n₁ = 2000 and n₀ = 1500

Table 2 presents the empirical power of the two proposed tests (the omnibus test and trend test) for heterogeneity in settings 1, 2, and 3, and the empirical type 1 error of these two tests in setting 4, when the null was true, when n₁ = 2000 and n₀ = 1500. The type 1 error rate was maintained at the 0.05 level in setting 4. The power of the trend test was higher than that for the omnibus test, as expected. Table 2 also presents the empirical coverage level of the 95% confidence band for R_S(w), w ∈ [0.25, 1.75]. These results show that the empirical coverage level of the confidence band was satisfactory in all four settings.

Table 2:

Power/type 1 error and confidence band coverage in Settings 1, 2, 3, and 4 when n₁ = 2000 and n₀ = 1500

	Omnibus test	Trend test	Confidence Band Coverage for R_S(w), w ∈ [0.25, 1.75]
Setting 1 (Power)	0.282	0.648	0.952
Setting 2 (Power)	0.392	0.806	0.950
Setting 3 (Power)	0.138	0.270	0.926
Setting 4 (Type 1 error)	0.046	0.058	0.944

Open in a new tab

Table 3 summarizes estimation performance in the discrete case, for settings 5 and 6, when n₁ = 2000 and n₀ = 1500. Similar to settings 1 through 4, the resulting bias is small, standard error estimates are close to their empirical counterparts, and coverage levels are close to the nominal level. The power of the proposed test in setting 5 was 0.784; the type 1 error rate of the proposed test in setting 6 was 0.042, again close to α = 0.05.

Table 3:

Simulation results for estimation of Δ_S(w) and R_S(w) in Settings 5 and 6 where W is discrete when n₁ = 2000 and n₀ = 1500; ESE = empirical standard error, ASE = average standard error; Coverage = coverage of the 95% confidence intervals

Setting 5
	Δ_S(w)			R_S(w)
w	0	1	2	0	1	2
Truth	4.400	7.400	10.400	0.780	0.745	0.726
Estimate	4.460	7.482	10.531	0.777	0.742	0.723
Bias	−0.060	−0.082	−0.131	0.003	0.003	0.004
ESE	0.249	0.294	0.347	0.014	0.012	0.011
ASE	0.254	0.289	0.352	0.014	0.011	0.010
Coverage	0.944	0.940	0.930	0.964	0.940	0.920
Setting 6 (no heterogeneity)
	Δ_S(w)			R_S(w)
w	0	1	2	0	1	2
Truth	6.000	6.000	6.000	0.500	0.500	0.500
Estimate	6.033	6.021	6.029	0.496	0.497	0.496
Bias	−0.033	−0.021	−0.029	0.004	0.003	0.004
ESE	0.211	0.216	0.207	0.025	0.025	0.027
ASE	0.211	0.212	0.211	0.026	0.026	0.026
Coverage	0.950	0.944	0.954	0.964	0.964	0.942

Open in a new tab

Simulation results for all settings with a smaller sample size n₁ = n₀ = 500 are provided in Web Appendix F. Overall, the results from this simulation study illustrate good performance of the proposed estimation and testing procedures in finite samples.

6. Application

We use our proposed procedures to examine potential heterogeneity in the utility of surrogate marker in the AIDS Clinical Trials Group 320 Study (Hammer et al., 1997). This study was a randomized, double-blind, placebo-controlled trial that compared a three-drug regimen with a two-drug regimen in HIV-infected patients with a CD4 cell count of 200 or less per cubic millimeter and at least three months of prior zidovudine therapy. Results showed better performance for the three-drug regimen with respect to progression to AIDS, death, change in CD4, and change in plasma HIV-1 RNA.

Our primary outcome of interest is the change in plasma HIV-1 RNA from baseline to 24 weeks and the surrogate marker of interest is change in CD4 cell count from baseline to 24 weeks. CD4 is of interest as a surrogate marker here because RNA is relatively expensive to measure (Calmy et al., 2007). Our analytic sample included 418 individuals randomized to the three-drug regimen group and 412 individuals randomized to the two-drug regimen group. The average change in log RNA from baseline to 24 weeks was −2.15 and −0.55 (log₁₀ copies/ml) in the three-drug group and the two-drug group, respectively, resulting in a treatment effect of −1.60 with a p-value < 0.0001 using a two-sample t-test. The estimated residual treatment effect was ${\hat{Δ}}_{S} = - 0.93$ , and proportion of treatment effect explained was ${\hat{R}}_{S} = 41.5 %$ .

We first examine the extent to which the proportion of treatment effect explained by change in CD4 count depends on the baseline CD4 count. Both the omnibus test and trend test reject the null hypothesis with p < 0.001, providing evidence of significant heterogeneity in the utility of the change in CD4 count as a surrogate for change in RNA with respect to baseline CD4. Figure 3 shows the estimates of Δ(w), Δ_S(w) and R_S(w) as a function of baseline CD4, as well as pointwise confidence intervals for each and the confidence band for R_S(w). These estimates reflect a decrease in the proportion of treatment effect explained with increasing baseline CD4, with R_s(w) ranging from 0.5 to 0.7 for lower baseline CD4 counts and from 0.1 to 0.2 for higher baseline CD4 counts.

Figure 3: — Estimates of Δ(w), Δ_S(w) and R_S(w) from the AIDS clinical trial with baseline CD4 as the baseline covariate of interest, change in log RNA from baseline to 24 weeks as the primary outcome, and change in CD4 from baseline to 24 weeks as the surrogate marker

Second, we examine race/ethnicity (White vs. non-White) as the discrete baseline covariate of interest applying the methods described in Web Appendix B. The test for heterogeneity results in a p-value of 0.23 indicating no significant evidence of heterogeneity in the utility of the surrogate with respect to race/ethnicity. For illustrative purposes only (see Web Appendix A), we report the estimates within each subgroup. Among White participants, $\hat{Δ} (white) = - 1.83 (95 % CI : - 2.05, - 1.61)$ , ${\hat{Δ}}_{S} (white) = - 1.14 (95 % CI : - 1.43, - 0.85)$ and ${\hat{R}}_{S} (white) = 0.38 (95 % C I : 0.25, 0.51)$ . Among non-White participants, $\hat{Δ} (non-white) = - 1.31 (95 % CI : - 1.53, - 1.09)$ , ${\hat{Δ}}_{S} (non-white) = - 0.66 (95 % CI : - 0.92, - 0.40)$ and ${\hat{R}}_{S} (non-white) = 0.50 (95 % CI : 0.33, 0.66)$ , suggesting that the surrogate strength is weaker among white participants than among non-white participants.

7. Discussion

In this paper, we proposed an approach and estimation procedures to examine potential heterogeneity in the utility of the surrogate marker with respect to a baseline covariate. In addition, we developed testing procedures, both an omnibus test and a trend-based test, to test for evidence of heterogeneity. Our simulation study demonstrated satisfactory finite sample performance of our proposed methods. Our exploratory analysis of an AIDS clinical trial illustrated the use of the procedures with both a discrete baseline covariate, race/ethnicity, and a continuous baseline covariate, CD4 cell count. An R package implementing the methods proposed here, named hetsurr, is available on Github (see Web Appendix G).

The presence of heterogeneity has important implications for use of the surrogate marker in a future study. First, the heterogeneity information provided by our proposed methods can be used to identify a region of interest i.e., a subset of values of W for which the surrogate is especially strong. This identified region could be used to inform future trial design/recruitment or to inform further investigation into the treatment mechanisms (see Web Appendix A). Second, when there is an evidence of heterogeneity, use of that surrogate marker in a new study population that has a different participant-mix than the original study (used to evaluate the utility of the surrogate marker) may lead to inaccurate conclusions about the treatment effect if the heterogeneity is not taken into account. For example, Parast et al. (2019) proposed an approach to test for a treatment effect earlier using a surrogate marker; however, we expect that if there is heterogeneity, this testing procedure may not perform well. If the surrogate is going to be used to test for a treatment effect, certain transportability assumptions are generally made about the new study compared to the existing study (Price et al., 2018; Wang et al., 2020), and these assumptions will likely not hold in the presence of surrogacy heterogeneity. Future work on the implications of heterogeneity with respect to using the surrogate as a replacement of the primary outcome in future studies would be useful.

In practice, we recommend that the process of examining heterogeneity in the utility of the surrogate marker should follow guidance for subgroup analyses in clinical trials. That is, subgroups to be examined should be ideally pre-specified and the interpretation of surrogate strength for certain subgroups should be considered in light of all subgroups that are tested i.e., after appropriate multiple testing adjustment (Wang et al., 2007); see Web Appendix A for further discussion.

Our proposed methods have some limitations. First, it is difficult or even impossible to verify some assumptions enumerated in (C1)-(C5). Potential violations to these assumptions should be carefully considered; it is likely that discussions about the validity of these assumptions with clinical experts may shed light on whether they are reasonable in a particular clinical setting (see Web Appendix A). In addition, recent work suggesting numerical approaches to examine sensitivity to violations of such assumptions may be useful (Elliott et al., 2015). Second, we require the selection of several bandwidth parameters and there are many options one could choose from (see Web Appendix G). Results may be sensitive to the bandwidth selected especially with a moderate sample size. Third, in this paper, we focus on a surrogate that is either continuous or discrete, a primary outcome that is either continuous or binary, and a treatment effect measured by the mean difference. Our methods cannot easily accommodate a setting where the surrogate and/or primary outcome are time-to-event outcomes subject to censoring. Lastly, while the nonparametric two-dimensional smoothing approach is robust in the sense that no stringent model assumptions are required, the method does require a relatively large sample size. In settings with smaller sample sizes, a parametric or semiparametric version of the framework presented here, such as the approach described in Remark 3 of Section 3.2, could be considered, though such an approach would strongly rely on the parametric specifications.

Supplementary Material

supinfo

NIHMS1756809-supplement-supinfo.pdf^{(1.6MB, pdf)}

Acknowledgements

Support for this research was provided by National Institutes of Health grant R01DK118354.

We are grateful to the AIDS Clinical Trial Group (ACTG) for providing the AIDS data.

Footnotes

Supporting Information

Web Appendices referenced in Sections 3, 4, 5, and 7 are available in the Supplementary Materials. In addition, a zip file containing code to replicate the simulation study and AIDS example is available with this paper at the Biometrics website on Wiley Online Library.

Data Availability Statement

The data from the ACTG 320 study used in this paper are publicly available upon request from the AIDS Clinical Trial Group: https://actgnetwork.org/submit-a-proposal/.

References

Agyemang E, Magaret AS, Selke S, Johnston C, Corey L, and Wald A (2018). Herpes simplex virus shedding rate: surrogate outcome for genital herpes recurrence frequency and lesion rates, and phase 2 clinical trials end point for evaluating efficacy of antivirals. The Journal of Infectious Diseases 218, 1691–1699. [DOI] [PMC free article] [PubMed] [Google Scholar]
Burzykowski T, Molenberghs G, and Buyse M (2005). The evaluation of surrogate endpoints. Springer. [Google Scholar]
Burzykowski T, Molenberghs G, Buyse M, Geys H, and Renard D (2001). Validation of surrogate end points in multiple randomized clinical trials with failure time end points. Journal of the Royal Statistical Society: Series C (Applied Statistics) 50, 405–422. [Google Scholar]
Calmy A, Ford N, Hirschel B, Reynolds SJ, Lynen L, Goemaere E, De La Vega FG, Perrin L, and Rodriguez W (2007). Hiv viral load monitoring in resource-limited regions: optional or necessary? Clinical infectious diseases 44, 128–134. [DOI] [PubMed] [Google Scholar]
Cohen RM and Lindsell CJ (2012). When the blood glucose and the hba1c don’t match: turning uncertainty into opportunity. Diabetes Care 35, 2421–2423. [DOI] [PMC free article] [PubMed] [Google Scholar]
Conlon AS, Taylor JM, and Elliott MR (2014). Surrogacy assessment using principal stratification when surrogate and outcome measures are multivariate normal. Biostatistics 15, 266–283. [DOI] [PMC free article] [PubMed] [Google Scholar]
Crump RK, Hotz VJ, Imbens GW, and Mitnik OA (2008). Nonparametric tests for treatment effect heterogeneity. The Review of Economics and Statistics 90, 389–405. [Google Scholar]
Daniels MJ and Hughes MD (1997). Meta-analysis for the evaluation of potential surrogate markers. Statistics in medicine 16, 1965–1982. [DOI] [PubMed] [Google Scholar]
Elliott MR, Conlon AS, Li Y, Kaciroti N, and Taylor JM (2015). Surrogacy marker paradox measures in meta-analytic settings. Biostatistics 16, 400–412. [DOI] [PMC free article] [PubMed] [Google Scholar]
Freedman LS, Graubard BI, and Schatzkin A (1992). Statistical validation of intermediate endpoints for chronic diseases. Statistics in medicine 11, 167–178. [DOI] [PubMed] [Google Scholar]
Gilbert PB and Hudgens MG (2008). Evaluating candidate principal surrogate endpoints. Biometrics 64, 1146–1154. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hammer SM, Squires KE, Hughes MD, Grimes JM, Demeter LM, Currier JS, Eron JJ Jr, Feinberg JE, Balfour HH Jr, Deyton LR, et al. (1997). A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or less. New England Journal of Medicine 337, 725–733. [DOI] [PubMed] [Google Scholar]
Huang Y and Gilbert PB (2011). Comparing biomarkers as principal surrogate endpoints. Biometrics 67, 1442–1451. [DOI] [PMC free article] [PubMed] [Google Scholar]
Inker LA, Mondal H, Greene T, et al. (2016). Early change in urine protein as a surrogate end point in studies of iga nephropathy: an individual-patient meta-analysis. American Journal of Kidney Diseases 68, 392–401. [DOI] [PubMed] [Google Scholar]
Joffe MM and Greene T (2009). Related causal frameworks for surrogate outcomes. Biometrics 65, 530–538. [DOI] [PubMed] [Google Scholar]
Lin D, Fischl MA, and Schoenfeld D (1993). Evaluating the role of cd4-lymphocyte counts as surrogate endpoints in human immunodeficiency virus clinical trials. Statistics in medicine 12, 835–842. [DOI] [PubMed] [Google Scholar]
Lin D, Fleming T, De Gruttola V, et al. (1997). Estimating the proportion of treatment effect explained by a surrogate marker. Statistics in medicine 16, 1515–1527. [DOI] [PubMed] [Google Scholar]
Parast L, Cai T, and Tian L (2017). Evaluating surrogate marker information using censored data. Statistics in Medicine 36, 1767–1782. [DOI] [PMC free article] [PubMed] [Google Scholar]
Parast L, Cai T, and Tian L (2019). Using a surrogate marker for early testing of a treatment effect. Biometrics 75, 1253–1263. [DOI] [PMC free article] [PubMed] [Google Scholar]
Parast L, McDermott MM, and Tian L (2016). Robust estimation of the proportion of treatment effect explained by surrogate marker information. Statistics in Medicine 35, 1637–1653. [DOI] [PMC free article] [PubMed] [Google Scholar]
Prentice RL (1989). Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in medicine 8, 431–440. [DOI] [PubMed] [Google Scholar]
Price BL, Gilbert PB, and van der Laan MJ (2018). Estimation of the optimal surrogate based on a randomized trial. Biometrics 74, 1271–1281. [DOI] [PMC free article] [PubMed] [Google Scholar]
Renard D, Geys H, Molenberghs G, Burzykowski T, and Buyse M (2002). Validation of surrogate endpoints in multiple randomized clinical trials with discrete outcomes. Biometrical Journal 44, 921–935. [Google Scholar]
Scott D (1992). Multivariate density estimation. Wiley, New York. [Google Scholar]
Spieker AJ and Huang Y (2017). A method to address between-subject heterogeneity for identification of principal surrogate markers in repeated low-dose challenge hiv vaccine studies. Statistics in medicine 36, 4071–4080. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sprenger T, Kappos L, Radue E-W, Gaetano L, Mueller-Lenke N, Wuerfel J, Poole EM, and Cavalier S (2020). Association of brain volume loss and long-term disability outcomes in patients with multiple sclerosis treated with teriflunomide. Multiple Sclerosis Journal 26, 1207–1216. [DOI] [PMC free article] [PubMed] [Google Scholar]
Taylor JM, Wang Y, and Thiébaut R. (2005). Counterfactual links to the proportion of treatment effect explained by a surrogate marker. Biometrics 61, 1102–1111. [DOI] [PubMed] [Google Scholar]
VanderWeele TJ (2013). Surrogate measures and consistent surrogates. Biometrics 69, 561–565. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wager S and Athey S (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113, 1228–1242. [Google Scholar]
Wang R, Lagakos SW, Ware JH, Hunter DJ, and Drazen JM (2007). Statistics in medicine—reporting of subgroup analyses in clinical trials. New England Journal of Medicine 357, 2189–2194. [DOI] [PubMed] [Google Scholar]
Wang X, Parast L, Tian L, and Cai T (2020). Model-free approach to quantifying the proportion of treatment effect explained by a surrogate marker. Biometrika 107, 107–122. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Y and Taylor JM (2002). A measure of the proportion of treatment effect explained by a surrogate marker. Biometrics 58, 803–812. [DOI] [PubMed] [Google Scholar]
Wang-Lopez Q, Chalabi N, Abrial C, Radosevic-Robin N, Durando X, Mouret-Reynier M-A, Benmammar K-E, Kullab S, et al. (2015). Can pathologic complete response (pcr) be used as a surrogate marker of survival after neoadjuvant therapy for breast cancer? Critical reviews in oncology/hematology 95, 88–104. [DOI] [PubMed] [Google Scholar]
Willke RJ, Zheng Z, Subedi P, Althin R, and Mullins CD (2012). From concepts, theory, and evidence of heterogeneity of treatment effects to methodological approaches: a primer. BMC medical research methodology 12, 185. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supinfo

NIHMS1756809-supplement-supinfo.pdf^{(1.6MB, pdf)}

Data Availability Statement

The data from the ACTG 320 study used in this paper are publicly available upon request from the AIDS Clinical Trial Group: https://actgnetwork.org/submit-a-proposal/.

[R1] Agyemang E, Magaret AS, Selke S, Johnston C, Corey L, and Wald A (2018). Herpes simplex virus shedding rate: surrogate outcome for genital herpes recurrence frequency and lesion rates, and phase 2 clinical trials end point for evaluating efficacy of antivirals. The Journal of Infectious Diseases 218, 1691–1699. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Burzykowski T, Molenberghs G, and Buyse M (2005). The evaluation of surrogate endpoints. Springer. [Google Scholar]

[R3] Burzykowski T, Molenberghs G, Buyse M, Geys H, and Renard D (2001). Validation of surrogate end points in multiple randomized clinical trials with failure time end points. Journal of the Royal Statistical Society: Series C (Applied Statistics) 50, 405–422. [Google Scholar]

[R4] Calmy A, Ford N, Hirschel B, Reynolds SJ, Lynen L, Goemaere E, De La Vega FG, Perrin L, and Rodriguez W (2007). Hiv viral load monitoring in resource-limited regions: optional or necessary? Clinical infectious diseases 44, 128–134. [DOI] [PubMed] [Google Scholar]

[R5] Cohen RM and Lindsell CJ (2012). When the blood glucose and the hba1c don’t match: turning uncertainty into opportunity. Diabetes Care 35, 2421–2423. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Conlon AS, Taylor JM, and Elliott MR (2014). Surrogacy assessment using principal stratification when surrogate and outcome measures are multivariate normal. Biostatistics 15, 266–283. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Crump RK, Hotz VJ, Imbens GW, and Mitnik OA (2008). Nonparametric tests for treatment effect heterogeneity. The Review of Economics and Statistics 90, 389–405. [Google Scholar]

[R8] Daniels MJ and Hughes MD (1997). Meta-analysis for the evaluation of potential surrogate markers. Statistics in medicine 16, 1965–1982. [DOI] [PubMed] [Google Scholar]

[R9] Elliott MR, Conlon AS, Li Y, Kaciroti N, and Taylor JM (2015). Surrogacy marker paradox measures in meta-analytic settings. Biostatistics 16, 400–412. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Freedman LS, Graubard BI, and Schatzkin A (1992). Statistical validation of intermediate endpoints for chronic diseases. Statistics in medicine 11, 167–178. [DOI] [PubMed] [Google Scholar]

[R11] Gilbert PB and Hudgens MG (2008). Evaluating candidate principal surrogate endpoints. Biometrics 64, 1146–1154. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Hammer SM, Squires KE, Hughes MD, Grimes JM, Demeter LM, Currier JS, Eron JJ Jr, Feinberg JE, Balfour HH Jr, Deyton LR, et al. (1997). A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or less. New England Journal of Medicine 337, 725–733. [DOI] [PubMed] [Google Scholar]

[R13] Huang Y and Gilbert PB (2011). Comparing biomarkers as principal surrogate endpoints. Biometrics 67, 1442–1451. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Inker LA, Mondal H, Greene T, et al. (2016). Early change in urine protein as a surrogate end point in studies of iga nephropathy: an individual-patient meta-analysis. American Journal of Kidney Diseases 68, 392–401. [DOI] [PubMed] [Google Scholar]

[R15] Joffe MM and Greene T (2009). Related causal frameworks for surrogate outcomes. Biometrics 65, 530–538. [DOI] [PubMed] [Google Scholar]

[R16] Lin D, Fischl MA, and Schoenfeld D (1993). Evaluating the role of cd4-lymphocyte counts as surrogate endpoints in human immunodeficiency virus clinical trials. Statistics in medicine 12, 835–842. [DOI] [PubMed] [Google Scholar]

[R17] Lin D, Fleming T, De Gruttola V, et al. (1997). Estimating the proportion of treatment effect explained by a surrogate marker. Statistics in medicine 16, 1515–1527. [DOI] [PubMed] [Google Scholar]

[R18] Parast L, Cai T, and Tian L (2017). Evaluating surrogate marker information using censored data. Statistics in Medicine 36, 1767–1782. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Parast L, Cai T, and Tian L (2019). Using a surrogate marker for early testing of a treatment effect. Biometrics 75, 1253–1263. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Parast L, McDermott MM, and Tian L (2016). Robust estimation of the proportion of treatment effect explained by surrogate marker information. Statistics in Medicine 35, 1637–1653. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Prentice RL (1989). Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in medicine 8, 431–440. [DOI] [PubMed] [Google Scholar]

[R22] Price BL, Gilbert PB, and van der Laan MJ (2018). Estimation of the optimal surrogate based on a randomized trial. Biometrics 74, 1271–1281. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Renard D, Geys H, Molenberghs G, Burzykowski T, and Buyse M (2002). Validation of surrogate endpoints in multiple randomized clinical trials with discrete outcomes. Biometrical Journal 44, 921–935. [Google Scholar]

[R24] Scott D (1992). Multivariate density estimation. Wiley, New York. [Google Scholar]

[R25] Spieker AJ and Huang Y (2017). A method to address between-subject heterogeneity for identification of principal surrogate markers in repeated low-dose challenge hiv vaccine studies. Statistics in medicine 36, 4071–4080. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Sprenger T, Kappos L, Radue E-W, Gaetano L, Mueller-Lenke N, Wuerfel J, Poole EM, and Cavalier S (2020). Association of brain volume loss and long-term disability outcomes in patients with multiple sclerosis treated with teriflunomide. Multiple Sclerosis Journal 26, 1207–1216. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Taylor JM, Wang Y, and Thiébaut R. (2005). Counterfactual links to the proportion of treatment effect explained by a surrogate marker. Biometrics 61, 1102–1111. [DOI] [PubMed] [Google Scholar]

[R28] VanderWeele TJ (2013). Surrogate measures and consistent surrogates. Biometrics 69, 561–565. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Wager S and Athey S (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113, 1228–1242. [Google Scholar]

[R30] Wang R, Lagakos SW, Ware JH, Hunter DJ, and Drazen JM (2007). Statistics in medicine—reporting of subgroup analyses in clinical trials. New England Journal of Medicine 357, 2189–2194. [DOI] [PubMed] [Google Scholar]

[R31] Wang X, Parast L, Tian L, and Cai T (2020). Model-free approach to quantifying the proportion of treatment effect explained by a surrogate marker. Biometrika 107, 107–122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Wang Y and Taylor JM (2002). A measure of the proportion of treatment effect explained by a surrogate marker. Biometrics 58, 803–812. [DOI] [PubMed] [Google Scholar]

[R33] Wang-Lopez Q, Chalabi N, Abrial C, Radosevic-Robin N, Durando X, Mouret-Reynier M-A, Benmammar K-E, Kullab S, et al. (2015). Can pathologic complete response (pcr) be used as a surrogate marker of survival after neoadjuvant therapy for breast cancer? Critical reviews in oncology/hematology 95, 88–104. [DOI] [PubMed] [Google Scholar]

[R34] Willke RJ, Zheng Z, Subedi P, Althin R, and Mullins CD (2012). From concepts, theory, and evidence of heterogeneity of treatment effects to methodological approaches: a primer. BMC medical research methodology 12, 185. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Testing for Heterogeneity in the Utility of a Surrogate Marker

Layla Parast

Tianxi Cai

Lu Tian

Abstract

1. Introduction

2. Notation, Setting and Assumptions

2.1. Notation and Setting

2.2. Assumptions

3. Assessing Heterogeneity

3.1. Approach

Remark 1.

3.2. Nonparametric estimation

Remark 2.

Remark 3.

3.3. Inference and Variance Estimation

4. Testing Procedures

4.1. Omnibus Test

4.2. Trend test

5. Simulation Study

Table 1:

Figure 1:

Figure 2:

Table 2:

Table 3:

6. Application

Figure 3:

7. Discussion

Supplementary Material

Acknowledgements

Footnotes

Data Availability Statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases