Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2024 Feb 22;80(1):ujad035. doi: 10.1093/biomtc/ujad035

A rank-based approach to evaluate a surrogate marker in a small sample setting

Layla Parast 1,, Tianxi Cai 2, Lu Tian 3
PMCID: PMC10883071  PMID: 38386359

ABSTRACT

In clinical studies of chronic diseases, the effectiveness of an intervention is often assessed using “high cost” outcomes that require long-term patient follow-up and/or are invasive to obtain. While much progress has been made in the development of statistical methods to identify surrogate markers, that is, measurements that could replace such costly outcomes, they are generally not applicable to studies with a small sample size. These methods either rely on nonparametric smoothing which requires a relatively large sample size or rely on strict model assumptions that are unlikely to hold in practice and empirically difficult to verify with a small sample size. In this paper, we develop a novel rank-based nonparametric approach to evaluate a surrogate marker in a small sample size setting. The method developed in this paper is motivated by a small study of children with nonalcoholic fatty liver disease (NAFLD), a diagnosis for a range of liver conditions in individuals without significant history of alcohol intake. Specifically, we examine whether change in alanine aminotransferase (ALT; measured in blood) is a surrogate marker for change in NAFLD activity score (obtained by biopsy) in a trial, which compared Vitamin E (Inline graphic) versus placebo (Inline graphic) among children with NAFLD.

Keywords: randomized clinical trial, rank test, small sample size, surrogate marker evaluation

1. INTRODUCTION

The health and economic burden from chronic diseases highlight the importance of identifying effective strategies for the prevention, treatment, and management of disease. Studies aimed at evaluating the effectiveness of such strategies often focus on outcomes that either require long-term follow-up of patients for example, time to dementia onset, or are invasive for example, derived from a biopsy (Lindström et al., 2006; Li et al., 2008). The identification and use of surrogate markers in such studies have the potential to improve our ability to make decisions about the effect of a treatment or intervention. When the surrogate marker can be obtained earlier than the occurrence of the clinical event of interest, such as dementia onset, use of the surrogate to test for a treatment effect would allow researchers to make conclusions regarding the treatment effect with less required follow-up time (Wittes et al., 1989; Parast et al., 2019; Wang et al., 2021). When the surrogate marker is a measurement that is less invasive or can be obtained with less cost or less burden, use of the surrogate has the potential to increase study compliance and participation, reduce patient burden, and/or decrease trial costs.

Over the past 4 decades, an incredible amount of progress has been made in the development of statistical methods to identify surrogate markers. In Prentice’s seminal paper on surrogate markers, he proposed a definition for a valid surrogate marker, which required that a test for a treatment effect on the surrogate marker also be a valid test for treatment effect on the primary outcome of interest (Prentice, 1989). This definition led to statistical methods that were developed to identify useful surrogate markers as those that “capture” a large proportion of the treatment effect on the primary outcome (Freedman et al., 1992; Wang and Taylor, 2002; Parast et al., 2016). While this proportion is commonly used in practice given its ease of interpretation, several other rigorous quantities have also been developed to assess the value of a surrogate marker, including the relative effect (RE) and adjusted association in the meta-analytic framework, and average causal necessity, average causal sufficiency, and the causal effect predictiveness surface in a principal stratification framework; see Elliott (2023) for a recent review.

Though many robust and rigorous methods are available to evaluate a surrogate marker, available methods are generally not able to handle small sample size studies. For example, while previously proposed nonparametric methods are ideal in that they require no restrictive model specification, the tradeoff is that a relatively large sample size is needed to apply these methods because they rely on, for example, kernel smoothing, which is known to perform poorly when the sample size is small due to slower convergence of estimates (Rosenblatt, 1969; Fan and Gijbels, 1992; Dabrowska, 1987). The alternative methods that do not use kernel smoothing require specification of various models that are unlikely to hold in practice and the adequacy of these models cannot be rigorously assessed when the sample size is small (Parast et al., 2016; Conlon et al., 2014). To our knowledge, there is currently no robust method to evaluate a surrogate marker, when the sample size is small. Furthermore, it is often the case that it is exactly these small studies where one is trying to identify a surrogate, for example, in small clinical trials of rare diseases, trials where a large sample is infeasible or unethical, and/or trials in a pediatric population with chronic disease (Dunoyer, 2011; Miyamoto and Kakkis, 2011).

Motivated by this open problem, we develop a novel nonparametric rank-based approach to evaluate a surrogate marker in a small sample setting. Specifically, we were motivated by a small study of children with nonalcoholic fatty liver disease (NAFLD), which is the most common cause of chronic liver disease in children in the US and ultimately results in advanced fibrosis, cirrhosis, and hepatocellular carcinoma (Nobili et al., 2019). This study, the Treatment of Nonalcoholic Fatty Liver Disease in Children (TONIC) trial, randomized children with NAFLD to Vitamin E, metformin, or placebo, and followed patients for 96 weeks. While the primary outcome in the study was sustained reduction in alanine aminotransferase (ALT; measured in blood), the study team states in Lavine et al. (2011) that in fact, “improvement in histology is a more desirable primary outcome measure than improvement in ALT” but they note that due to the lack of prior histology-based NAFLD trials in children, they could not calculate needed sample size for the trial using histology as the primary outcome. Therefore, they used ALT as a surrogate for histology, though they also measured histology, quantified via the NAFLD activity score, as a secondary outcome. Trial results showed no significant treatment effect of Vitamin E (Inline graphic) compared to placebo (Inline graphic) on sustained reduction in ALT, but did show a significant treatment effect on the change in NAFLD activity score (Lavine et al., 2011). This trial highlighted the need to rigorously question whether perhaps ALT is in fact not a sufficient surrogate marker for the NAFLD activity score to better inform future trial design.

The remainder of the article is organized as follows: Section 2 describes our proposed testing approach, including estimation and inference as well as implications for study design. Section 3 presents our simulation study, investigating the finite sample properties of our methods and compares to existing methods. Section 4 investigates ALT as a surrogate in the pediatric trial, and Section 5 offers concluding remarks.

2. METHODS

2.1. Notation and testing procedure

Let Inline graphic denote the primary outcome, Inline graphic denote the surrogate marker, and Inline graphic denote the treatment indicator, where treatment is randomized and Inline graphic (eg, treatment versus control). We use potential outcomes notation, where each person has a set of potential outcomes Inline graphic where Inline graphic and Inline graphic are the outcome and surrogate value when Inline graphic, respectively. The observed data consists of Inline graphic independently identically distributed (i.i.d) copies of Inline graphicInline graphic from the Inline graphic group, and Inline graphic i.i.d copies of Inline graphicInline graphic from the Inline graphic group. Our general setting is illustrated in Figure 1, where there exists a prior study, Study A, where the surrogate marker and primary outcome have both been measured. Our aim is to use Study A to evaluate Inline graphic as a surrogate marker for Inline graphic, the primary outcome. The ultimate goal is to be able to then conduct a future Study B where we only measure the surrogate marker, and test for a treatment effect using the surrogate marker. In this paper, we focus on a setting where the sample size is small such that the nonparametric approach of Parast et al. (2016), which utilizes kernel smoothing, to evaluate the surrogacy of Inline graphic in Study A is not feasible, for example, Inline graphic and Inline graphic are Inline graphic 50. To evaluate Inline graphic as a surrogate for Inline graphic in Study A, we consider a test for a treatment effect based on Inline graphic versus a test for a treatment effect based on Inline graphic. Let Inline graphic and Inline graphic, and

FIGURE 1.

FIGURE 1

General setting: Study A and Study B.

2.1.
2.1.

Note that Inline graphic is the Mann-Whitney U-statistic examining the difference in Inline graphic between the 2 treatment groups and similarly for Inline graphic, and Inline graphic and Inline graphic (Mann and Whitney, 1947). Consider the null hypothesis,

2.1.

that is, there is no treatment effect on Inline graphic. To test Inline graphic using the surrogate, we focus instead on testing

2.1.

that is, there is no treatment effect on Inline graphic. Note that Inline graphic is equivalent to the area under the receiver operating characteristic curve considering Inline graphic as a “predictor” of the treatment group assignment Inline graphic and similarly for Inline graphic (Hanley and McNeil, 1982). In the following, we discuss our approach throughout in the context of U-statistics. We show in Web Appendix A that under these conditions.

  1. Inline graphic is monotone increasing in Inline graphic for any Inline graphic for Inline graphic

  2. Inline graphic for all Inline graphic

  3. Inline graphic for all Inline graphic,

the following holds:

2.1. (1)

Conditions (C1)-(C3) and result (1) are important as they ensure protection from a surrogate paradox situation (VanderWeele, 2013), where the treatment has a positive effect on the surrogate, the surrogate and the primary outcome are positively associated, but the treatment has a negative effect on the primary outcome. We discuss these conditions further in Web Appendix A. Note that we do not claim that Inline graphic as it is possible that Inline graphic is substantially higher than Inline graphic.

As noted above, the ultimate goal in our setting is to potentially use Inline graphic as a replacement of Inline graphic in the future Study B. That is, we want to make inference about the treatment effect on Inline graphic, quantified via Inline graphic, using only Inline graphic. Motivated by this, we propose to measure the strength of the surrogate via the simple difference between Inline graphic and Inline graphic:

2.1.

and investigate whether Inline graphic is bounded by a specified upper bound, Inline graphic. To make a decision regarding utility of the surrogate marker, we focus on the following noninferiority test:

2.1. (2)

where Inline graphic reflects a poor surrogate (inferior) and Inline graphic reflects a valid surrogate (noninferior) (Patterson and Jones, 2017). That is, we hope that Inline graphic is “close enough” to Inline graphic (or larger than Inline graphic) such that we can be reasonably confident that, in a future study, (1) if we reject Inline graphic based on Inline graphic, this would reflect evidence of a treatment effect on Inline graphic, or (2) if we do not reject Inline graphic based on Inline graphic, this would imply that there is no or little treatment effect on Inline graphic. In Section 2.2, we discuss selection of Inline graphic. Now, we focus on conducting the noninferiority test with a given Inline graphic.

Using data from Study A, Inline graphic and Inline graphic can be estimated by Inline graphic and Inline graphic, respectively. Based on the theory for U-statistics, the null variance of Inline graphic can be obtained as Inline graphic, and similarly for Inline graphic (Mann and Whitney, 1947). To obtain a confidence interval for Inline graphic we use the variance derivations of DeLong et al. (1988) which can be used for any linear combination of U-statistics. Specifically, DeLong et al. (1988) shows that for any contrast Inline graphic where Inline graphic is a vector of coefficients,

2.1. (3)

has a standard normal distribution, where Inline graphic and Inline graphic are provided in Web Appendix A. Thus, we obtain a Inline graphic100% one sided confidence interval for Inline graphic as Inline graphic where Inline graphicInline graphic, the estimated SE of Inline graphic. We then have a formal testing procedure for testing (2) as follows:

  1. Calculate Inline graphic based on the observed data.

  2. Obtain a Inline graphic100% one-sided confidence interval as described above.

  3. Conclude that Inline graphic is a valid surrogate if the upper bound of the confidence interval is Inline graphic.

Note that we avoid referring to rejecting or failing to reject Inline graphic; instead, we refer to the upper bound of the confidence interval and the implied conclusion regarding surrogacy. In addition, when the upper bound of the confidence interval is Inline graphic, it is not necessarily the case that one should state Inline graphic is not a valid surrogate marker; rather, we do not have sufficient evidence using this nonparametric approach that Inline graphic is a valid surrogate marker. In Section 3, we examine the finite sample performance of this proposed testing procedure, specifically its power and type 1 error, and compare to existing approaches. In Section 4, we use this approach to analyze the pediatric trial data.

Remark. Our approach is similar in spirit to the RE proposed by Buyse and Molenberghs (1998), where RE is the ratio of the effect of the treatment on Inline graphic relative to the effect of treatment on Inline graphic, estimated parametrically in the case of normally distributed Inline graphic and Inline graphic. When this ratio is equal to 1, Inline graphic is said to be a perfect surrogate. Intuitively, this is similar to our motivation for Inline graphic such that we expect Inline graphic to be 0 when Inline graphic is a perfect surrogate for Inline graphic. Both are motivated by the ultimate goal of surrogate marker use in future studies which is to be able to make inference about the effect of treatment on Inline graphic based on the treatment effect on Inline graphic. However, both are also subject to limitations described in Molenberghs et al. (2002), namely that when these quantities are calculated in a single trial, strong assumptions are needed to conclude that any indication of surrogacy is relevant to future studies; we discuss this further in Section 5.

2.2. Selection of Inline graphic

Choice of Inline graphic will likely depend on a number of factors including cost of primary outcome measurement, cost of surrogate marker measurement, cost of individual recruitment, potential follow-up time reduction if the surrogate is measured before the primary outcome, and information that might be available from previous studies. Certainly, one could consider a prespecified Inline graphic value such as 0.10; admittedly, there is no obvious single choice for Inline graphic. Here, we propose an approach to select Inline graphic that may be useful if one is willing to select a data-driven and sample size-dependent Inline graphic. We argue that it is reasonable to select Inline graphic according to the ability to detect the treatment effect on the primary outcome via the surrogate marker (observed or hypothesized). Specifically, we want an approach that ensures the power of a test based on Inline graphic, since such a test is what would potentially be used to test for a treatment effect in a future study. Let’s say we want Inline graphic to be such that we would have Inline graphic% power to detect a treatment effect on Inline graphic. Thus, we are only willing to deem a surrogate as adequate if it provides this level of power. Motivated by this, let’s first determine the value Inline graphic such that

2.2.

where Inline graphic is the desired powered for a test for a treatment effect on Inline graphic, if Inline graphic and Inline graphic denotes the null variance of Inline graphic and Inline graphic that is, Inline graphic The power takes this form because one would reject the null hypothesis Inline graphic if Inline graphic. It can be shown that: Inline graphic Therefore, Inline graphic is the minimum value of Inline graphic such that we would have Inline graphic% power to detect a treatment effect on Inline graphic. Now, we define Inline graphic where Inline graphic is the hypothesized treatment effect on the primary outcome, which we may be able to approximate by its empirical counterpart, Inline graphic, in preliminary data. That is, we are willing to take Inline graphic as a substitute for Inline graphic if Inline graphic is at most Inline graphic-distance away from Inline graphic In summary, Inline graphic would depend on the hypothesized treatment effect on Inline graphic, and the desired power of the test based on Inline graphic. The latter, in turn, depends on the sample size and significance level of the test.

We illustrate the selection of Inline graphic in Figure 2. The black line in this figure shows the recommended Inline graphic to use based on the proposed calculations at various sample sizes, assuming the desired power of the test based on Inline graphic is.70 and Inline graphic. For example, under these assumptions, if the total sample size is 40, then the recommended Inline graphic to use is 0.17. The gray line shows parallel calculations except assuming Inline graphic; here, if the total sample size is 40, then the recommended Inline graphic value is 0.07. In Web Appendix B, we provide additional figures to illustrate selection of Inline graphic. In practice, such figures can help one understand what value of Inline graphic would be reasonable given their study specifics. While we focus on evaluating the surrogate in our current study, Study A, the fact that Inline graphic is defined as a function of the sample size means that one could calculate Inline graphic based on the intended sample size and hypothesized Inline graphic in the future study.

FIGURE 2.

FIGURE 2

Illustration of Inline graphic selection as a function of the total sample size: the black line shows the recommended Inline graphic based on the proposed calculations at various sample sizes, assuming the desired power of Inline graphic is 0.70 and assuming Inline graphic; the gray line shows the recommended Inline graphic based on the proposed calculations at various sample sizes, assuming the desired power for Inline graphic is 0.70 and assuming Inline graphic (Inline graphic-level = 0.05).

2.3. Power to identify a surrogate marker

In this section, we develop an approach to examine the power to identify a valid surrogate marker given a specified design. Specifically, we consider the case where the planned sample size for Study A is fixed (eg, due to budget) and known to be small (eg, Inline graphic in each group). We aim to estimate the power of Study A in determining whether Inline graphic is a valid surrogate marker for Inline graphic (ie, the upper bound of the confidence interval, Inline graphic, is less than Inline graphic) under some fixed alternative where Inline graphic and Inline graphic. At a fixed sample size Inline graphic, where we assume without loss of generality that Inline graphic, the power at a type 1 error level Inline graphic, defined as Inline graphic, can be expressed as

2.3. (4)

However, this power depends on Inline graphic the SD of Inline graphic, which depends on the unknown correlation between Inline graphic and Inline graphic. If preliminary information about the joint distribution Inline graphic is available, then one could estimate Inline graphic accordingly. When no such information is available, we show in Web Appendix C that one may approximate Inline graphic by Inline graphic when Inline graphic is not substantially far from zero, where Inline graphic is the assumed Spearman’s rank correlation coefficient between Inline graphic and Inline graphic. Based on this approximation, we may estimate (4) by

2.3. (5)

We can use (5) to examine the expected power at a fixed Inline graphic with various hypothesized values of Inline graphic and Inline graphic, with a specified value for either Inline graphic or Inline graphic. For example, Figure 3 shows the expected power as a function of Inline graphic with Inline graphic for various fixed alternatives Inline graphic with Inline graphic. In Section 3, we examine the empirical performance of this estimated power.

FIGURE 3.

FIGURE 3

Estimated power to identify a surrogate marker for various fixed alternatives Inline graphic, as a function of Inline graphic with Inline graphic and Inline graphic (Inline graphic-level = 0.05).

3. SIMULATION STUDY

3.1. Simulation goals and setup

The goals of this simulation study were to (1) examine the finite sample performance of our testing procedure with respect to power and Type 1 error across a variety of settings; (2) compare our approach to existing approaches (details provided below), which admittedly were not developed for the small sample size setting; (3) compare empirical power versus estimated power calculated as proposed in Section 2.3. Details regarding data generation are included in Web Appendix D. Briefly, Setting 1 was generated such that Inline graphic was a useless surrogate and thus, allowed us to examine Type 1 error. In this setting, Inline graphic = 0.793, Inline graphic and Inline graphic; the true proportion of treatment effect on the primary outcome explained by the treatment effect on the surrogate (PTE), as defined in Parast et al. (2016), was 0. In Setting 2, we purposefully made Inline graphic an almost perfect surrogate to examine testing in such a setting. Here, Inline graphic = 0.926, Inline graphic, Inline graphic, and the true PTE was 0.97. In Setting 3, we purposefully generated Inline graphic such that it was not useless nor perfect, but rather, in between. Here, Inline graphic = 0.906, Inline graphic, Inline graphic, and the true PTE was 0.57. In Settings 1-3, we simulated from nicely behaved normal distributions. In Setting 4, we purposefully generated data that were more complex to purposefully demonstrate the reasonable performance of our proposed approach and the poor performance of the existing model-based approaches. Here, Inline graphic = 0.938, Inline graphic, Inline graphic, and the true PTE was 0.68. All simulations were summarized over 1000 replications with Inline graphic and Inline graphic selection was based on a desired power of 0.80.

3.2. Simulation results

In each setting, we examined estimation of Inline graphic, Inline graphic, and Inline graphic, where Inline graphic was our primary focus, in terms of bias, and SE estimation (empirical versus average of the estimated SE). In addition, we examined the empirical power (Type 1 error for Setting 1) as the proportion of simulations where the upper bound of the confidence interval for Inline graphic was less than Inline graphic. Table 1 shows the simulation results for our proposed approach across all settings. These results show reasonably good performance of our estimation procedure for Inline graphic with minimal bias, and average SE (ASE) estimates close to the empirical SE (ESE). Estimation of Inline graphic and Inline graphic show minimal bias and similarly good SE estimation. The constructed confidence intervals show slight over-coverage for the true Inline graphic and Inline graphic in all settings, and under-coverage for Inline graphic in Settings 3 and 4. This information regarding coverage for Inline graphic may be particularly useful if one is considering an alternative choice for Inline graphic; we discuss this further in Section 5. Testing results show that the test is conservative, which was expected given this is a nonparametric test for a small sample size setting. In Setting, 1 where the surrogate was useless, the upper bound of the confidence interval for Inline graphic was greater than Inline graphic for all simulation iterations that is, Type 1 error of 0. In Setting 2, where the surrogate was almost perfect, the power was 1 that is, the upper bound of the confidence interval for Inline graphic was less than Inline graphic for all simulation iterations. For Settings 3 and 4, where the surrogate was neither perfect nor useless, power was 0.525 and 0.688, respectively.

TABLE 1.

Simulation results for all settings using the proposed testing procedure for estimation of Inline graphic, Inline graphic, and Inline graphic in terms of bias, ESE, ASE, and coverage of the 95% confidence intervals, and the empirical Type 1 error/Power testing Inline graphic, with Inline graphic.

Setting 1 (null)
Inline graphic Inline graphic Inline graphic
Truth 0.793 0.500 0.293
Estimate 0.791 0.503 0.289
Bias 0.002 -0.003 0.004
ESE 0.063 0.084 0.101
ASE 0.064 0.083 0.105
Coverage 0.978 0.940 0.953
Testing  Inline graphic, Type 1 errorInline graphic
Setting 2
Inline graphic Inline graphic Inline graphic
Truth 0.926 0.922 0.004
Estimate 0.924 0.920 0.003
Bias 0.002 0.002 0.000
ESE 0.036 0.037 0.010
ASE 0.036 0.037 0.011
Coverage 0.989 0.988 0.954
Testing   Inline graphic, PowerInline graphic
Setting 3
Inline graphic Inline graphic Inline graphic
Truth 0.906 0.797 0.108
Estimate 0.904 0.794 0.110
Bias 0.002 0.003 -0.002
ESE 0.042 0.063 0.035
ASE 0.041 0.063 0.037
Coverage 0.985 0.972 0.931
Testing  Inline graphic, PowerInline graphic
Setting 4
Inline graphic Inline graphic Inline graphic
Truth 0.938 0.831 0.107
Estimate 0.936 0.828 0.108
Bias 0.002 0.003 -0.001
ESE 0.037 0.057 0.039
ASE 0.036 0.058 0.040
Coverage 0.992 0.977 0.886
Testing  Inline graphic, PowerInline graphic

In Table 2, we compare the estimated power obtained using (5) to the empirical power for Settings 2-4, where we calculate the estimated power 3 different ways: (1) using Inline graphic = Inline graphic, which is the true rank correlation between Inline graphic and Inline graphic, (2) assuming Inline graphic, and (3) assuming Inline graphic. The true correlations in Settings 2, 3, and 4 were 0.98, 0.91, and 0.82, respectively meaning that when we calculate estimated power assuming Inline graphic, we are under-estimating the correlation, and when we assume Inline graphic, we are under-estimating the correlation for Setting 2, but over-estimating the correlation for Settings 3 and 4. Results show that the estimated power using the true rank correlation is reasonably close to the empirical power from the simulation study. As expected, the estimated power under-estimates the empirical power when the assumed correlation is lower than the true correlation, and over-estimates the empirical power when the assumed correlation is higher than the true correlation. Simulation results showing the performance of existing approaches in Settings 1-4 are provided in Web Appendix D. Overall, our simulation study demonstrates the good finite sample performance of our proposed testing procedure in a small sample setting. Fully reproducible simulation code is publicly available in the SurrogateRank Repository (2023).

TABLE 2.

Empirical (Emp) power versus Estimated (Est) power in Settings 2-4 where estimated power is calculated in 3 different ways: (1) using Inline graphic = Inline graphic which is the true rank correlation between Inline graphic and Inline graphic, (2) assuming Inline graphic, and (3) assuming Inline graphic.

Emp power Est power, Inline graphic Est power, Inline graphic Est power, Inline graphic
Setting 2 1 1 0.835 1
Setting 3 0.525 0.592 0.231 0.823
Setting 4 0.688 0.658 0.393 0.987

4. DATA APPLICATION

We use our proposed method to investigate whether change in ALT from baseline to 96 weeks is a surrogate for change in NAFLD activity score in the TONIC trial described above. The NAFLD score combines information on steatosis (build up of fat in the liver), hepatocyte ballooning (form of liver cell degeneration associated with cell swelling and enlargement), and lobular inflammation (2 or more inflammatory cells within the lobule), such that higher numbers indicate worse liver function. ALT is an enzyme concentrated primarily in the liver, which increases when there is liver damage. Using our proposed method, Inline graphic and Inline graphic, and the 95% confidence interval was Inline graphic. Assuming a desired power of 0.70, Inline graphic resulting in Inline graphic. Since the upper bound of the confidence interval for Inline graphic is above Inline graphic, we would conclude that there is not sufficient evidence to support change in ALT as a surrogate marker for change in NAFLD activity score.

For comparison, we also calculated the proportion of treatment effect explained using the Freedman approach (Freedman et al., 1992), the more flexible but still model-based approach of Wang and Taylor (2002), the nonparametric kernel-based approach of Parast et al. (2016), and the nonparametric optimal transformation approach of Wang et al. (2020); point estimates and 95% confidence intervals were Inline graphic, Inline graphic and Inline graphic, and Inline graphic, respectively. All comparison approaches provide confidence intervals with a negative lower bound, with an especially wide confidence interval for the nonparametric kernel-based approach. Additional analytic results are provided in Web Appendix D.

This illustration highlights the need for our proposed approach when the sample size is small. The available methods, first, are not meant to be used in a small sample setting and second, produce results that are difficult to interpret, with negative lower bounds and wide confidence intervals. In contrast, our approach offers unique insight into the use of ALT as a surrogate and specifically, provides evidence suggesting that ALT not be used as a surrogate in future studies in this population. While we do not provide a single number quantification of the strength of the surrogate like the proportion of treatment effect explained measure, this method does provide reasonable and actionable information about the strength of the surrogate that can be used to make decisions about future use of the surrogate.

5. DISCUSSION

Surrogate markers, used appropriately, can bring substantial savings in terms of time and cost in conducting randomized clinical trials. However, available methods to evaluate surrogate markers require either a large sample size or restrictive parametric assumptions, with no robust methods available for studies with a small sample size. This paper develops a statistical methodology to bridge this gap. We proposed a nonparametric rank-based approach to evaluate a surrogate marker in a small sample size setting, developed estimation and inference procedures, proposed an approach to examine power, and demonstrated good finite sample performance in a simulation study. Our method was motivated by a small clinical trial among children with NAFLD; we used our approach to determine whether change in ALT as a surrogate marker for change in NAFLD activity score. Our results led to the conclusion that there is not sufficient evidence to support change in ALT as a surrogate marker for change in NAFLD activity score, while results from available methods provided numerical results that were difficult to interpret. An R package implementing the methods proposed here, named SurrogateRank, is available in the SurrogateRank Repository (2023).

Our choice to focus on the Mann-Whitney rank-based U-statistic to quantify the treatment effect on the primary outcome and surrogate marker has 3 advantages. First, our proposed testing procedure is nonparametric and yields robust inference results in a small sample setting, where complex model checking is nearly impossible. Second, our approach is “scale-free” since the use of ranks puts Inline graphic and Inline graphic on the same scale even when Inline graphic and Inline graphic are not, which is usually the case in practice. This makes our approach invariant to transformations of Inline graphic or Inline graphic, which is a particularly important quality that is absent in available methods. For example, in the TONIC trial application, the nonparametric kernel-based approach of Parast et al. (2016) resulted in an estimate of the proportion of treatment effect explained equal to 0.278; if we instead use the outcome as Inline graphic, this estimate is now 0.482, nearly twice as large. Third, our approach focuses on comparing the entire distribution of the 2 groups with respect to the primary outcome and surrogate marker, while available nonparametric focus instead on the mean; this is particularly advantageous when the distribution of the data is such that the mean fails to be a good summary of the outcome. These advantages highlight the utility of our approach, not just for small samples but also for large samples. Here, we focus on the small sample size problem, but there is nothing about our method that precludes it from being useful in a large sample size.

While these advantages of our proposed approach are not solely applicable to a small sample setting, methods to evaluate surrogate markers in a small sample setting are incredibly crucial. In a world where big data problems are ubiquitous, it is often forgotten that small data problems have not all been solved. Of course, this is a very difficult problem. It would be unsurprising for a reader to be skeptical of any attempt to validate a surrogate marker in a small single trial. However, this does not change the fact that there is a huge demand for surrogates and this demand is especially zealous in settings with small sample sizes. For clinical studies of rare diseases and/or disease in a pediatric setting, the studies are traditionally small and these are exactly the types of studies in which we are often desperate for a surrogate so that we can make conclusions about treatments sooner. Our pediatric NAFLD application is 1 example of such a study; another example is a currently recruiting clinical trial (ClinicalTrials.gov Identifier: NCT05067621), which is aiming to enroll 60 total participants aged 10-21 with the goal of providing mechanistic insights in support of Semaglutide therapy (glucagon-like peptide-1 receptor agonist, which increases insulin secretion) for prediabetes, new-onset type 2 diabetes, and nonalcoholic fatty liver disease in youth. In this planned study, participants will be randomized to semegalutide or placebo; the primary outcomes are change in oral disposition index and change in protein density fat fraction, and there are multiple surrogate markers of interest, including change in oral glucose tolerance test-derived biomarkers, glucose peak, glucagon levels, and incretin effect (NIH, 2022; Le Garf et al., 2021). It is warranted to harbor skepticism here, but important nonetheless to pursue the development of statistical methods that attempt to provide tools that are truly needed.

As mentioned above, the ultimate goal in the identification of a surrogate marker is to then use the surrogate marker to test for a treatment effect in a future study. In general, when we evaluate a surrogate in one study and then claim to be able to use it in a future study, we are making assumptions about the “transportability” of surrogate information from one study to another. In this paper, we focused solely on evaluating a surrogate in a single study, Study A. However, implicit in our framework, is an assumption about the transportability of Inline graphic to a future study i.e., that Inline graphic would remain the same in the future study. Whether such a transportability assumption holds depends on a number of factors, such as whether the patient population is different in the future study, whether there is heterogeneity in the utility of the surrogate, and to what extent conditions (C1)-(C3) hold in both studies. Informally, one needs to ask whether it is reasonable to assume that Inline graphic and Inline graphic will behave in the same way, as captured by Inline graphic, in the future study. In a setting with multiple trials, one may consider a meta-analytic extension of our approach to investigate transportability; we discuss this further in Web Appendix E. Previous work has offered formal conditions for transportability of a surrogate marker to a future study via selection diagrams and do-calculus (Pearl and Bareinboim, 2011; Tikka and Karvanen, 2019; Bareinboim and Pearl, 2013). However, to our knowledge, there does not seem to be an existing framework to empirically investigate the plausibility of transportability in a surrogate marker setting, especially within a single study. Though beyond the scope of this paper, further work on transportability is an important area of future work.

Our approach has some limitations. First, our testing procedure is conservative and may have low-to-moderate power in certain settings. However, given that the implication of our testing results may lead to use of the surrogate marker to test for a treatment effect in a future study, we believe conservative power may be desirable. Power gains may be achievable if one had multiple small studies available by considering a meta-analytic approach (see Web Appendix E). Second, while we propose an intuitive approach to select Inline graphic, this approach may be inappropriate or undesirable in certain settings. For example, if one does not want Inline graphic to depend on the sample size, does not agree with the concept of the bar for surrogacy depending on the sample size (note that higher sample sizes will result in a larger Inline graphic in the current proposal), or does not want Inline graphic to depend on the observed Inline graphic in the current study, an alternative approach will be needed. Lastly, the proposed approach is designed to accommodate outcomes and surrogates that are either continuous or on an ordinal scale with fine levels. Extending our approach to settings with outcomes or surrogates that are binary, counts, and/or censored time-to-event outcomes would not be trivial; at a minimum, one would need to consider identifying an appropriate validation criterion in these settings (Deltuvaite-Thomas et al., 2023). Further research on this topic is warranted.

Supplementary Material

ujad035_Supplemental_Files

Web Appendices referenced in Sections 2, 3, 4, and 5, and a zip file containing code to replicate all results are available with this paper at the Biometrics website on Oxford Academic.

Acknowledgement

The TONIC study was supported by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). The data were supplied by the NIDDK Central Repository. This manuscript was not prepared under the auspices of the TONIC study contributors and does not represent analyses or conclusions of the TONIC study contributors, the NIDDK Central Repository, or the NIH.

Contributor Information

Layla Parast, Department of Statistics and Data Science, University of Texas at Austin, Austin, TX 78712, United States.

Tianxi Cai, Department of Biostatistics, Harvard University, Boston, MA 02115, United States.

Lu Tian, Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, United States.

FUNDING

This work was supported by NIDDK grant R01DK118354.

CONFLICT OF INTEREST

None declared.

DATA AVAILABILITY

The data from the Treatment of Nonalcoholic Fatty Liver Disease in Children (TONIC) study used in this paper are publicly available upon request from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) Data Repository and completion of a data use agreement: https://repository.niddk.nih.gov/studies/tonic/.

References

  1. Bareinboim E., Pearl J. (2013). A general algorithm for deciding transportability of experimental results. Journal of Causal Inference, 1, 107–134. [Google Scholar]
  2. Buyse M., Molenberghs G. (1998). Criteria for the validation of surrogate endpoints in randomized experiments. Biometrics, 54, 1014–1029. [PubMed] [Google Scholar]
  3. Conlon A. S., Taylor J. M., Elliott M. R. (2014). Surrogacy assessment using principal stratification when surrogate and outcome measures are multivariate normal. Biostatistics, 15, 266–283. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Dabrowska D. M. (1987). Non-parametric regression with censored survival time data. Scandinavian Journal of Statistics, 14, 181–197. [Google Scholar]
  5. DeLong E. R., DeLong D. M., Clarke-Pearson D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44, 837–845. [PubMed] [Google Scholar]
  6. Deltuvaite-Thomas V., Verbeeck J., Burzykowski T., Buyse M., Tournigand C., Molenberghs G., et al. (2023). Generalized pairwise comparisons for censored data: An overview. Biometrical Journal, 65, e2100354. [DOI] [PubMed] [Google Scholar]
  7. Dunoyer M. (2011). Accelerating access to treatments for rare diseases. Nature Reviews Drug Discovery, 10, 475–476. [DOI] [PubMed] [Google Scholar]
  8. Elliott M. R. (2023). Surrogate endpoints in clinical trials. Annual Review of Statistics and its Application, 10, 75–96. [Google Scholar]
  9. Fan J., Gijbels I. (1992). Variable bandwidth and local linear regression smoothers. The Annals of Statistics, 20, 2008–2036. [Google Scholar]
  10. Freedman L. S., Graubard B. I., Schatzkin A. (1992). Statistical validation of intermediate endpoints for chronic diseases. Statistics in Medicine, 11, 167–178. [DOI] [PubMed] [Google Scholar]
  11. Hanley J. A., McNeil B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29–36. [DOI] [PubMed] [Google Scholar]
  12. Lavine J. E., Schwimmer J. B., Van Natta M. L., Molleston J. P., Murray K. F., Rosenthal P., et al. (2011). Effect of vitamin e or metformin for treatment of nonalcoholic fatty liver disease in children and adolescents: the tonic randomized controlled trial. JAMA, 305, 1659–1668. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Le Garf S., Nègre V., Anty R., Gual P. (2021). Metabolic fatty liver disease in children: A growing public health problem. Biomedicines, 9, 1915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Li G., Zhang P., Wang J., Gregg E. W., Yang W., Gong Q., et al. (2008). The long-term effect of lifestyle interventions to prevent diabetes in the china da qing diabetes prevention study: a 20-year follow-up study. The Lancet, 371, 1783–1789. [DOI] [PubMed] [Google Scholar]
  15. Lindström J., Ilanne-Parikka P., Peltonen M., Aunola S., Eriksson J. G., Hemiö K., et al. (2006). Sustained reduction in the incidence of type 2 diabetes by lifestyle intervention: follow-up of the finnish diabetes prevention study. The Lancet, 368, 1673–1679. [DOI] [PubMed] [Google Scholar]
  16. Mann H. B., Whitney D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 18, 50–60. [Google Scholar]
  17. Miyamoto B. E., Kakkis E. D. (2011). The potential investment impact of improved access to accelerated approval on the development of treatments for low prevalence rare diseases. Orphanet Journal of Rare Diseases, 6, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Molenberghs G., Buyse M., Geys H., Renard D., Burzykowski T., Alonso A. (2002). Statistical challenges in the evaluation of surrogate endpoints in randomized trials. Controlled Clinical Trials, 23, 607–625. [DOI] [PubMed] [Google Scholar]
  19. NIH (2022). US National Library of Medicine: ClinicalTrials.gov: Semaglutide effects in obese youth with prediabetes/new onset type 2 diabetes and non-alcoholic fatty liver disease. https://clinicaltrials.gov/ct2/show/NCT05067621 [Accessed 15 November 2023].
  20. Nobili V., Alisi A., Valenti L., Miele L., Feldstein A. E., Alkhouri N. (2019). Nafld in children: new genes, new diagnostic modalities and new drugs. Nature Reviews Gastroenterology and Hepatology, 16, 517–530. [DOI] [PubMed] [Google Scholar]
  21. Parast L., Cai T., Tian L. (2019). Using a surrogate marker for early testing of a treatment effect. Biometrics, 75, 1253–1263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Parast L., McDermott M. M., Tian L. (2016). Robust estimation of the proportion of treatment effect explained by surrogate marker information. Statistics in Medicine, 35, 1637–1653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Patterson S. D., Jones B. (2017). Bioequivalence and Statistics in Clinical Pharmacology, New York: CRC Press. [Google Scholar]
  24. Pearl J., Bareinboim E. (2011). Transportability across studies: A formal approach. Technical report.
  25. Prentice R. L. (1989). Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in Medicine, 8, 431–440. [DOI] [PubMed] [Google Scholar]
  26. Rosenblatt M. (1969). Conditional probability density and regression estimators. Multivariate Analysis II, 25, 31. [Google Scholar]
  27. SurrogateRank Repository (2023). The SurrogateRank package and repository. https://github.com/laylaparast/SurrogateRank/ [Accessed 15 November 2023].
  28. Tikka S., Karvanen J. (2019). Surrogate outcomes and transportability. International Journal of Approximate Reasoning, 108, 21–37. [Google Scholar]
  29. VanderWeele T. J. (2013). Surrogate measures and consistent surrogates. Biometrics, 69, 561–565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Wang X., Cai T., Tian L., Bourgeois F., Parast L. (2021). Quantifying the feasibility of shortening clinical trial duration using surrogate markers. Statistics in Medicine, 40, 6321–6343. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Wang X., Parast L., Tian L., Cai T. (2020). Model-free approach to quantifying the proportion of treatment effect explained by a surrogate marker. Biometrika, 107, 107–122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Wang Y., Taylor J. M. (2002). A measure of the proportion of treatment effect explained by a surrogate marker. Biometrics, 58, 803–812. [DOI] [PubMed] [Google Scholar]
  33. Wittes J., Lakatos E., Probstfield J. (1989). Surrogate endpoints in clinical trials: cardiovascular diseases. Statistics in Medicine, 8, 415–425. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ujad035_Supplemental_Files

Web Appendices referenced in Sections 2, 3, 4, and 5, and a zip file containing code to replicate all results are available with this paper at the Biometrics website on Oxford Academic.

Data Availability Statement

The data from the Treatment of Nonalcoholic Fatty Liver Disease in Children (TONIC) study used in this paper are publicly available upon request from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) Data Repository and completion of a data use agreement: https://repository.niddk.nih.gov/studies/tonic/.


Articles from Biometrics are provided here courtesy of Oxford University Press

RESOURCES