Abstract
The use of simulation-based sensitivity analyses is fundamental for evaluating and comparing candidate designs of future clinical trials. In this context, sensitivity analyses are especially useful to assess the dependence of important design operating characteristics with respect to various unknown parameters. Typical examples of operating characteristics include the likelihood of detecting treatment effects and the average study duration, which depend on parameters that are unknown until after the onset of the clinical study, such as the distributions of the primary outcomes and patient profiles. Two crucial components of sensitivity analyses are (i) the choice of a set of plausible simulation scenarios and (ii) the list of operating characteristics of interest. We propose a new approach for choosing the set of scenarios to be included in a sensitivity analysis. We maximize a utility criterion that formalizes whether a specific set of sensitivity scenarios is adequate to summarize how the operating characteristics of the trial design vary across plausible values of the unknown parameters. Then, we use optimization techniques to select the best set of simulation scenarios (according to the criteria specified by the investigator) to exemplify the operating characteristics of the trial design. We illustrate our proposal in three trial designs.
Keywords: Clinical trial design, operating characteristics, sensitivity analysis, function approximation, simulated annealing
1. Introduction
Clinical trial designs are becoming increasingly complex to meet the multifaceted needs and goals of precision medicine. Examples of complex designs include adaptive seamless phase i/ii designs for evaluating, early in the treatment development process, the dosing, safety, and activity of new drugs (Hobbs et al., 2019). Also, adaptive randomized trials with frequent interim looks at the data can evaluate one or more therapies simultaneously while attempting to minimize trial duration and resources (Thorlund et al., 2018; Berry et al., 2010). Additional examples of complex designs have been implemented in biomarker-stratified trials to evaluate the efficacy of a therapy and possible variations of treatment effects across patient subgroups (Mehta et al., 2019).
When planning a new trial, it is necessary to predict and evaluate several operating characteristics. Relevant operating characteristics can include the likelihood of selecting an effective dose with low toxicity in a phase i/ii study, the probability of detecting treatment effects in a randomized study, the expected trial duration, costs, and other metrics to evaluate designs that often enroll patients from different subgroups. Multiple operating characteristics typically need to be examined jointly in order to evaluate the relevant trade-offs achieved by candidate designs, such as balancing the accuracy in estimating treatment effects and the expected study duration.
The obvious challenge for evaluating a candidate design is that the vector of operating characteristics of the study design is not known and it is difficult to estimate before the onset of the trial. Indeed, the operating characteristics are usually a function of a vector of unknown parameters that identify the distribution of all relevant variables to be captured during the trial. For example, unknown parameters can include the enrollment and drop-out rates, the magnitude of treatment effects, and the prevalence of predictive biomarkers in the trial population. Uncertainty on these parameters makes it non-trivial to evaluate whether a candidate design is appropriate for implementing the new study.
Sensitivity analyses are commonly used to account for uncertainty on unknown parameters and operating characteristics when evaluating a candidate design. They typically proceed in three steps. First, a set of plausible scenarios, i.e., specific values of the vector of unknown parameters, is selected. Next, the corresponding operating characteristics are computed using trial simulations or analytic results. Finally, based on the computed operating characteristics and their variations across the set of scenarios, the investigators evaluate whether the candidate design is appropriate to achieve the aims of the study. Throughout the manuscript, we use the terms sensitivity analysis or simulation report to indicate a set of scenarios and the associated operating characteristics which are computed to illustrate how the operating characteristics vary across plausible values of unknown parameters.
Producing a simulation report to effectively evaluate a study design has been recommended as one of the key supporting documents for interacting with the FDA (Mayer et al., 2019; Food et al., 2020). However, it can be difficult to select the set of unknown parameters, especially if the dimension of the vector of unknown parameters is moderate to high (say ≥ 5). For the investigators, it might be unclear if the selected scenarios are adequate to illustrate the variations of the operating characteristics across potential values of the unknown parameters. Similarly, for regulators, there may be skepticism as to whether the selected scenarios are chosen to highlight positive aspects of the trial design without pointing at its limitations and negative aspects (Razavi et al., 2021). Another subtle challenge is the choice of the number of scenarios. Indeed, a large number of scenarios (say 100) may simplify the task of representing how the operating characteristics vary across potential values of the unknown parameters, but a simulation report that contains too many scenarios makes it difficult to interpret and communicate the included results.
We propose a method to choose an optimal set of scenarios for a simulation report that will provide relevant operating characteristics. This decision is based on a utility criterion, which formalizes the ability of any set of scenarios to represent the map between the unknown parameters and the operating characteristics. In some cases, we will consider a restriction of the parameter space to focus only on regions of plausible values of the unknown parameters. The utility criterion assigns high (low) utility to a set of scenarios if the table of potential unknown parameters and operating characteristics is an accurate (inaccurate) summary of how the design's operating characteristics vary across the considered parameter space. We call the set of scenarios that maximizes the utility criterion the Representative and Optimal Sensitivity Analysis (ROSA) scenarios. To select the ROSA scenarios, we introduce a computational procedure that leverages (i) flexible regression methods like neural networks (NNs) (Goodfellow et al., 2016) and (ii) optimization algorithms like simulated annealing (Bélisle, 1992). Our approach is applicable to any trial design, regardless of the number of unknown parameters and the number of operating characteristics.
In summary, we propose ROSA as a computational tool that allows one to examine any clinical trial design by selecting a parsimonious set of simulation scenarios with the goal of representing the variations of the operating characteristics across plausible values of unknown parameters. To illustrate this approach, we conduct sensitivity analyses for three trial designs. The first is a two-arm randomized design that aims to test and estimate the effects of an experimental treatment compared to the standard of care (SOC). The second is a multi-stage randomized trial that leverages an auxiliary/surrogate outcome measured shortly after randomization for interim decisions and a primary outcome with a longer ascertainment time (Niewczas et al., 2019). The third is a biomarker-adaptive enrichment design similar to the design of the TAPPAS trial (Mehta et al., 2019), a randomized phase iii trial comparing TRC105 and pazopanib versus pazopanib alone in patients with advanced angiosarcoma (Jenkins et al., 2011; Jones et al., 2017). In the first design, we consider a single unknown parameter and a single operating characteristic, whereas for the latter two designs we consider multiple unknown parameters and multiple operating characteristics.
2. Selecting sensitivity scenarios
2.1. Notation and problem set-up
We introduce our procedure to select sensitivity scenarios , where is the set of potential values of the unknown parameters . We assume that is a bounded subset of and use the notation to indicate the Euclidean norm on . We will restrict to a subset when there is sufficient prior information from completed studies or clinical experience. We identify ROSA scenarios as the scenarios that maximize a utility criterion
| (1) |
where
| (2) |
We can symmetrically define the corresponding loss function by inverting the sign in equation (2). Here, is a metric between the operating characteristics and . We will consider metrics of the form
where are non-negative weights that sum to one. The weights can be user-specified to calibrate the relative importance of different operating characteristics. Setting the weights to results in equal weighting for each operating characteristic.
We can now provide an explicit interpretation of the utility function in equation (2). Consider a set of scenarios – the order of the entries is not relevant – and an arbitrary scenario in . For , the metric is a summary of the differences between the operating characteristics at and the same operating characteristics when we consider the -th scenario . Therefore, can be viewed as an approximation error between and a similar vector of operating characteristics selected among our options . Expression (2) identifies through the maximization operator the worst-case (with highest approximation error) that we can obtain by varying in . We maximize the utility function and use to indicate the ROSA scenarios. Alternative utility criteria and loss functions are described later in the manuscript. A table of notation used throughout the paper is provided in Table 1 below.
Table 1.
Notation
| ≜ | Unknown parameter space in | |
| ≜ | Restricted unknown parameter subspace by prior knowledge in | |
| ≜ | Restricted unknown parameter subspace by prior knowledge and fixing certain dimensions in | |
| ≜ | Diffuse and finite unknown parameter subspace in | |
| ≜ | -dimensional vector of unknown parameters | |
| ≜ | -dimensional training vector of unknown parameters | |
| ≜ | -dimensional validation vector of unknown parameters | |
| ≜ | A set of sensitivity scenarios | |
| ≜ | The ROSA set of sensitivity scenarios optimizing loss | |
| ≜ | The ROSA set of sensitivity scenarios optimizing marginal loss | |
| ≜ | -vector of operating characteristics for unknown parameters | |
| ≜ | Estimated vector of operating characteristics for unknown parameters | |
| ≜ | Average across simulations of the vector of operating characteristics for unknown parameters | |
| ≜ | Generic function to capture if a null hypothesis has been rejected, where is the th trial under the th scenario, | |
| ≜ | Loss Function | |
| ≜ | Utility criterion | |
| ≜ | Fixed non-negative weights for operating characteristics | |
| ≜ | Weights for stage 1 and 2 p-values | |
| ≜ | Pre-specified distance metric | |
| ≜ | Gaussian noise in iteration of simulated annealing | |
| ≜ | Acceptance probability in iteration of simulated annealing | |
| ≜ | Unknown parameter space in | |
| ≜ | Decreasing sequence of positive numbers (cooling schedule of simulated annealing) | |
| ≜ | Multiplicative reduction factor for simulated annealing in (0, 1) | |
| ≜ | Random variable distributed Uniform(0,1) for simulated annealing | |
| ≜ | Enrollment rate in (0,∞) | |
| ≜ | Planned number of patients on arm at the final analysis | |
| ≜ | Planned number of patients on arm at the interim analysis | |
| ≜ | Binary auxiliary outcome | |
| ≜ | Primary outcome | |
| ≜ | Response probability | |
| ≜ | Treatment effect on | |
| ≜ | Response probability | |
| ≜ | Correlation between and in |
2.2. An example with a geometric interpretation
To provide a geometric interpretation of the utility criterion , we illustrate how one set of scenarios can be preferable to a different set of scenarios (Figure 1 ). Specifically, suppose we aim to design a single-arm trial with an interim analysis that allows for early-stopping for futility. The goal of the trial is to compare the response rate of an experimental drug with that of the at the end of the study. However, because study patients only receive the experimental drug, the response rate under the is estimated before the onset of the study, for example using data from a previous trial. At the interim analysis, the trial may stop for futility if the preliminary evidence of positive treatment effects is insufficient to continue the study. During the final analysis, the null hypothesis (the experimental therapy is not superior to the historical control) is tested against the alternative hypothesis (the experimental therapy is superior to the historical control). In this design, are the unknown parameters, and . Suppose that there are two operating characteristics of interest: (i) , the probability of a positive result ( is rejected) and (ii) , the expected sample size.
Fig. 1.

Geometric representation of an arbitrary scenario and two proposed sets of scenarios. (Left) Parameter space with arbitrary scenario (orange triangle) and two sets of proposed scenarios (blue points) and (red points). (Right) The set of operating characteristics coincides with the irregular shape. The operating characteristics of and two sets of scenarios are illustrated. The radius of the dotted circles (with blue points as centers) is the value of the loss associated with the blue points. ROSA scenarios minimize the loss , which in turn is equal to the radius of the dotted circles that cover the operating characteristic surface .
The left panel of Figure 1 is a representation of . We are interested in the two operating characteristics of the single-arm design. Two sets of scenarios are proposed. The first set of scenarios (blue points) is chosen by varying both unknown parameters at the same time, while the second set (red points) is chosen by varying only while fixing the value of . The two sets of scenarios, the corresponding operating characteristics, and associated loss are represented in the right panel of Figure 1. The first set of scenarios (blue points) is preferred over the second set (red points) because it is more representative of the variation of the operating characteristics over . Geometrically, the loss associated with the blue points is identical to the minimum radius of the circles with centers (see Figure 1) necessary to cover the operating characteristics surface .
2.3. Estimating the operating characteristics
We describe an algorithm to numerically approximate the operating characteristics for every . This is necessary to solve the optimization problem in equation (2). Indeed, in most cases the function cannot be computed in closed form.
We briefly outline our four-step procedure. In the first step, we choose a large number (say ) of training scenarios . In the second step, we use Monte Carlo simulations to obtain estimates of . In the third step, we train a flexible regression model – we use NNs in our implementation – based on the data points . The output of this step is a regression function that is easy to compute at any and that approximates . In the fourth step, we validate the regression model based on (say ) independent simulations . Steps 1–3 of this procedure are summarized in Algorithm 1. Step 4 is described in Algorithm 2.
In more detail, in step 1, to select the training scenarios , we randomly select scenarios in using Latin hypercube sampling (LHS) (McKay et al., 2000). LHS generates scenarios by first partitioning the unknown parameter dimensions into non-overlapping intervals and selecting one value from each interval at random. The values obtained for the first unknown parameter are randomly paired with the values obtained for the second , and so on, for all unknown parameters to form -tuples, which constitute the training scenarios .
In step 2, we estimate the operating characteristics of the trial design. In the paper, we focus on operating characteristics that can be defined as expected values, which are often of great interest, e.g., bias, power, duration of the trial, etc. One approach to handling unbounded operating characteristics (e.g. median squared error) is to apply simple transformations in such a way that these operating characteristics can be expressed as expected values, i.e., so that we can write down
for some function , where the random vector represents the data generated during the trial – including the collection of treatment assignment indicators and realized patient outcomes – under scenario . In practice, to estimate , we proceed as follows. First, for each of the training scenarios , we simulate (say ) clinical trials following the trial design. We then use the scenario-specific simulated trials to compute the estimate
where is the th trial dataset simulated under the th training scenario . Throughout the manuscript will take values in a compact set. For example, can be the indicator that captures if a null hypothesis of interest has been rejected at the end of the study, or the duration of the simulated trial. One possibility for handling unbounded operating characteristics (e.g. median squared error) is to apply monotone transformations, from the real line to the unit interval, that rescale the operating characteristics. In this case, the selection of representative scenarios would be influenced by the specific monotone map used to express the operating characteristic.
In step 3, we have only two inputs, the scenarios and the estimates , to fit a function . For example, one could use NNs, splines (Bookstein, 1989), or Gaussian processes (Rasmussen, 2003). We use NN regression functions in our applications because these are easy to compute using widely available software and have been demonstrated to have good performances (Leshno et al., 1993; Hornik, 1991; Goodfellow et al., 2016).
In step 4 (Algorithm 2), we investigate the differences between and . Specifically, we first select at random validation scenarios independently with respect to previous computations (step 1–3) and simulate trials (say ) for each . Based on the results of the simulated trials, for each , we then compute Monte Carlo estimates of the operating characteristics . For several important operating characteristics (e.g., average sample size, expected duration, power, type 1 error), the estimator is unbiased. Finally, we compare the estimates and the independent estimates . We use summary statistics and graphs to evaluate the differences . If the approximation is not adequate, we can use a different regression methodology, increase the number of trials, or increase the number of training scenarios in Algorithm 1.
2.4. Approximating the loss function
After computing (Algorithm 1 ) and validating its accuracy (Algorithm 2), we use it to approximate the loss function . To proceed, we choose a diffuse and finite subset of the parameter space . For example can include 100,000 random points from a distribution with support . When contains a large number of random points that are distributed over , under minimal assumptions (e.g., compact and operating characteristics with bounded range),
To summarize, we can approximate the loss function over the entire parameter space by using a diffuse and finite subset .
2.5. Optimization by simulated annealing
We now aim to approximately minimize the loss function . To illustrate the need for approximate solutions, consider the setting of a single unknown parameter , a finite , and an easy-to-compute loss function . Even in this simple setting identifying can be challenging. For example, to select representative scenarios from 1000 points , the loss function would need to be calculated for different possible sets . In what follows, we describe the use of simulated annealing (Algorithm 3 ), a simple strategy to reduce the outlined computational burden, regardless if is finite or not Kirkpatrick et al. (1983); Bélisle (1992); Spall (2005).
The simulated annealing algorithm proceeds as follows. First, initial scenarios are proposed, for example by sampling from a probability distribution with support . Then, iteratively For , the current scenarios are perturbed by adding to them zero-mean noise variables , thus obtaining new proposed scenarios (this step is represented by the "Perturb" operator in Algorithm 3). At each iteration, the proposed scenarios can either be accepted (i.e., ) or rejected (i.e., ). The acceptance or rejection of the proposed scenarios is stochastic, with probability (defined below), which is a function of and .
The acceptance probability is equal to 1 when . That is, if the proposed scenarios decrease the current loss value, then the proposed scenarios are accepted. If instead , then is
where , is a decreasing sequence of positive real numbers often called the "cooling schedule" of the algorithm. A common cooling schedule is , where is a constant and is a multiplicative contraction, but other forms are possible (Spall, 2005). In our applications, we use a piecewise-constant cooling schedule (Husmann et al., 2017).
After simulating the outlined Markov Chain for a fixed number of iterations, the final set of scenarios approximately minimizes the loss function (Bélisle, 1992). In our ROSA implementation, we use multiple independent replicates of Algorithm 3, with different initial scenarios , to investigate convergence of the random trajectory . Intuitively, we evaluate if the replicated trajectories, with different starting values, terminates with nearly identical final vectors and negligible differences in the values.
3. Applications: Sensitivity analyses of trial designs
We illustrate the ROSA approach by performing sensitivity analyses for three designs of different complexity levels. In each example, we describe the design of the trial, the unknown parameters, and the operating characteristics of interest. By illustrating the ROSA methodology in three trial designs, we show its flexibility with potential applications to evaluate nearly any clinical trial design. Indeed, ROSA only requires the possibility of simulating the trials under potential unknown parameters and the definition of the operating characteristics of interest.
3.1. Application 1: Two-arm RCT
In the first example, we will only consider a single unknown parameter (i.e., ) and a single operating characteristic that can be computed analytically. In this case, the optimal set of scenarios can be computed exactly, without resorting to approximation methods. This simple and stylized setting is useful to highlight the similarity of the approximations and selected scenarios computed by ROSA with their exact counterparts.
3.1.1. Trial design
We consider the design of a two-arm randomized trial (1:1 randomization ratio) with a sample of patients. For each , we let or 1 if the -th study patient is assigned to the control or experimental arm. The outcomes of the study patients are , which we assume to be independent and normally distributed. If then has mean and standard deviation equal to 30. In the analysis of the study, a -statistic will be used to test the null hypothesis against the alternative at 5% significance level.
3.1.2. Aim of the sensitivity analysis
The goal of the sensitivity analysis is to summaze the variation of the probability of rejecting , a function of the unknown treatment effect . For example, if we knew that , then , but in general is an unknown value. Suppose we aim to identify scenarios that maximize the utility , i.e.,
| (3) |
where .
In this trial, we have a single unknown parameter , and the operating characteristic of interest is monotone, continuous, invertible, and ranges from 0 to 1. Therefore, it is straightforward to see that the optimal scenarios correspond to the operating characteristic values that evenly divide the interval (0,1). To be precise, ; these are the three values of a regular grid on the interval (0, 1). Figure 2A illustrates the optimal set of scenarios when . Since can be calculated exactly, the optimal scenarios can be obtained by computing the inverse function at the values , , and 5/6. Specifically,
where is the quantile of the standard normal distribution. The corresponding optimal scenarios are illustrated as red asterisks in Figure 2B.
Fig. 2.

Sensitivity analysis of a RCT (operating characteristic: probability of rejecting ). Panel A: Exact solutions when . Panel B: Comparison of scenarios selected through exact calculation (red asterisks) and by 20 ROSA implementations with different initial proposals (blue points). Panel C: Graphical tool to choose the number of sensitivity scenarios.
3.1.3. Implementing and benchmarking ROSA
The exact computation of the optimal set of scenarios provides a solid benchmark for an initial evaluation of ROSA (Algorthms 1–3). We can compare the exact solution with the results from ROSA, which has the advantage of being applicable to other designs and operating characteristics that are not available in closed form.
We implement our ROSA approach to identify scenarios. We randomly select scenarios with independent samples from the Uniform(−5, 25) distribution. Note that and . For each , we simulate trials to compute the estimate , where either accepts or rejects the for trial and scenario . Then, we compute a continous function using the independent estimates and a NN with 3 hidden layers (8, 64, and 64 neurons respectively) and ReLU activation functions. Finally, to select three sensitivity scenarios, we use a simulated annealing algorithm based on an initial parameterization and final parameterization (c.f. Algorithm 3). We repeat these three steps (selection of scenarios, use of the NN, and optimization with simulated annealing) 20 times, each time initializing , , with independent random draws from the Uniform (−5, 25) distribution. The results of the exact approach (red asterisks) compared with ROSA (blue points) are shown in Figure 2B. The scenarios , , selected by simulated annealing (blue dots) are close to the exact solution (red asterisks).
3.1.4. Choice of number of scenarios
In practice, the decision regarding the number of scenarios to report is left to the analyst. This choice can be supported by a graph like Figure 2C, which allows the investigator to determine the minimum number of scenarios needed to guarantee a loss no larger than a targeted threshold. For example, to guarantee a loss no larger than 0.050 in this example, we need to select at least 10 scenarios for the simulation report.
We ran ROSA with and compared the loss in the resulting set of scenarios with that of the exact solution. The difference in the loss of the exact and approximate optima was less than 1% across all values that we considered (Figure 2C). Table 2 indicates that the computation time of the simulated annealing algorithm scales well as increases and that, as expected, the loss decreases as increases. All analyses were run on a Windows laptop with an Intel(R) Core(TM) i7-7700HQ 2.80 GHz processor, 16GB RAM, and 6MB of cache memory.
Table 2.
ROSA computation time, ROSA loss , minimum (exact) loss , and relative difference in loss of ROSA scenarios compared to the exact solutions.
| Number of Scenarios | Time (seconds) | ROSA Loss | Min. Loss | Rel. Diff. |
|---|---|---|---|---|
| 5 | 8.8 | 0.101 | 0.100 | 1.0% |
| 6 | 8.8 | 0.084 | 0.083 | 0.7% |
| 7 | 9.1 | 0.072 | 0.071 | 0.8% |
| 8 | 9.2 | 0.062 | 0.0625 | 0.7% |
| 9 | 9.1 | 0.056 | 0.056 | 0.6% |
| 10 | 9.1 | 0.050 | 0.050 | 0.2% |
| 20 | 10.1 | 0.025 | 0.025 | 0.5% |
| 30 | 10.2 | 0.017 | 0.0167 | 0.8% |
3.2. Application 2: Interim decisions based on auxiliary outcomes
In the second example, we consider sensitivity analyses with multiple unknown parameters and two operating characteristics. We illustrate the use of our computational procedures, including the operating characteristics approximation procedure (Algorithm 1), the validation procedure (Algorithm 2), and the simulated annealing optimization procedure (Algorithm 3). We investigate whether it is appropriate to fix the value of some of the unknown parameters across all sensitivity scenarios. Identical values for a subset of the unknown parameters can simplify the interpretation of the sensitivity analysis but can also introduce severe limitations in faithfully representing how the operating characteristics vary across plausible values of the unknown parameters.
3.2.1. Trial design
We consider a two-arm, two-stage randomized trial with a binary primary outcome and a binary auxiliary outcome (Niewczas et al., 2019). The primary outcome is available months after randomization, while the auxiliary outcome is available after months. For example, in glioblastoma trials, 12-month progression-free survival (PFS) and 24-month overall survival (OS) have been used as auxiliary and primary outcomes, respectively (Han et al., 2014). The approach that we illustrate is applicable for any value of and .
We let be the planned number of patients for arms (i.e., control and experimental arms) and indicate with the response probability . Similarly, let be the planned number of patients assigned to arm a before the interim analysis, and indicate the response probability . The difference is the treatment effect on . The primary aim of the trial is to test versus , at level . The final analysis of the study involves only the primary outcome , and the trial will use a standard -test, , where is the estimate of and is a weighted average of and .
An interim analysis is conducted after the auxiliary outcomes become available for patients for arms and 1 (i.e., months after the enrollment of patients on arms and 1), with early-stopping for futility or continuation based on a summary of the auxiliary outcomes . In several clinical settings, the treatment effect on tends to be more pronounced than the treatment effect on . The interim analysis is based on the summary , where is the estimate of and is a weighted average of and . We replicate the design of Niewczas et al. (2019), which calculates at the interim analysis the conditional power (CP) using the auxiliary outcome to determine whether to stop the trial for futility or not. Specifically, the CP is calculated based on and the information fraction as
where is the quantile of the standard normal distribution and is the cumulative distribution function of the standard normal distribution. Here, we set the cut-off point to be 0.5 so that the trial continues when .
3.2.2. Aim of the sensitivity analysis
The complexity of the simulation report increases with (the number of scenarios), (the number of entries of the unknown parameters ), and (the number of operating characteristics ). Here the full set of unknown parameters include the enrollment rate , the response rates for in , the response rates for in , and the correlation between and in , .
Controlling the complexity of the simulation report is important to ensure high interpretability of the report, which will be discussed by several stakeholders. There are a few potential strategies to reduce the complexity of the simulation report. First, it is often possible to consider only a subset of the parameter space based on prior knowledge of plausible values of the unknown parameters. For example, previous clinical studies can indicate a plausible range for the enrollment rate , the response rates under the SOC, and other parameters that are expected to have minimal variations across trials. In addition, we can also consider fixing multiple entries of the vectors to some reference values. In this case the space from which we select scenarios is further reduced to . For example, if the operating characteristics have low sensitivity with respect to the correlation parameters or the enrollment rate of the study, then we can fix these unknown parameters to common values (i.e., estimates) across all scenarios.
ROSA allows us to evaluate whether it is appropriate to assign the same value to one or more unknown parameters (e.g., and ) across all scenarios. In other words, we evaluate a simulation report with all scenarios in a restricted subset . A simulation report with scenarios in can potentially be easier to interpret compared to a report in which all entries of vary across scenarios by reducing the number of dimensions of the unknown parameters and pointing to the most relevant unknown parameters when discussing the variations of the operating characteristics across . We can select scenarios from the restriction only if the capability of the simulation report of representing the operating characteristics variations across is preserved. Our case study investigates this aspect. The operating characteristics of interest in our case study are the probability of rejecting the null hypothesis of no treatment effect on at the end of the study and the average sample size.
3.2.3. Implementing and benchmarking ROSA
Using our ROSA procedure, we randomly select training scenarios using LHS and conduct Monte Carlo simulations for each of the training scenarios to obtain estimates of the operating characteristics across . Here is a product space with the enrollment rate , the response rates for in , the response rates for in , and the correlation between and in . For , we fix the enrollment rate and the response rates in the control groups.
We use a NN to obtain an interpolation of the operating characteristics. As described in Algorithm 4, to evaluate if the estimates of the operating characteristics are accurate, we compare them to independent Monte Carlo estimates of size on a set of uniformly-distributed validation points spanning the plausible parameter space . The coefficients of determination in this comparison are above 0.96. This suggests that the NN accurately estimates the operating characteristics.
We compare two simulation reports, and our goal is to provide stakeholders the simplified version if it accurately describes the operating characteristics. The first one includes scenarios from restricted by prior knowledge from completed studies and clinical experience and the second includes scenarios from further restricted by fixing the value of some entries of as described above. We use simulated annealing to identify two sets of scenarios in and , respectively. In both cases we minimize the same loss function defined over K-tuples of points. We also calculate the loss associated with these two optimal sets of scenarios from and . In Figure 3, we illustrate the difference in loss between these two optimal sets; as expected, the loss decreases as increases. We observe in Figure 3 that for any value of , the loss associated with the optimal set of scenarios restricted to is larger compared to the optimal scenarios in . However, the difference is modest, and the gain in interpretability of a sensitivity analysis report with fewer unknown parameters may be worth the slightly larger loss. For example, if an investigator requires the loss to be under a threshold of , then it is sufficient to consider scenarios, regardless of whether we consider scenarios selected from or
Fig. 3.

Clinical trial design with an interim analysis and an auxiliary endpoint. A graphical representation to choose the number of sensitivity scenarios . We compare optimal sets of scenarios selected from and from the lower-dimensional restriction .
3.3. Application 3: Biomarker-driven adaptive enrichment
In the third example, we discuss sensitivity analyses dedicated to an adaptive trial with subpopulations defined by biomarkers, considering multiple unknown parameters and multiple operating characteristics of interest. As a motivating example, in several oncology trials, a major decision is whether to restrict patient enrollment to a targeted subgroup of patients (e.g., biomarker-positive subgroup) or to enroll a broader patient population. Enrolling only a biomarker-positive subgroup may deny a substantial number of patients access to an effective therapy, whereas enrolling a larger population may compromise the power to detect positive treatment effects. Several trial designs discussed in the literature attempt to address the outlined problem through interim looks at the data.
3.3.1. Trial design
We consider an adaptive two-stage enrichment trial design with one-to-one randomization (Jenkins et al., 2011; Jones et al., 2017; Mehta et al., 2019). The design is applicable in the setting where a biomarker-positive subgroup of patients is hypothesized to benefit more from the experimental treatment than the rest of the study population. The design includes a single interim analysis, and it uses progression-free survival (PFS) for interim decision-making, while overall survival (OS) is the endpoint for the final analysis, which occurs when a pre-specified number of events is reached. The interim analysis uses the estimated PFS hazard ratio to capture potential early signals of treatment effects. In the implementation of Jenkins et al. (2011), which we replicate, the HR is estimated for both the overall population and the biomarker-positive subgroup . An interim decision determines which group is enrolled and tested during the second stage of the trial:
A –. Promising results in the biomarker-positive population.
If the HR estimate but , then the trial will continue enrolling only biomarker-positive patients and the final analysis will test . Here is the null hypothesis of no differences in OS between treatment and control groups in the biomarker-positive population. The null hypothesis is rejected if , where is a log-rank p-value computed using only OS data from patients randomized during the first (second) stage of the trial. The weights and the standard normal cumulative distribution function are used to summarize evidence of treatment effects from the two stages of the trial. We refer to Jenkins et al. (2011) for details on the choice of and other aspects of the final analysis.
B –. Promising results in the overall population only.
If but , then the trial will continue enrolling all patients and the final analysis will only test , the null hypothesis of no differences in OS in the overall population. In this case the null hypothesis is tested using stage-specific OS log-rank p-values and combining evidence from the two stages of the trial.
C –. Unpromising results.
If and , then the trial stops early for futility.
D –. Promising early results for both populations.
Lastly, if the estimated HR in the biomarker-positive subgroup and the overall population , then the trial will continue enrolling all patients and testing efficacy both in the overall population and in the biomarker-positive subgroup.
The potential conclusion at the final analysis are (i) to recommend the new treatment for biomarker-positive patients, (ii) recommend the new treatment for both biomarker-positive and biomarker-negative patients, or (iii) not recommend the experimental treatment for future patients.
3.3.2. Aims of the sensitivity analysis
We focus on the following three operating characteristics: (i) , the probability of enrolling only biomarker-positive patients in the second stage, (ii) , the probability of enrolling both biomarker-positive and biomarker-negative patients in the second stage, and (iii) , the probability of no evidence of positive treatment effects, which is equal to the probability of not rejecting the null hypotheses.
We choose plausible intervals for the unknown parameters based on prior literature. Specifically, the recruitment rate per week, the prevalence of the biomarker-positive subgroup , the PFS HR comparing the treatment and control groups in the biomarker-positive subgroup , the PFS HR comparing treatment and control in the biomarker-negative subgroup , the OS HR comparing treatment and control in the biomarker-positive subgroup , the OS HR comparing treatment and control groups in the biomarker-negative subgroup , the correlation between OS and PFS in the biomarker-positive subgroup , and the correlation between OS and PFS in the biomarker-negative subgroup . Marginal exponential distributions and latent frailty terms were used for simulating correlated OS and PFS times (Michael and Schucany, 2002). More flexible models such as the Weibull distribution can be considered.
3.3.3. Implementing and benchmarking ROSA
For the outlined two-stage trial with biomarker populations, our ROSA pipeline can be used to compute multiple simulation reports, varying both the list of operating characteristics and the definition of . For example, one can fix the OS HRs in the biomarker-positive and negative populations to focus on the design sensitivity to other parameters, such as the PFS HRs. Similarly, the set of unknown parameters can be restricted to values with positive effects only for the biomarker-positive population. Importantly, one set of training simulations can be re-utilized to compute multiple sensitivity tables where the definitions of and vary.
We examine the difference in the marginal losses
| (4) |
when the set of scenarios are chosen by optimizing different loss functions. For example, let be the set of scenarios that minimize the marginal loss in (4). Similarly, let be the set of scenarios that minimize the joint loss in (2). Then it is intuitive that . In different words, the marginal losses tend to be smaller when the set of scenarios is chosen to minimize compared to a set of scenarios that minimizes with the aim of representing multiple operating characteristics. If the discrepancy , is relatively small for all total operating characteristics, then this indicates that it is reasonable to select a single set of scenarios to illustrate how the operating characteristics vary jointly across . We describe the difference between the marginal losses , when scenarios in are chosen by optimizing in (4) – optimum: - or by optimizing as in (2) - optimum: . Recall that is computed with the goal of illustrating how multiple operating characteristics vary across while optimizes the representation of a single operating characteristic . The weights in (2) are . In Figure 4 panel 1, we plot in red and in blue. Similarly, in panel 2 we compare and , and in panel 3 we compare and . Our results indicate that for all three operating characteristics, ; as expected, there is an increase of the marginal losses when the set of scenarios is selected to illustrate jointly the variations of multiple operating characteristics across . However, this difference is small for all . Furthermore, for each , the relative difference is similar across the three operating characteristics (Figure 4). This result supports the use of identical weights and of a single sensitivity table, with the same set of scenarios to illustrate jointly all three operating characteristics.
Fig. 4.

Marginal losses of different sets of scenarios (red) and (blue).
4. Discussion
The evaluation of complex designs such as dose-finding studies (lasonos et al., 2015), factorial trials (Green et al., 2002), and response-adaptive trials (Pallmann et al., 2018) focuses on multiple operating characteristics, such as the level of toxicities, the probability of selecting the correct treatment arm, or frequentist operating characteristics, including power and false positive probabilities. During the design stage of a complex clinical trial, simulation reports are typically produced to discuss sample size, interim analyses, and other major decisions with various stakeholders. The simulation report consists of one or a few tables dedicated to showcasing how major operating characteristics vary across potential values of unknown parameters in . In most cases, the analyst focuses on subsets of plausible parameters , for example, values concordant with previous studies, or subsets of potential values of particular interest, for example with positive and clinically relevant treatment effects.
Simulations are fundamental in the design of complex trials since operating characteristics can rarely be obtained analytically and are crucial in the assessment of study designs for regulators, pharmaceutical companies and other stakeholders (Food et al., 2020). However, a limited number of scenarios or poorly chosen scenarios could be inadequate to highlight variations of the operating characteristics across plausible unknown parameters and can result in sub-optimal decisions. We propose ROSA as a useful tool that can support investigators at this design stage when selecting which and how many scenarios to include in these simulation reports.
We focus on choosing an informative number of scenarios among the plausible unknown parameters to summarize the variations of key operating characteristics. Our approach minimizes an explicit loss function and uses established techniques for functional approximation (NNs) and numerical optimization (simulated annealing). We showcase our approach in three trials. Importantly, our approach is general and can be applied to nearly any clinical trial design. It only requires simulations to mimic the clinical trial under hypothetical scenarios.
Although our approach is general, we focused on loss functions of a specific form (2). It is possible to consider different loss functions. For example, one could consider the loss function , where is a probability distribution on (e.g., a posterior distribution obtained from previous data). The distribution could be used to incorporate prior information about the unknown parameters in the selection of sensitivity scenarios. Moreover, the metric can be extended to capture both differences between operating characteristics at plausible values and other aspects, such as the difference between expected values of the outcomes at and .
One major challenge in the presentation of simulation reports is the need for simplicity and interpretability of the results. To this end, we considered fixing one or more unknown parameters to identical values across the scenarios, which may be reasonable when there is a priori knowledge of certain unknown parameters. There are other ways to simplify a simulation report, such as removing operating characteristics that do not vary across plausible unknown parameters, or reporting only the range of the operating characteristics across instead of presenting the operating characteristics for each representative scenario.
Variations of the ROSA approach may also consider optimization algorithms other than simulated annealing and regression methods alternative to NN for approximating the operating characteristics across . The methodology that we proposed here can be used to handle other relevant problems, such as missing data. Indeed, during the design of the trial, there is often uncertainty on whether the analyses will involve missing data or not and the potential consequences of the missingness pattern. Probability models that include pre-treatment variables, outcomes, and missing data patterns are useful to explore the robustness of the design. In this case, ROSA can support the selection of scenarios with different missing data patterns.
Supplementary Material
Acknowledgements
The authors thank Cyrus Mehta and Christina Howe for helpful conversations and feedback that greatly enhanced the paper. LH was supported by the Clinical Orthopedic and Musculoskeletal Education and Training (COMET) Program, NIAMS grant T32 AR055885. LT was supported by the NIH grant R01LM013352.
Footnotes
Conflicts of interest
The authors report there are no competing interests to declare.
References
- Bélisle CJ (1992). Convergence theorems for a class of simulated annealing algorithms on rd. Journal of Applied Probability pages 885–895.
- Berry SM, Carlin BP, Lee JJ, and Muller P (2010). Bayesian adaptive methods for clinical trials CRC press. [Google Scholar]
- Bookstein FL (1989). Principal warps: Thin-plate splines and the decomposition of deformations. IEEE Transactions on pattern analysis and machine intelligence 11, 567–585. [Google Scholar]
- Food, Administration, D., et al. (2020). Interacting with the fda on complex innovative trial designs for drugs and biological products Updated December.
- Goodfellow I, Bengio Y, and Courville A (2016). Deep learning MIT press. [Google Scholar]
- Green S, Liu P-Y, and O'Sullivan J (2002). Factorial design considerations. Journal of Clinical Oncology 20, 3424–3430. [DOI] [PubMed] [Google Scholar]
- Han K, Ren M, Wick W, Abrey L, Das A, Jin J, and Reardon DA (2014). Progression-free survival as a surrogate endpoint for overall survival in glioblastoma: a literature-based meta-analysis from 91 trials. Neuro-oncology 16, 696–706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hobbs BP, Barata PC, Kanjanapan Y, Paller CJ, Perlmutter J, Pond GR, Prowell TM, Rubin EH, Seymour LK, Wages NA, et al. (2019). Seamless designs: current practice and considerations for early-phase drug development in oncology. JNCl: Journal of the National Cancer Institute 111, 118–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hornik K (1991). Approximation capabilities of multilayer feedforward networks. Neural networks 4, 251–257. [Google Scholar]
- Husmann K, Lange A, and Spiegel E (2017). The r package optimization: Flexible global optimization with simulated-annealing
- lasonos A, Gönen M, and Bosl GJ (2015). Scientific review of phase i protocols with novel dose-escalation designs: how much information is needed? Journal of Clinical Oncology 33, 2221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jenkins M, Stone A, and Jennison C (2011). An adaptive seamless phase ii/iii design for oncology trials with subpopulation selection using correlated survival endpoints. Pharmaceutical statistics 10, 347–356. [DOI] [PubMed] [Google Scholar]
- Jones RL, Attia S, Mehta CR, Liu L, Sankhala KK, Robinson SI, Ravi V, Penel N, Stacchiotti S, Tap WD, et al. (2017). Tappas: An adaptive enrichment phase 3 trial of trc105 and pazopanib versus pazopanib alone in patients with advanced angiosarcoma (aas). J. Clin. Oncol 35, TPS11081. [Google Scholar]
- Kirkpatrick S, Gelatt CD, and Vecchi MP (1983). Optimization by simulated annealing. Science 220, 671–680. [DOI] [PubMed] [Google Scholar]
- Leshno M, Lin VY, Pinkus A, and Schocken S (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks 6, 861–867. [Google Scholar]
- Mayer C, Perevozskaya I, Leonov S, Dragalin V, Pritchett Y, Bedding A, Hartford A, Fardipour P, and Cicconetti G (2019). Simulation practices for adaptive trial designs in drug and device development. Statistics in Biopharmaceutical Research 11, 325–335. [Google Scholar]
- McKay MD, Beckman RJ, and Conover WJ (2000). A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 42, 55–61. [Google Scholar]
- Mehta C, Liu L, and Theuer C (2019). An adaptive population enrichment phase iii trial of trc105 and pazopanib versus pazopanib alone in patients with advanced angiosarcoma (tappas trial). Annals of Oncology 30, 103–108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Michael J and Schucany W (2002). The mixture approach for simulating new families of bivariate distributions with specified correlations. The American Statistician 56, 48–54. [Google Scholar]
- Niewczas J, Kunz CU, and König F (2019). Interim analysis incorporating short- and long-term binary endpoints. Biometrical Journal 61, 665–687. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pallmann P, Bedding AW, Choodari-Oskooei B, Dimairo M, Flight L, Hampson LV, Holmes J, Mander AP, Odondi L, Sydes MR, et al. (2018). Adaptive designs in clinical trials: why use them, and how to run and report them. BMC medicine 16, 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rasmussen CE (2003). Gaussian processes in machine learning. In Summer school on machine learning, pages 63–71. Springer. [Google Scholar]
- Razavi S, Jakeman A, Saltelli A, Prieur C, looss B, Borgonovo E, Plischke E, Piano SL, Iwanaga T, Becker W, et al. (2021). The future of sensitivity analysis: an essential discipline for systems modeling and policy support. Environmental Modelling & Software 137, 104954. [Google Scholar]
- Spall JC (2005). Introduction to stochastic search and optimization: estimation, simulation, and control John Wiley & Sons. [Google Scholar]
- Thorlund K, Haggstrom J, Park JJ, and Mills EJ (2018). Key design considerations for adaptive clinical trials: a primer for clinicians. BMJ 360, k698. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
