Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2016 Sep 20;18(2):230–243. doi: 10.1093/biostatistics/kxw043

Statistical methods for down-selection of treatment regimens based on multiple endpoints, with application to HIV vaccine trials

Ying Huang 1,2,, Peter B Gilbert 1,2, Rong Fu 1,2, Holly Janes 1,2
PMCID: PMC6092735  PMID: 27649715

SUMMARY

Biomarker endpoints measuring vaccine-induced immune responses are essential to HIV vaccine development because of their potential to predict the effect of a vaccine in preventing HIV infection. A vaccine’s immune response profile observed in phase I immunogenicity studies is a key factor in determining whether it is advanced for further study in phase II and III efficacy trials. The multiplicity of immune variables and scientific uncertainty in their relative importance, however, pose great challenges to the development of formal algorithms for selecting vaccines to study further. Motivated by the practical need to identify a set of promising vaccines from a pool of candidate regimens for inclusion in an upcoming HIV vaccine efficacy trial, we propose a new statistical framework for the selection of vaccine regimens based on their immune response profile. In particular, we propose superiority and non-redundancy criteria to be achieved in down-selection, and develop novel statistical algorithms that integrate hypothesis testing and ranking for selecting vaccine regimens satisfying these criteria. Performance of the proposed selection algorithms are evaluated through extensive numerical studies. We demonstrate the application of the proposed methods through the comparison of immune responses between several HIV vaccine regimens. The methods are applicable to general down-selection applications in clinical trials.

Keywords: Clustering, Down-selection, Hypothesis testing, Non-redundancy, Ranking, Superiority

1. INTRODUCTION

In HIV vaccine studies, immune response biomarkers play a key role in vaccine development because of their potential to predict the effect of vaccines in preventing HIV infection. A vaccine’s immune response profile observed in phase I immunogenicity studies largely determines whether it is advanced for further study in phase II and III efficacy trials. Following the success of the RV144 Thailand HIV vaccine trial (Rerks-Ngarm and others, 2009), which demonstrated for the first time positive vaccine efficacy, the HIV vaccine trials network (HVTN) is planning a multi-regimen phase IIb efficacy trial in Southern Africa. In this trial, a set of qualifying prime-boost vaccine regimens will be simultaneously evaluated against a shared placebo group for comparison of vaccine efficacy and evaluation of immune correlates of protection (Gilbert and others, 2011). There will be constraints on the maximum number of regimens allowed in the efficacy trial due to budget limits. To prepare for this efficacy trial, 15 vaccine regimens, combining 5 unique prime-boost types and 3 Env dose Inline graphic adjuvant types, are currently planned for 2 phase I immunogenicity studies, based on which up to 3 regimens will be chosen to advance to the efficacy trial.

Although this type of “pick-the-winner” design has been commonly studied by clinical trial researchers (Mandrekar and Sargent, 2006), the current application is associated with unique challenges due to the multiplicity of immune responses and the need to select multiple winners. Existing research typically focuses on either the selection of a single best regimen based on one or two endpoints or the comparison of a pair of regimens with respect to univariate or multivariate endpoints. For example, in oncology research, randomized phase II trials are often used, to select a regimen with the best primary endpoint (e.g., response rate) to take forward to a larger phase III trial (Simon and others, 1985; Sargent and Goldberg, 2001; Steinberg and Venzon, 2002). In these trials it is important to have high probability to select a regimen that is superior to others. Other authors have studied the comparison between an experimental regimen and a control regimen with respect to multiple endpoints. The objective is to demonstrate that the experimental regimen is better than the control regimen with respect to some endpoints and not worse with respect to any (Tang, 1994; Follmann, 1996; Bloch and others, 2001; Perlman and Wu, 2004; Tamhane and Logan, 2004; Röhmel and others, 2006; Bloch and others, 2007). These approaches emphasize maximizing the power of the comparison while stringently controlling the type-I error rate when the experimental regimen is not better.

Similar to these earlier studies, we are interested in identifying regimens with superior outcomes (immunogenicity). The problem we tackle, however, involves new challenges. In particular, there is still substantial uncertainty regarding the relationship between an HIV vaccine-induced immune response profile and a vaccine’s protective effect. One essential requirement for our approach is thus to advance regimens with diverse immune profiles to ensure that different vaccines with possibly different mechanisms of protection can be evaluated in the efficacy trial. Controlling the rate of advancing regimens with similar or redundant immune profiles is important in our application due to the high cost and operational challenge of HIV vaccine efficacy studies. Currently there are no existing guidelines for selecting the top few vaccine regimens given a multidimensional immune response profile based on this consideration and superiority of responses. New selection criteria and selection algorithms need to be developed to achieve these goals.

For our future application, we have adopted a plan that will first evaluate each regimen individually based on their phase I trial safety and immunogenicity data, to determine whether they pass some minimum bars to be eligible for down-selection; a formal down-selection process will then be performed among those eligible regimens, based on the comparison of their immune profiles. This plan has been summarized in a recently published clinical paper (Huang and others, 2016). The focus of this paper is to detail the development of the statistical framework and methods for the formal down-selection process that involves head-to-head comparisons between regimens. The research has application beyond HIV vaccine trials for selecting the best several interventions based on multivariate endpoints measured in phase I/II randomized clinical trials.

In Section 2, we propose formal superiority and non-redundancy criteria for selecting regimens, and statistical algorithms for down-selection that integrate hypothesis testing and ranking techniques. Different down-selection algorithms are evaluated and compared through extensive simulation studies in Section 3. In Section 4 we apply the proposed methods to immune response data from HIV vaccine trials. We then complete the paper with concluding remarks.

2. METHODS

Suppose a set of immune endpoints Inline graphic of dimension Inline graphic are measured at a single time point for every participant from Inline graphic eligible vaccine regimens for down-selection. The objective is to select between one and Inline graphic regimens out of the Inline graphic based on information on Inline graphic, where Inline graphic is a fixed positive integer >1 and Inline graphicInline graphic. These selected regimens will then advance to the efficacy trial. In our application, Inline graphic equals 3 as a decision of the HVTN leadership. Let Inline graphic be the regimen indicator, with Inline graphic indicating the sample size for the Inline graphicth regimen. Let Inline graphic be the participant indicator, which takes values Inline graphic for regimen Inline graphic. Let Inline graphic be the immune endpoint indicator. We use Inline graphic to indicate the value for the Inline graphicth immune endpoint measured for participant Inline graphic from regimen Inline graphic. Next, we first present the criteria for down-selection and then propose statistical algorithms to select vaccine regimens satisfying these criteria.

2.1. Criteria for down-selection

We define two criteria that are considered important to achieve in down-selection. The first criterion is “superiority.” That is, vaccine regimens selected should be superior to regimens not selected based on the comparison of individual immune endpoints. Without loss of generality, we assume that for each individual endpoint considered, a larger immune response is associated with a better protective effect of vaccine and thus is desirable. Furthermore, we base superiority/inferiority on comparisons of mean immune responses. Assuming the immune endpoint Inline graphic measured on a regimen Inline graphic is continuous with mean Inline graphic and variance Inline graphic, a regimen Inline graphic is considered superior, inferior, or equivalent to another regimen Inline graphic with respect to a particular endpoint Inline graphic if the mean of the endpoint in Inline graphic is larger (Inline graphic), smaller (Inline graphic), or the same (Inline graphic) compared to Inline graphic. The same can be said if the immune endpoint follows a Bernoulli distribution with value 1 or 0 indicating positive or negative response (i.e., Inline graphic).

We define regimen Inline graphic to be superior to regimen Inline graphic with respect to its immune profile if Inline graphic is superior to Inline graphic with respect to at least one endpoint and is not inferior to Inline graphic with respect to any endpoint. Analogously, we define Inline graphicinferior to Inline graphic with respect to its immune profile if Inline graphic is inferior to Inline graphic with respect to at least one endpoint and is not superior to Inline graphic with respect to any endpoint. Thus regimen Inline graphic being superior to Inline graphic with respect to its immune profile is the same as Inline graphic being inferior to Inline graphic. Two regimens Inline graphic and Inline graphic are said to have equivalent immune profiles if they are equivalent with respect to each of the Inline graphic endpoints considered. Given a set of down-selected vaccine regimens, the superiority criterion is fully satisfied if no selected regimen is inferior to any other regimen that enters down-selection.

The second criterion we consider is “non-redundancy.” Given limited resources for conducting the efficacy trial, it is desirable to select vaccine regimens with non-redundant immune profiles. That is, if multiple regimens are selected, we prefer that each selected regimen is superior to the others with respect to different immune endpoints, such that diverse mechanisms of vaccine protective effects can be investigated in the efficacy trial. We define a vaccine regimen Inline graphic to be non-redundant to regimen Inline graphic if Inline graphic is not superior, not inferior, and not equivalent to Inline graphic with respect to its immune profile. Equivalently, non-redundancy means Inline graphic is superior to Inline graphic with respect to at least one immune endpoint and Inline graphic is superior to Inline graphic with respect to at least one immune endpoint. Non-redundancy can be similarly defined for a set of selected regimens: we consider the “non-redundancy” criterion satisfied if there is no pair of regimens within the set that are redundant compared to one another. A toy example is presented in Figure 1 to illustrate the two criteria.

Fig. 1.

Fig. 1.

A toy example to demonstrate the superiority and non-redundancy criteria. Suppose five regimens with mean structure in (a) enter the down-selection. The pairwise relationship is shown in (b), where Inline graphic indicates that one regimen is superior to the other with respect to their immune profiles, with the arrow pointing to the inferior regimen; and Inline graphic indicates equivalence between two regimens with respect to their immune profiles. Panels (c)–(f) display some possible outcomes based on the down-selection. The non-redundancy criterion is satisfied if there is not a Inline graphic or Inline graphic among the selected set; the superiority criterion is satisfied if for any regimen in the selected set, there is no Inline graphic pointing to it in original set (b). Thus neither criteria is satisfied in (c); superiority but not non-redundancy is satisfied in (d); non-redundancy but not superiority is satisfied in (e); both criteria are satisfied in (f).

2.2. Methods for down-selection

In order to select vaccine regimens satisfying the superiority and non-redundancy criteria, we develop statistical algorithms that integrate hypothesis testing and ranking techniques.

2.2.1. Hypothesis testing.

Consider the comparison between a pair of regimens Inline graphic and Inline graphic with respect to their immune profiles. We propose conducting formal hypothesis testing based on the comparison of individual endpoints between the two regimens.

First, for an endpoint Inline graphic, there are two directions in which the superiority of one regimen relative to the other can be tested: Inline graphic may be superior to Inline graphic or Inline graphic may be superior to Inline graphic. For superiority of Inline graphic, we test the null hypothesis Inline graphic against the alternative hypothesis Inline graphic, using, e.g., the one-sided Wald test or t-test for continuous immune responses or a one-sided Wald test for binary immune response positivity. Similarly, for superiority of Inline graphic, one can test the null hypothesis Inline graphic against the alternative hypothesis Inline graphic.

Next, the comparisons of individual endpoints between the two regimens can be integrated to test for the non-inferiority of one regimen relative to the other as below. Let Inline graphic, the intersection of Inline graphic, and Inline graphic, the union of Inline graphic. Then Inline graphic denotes the null hypothesis that regimen Inline graphic is inferior or equivalent to regimen Inline graphic with respect to its immune profile while Inline graphic denotes the alternative hypothesis that regimen Inline graphic is superior or non-redundant to Inline graphic. Similarly, Inline graphic and Inline graphic express the null hypothesis that Inline graphic is inferior or equivalent to Inline graphic and the alternative hypothesis that Inline graphic is superior or non-redundant to Inline graphic. Note that our definition of equivalence is different from what is used in “bio-equivalence studies” in clinical pharmacokinetics, where, e.g., no difference is declared between two regimens if the 90% confidence interval of the ratio of an exposure measure falls completely within the range 80–125%.

It follows that the non-redundancy criterion can be captured using the above hypotheses. In particular, if either Inline graphic or Inline graphic is true, i.e., Inline graphic holds, then one of the regimens is inferior or equivalent to the other, and thus the two regimens are redundant. On the other hand, if both Inline graphic and Inline graphic are true, i.e., Inline graphic holds, then the two regimens are non-redundant. A summary of these hypotheses can be found in Table 1.

Table 1.

Comparison between two regimens A & B

Comparison with respect to individual assay Inline graphic
Inline graphic, Inline graphic inferior or equivalent to Inline graphic Inline graphic, Inline graphic superior to Inline graphic
Inline graphic, Inline graphic inferior or equivalent to Inline graphic Inline graphic, Inline graphic superior to Inline graphic
Comparison with respect to immune profile
Inline graphic, Inline graphic inferior or equivalent to Inline graphic Inline graphic, Inline graphic superior or non-redundant to Inline graphic
Inline graphic, Inline graphic inferior or equivalent to Inline graphic Inline graphic, Inline graphic superior or non-redundant to Inline graphic
Inline graphic, Inline graphic & Inline graphic redundant Inline graphic, Inline graphic & Inline graphic not-redundant

In our down-selection application to select HIV vaccine regimens for advancing to the efficacy trial, the satisfaction of the non-redundancy criterion is of paramount importance: Phase IIb or III efficacy trials are large and operationally challenging, and including regimens redundant to each other in the efficacy trial can be a huge waste of resources. We therefore propose methods that aim to best select regimens satisfying the superiority criterion while controlling the probability of violating the non-redundancy criterion, i.e., the probability of including redundant regimens in the down-selected set. As will be shown in next section, this probability can be controlled by controlling the error of incorrectly declaring a pair of redundant regimens to be non-redundant, the “pairwise false non-redundancy error” (PW-FNRE), at a specific level that takes into consideration the multiple pairs of regimens compared during down-selection. Next, we describe the testing procedure that controls PW-FNRE for comparison between a particular pair of regimens.

Consider two regimens Inline graphic and Inline graphic that are redundant, i.e., Inline graphic holds. If we were to bound the PW-FNRE under a significance level Inline graphic, applying the intersection-union principal (Roy, 1953), we can control the error of falsely rejecting Inline graphic or Inline graphic each at level Inline graphic. Since Inline graphic is the intersection of Inline graphic, we adjust for multiple testing of individual endpoints using some step-down procedure, such as the Holm–Bonferroni procedure (Holm, 1979), as described in Web supplementary Appendix A available at Biostatistics online. We do the same multiple testing adjustment for the test of Inline graphic at level Inline graphic. Rejecting Inline graphic shows Inline graphic is superior or non-redundant to Inline graphic. Rejecting Inline graphic shows Inline graphic is superior or non-redundant to Inline graphic. Because it is not possible that Inline graphic is superior to Inline graphic and Inline graphic is superior to Inline graphic, it follows that rejecting both Inline graphic and Inline graphic implies Inline graphic and Inline graphic are non-redundant. We fail to demonstrate their non-redundancy otherwise.

In the next section, we will propose algorithms for down-selection that aim to control the probability of including redundant regimens in the selected set, through adjustment of Inline graphic during the comparison of each regimen pair. These algorithms combine hypothesis testing with ranking.

2.2.2. Ranking.

Ranking is another statistical technique we incorporate into the down-selection. We generate a univariate summary score for each regimen based on its immune profile, which is then used to order candidate regimens. In this paper we focus on deriving a summary score based on means of individual endpoints. The most appropriate way for integrating means across endpoints depends on the scientific question of interest. For example, a natural choice is the weighted average of mean values across immune endpoints, where the weight given to each endpoint reflects knowledge and beliefs about the relative importance of that endpoint to a vaccine’s protective effect. We name this ranking method the “average score” (AS) method. Alternatively, if one favors regimens with good ranks in means for several endpoints rather than regimens with very large means for a particular endpoint, regimens can be first ranked for each individual endpoint based on the mean and then a weighted average of the ranks across endpoints is computed. We name this ranking method the “average rank” (AR) method.

The uses of ranking are 2-fold in down-selection. First, ranking provides a solution to avoid the non-transitivity issue that can appear in pairwise comparisons between regimens. Suppose we perform hypothesis testing between all regimen pairs. Non-transitivity could happen in that regimen Inline graphic is declared to be superior or non-redundant to Inline graphic but not vice versa, Inline graphic is declared superior or non-redundant to Inline graphic but not vice versa, and Inline graphic is declared superior or non-redundant to Inline graphic but not vice versa. Providing regimens with a priori order allows us to sequentially investigate regimens following the order, such that a regimen with better rank has an advantage during selection. Second, ranking can serve as a tie-breaker when we fail to demonstrate that either of a pair of regimens is superior or non-redundant to the other. That is, if neither Inline graphic nor Inline graphic is rejected when comparing regimens Inline graphic and Inline graphic, the regimen with worse ranking will be filtered out.

Next we propose an algorithm that combines ranking and hypothesis testing, the “ranking, filtering, and selection” (RFS) algorithm, which always selects between 1 and Inline graphic regimens.

Step 1. We rank all regimens according to a univariate summary score. When ties exist, which can happen when using AR, we use an additional summary score to break the ties or randomly assign the order between regimens with equal summary scores. We first select the top-ranked regimen, and then evaluate the regimen ranked next as described in Step 2.

Step 2. We define the comparison of a new regimen with every regimen in a non-empty down-selected set as one iteration. At each iteration, suppose there are Inline graphic regimens already selected, namely Inline graphic, for some positive integer Inline graphic. For each Inline graphic, we compare the regimen ranked next, namely regimen Inline graphic, with Inline graphic as in Step 2(A) and 2(B).

Step 2(A). For each individual endpoint Inline graphic, we conduct a one-sided Wald test of Inline graphic versus Inline graphic, and control the error of falsely rejecting any null hypothesis Inline graphic among Inline graphic at level Inline graphic through a Holm procedure. We do the same thing for test of Inline graphic against Inline graphic.

Step 2(B). We declare Inline graphic superior or non-redundant to Inline graphic if at least one Inline graphic among Inline graphic is rejected, and we declare Inline graphic superior or non-redundant to Inline graphic if at least one Inline graphic among Inline graphic is rejected. If both Inline graphic and Inline graphic are rejected, we declare Inline graphic and Inline graphic to be non-redundant; otherwise we fail to establish non-redundancy of the two regimens. As shown in Section 2.2.1, this allows us to control PW-FNRE, the error of falsely declaring a particular pair of regimens to be non-redundant at level Inline graphic.

Step 3. If relative to any Inline graphic, Inline graphic, we fail to declare that Inline graphic is superior or non-redundant, we do not select Inline graphic; otherwise, we select Inline graphic and filter out all Inline graphic that we fail to declare non-redundant to Inline graphic. Regimens that were filtered out will not enter the down-selection process again.

Step 4. We continue Steps 2–3 until Inline graphic regimens have been selected or all regimens have been evaluated once.

In the following we consider the control of the probability of including non-redundant regimens in the final down-selected set. We propose to do so by controlling the probability of falsely advancing both regimens in at least one regimen-pair comparison during the entire RFS process, the “family-wise false non-redundancy error” (FW-FNRE). Note that if the final selected set contains redundant regimens, then one must have falsely advanced both regimens in some pairwise comparisons in RFS, although the reverse is not necessarily true since our algorithm allows regimens selected earlier to be filtered out. Therefore, procedures that control the FW-FNRE will also control the probability of including redundant regimens in the final down-selected set. We propose to control FW-FNRE in RFS at a desired significance level Inline graphic through adjustment of Inline graphic, the significance level for PW-FNRE, in Step 2 of RFS. We describe two choices of Inline graphic below.

First, among the Inline graphic regimens to be selected, there are Inline graphic regimen pairs for comparison. Therefore, applying the Bonferroni procedure, one can control the FW-FNRE of an RFS procedure at level Inline graphic by choosing Inline graphic. We name this algorithm the RFS-I algorithm. RFS-I stringently controls the FW-FNRE and consequently the probability of advancing redundant regimens. Depending on the distribution of the data, the actual FW-FNRE of RFS-I can be smaller than Inline graphic for several reasons. First, regimens in some pairs may not be redundant to each other. Second, comparisons will not necessarily happen between all pairs of regimens in the RFS process. Third, in a particular iteration of the RFS process, a new regimen Inline graphic is selected without excluding any existing regimens in the selected set only if Inline graphic is shown to be non-redundant to all regimens already selected. Suppose Inline graphic is redundant to a regimen Inline graphic in the selected set; falsely advancing both Inline graphic and Inline graphic requires not only that we incorrectly declare Inline graphic and Inline graphic as non-redundant in their comparison but also that we declare Inline graphic to be non-redundant to any other regimens in the selected set. In other words, the probability of making a false non-redundancy error during one particular iteration is less than or equal to the probability of making a false non-redundancy error for comparison of Inline graphic with each of Inline graphic, Inline graphic.

We therefore also consider an alternative choice of Inline graphic that leads to a less conservative RFS algorithm. For Inline graphic, the maximum possible number of iterations as defined earlier equals Inline graphic across all possible paths the RFS algorithm might take. We thus propose to use an RFS with Inline graphic, based on the fact that in each iteration, the chance of adding a new redundant regimen is smaller or equal to the probability of making a false non-redundancy error when comparing the new regimen with any existing regimen. We name this algorithm RFS-II. A caveat of this algorithm is that different hypotheses are being tested in each possible path of RFS and there is a potential of inflated type-I error if the information used to determine which path to move forward is related to the comparisons involved in each path. We did not find this to be a concern, however, for RFS-II using either AS or AR to determine the sequence of regimens for comparison, based on extensive numerical studies. The probability of selecting redundant regimens is always well controlled, as will be shown later in the simulation studies.

3. SIMULATION STUDIES

We conduct simulation studies to evaluate the proposed RFS algorithms for down-selection. We consider several settings where 14 immune endpoints from 60 individuals within each regimen arm are simulated from a multivariate normal distribution with variance 1 and regimen-specific mean. The maximum possible number of regimens allowed to be selected (Inline graphic) is either 2 or 3.

We investigate the performance of RFS-I and RFS-II that control the FW-FNRE at Inline graphic (i.e., Inline graphic and Inline graphic respectively). For comparison, we also included two other RFS type algorithms with less adjustment for multiple testing. The first, RFS-III, has Inline graphic. It is expected to control PW-FNRE at 0.05 level, but not FW-FNRE. The second algorithm, RFS-IV, conducts one-sided tests between each pair of regimens for each individual endpoint at significance level 0.025, without any adjustment for multiple testing. For each algorithm, we consider ranking using either AS or AR with equal weights given to each endpoint.

For each simulation setting, regimens that should be selected are those satisfying superiority and non-redundancy criteria as well as ranked among the top Inline graphic if there are more than Inline graphic regimens satisfying both criteria. The rest of the regimens should not be selected. We evaluate the following performance criteria of each algorithm by averaging their estimates across 4000 Monte-Carlo simulations: (1) Pr(redundant regimens are selected), heretofore referred as the probability of redundancy, (2) average total number of regimens selected, (3) percentage of regimens that should be selected among the selected set, heretofore referred to as the “positive predictive value” (PPV), (4) Pr(a regimen is selected Inline graphic it should be selected), heretofore referred to as the “true positive rate” (TPR), and (5) Pr(a regimen is selected Inline graphic it should NOT be selected), heretofore referred to as the “false positive rate” (FPR).

Note that the above definition of TPR assumes that whether a regimen should be selected can be well defined. For more general settings where two or more identical regimens are superior to other regimens and only one should be selected, the definition of TPR can be generalized as Pr(an identical group is selected Inline graphic it should be selected), if we define an identical group selected if any regimen among the group is selected (details are presented in supplementary Web Appendix C available at Biostatistics online).

3.1. Simulation settings I–IV

Settings I–IV assume 14 independent endpoints are measured on Inline graphic vaccine regimens (See Web Supplementary Tables 1–3 available at Biostatistics online). In settings I–II, regimen 1 is the only regimen that should be selected based on either AS or AR, with Inline graphic in setting I and Inline graphic in setting II. The ideal outcome is to select regimens 1 and 2 in setting III and regimens 1, 2, and 3 in setting IV, with Inline graphic in both settings. A positive effect size Inline graphic is used to characterize the mean differences between regimens.

3.1.1. Results based on AS ranking.

Here, we present results of various algorithms based on AS ranking. We first assess the probability of redundancy, as presented in supplementary Web Tables 4–7 in Appendix B available at Biostatistics online. The probability of redundancy was well controlled in all settings by RFS-I and RFS-II, with RFS-I more conservative compared to RFS-II as expected. The actual probability of redundancy varies with setting and depends on the extent of redundancy between regimens entering down-selection (supplementary Web Figure 1 in Appendix B available at Biostatistics online). For settings where all pairs of regimens entering down-selection are redundant (settings I–II), the probability of redundancy for RFS-II is close to the nominal level for medium to large difference between regimens (i.e., Inline graphic). Both algorithms become more conservative when the proportion of non-redundant pairs increases (Inline graphic for setting III and Inline graphic for setting IV) and when differences between regimens are small. In contrast, the probability of redundancy for RFS-III and RFS-IV is higher than the desired level, for medium to large Inline graphic for all settings. The inflation of the error rate tends to be more severe when the difference between regimens increases. Without any adjustment for multiple testing, RFS-IV is particularly problematic with the probability of redundancy around 80% in some settings.

Performance based on other operational criteria as a function of Inline graphic for settings I–IV is presented in Figure 2 and supplementary Web Figures 2–4 of Appendix B available at Biostatistics online, where subfigure (a) shows the average number of regimens selected, subfigure (b) shows the PPV, subfigure (c) shows the TPR, and subfigure (d) shows the FPR. Overall, we observe that multiple testing adjustment is important for good selection performance: RFS-I and RFS-II have better performance compared to other algorithms in all settings in the sense that when the effect size Inline graphic increases, the two algorithms tend to select the correct number of regimens and are thus cost-effective; their PPVs approach 1 and the selected regimens tend to be the desired ones; they have TPRs that are close to 1 and small FPRs. In contrast, when the effect size Inline graphic increases, RFS-III and RFS-IV always have imperfect PPV and larger FPR compared to RFS-I and RFS-II; on average they select more regimens than necessary when the number of maximum regimens allowed to be selected (Inline graphic) is larger than the true number of desired regimens (settings I–III), with the problem particularly severe for RFS-IV. Regarding the comparison between the two proposed algorithms RFS-I and RFS-II, RFS-II can have an advantage over RFS-I when the differences between the regimens are not large.

Fig. 2.

Fig. 2.

Performance of various selection algorithms for setting I measured by the Monte-Carlo average. In this setting, regimens 2–8 are equivalent. Regimen 1 is superior to regimens 2–8. The ideal outcome is to select regimen 1 only. Endpoints are independent. Up to 2 regimens are allowed to be selected. AS is used for ranking regimens. Here Inline graphic is the effect size that characterizes the mean difference between regimens (regimen means are presented in Web Supplementary Table 1 available at Biostatistics online). “# Selected” is the average total number of regimens selected; “PPV” is the percentage of regimens that should be selected among the selected set; “TPR” is the probability that a regimen is selected given that it should be selected; “FPR” is the probability that a regimen is selected given that it should not be selected.

3.1.2. AS versus AR in ranking.

In our simulations, AR ranking leads to poorer performance as compared to AS ranking, in terms of a smaller probability of correctly selecting a regimen that should be selected (TPR) and a larger probability of incorrectly selecting a regimen that should not be selected (FPR). Supplementary Web Figure 5 in Appendix B available at Biostatistics online shows the comparison of performance between AS and AR using RFS-I and RFS-II for simulation setting IV as an example.

3.2. Additional simulations

We conducted additional simulations with varying number of regimens entering down-selection, Inline graphic. The patterns we observe comparing different algorithms are similar to that for Inline graphic (see supplementary Web Appendix D available at Biostatistics online). In general RFS-II performs the best. RFS-III performs similarly to RFS-II when Inline graphic is small (Inline graphic).

We further explored the impact of correlation between immune endpoints. Performance of RFS-II for mean structure the same as in setting III under varying correlations Inline graphic between all pairs of endpoints are presented in supplementary Web Appendix E available at Biostatistics online. Larger correlation leads to reduced performance when Inline graphic is small but its impact tends to vanish when Inline graphic gets large.

4. DATA ILLUSTRATION

We applied the proposed down-selection algorithms to immune assay data collected from five vaccine regimens studied in HIV vaccine trials. The first regimen is the partially efficacious vaccine, ALVAC-HIV prime plus the gp120 AIDSVAX B/E boosts vaccine used in the RV144 Thai trial Rerks-Ngarm and others, 2009. In RV144 8197 participants were assigned to receive vaccine injections at weeks 0, 4, 12, and 24. Among vaccinees uninfected at week 26, immune responses were measured from 41 cases (vaccinees who acquired infection before month 42) and 205 controls (vaccinees free of infection over 42 months) who were selected in a 5:1 ratio to cases among strata defined by gender, number of vaccine injections received and per-protocol status (Haynes and others, 2012). To illustrate our methods, we utilize immune assay data at week 26 for the 205 controls (RV144T), which approximately represent the cohort of vaccinees uninfected at the time of immune assay measurement given the extremely low incidence of HIV infection and the similar distributions of gender, number of vaccine injections and per-protocol status in this sample compared to the cohort of interest. The other four regimens are studied in an ongoing phase I trial (HVTN096), including NYVAC prime plus NYVAC + AIDSVAX B/E boosts (T1), NYVAC + AIDSVAX B/E prime plus NYVAC + AIDSVAX B/E boosts (T2), DNA prime plus NYVAC + AIDSVAX B/E boosts (T3), and DNA + AIDSVAX B/E prime plus NYVAC + AIDSVAX B/E boosts (T4). For each arm in 096, participants were assigned to receive vaccine injections at weeks 0, 4, 12, and 24 and immune assay data from week 26 are available for 19, 18, 17, and 19 vaccine recipients, respectively. Both studies included participants receiving placebo. While it is clear that active treatments need to be superior to placebo, we decided not to include placebo formally in the procedure as it is the direct comparison between active treatments that is key for the down-selection. Data on the following eight immune endpoints were collected for the five vaccine regimens: IgG binding antibody responses to six different HIV antigens measured using the binding antibody multiplex assay (BAMA), one neutralization antibody (NAb) response measured using the TZMbl assay, and one CD4 T-cell response measured using intracellular cytokine staining (ICS). Endpoint values in each regimen were scaled by the standard deviation (SD) of the endpoint in RV144T as it has the largest sample size. Alternatively scaling can be based on a pooled SD across regimens. The scaling is performed by individual endpoint such that each endpoint mean is at the scale of per SD before we construct the weighted AS.

We report two sets of analyses here. First, we use available data from RV144 or AR and 096 as described above. Means and standard deviations of the endpoints measures by vaccine regimen are presented in Table 2. Values presented have been scaled by standard deviations in the RV144 vaccine arm to ensure comparability of mean values across endpoints. We consider equal weights for BAMA IgG, ICS, and NAb and subdivide the weight for BAMA IgG equally between the six different measures. The rank of the five regimens is 096T3, 096T1, 096T4, 096T2, and RV144T based on AS and 096T3, 096T1, 096T4, RV144T, and 096T2 based on AR (Table 2). Applying RFS-II that controls the probability of redundancy at the 0.05 level, 096T3 is the only regimen selected based on either AS or AR ranking (see supplementary Web Figure 12 in Appendix H available at Biostatistics online). To understand the pattern similarity between immune response profile of the five regimens, we also conduct an exploratory cluster analysis that groups regimens into different clusters based on means of their immune responses (as described in supplementary Web Appendix F available at Biostatistics online). Based on the weighted Manhattan distance between response means, the five vaccine regimens were grouped into two clusters {RV144, 096T2} and {096T1, 096T3, 096T4}, as presented in Figure 3. Second, to inform the design of the upcoming phase I immunogenicity studies for down-selection, we conducted power analysis with varying sample sizes (see supplementary Web Appendix G available at Biostatistics online). Based on these analyses, the HVTN leadership has decided to enroll 50 participants per arm, with an expected 43 participants per arm with complete endpoint measurements assuming a 15% missing rate. To mimic what will happen in the future application, for the second analysis, we repeated the down-selection practice using resampled data from the five regimens: 43 participants were sampled (with replacement) from each regimen among individuals with complete endpoint measures. Results based on the resampled data are presented in supplementary Web Appendix H available at Biostatistics online. Applying RFS-II algorithm again selected only 096T3.

Table 2.

Endpoint-specific mean and SD for each of eight different immune responses and each of five different HIV vaccine regimens for the original data. Values have been scaled by SD of RV144T

Endpoint RV144T 096T1 096T2 096T3 096T4
BAMA AE.A244 V1V2 Tags/293F (B1) 5.91 (1) 5.49 (1) 5.35 (0.98) 5.72 (1.52) 5.21 (1.4)
BAMA gp70_B.CaseA2 V1/V2/169K (B2) 4.22 (1) 4.47 (1.06) 3.86 (0.87) 4.44 (1.54) 4.03 (1.16)
BAMA gp70_B.CaseA_V1_V2 (B3) 3.68 (1) 3.78 (0.83) 3.82 (0.71) 3.89 (1.12) 3.85 (1.04)
BAMA A244 gp 120 gDneg/293F/mon (B4) 3.42 (1) 4.47 (1.12) 3.77 (1.09) 4.46 (1.39) 4.04 (1.25)
BAMA vaccine insert (B5) 3.77 (1) 4.13(1.03) 3.65 (0.59) 4.23 (1.23) 3.91 (1.21)
BAMA Con S gp140 CFI (B6) 3.80 (1) 5.24(0.98) 4.82(0.76) 5.26 (1.42) 5.07 (1.17)
ICS Inline graphic3.43(1) Inline graphic3.24 (1.29) Inline graphic3.8 (0.86) Inline graphic2.57 (0.86) Inline graphic2.86 (1.41)
NAb 6.23(1) 7.31 (1.26) 6.83 (1.15) 7.16 (1.38) 7.39 (1.44)
Weighted ASInline graphic 2.31 2.89 2.41 3.09 2.96
Weighted ARInline graphic 4.28 2.39 4.39 1.83 2.11
Sample size Inline graphic 205 19 18 17 19

Inline graphicweight = 1/6 for each BAMA endpoint, and weight = 1 for ICS and NAb.

Boldface in each row indicates the best ranked regimen according to an individual endpoint or the weighted score.

BAMA endpoints were natural log-transformed blank-subtracted mean fluorescent intensity (MFI) values (blank-subtracted MFI Inline graphic1 are set to 1 before natural log transformation).

ICS Env response magnitudes below 0.025 were set to 0.025 and all magnitudes were log10-transformed.

NAb endpoint was calculated as average of log10-transformed titers over 6 isolates, where titers Inline graphic20 were set to 10 before the transformation.

Fig. 3.

Fig. 3.

Heatmap of endpoint means of vaccine regimens. The value for each endpoint has been centered by the average of its mean across regimens. Hierarchical clustering analysis is performed based on the complete linkage method; the distance matrix is generated as weighted Manhattan distance of endpoint means. The five regimens were grouped into two clusters {RV144, 096T2} and {096T1, 096T3, 096T4} using the method described in Web Supplementary Appendix F available at Biostatistics online.

Note that our definitions of superiority, inferiority, and equivalence with respect to individual endpoints (Section 2.1) could be extended to account for “putative predicted clinical significance.” That is, we could define Inline graphic superior to Inline graphic with respect to endpoint Inline graphic if Inline graphic, Inline graphic inferior to Inline graphic if Inline graphic, and Inline graphic equivalent to Inline graphic if Inline graphic for a fixed non-negative constant Inline graphic dictated by the clinical application. Note that with larger Inline graphic, it is harder to declare superiority of one regimen with respect to another. We explore various Inline graphic from 0 to 0.5; regimen 096T3 remains to be the only regimen selected using either of the two dataset.

5. CONCLUDING REMARKS

In this paper, we proposed a novel statistical framework to down-select the best several qualifying treatment regimens based on multivariate endpoints, motivated by the need to select multiple HIV vaccine regimens to be studied in an efficacy trial based on immunogenicity data from phase I trials. In particular, we formulated two criteria to meet in the down-selection process, superiority and non-redundancy. While the former criterion has been utilized by many other researchers to select the best single intervention among a candidate set, the latter criterion is a new concept intended to satisfy the need to select the set of best regimens with diverse immune response patterns. To select regimens that meet these criteria, we developed novel statistical algorithms that combine hypothesis testing and ranking. Our proposed algorithms control the chance of advancing regimens with redundant immune profiles through multiple testing adjustment. We found multiple testing adjustment important for good performance of selection algorithms, especially as the difference between regimens gets larger. In practice, results of the formal down-selection given a fixed maximum number of regimens possibly selected will be used as a guideline for decision making, but a final decision could also depend on other information such as the immune response patterns observed in unsupervised clustering analysis. A regimen that is borderline in down-selection but is sufficiently different from other regimens might be worth including. R codes for implementing these methods are available to the public at http://research.fhcrc.org/huang/en/software.html.

The current paper focuses on the comparison of continuous or binary immune responses based on their means using the Wald test or t-test. One could extend this method to base comparisons on whole distributions of endpoints, where now superiority, inferiority, and equivalence are defined in terms of Inline graphic (stochastically larger), Inline graphic (stochastically smaller), or Inline graphic (equal in distribution). Nonparametric tests such as the Wilcoxon Rank Sum test may be used to achieve robustness to skewness or outliers. Transformation of the immune response measures could be entertained before RFS to handle different data types. For example, Wittkowski and others (2004) proposed transforming ordinal data based on U-statistics, which naturally leads to a family of nonparametric tests for comparing regimens.

There are several important considerations with respect to selection of immune endpoints for down-selection. First, immune endpoints should be chosen to cover major immune classes and inclusion of highly correlated endpoints should be avoided because the multiplicity correction in the down-selection algorithm tends to be more conservative in the presence of high correlations. Second, all endpoints should be based on validated or qualified immunological assays such that the degree of technical measurement error is low and most of the inter-vaccinee variability in the endpoint is potentially associated with differences in the level of vaccine efficacy to prevent HIV infection. Third, an endpoint can be an established composite score integrating several immune response measures, such as the T-cell polyfunctionality score (Larsen and others, 2012).

Potential application of the methods we developed in this paper is not limited to HIV vaccine research. They may be applied in other clinical settings for selecting the best several interventions from randomized trials based on multiple endpoints. For example, in cancer biomarker research, where there are multiple biomarker endpoints measured, our methods would apply to selecting sets of interventions that generate unique patterns of biomarkers that are likely to be associated with diverse biological mechanisms for efficacy.

Supplementary Material

Supplementary Data

ACKNOWLEDGMENTS

We thank Allan DeCamp, Yunda Huang, Barb Metch, and James Kublin for helpful discussions. The authors thank the participants, investigators, and sponsors of the HVTN096 trial and the RV144 trial. We thank the editor, AE, and referees for their insightful comments. Conflict of Interest: None declared.

SUPPLEMENTARY MATERIAL

Supplementary material is available at http://biostatistics.oxfordjournals.org.

FUNDING

National Institutes of Health (grants R01 GM106177-01, 2R37AI05465-10, and UM1AI068635).

REFERENCES

  1. Bloch, D. A., Lai, T. L. and Tubert-Bitter, P.. (2001).. One-sided tests in clinical trials with multiple endpoints. Biometrics 57, 1039–1047. [DOI] [PubMed] [Google Scholar]
  2. Bloch, D. A., Lai, T. L., Su, Z. and Tubert-Bitter, P.. (2007).. A combined superiority and non-inferiority approach to multiple endpoints in clinical trials. Statistics in Medicine 26, 1193–1207. [DOI] [PubMed] [Google Scholar]
  3. Follmann, D. (1996).. A simple multivariate test for one-sided alternatives. Journal of the American Statistical Association 91, 854–861. [Google Scholar]
  4. Gilbert, P. B., Grove, D., Gabriel, E., Huang, Y., Gray, G., Hammer, S. M., Buchbinder, S. P., Kublin, J., Corey, L. and Self, S. G.. (2011).. A sequential phase 2b trial design for evaluating vaccine efficacy and immune correlates for multiple HIV vaccine regimens. Statistical Communications in Infectious Diseases 3, 1–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Hayes, B. F.,, Gilbert, P. B.,, McElrath, M. J.,, Zolla-Pazner, S.,, Tomaras, G. D.,, Alam, S. M.,, Evans, D. T.,, Montefiori, D. C.,, Karnasuta, C.,, Sutthent, R. and others. (2012).. Immune-correlates analysis of an HIV-1 vaccine efficacy trial. New England Journal of Medicine 366, 1275–1286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Holm, S. (1979).. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6, 65–70. [Google Scholar]
  7. Huang, Y., DiazGanados, C., Janes, H., Huang, Y., deCamp, A. C., Metch, B., Grant, S., Sanchez, B., Phogat, S., Koutsoukos, M.. and others (2016).Selection of HIV vaccine candidates for concurrent testing in an efficacy trial. Current Opinion in Virology 17, 57–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Larsen, M., Sauce, D., Arnaud, L., Fastenackels, S., Appay, V. and Gorochov, G.. (2012).. Evaluating cellular polyfunctionality with a novel polyfunctionality index. PLoS One 7, e42403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Mandrekar, S. J. and Sargent, D. J.. (2006).. Pick the winner designs in phase II cancer clinical trials. Journal of Thoracic Oncology 1, 5–6. [PubMed] [Google Scholar]
  10. Perlman, M. D. and Wu, L.. (2004).. A note on one-sided tests with multiple endpoints. Biometrics 60, 276–280. [DOI] [PubMed] [Google Scholar]
  11. Rerks-Ngarm, S., Pitisuttithum, P., Nitayaphan, S., Kaewkungwal, J., Chiu, J., Paris, R., Premsri, N., Namwat, C., de Souza, M., Adams, E.. and others. (2009).. Vaccination with ALVAC and AIDSVAX to prevent HIV-1 infection in Thailand. New England Journal of Medicine 361, 2209–2220. [DOI] [PubMed] [Google Scholar]
  12. Röhmel, J., Gerlinger, C., Benda, N. and Läuter, J.. (2006).. On testing simultaneously non-inferiority in two multiple primary endpoints and superiority in at least one of them. Biometrical Journal 48, 916–933. [DOI] [PubMed] [Google Scholar]
  13. Roy, S. N. (1953).. On a heuristic method of test construction and its use in multivariate analysis. The Annals of Mathematical Statistics 24 (2), 220–238. [Google Scholar]
  14. Sargent, D. J. and Goldberg, R. M.. (2001).. A flexible design for multiple armed screening trials. Statistics in Medicine 20, 1051–1060. [DOI] [PubMed] [Google Scholar]
  15. Simon, R., Wittes, R. E. and Ellenberg, S. S.. (1985).. Randomized phase II clinical trials. Cancer Treatment Reports 69, 1375–1381. [PubMed] [Google Scholar]
  16. Steinberg, S. M. and Venzon, D. J.. (2002).. Early selection in a randomized phase II clinical trial. Statistics in Medicine 21, 1711–1726. [DOI] [PubMed] [Google Scholar]
  17. Tamhane, A. C. and Logan, B. R.. (2004).. A superiority-equivalence approach to one-sided tests on multiple endpoints in clinical trials. Biometrika 91, 715–727. [Google Scholar]
  18. Tang, D. (1994).. Uniformly more powerful tests in a one-sided multivariate problem. Journal of the American Statistical Association 89, 1006–1011. [Google Scholar]
  19. Wittkowski, K. M., Lee, E., Nussbaum, R., Chamian, F. N. and Krueger, J. G.. (2004).. Combining several ordinal measures in clinical studies. Statistics in Medicine 23, 1579–1592. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES