Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Sep 11.
Published in final edited form as: Clin Cancer Res. 2013 Jan 9;19(6):1326–1334. doi: 10.1158/1078-0432.CCR-12-1223

Borrowing Information Across Subgroups in Phase II Trials: Is It Useful?

Boris Freidlin 1, Edward L Korn 1
PMCID: PMC11388725  NIHMSID: NIHMS433148  PMID: 23303215

Abstract

Due to the heterogeneity of human tumors, cancer patient populations are usually comprised of multiple subgroups with different molecular and/or histological characteristics. In screening new anticancer agents, there might be scientific rationale to expect some degree of similarity in clinical activity across the subgroups. This poses a challenge to the design of phase II trials assessing clinical activity: conducting an independent evaluation in each subgroup requires considerable time and resources whereas a pooled evaluation that completely ignores patient heterogeneity can miss treatments that are only active in some subgroups. It has been suggested that approaches that borrow information across subgroups can improve efficiency in this setting. In particular, the hierarchical Bayesian approach putatively uses the outcome data to decide whether borrowing of information is appropriate. We evaluated potential benefits of the hierarchical Bayesian approach (using models suggested previously) and a simpler pooling approach by simulations. In the phase II setting the hierarchical Bayesian approach is shown not to work well in the simulations considered, as there appears to be insufficient information in the outcome data to determine whether borrowing across subgroups is appropriate. When there is strong rationale for expecting uniform level of activity across the subgroups, approaches utilizing simple pooling of information across subgroups may be useful.

Keywords: Hierarchical Bayesian model, futility analysis, pooling of information, subgroup analysis

Introduction

One of the major challenges in the development of anticancer agents is the heterogeneity of patient populations. In early clinical studies assessing activity of new treatments (phase II trials), patients can often be classified into non-overlapping subgroups for which it may be reasonable to assume that the activity (or lack thereof) is homogeneous within each subgroup, and for which there is some reason to believe that the activity level may be similar across subgroups. Three scenarios where such subgroups arise are: (I) patients expressing a molecular target of interest from multiple cancer histologies (1), (II) multiple histological subtypes of a given cancer histology, and (III) biomarker-defined molecular subtypes of a given cancer histology. An example of the first scenario is a trial (NCT01306045) that evaluates five agents with distinct biomarker targets - each agent is evaluated in patients expressing the corresponding biomarker in three different histological subgroups (NSCLC, SCLC and thymic cancer) separately. An example of the second scenario is a trial (2) that evaluated imatinib in ten histological subtypes of soft tissue sarcoma. An example of the third scenario is the BATTLE trial (3) that evaluated therapies in five biomarker-defined subgroups of NSCLC.

The two common approaches to analyzing activity with subgroups are to ignore them (and do one pooled analysis) or to perform a separate stand-alone analysis in each subgroup. Either approach can be problematic. A pooled analysis can miss agents that are only active in one or a few subgroups. On the other hand, conducting an independent evaluation in each subgroup is time/resource consuming and is often not feasible because of the large total sample size that would be required.

Rather than stand-alone subgroup analyses or a single pooled analysis, an attractive middle ground would share the outcome results from the different subgroups to improve the inference for each subgroup. This is sometimes referred to as "borrowing information" or "borrowing strength" across the subgroups (46). For example in Table 1, by borrowing of information from subgroups 1–4 one might find the 30% response rate in subgroup 5 more believable in Scenario 1 than in Scenario 2. Formal statistical borrowing of information is often done via Bayesian methods. For example, when some preliminary information on the overall response rate is available, a simple Bayesian model summarizes the preliminary data in a prior distribution for the true subgroup response rates. The inference for each subgroup is based on the posterior distribution for its response rate, which uses the preliminary information by shrinking the observed response rates towards the mean of the prior distribution, with more shrinking if the subgroup sample size is small. The amount of shrinking also depends on the spread of the prior distribution around its mean, with more shrinking if the spread is narrow.

Table 1.

Two hypothetical five-subgroup scenarios in which the 30% response rate in subgroup 5 is more believable in scenario 1 than scenario 2

Subgroup Scenario 1 Scenario 2
No. of Patients Responses (%) No. of Patients Responses (%)
1 25 8 (32%) 25 1 (4%)
2 25 6 (24%) 25 0 (0%)
3 25 7 (28%) 25 2 (8%)
4 25 9 (36%) 25 1 (4%)
5 10 3 (30%) 10 3 (30%)

The simple Bayesian approach does not allow borrowing information across the subgroups. Furthermore, the amount of shrinking to the mean of the prior distribution does not depend on the observed outcomes as it is fixed by the spread of the prior distribution and the subgroup sample size. To overcome these deficiencies, a hierarchical Bayesian approach has been suggested (79). This approach uses the observed response rates to help to decide whether to borrow the information (by shrinking to roughly the mean of all the observed response rates) and how much to borrow (shrink less if the observed response rates are far apart) (10). In theory, the idea of adapting the degree of borrowing across the subgroups according the observed data, “outcome-adaptive borrowing ”, sounds attractive. It has been suggested to "make for more informed decision making and smaller clinical trials" (10, page 35), and to be "an effective method for studying rare diseases and their subtypes" (2, page 3148). However, the degree to which this approach works warrants careful examination. We evaluate the potential benefits of hierarchical Bayesian modeling via computer simulations. We also evaluate a much simpler approach for borrowing information across subgroups in futility/inefficacy interim monitoring (11).

Outcome-Adaptive Borrowing: Hierarchical Bayesian Approaches

We assume that activity is measured by a binary outcome (e.g., response). Let pi be the observed response rate based on ni patients in the ith subgroup (i=1,..,K). The true response rates in the subgroups, π1, π2,…,πK, (or a specified transformation of them) are assumed to come from a specified prior distribution with unknown mean and variance. One could use the observed response rates to estimate this unknown mean and variance, and subsequently estimate the πi; this would be what is known as an empirical Bayes approach (12, Chapter 3). Alternatively, a hierarchical Bayesian approach treats the unknown mean and variance as random quantities and specifies distributions for them (9). There are many model specifications for the hierarchical Bayesian approach. We used two models specifically developed for phase II clinical trial setting. The first parameterizations is from Thall et al. (9). For this model (Model 1) two different sets of hyperpriors were considered allowing for moderate and strong borrowing, respectively. The second parameterization (Model 2) is from Berry et al. (13) (see the Appendix for details).

Combining Pooled and Subgroup-Specific Analyses

LeBlanc et al. (11) suggested a simple approach to borrowing information: a futility analysis based on the pooled data of all the subgroups is performed in addition to futility analyses performed for each subgroup. The futility analysis on the pooled data is performed after a specified number of patients (npooled) are enrolled in the study overall; a test is performed to see whether the pooled response rate (ppooled) is consistent with an alternative hypothesis, and, if not, the whole trial is stopped. The separate futility analyses are conducted in each subgroup, after a specified number of patients (ns) are enrolled in that subgroup; a test is performed in each subgroup to see whether the subgroup response rate is consistent with an alternative hypothesis, and, if not, accrual to the individual subgroup is stopped. The alternative hypothesis for the pooled analysis would typically be taken to be less than the alternative hypotheses for the individual subgroups. For example, LeBlanc et al. suggest that for testing an alternative hypothesis of a response rate of 30% (versus a null response rate of 10%) at significance level α0s (e.g., 0.02), in the individual subgroups, and an alternative hypothesis of 20% (versus a null response rate of 10%) at significance level α0 (e.g., 0.02) be used for the pooled futility analysis. That is, accrual to an individual subgroup i is stopped if TiA < α0s, and accrual to the entire study is stopped if TA < α0, where TiA = Pr (Xpins|ns, π = .3) and TA = Pr (Xppooled npooled|npooled, π = .2) are binomial tail probabilities.

Simulations

Following Thall et al and LeBlanc et al., a response rate of 30% was considered promising in a subgroup while a response rate of 10% was considered uninteresting. First, we considered a fixed sample size design where 25 patients are accrued to each subgroup (i.e., no interim monitoring): for each patient in each subgroup response status was independently generated to be 1 with probability equal to the given response rate, and 0, otherwise. In practice, however, many phase II designs in oncology include interim futility monitoring rules that allow them to stop early for disappointing results (to protect patients and resources). Thus, we also simulated designs with interim futility analyses. These later simulations allowed us to assess the benefits of borrowing information across treatment arms when (1) accumulating data is repeatedly evaluated over time and (2) the accrual rates differ between the subgroups. In these simulations, patient outcomes were generated sequentially with the subgroup status of each patient generated from a multinomial distribution with pre-specified subgroup frequencies. The simulation code was written in R (14), and the hierarchical Bayesian approach was implemented using the WINBUGS package (15).

Fixed sample size (no interim monitoring)

We investigated settings with 5 and 10 subgroups. For an individual evaluation in each subgroup (no borrowing), a design that enrolls 25 patients in each subgroup and rejects the null hypothesis if there are 5 or more responses allows one to distinguish 30% vs. 10% response rates with a false-positive rate of 0.1 and a power of 90%. (This design provides so-called “strong” control of marginal false-positive error rates. That is, the false-positive error rate in each subgroup is no greater than 0.1 under the hypothesis that its true response rate is 10% regardless of the response rates of the other subgroups.) In a corresponding hierarchical Bayesian design (with 25 patients in each group) the null hypothesis for the ith subgroup is rejected if the posterior probability that πi>10% is greater than a certain cut-off value. To facilitate a fair comparison of the two analysis strategies, the cut-off value needs to be chosen to provide the same strong control of false-positive error rates (at 0.1 level) as with the subgroup-specific evaluation design (16). To accomplish this for the Thall et al Model 1 with moderate borrowing, a cut-off of 0.850 was used in both settings. For the Thall et al Model 1 with strong borrowing, cut-offs of .940 and .965 were used for 5- and 10-subgroup settings, respectively. For the Berry et al Model 2, cut-offs of 0.955 and 0.960 were used for 5- and 10-subgroup settings, respectively.

Tables 2 and 3 display the empirical subgroup probabilities of a positive conclusion under a range of scenarios with five and ten subgroups, respectively. In the subgroups-specific analyses the probabilities of a positive conclusion in a given subgroup are approximately 0.9 (0.1) when that subgroup response rate is 30% (10%), regardless of the response rates in other subgroups. The hierarchical Bayesian strategies are shown to have no better power than the subgroup-specific strategy: The hierarchical Bayesian design Model 1 with moderate borrowing takes a conservative approach to borrowing and performs almost identically to the subgroup-specific approach. The hierarchical Bayesian designs that allow more borrowing (Model 1 with strong borrowing and Model 2) sometime have nontrivially lower power than the subgroups-specific approach in cases where the treatment is active only in a minority of subgroups (Cases 3–4 in Tables 2 and 3). Note that under the global null (10% response in all subgroups) Model 1 with strong borrowing and Model 2 have false-positive error rates that are well below the nominal 0.1 level (Case 5 in Tables 2 and 3). This is the price of borrowing across subgroups in a design with strong control of the error rates. While it might seem tempting to forgo the strong error control by making it easier to reject null hypotheses, this would result in a design with unacceptably high false-positive error rates in some scenarios. For example, for Model 2 in Table 3 lowering the bar for rejection so that the Case 5 results were 0.1, would increase the false-positive error rate to 0.40 in Case 1 (for the first subgroup). We revisit the motivations for requiring strong control of the error rates in the Discussion.

Table 2.

Empirical probabilities of rejecting the null hypothesis: five subgroups (no interim monitoring, 25 patients per subgroup, 10,000 replications)

Design True response rate in each subgroup
Case 1 .1 .3 .3 .3 .3
Subgroup-specific analyses .096 .909 .912 .910 .914
HB* Model 1 moderate borrowing .096 .909 .912 .910 .914
HB Model 1 strong borrowing .098 .895 .893 .892 .891
HB Model 2 (Berry et al) .096 .899 .898 .896 .899
Case 2 .1 .1 .3 .3 .3
Subgroup-specific analyses .096 .096 .912 .910 .914
HB Model 1 moderate borrowing .096 .096 .912 .910 .914
HB Model 1 strong borrowing .085 .096 .842 .842 .842
HB Model 2 (Berry et al) .091 .091 .853 .855 .858
Case 3 .1 .1 .1 .3 .3
Subgroup-specific analyses .096 .096 .097 .910 .914
HB Model 1 moderate borrowing .096 .096 .097 .910 .914
HB Model 1 strong borrowing .058 .061 .058 .807 .809
HB Model 2 (Berry et al) .067 .065 .062 .817 .820
Case 4 .1 .1 .1 .1 .3
Subgroup-specific analyses .096 .096 .097 .096 .914
HB Model 1 moderate borrowing .096 .096 .097 .096 .914
HB Model 1 strong borrowing .037 .040 .038 .038 .762
HB Model 2 (Berry et al) .041 .041 .036 .043 .791
Case 5 .1 .1 .1 .1 .1
Subgroup-specific analyses .096 .096 .097 .096 .099
HB Model 1 moderate borrowing .096 .096 .097 .096 .099
HB Model 1 strong borrowing .025 .030 .029 .030 .025
HB Model 2 (Berry et al) .032 .033 .030 .030 .033
Case 6 .3 .3 .3 .3 .3
Subgroup-specific analyses .912 .909 .912 .910 .914
HB Model 1 moderate borrowing .912 .909 .912 .910 .914
HB Model 1 strong borrowing .907 .910 .907 .911 .911
HB Model 2 (Berry et al) .911 .908 .912 .910 .913
*

HB = hierarchical Bayesian

Table 3.

Empirical probabilities of rejecting the null hypothesis: ten subgroups (no interim monitoring, 25 patients per subgroup, 10,000 replications)

Design True response rate in each subgroup
Case 1 .1 .3 .3 .3 .3 .3 .3 .3 .3 .3
Subgroup-specific analyses .096 .905 .908 .907 .909 .908 .910 .908 .909 .912
HB* Model 1 moderate borrowing .096 .905 .908 .907 .909 .908 .910 .908 .909 .912
HB Model 1 strong borrowing .099 .892 .892 .894 .897 .896 .903 .895 .895 .898
HB Model 2 (Berry et al) .099 .905 .908 .907 .908 .912 .910 .908 .909 .911
Case 2 .1 .1 .3 .3 .3 .3 .3 .3 .3 .3
Subgroup-specific analyses .096 .098 .908 .907 .909 .908 .910 .908 .909 .912
HB Model 1 moderate borrowing .096 .098 .908 .907 .909 .908 .910 .908 .909 .912
HB Model 1 strong borrowing .085 .087 .857 .857 .861 .858 .867 .859 .857 .861
HB Model 2 (Berry et al) .097 .098 .904 .903 .906 .905 .906 .904 .905 .907
Case 3 .1 .1 .1 .1 .1 .1 .1 .1 .3 .3
Subgroup-specific analyses .096 .098 .095 .095 .100 .099 .108 .094 .909 .912
HB Model 1 moderate borrowing .096 .098 .095 .095 .100 .099 .108 .094 .909 .912
HB Model 1 strong borrowing .019 .022 .020 .021 .023 .019 .022 .022 .677 .681
HB Model 2 (Berry et al) .032 .033 .031 .031 .036 .032 .037 .031 .783 .790
Case 4 .1 .1 .1 .1 .1 .1 .1 .1 .1 .3
Subgroup-specific analyses .096 .098 .095 .095 .100 .099 .108 .094 .098 .914
HB Model 1 moderate borrowing .096 .098 .095 .095 .100 .099 .108 .094 .098 .914
HB Model 1 strong borrowing .012 .014 .013 .013 .016 .012 .015 .015 .014 .656
HB Model 2 (Berry et al) .029 .029 .028 .028 .033 .030 .034 .027 .031 .747
Case 5 .1 .1 .1 .1 .1 .1 .1 .1 .1 .1
Subgroup-specific analyses .096 .098 .095 .095 .100 .099 .108 .094 .098 .094
HB Model 1 moderate borrowing .096 .098 .095 .095 .100 .099 .108 .094 .098 .094
HB Model 1 strong borrowing .009 .010 .010 .010 .012 .007 .011 .010 .010 .010
HB Model 2 (Berry et al) .022 .023 .022 .023 .026 .024 .026 .021 .024 .023
Case 6 .3 .3 .3 .3 .3 .3 .3 .3 .3 .3
Subgroup-specific analyses .913 .905 .908 .907 .909 .908 .910 .908 .909 .912
HB Model 1 moderate borrowing .913 .905 .908 .907 .909 .908 .910 .908 .909 .912
HB Model 1 strong borrowing .908 .910 .910 .910 .912 .910 .917 .910 .912 .911
HB Model 2 (Berry et al) .915 .906 .909 .908 .910 .908 .911 .909 .909 .913
*

HB = hierarchical Bayesian

With Interim Inefficacy/Futility Monitoring

We assumed a setting with 5 subgroups. For the subgroup-specific analyses approach, we used two-stage designs for the inefficacy/futility monitoring: after 15 patients are accrued to a subgroup, accrual is stopped to that subgroup if there is one or fewer responses. Otherwise, an additional 10 patients are accrued to that subgroup with the null hypothesis rejected with five or more responses. For each subgroup, the false-positive rate and power are 0.1 and 90%, respectively.

For the approach of LeBlanc et al, in addition to the first-stage individual subgroup inefficacy/futility monitoring described above (with 15 patients), we pool the data from all the subgroups and test whether the data are consistent with an overall response rate of 20%. If the hypothesis can be rejected at the 0.02 level, accrual to the whole trial is stopped and the treatment is considered inactive for all subgroups. The pooled analyses were performed when 40 and 80 patients had been accrued on the study overall.

For the hierarchical Bayesian approach we followed the parameterization and the stopping rule proposed in Thall et al, with stopping accrual to the subgroup i if the posterior probability that πi>30% was <0.005. That probability was calculated for each subgroup when 40 and 80 patients had been accrued to the study overall. Subgroups that are not stopped continue to a sample size of 25. When accrual to the whole trial is over, the posterior probabilities that πi>10% are calculated for the remaining subgroups, with the null hypothesis rejected for those subgroups for which this probability is greater than the cut-off of values 0.85 and 0.94 to control the false-positive error at 0.1, for moderate and strong borrowing models, respectively.

Two accrual settings were considered: equal accrual rates (Table 4) and unequal accrual rates: (30%,20%,20%,20%,10%) relative accrual rates in group 1–5, respectively (Table 5). The ability of a design to stop accrual early to the subgroups with disappointing results is measured by the average sample size. The hierarchical Bayesian and subgroup-specific evaluation approaches perform similarly in this respect and allow approximately a 20% reduction in accrual in inactive subgroups due to futility stopping. The LeBlanc approach allows for an additional reduction in sample size when most of the subgroups are inactive (Cases 1 and 2) albeit at the price of reduced power (Case 2). The loss of power can be substantial if the treatment is only active in the subset with low accrual rate (Case 2 of Table 5).

Table 4.

Average sample size and empirical probability of rejecting the null hypothesis, design with interim futility monitoring: Equal subgroup accrual rates (10,000 replications)

Design True response rate in each subgroup
Case 1 .1 .1 .1 .1 .1
  Subgroup-specific analyses Average sample size 19.4 19.5 19.5 19.5 19.5
Rejection probability .093 .096 .094 .098 .098
      HB* Model 1 Moderate borrowing Average sample size 19.6 19.6 19.6 19.6 19.4
Rejection probability .091 .094 .094 .097 .097
  Simple borrowing (LeBlanc et al) Average sample size 16.2 16.3 16.3 16.3 16.3
Rejection probability .064 .067 .069 .068 .071
Case 2 .1 .1 .1 .1 .3
  Subgroup-specific analyses Average sample size 19.5 19.5 19.5 19.5 24.6
Rejection probability .092 .089 .096 .087 .900
      HB Model 1 Moderate borrowing Average sample size 20.0 20.0 20.0 19.9 24.4
Rejection probability .092 .090 .097 .088 .892
  Simple borrowing (LeBlanc et al) Average sample size 18.6 18.6 18.7 18.7 22.8
Rejection probability .086 .084 .091 .084 .776
Case 3 .1 .1 .1 .3 .3
  Subgroup-specific analyses Average sample size 19.4 19.4 19.4 24.6 24.6
Rejection probability .093 .092 .089 .897 .900
      HB Model 1 Moderate borrowing Average sample size 20.4 20.4 20.4 24.5 24.6
Rejection probability .094 .093 .090 .895 .894
  Simple borrowing (LeBlanc et al) Average sample size 19.3 19.2 19.3 24.0 24.1
Rejection probability .092 .091 .089 .875 .874
Case 4 .1 .1 .3 .3 .3
  Subgroup-specific analyses Average sample size 19.5 19.5 24.6 24.7 24.6
Rejection probability .097 .095 .900 .901 .892
      HB Model 1 Moderate borrowing Average sample size 21.0 21.0 24.6 24.6 24.6
Rejection probability .100 .098 .901 .902 .891
  Simple borrowing (LeBlanc et al) Average sample size 19.5 19.4 24.5 24.6 24.6
Rejection probability .097 .095 .898 .899 .889
*

HB = hierarchical Bayesian

Table 5.

Average sample size and empirical probability of rejecting the null hypothesis, design with interim futility monitoring: Unequal subgroup accrual rates (30%, 20%, 20%, 20%,10%) (10,000 replications)

Design True response rate in each subgroup
Case 1 .1 .1 .1 .1 .1
  Subgroup-specific analyses Average sample size 19.5 19.5 19.5 19.4 19.6
Rejection probability .095 .090 .093 .097 .098
HB* Model 1 moderate borrowing Average sample size 20.7 19.8 19.6 19.7 21.2
Rejection probability .094 .091 .093 .100 .096
HB Model 1 strong borrowing Average sample size 20.0 19.2 19.2 19.0 20.4
Rejection probability .026 .025 .024 .027 .022
  Simple borrowing (LeBlanc et al) Average sample size 18.3 16.6 16.6 16.5 13.2
Rejection probability .077 .070 .069 .072 .055
Case 2 .1 .1 .1 .1 .3
  Subgroup-specific analyses Average sample size 19.5 19.4 19.5 19.5 24.6
Rejection probability .093 .085 .097 .094 .900
HB Model 1 moderate borrowing Average sample size 20.8 20.2 20.2 20.1 24.7
Rejection probability .096 .087 .097 .095 .898
HB Model 1 strong borrowing Average sample size 20.8 20.3 20.3 20.0 24.2
Rejection probability .033 .037 .033 .037 .742
  Simple borrowing (LeBlanc et al) Average sample size 18.9 18.0 18.1 18.1 19.6
Rejection probability .088 .077 .087 .086 .645
Case 3 .1 .1 .1 .3 .3
  Subgroup-specific analyses Average sample size 19.4 19.5 19.5 24.6 24.6
Rejection probability .093 .097 .097 .893 .897
HB Model 1 moderate borrowing Average sample size 21.0 20.5 20.5 24.5 24.7
Rejection probability .096 .098 .099 .891 .897
HB Model 1 strong borrowing Average sample size 21.9 21.8 21.8 24.5 24.7
Rejection probability .051 .057 .062 .800 .796
  Simple borrowing (LeBlanc et al) Average sample size 19.3 19.2 19.1 24.0 23.6
Rejection probability .093 .096 .096 .857 .848
Case 4 .1 .1 .3 .3 .3
  Subgroup-specific analyses Average sample size 19.6 19.5 24.7 24.6 24.6
Rejection probability .095 .099 .898 .899 .901
HB Model 1 moderate borrowing Average sample size 21.3 21.0 24.7 24.6 24.7
Rejection probability .099 .102 .898 .898 .903
HB Model 1 strong borrowing Average sample size 23.1 23.2 24.8 24.8 24.8
Rejection probability .085 .085 .842 .845 .840
  Simple borrowing (LeBlanc et al) Average sample size 19.6 19.4 24.5 24.5 24.5
Rejection probability .096 .099 .891 .892 .892
*

HB = hierarchical Bayesian

Discussion

Theoretically, the use of outcome-adaptive borrowing across subgroups is attractive -- one could increase the power to find effective treatments in subgroups by borrowing the information from other subgroups only when the data suggest it is reasonable to do so. Unfortunately, our experience with the models proposed for phase II clinical trials by the experts in the hierarchical methodology suggests that this approach does not work for identifying responsive subgroups in the phase II setting with 10 or fewer subgroups. The essential problem is that there is not enough information in the data to determine whether borrowing is appropriate. Therefore, results obtained are sensitive to the details of the hierarchical model specified for the data. With the same observed response rates but with different model parameterizations, it is possible to make the borrowing across subgroups either extremely difficult or extremely easy. For example, using the Thall et al model, increasing the mean of the hyperprior for the precision parameter increases the amount of borrowing: Table 6 considers borrowing under the two scenarios described in Table 1. For Scenario 1, the posterior probability of the response rate in group 5 being greater than 30% is approximately the same regardless of the amount of borrowing. On the other hand, for Scenario 2 the posterior probability of the response rate in group 5 being greater than 30% decreases considerably with stronger borrowing. Thus, borrowing makes the observed response rate in subgroup 5 more believable in Scenario 1 than in Scenario 2. No parameterization, however, is able to use the data to accurately determine whether borrowing is appropriate or not.

Table 6.

Examples of different degree of borrowing for data in Table 1

Model Probability(response rate in subgroup 5 > 30% | data)
Scenario 1 Scenario 2
Subgroup specific analysis
No borrowing
Model: Beta-binomial
.437 .437
Hierarchical model (Thall et al)
No Borrowing
Mean of precision hyperprior .01 (Gamma(2,200))
.459 .446
Moderate borrowing
Mean of precision hyperprior .1 (Gamma(2,20))
.453 .382
Strong borrowing
Mean of precision hyperprior 1 (Gamma(2,2))
.464 .160

It should be noted that we conducted our evaluations under a strong control of the false-positive error rates to avoid unacceptably high false-positive error rates for some subgroups. Our rationale for use of the strong error control framework is as follows: the purpose of conducting phase II trials is to minimize the number of (large) negative phase III trials by screening out clinical settings where agents do not work. Therefore, relinquishing the strong control of the false-positive error rates would defeat the very purpose of conducting phase II evaluations. Although response rates measured on five to ten subgroups are not enough data to determine whether borrowing is appropriate, in the setting of high-dimensional data where treatment effects on thousands of genes are measured, hierarchical models involving borrowing of information across genes can be quite successful (17).

If outcome-adaptive borrowing does not work in the phase II trial setting, how about simple pooling methods for inefficacy/futility monitoring that are not outcome-adaptive (11)? These methods do not increase the false-positive (type 1) errors but potentially reduce power, so their application needs to be considered carefully. One would want to be in the situation in which there is reasonable strong biological rationale as to why the subgroups should have similar treatment effects. We now give our recommendations for the three scenarios mentioned in the beginning of this commentary. In the first scenario, patients with multiple cancer histologies expressing a particular molecular target, there often could be rationale for pooling across histological subgroups. (Caution is still warranted here as is illustrated by a recent experience with a BRAF inhibitor that is highly active in BRAF mutant melanoma while showing disappointing activity in BRAF mutant colorectal cancer (18)). In the second scenario, subgroups defined by multiple histological subtypes of a given cancer histology, the reasonableness of pooling would depend on the cancer and the subtypes. (Consensus on whether pooling is appropriate in a given setting may still be difficult to reach, e.g., in soft tissue sarcoma Chugh et al. (2) used a design with borrowing while Maki et al. (19) believed pooling is not appropriate and conducted an independent investigation in each subgroup). For the third scenario, biomarker-defined molecular subtypes of a given cancer histology, we generally cannot see any clear rationale for borrowing of information. (This opinion is apparently shared by the BATTLE investigators (3)). It should be noted that in this paper we are not considering situations where patient subgroups are defined by the levels of biomarker expression which are expected to be related to response; in this case, sequential procedures that borrow information according to the ordered nature of the subgroups could be appropriate.

To summarize our results, in the phase II setting the outcome-adaptive approach does not seem to have enough information to determine whether borrowing is appropriate, and is therefore of limited use. Designs that borrow (pool) information across subgroups can reduce power to detect treatments that work only in some subgroups. In specialized settings where there is reasonable biologic rationale for expecting similar treatment effects in the subgroups, simple pooling of information across subgroups may be appropriate.

Appendix

Thall et al. (9) proposed the following parameterization of the hierarchical Bayesian model. For the response rates πi, a logistic model was assumed: θi = logi/(1 − πi)} for i=1,…, K. The θi are assumed to be independent and identically distributed normal variables with mean μ and variance 1/τ (note that parameter τ represents precision in this parameterization). The hyperprior distribution for μ was assumed to be normal with mean equal to logit(0.2)=1.386 and variance 10. The mean of the hyperprior for μ was set to the logit of 0.20, to represent the prior belief that the average response rate is half-way between the interesting response rate of 0.3 and the uninteresting response rate of 0.1. For the hyperprior distribution for τ Thall et al used a gamma distribution with parameters 2 and 20. This distribution has mean of .1 corresponding to a relatively low precision in the prior distribution of θi, and thus results in relatively modest amount of borrowing. To allow more borrowing we also considered a hyperprior (for τ) with a larger mean of 1 (gamma distribution with parameters 2 and 2), corresponding to a higher precision in the prior distribution of θi.

Berry et al. (13) assumed that the πi are randomly sampled from a beta distribution with parameters a and b. Each of the parameters a and b was assumed to have an independent beta distribution for their hyperprior. In our simulations, for a we used a uniform distribution on [0, 4], and for b we used a uniform distribution on [0, 16]. Similarly to the Thall et al model, the hyperprior parameters were selected to reflect the prior belief that the average response rate is 0.20.

References

  • 1.Seymour L, Ivy SP, Sargent D, Spriggs D, Baker L, Rubinstein L, et al. The design of phase II clinical trials testing cancer therapeutics: consensus recommendations from the clinical trial design task force of the national cancer institute investigational drug steering committee. Clin Cancer Res. 2010;16:1764–1769. doi: 10.1158/1078-0432.CCR-09-3287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Chugh R, Wathen JK, Maki RG, Benjamin RS, Patel SR, Meyers PA, et al. Phase II multicenter trial of imatinib in 10 histologic subtypes of sarcoma using a bayesian hierarchical statistical model. J Clin Oncol. 2009;27:3148–3153. doi: 10.1200/JCO.2008.20.5054. [DOI] [PubMed] [Google Scholar]
  • 3.Zhou X, Liu S, Kim ES, Herbst RS, Lee JJ. Bayesian adaptive design for targeted therapy development in lung cancer--a step toward personalized medicine. Clin Trials. 2008;5:181–193. doi: 10.1177/1740774508091815. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Berry DA. Introduction to Bayesian methods III: use and interpretation of Bayesian tools in design and analysis. Clin Trials. 2005;2:295–300. doi: 10.1191/1740774505cn100oa. [DOI] [PubMed] [Google Scholar]
  • 5.Hobbs BP, Carlin BP. Practical Bayesian design and analysis for drug and device clinical trials. J Biopharm Stat. 2008;18:54–80. doi: 10.1080/10543400701668266. [DOI] [PubMed] [Google Scholar]
  • 6.Biswas S, Liu DD, Lee JJ, Berry DA. Bayesian clinical trials at the University of Texas M. D. Anderson Cancer Center. Clin Trials. 2009;6:205–216. doi: 10.1177/1740774509104992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lindley DV, Smith AFM. Bayesian estimates for the linear model. Journal of the Royal Statistical Society, Series B, 1972;34:1–41. [Google Scholar]
  • 8.Kass RE, Steffey D. Approximate Bayesian inference in conditionally independent hierarchical models (parametric empirical Bayes models) Journal of the American Statistical Association. 1989;84:717–726. [Google Scholar]
  • 9.Thall PF, Wathen JK, Bekele BN, Champlin RE, Baker LH, Benjamin RS. Hierarchical Bayesian approaches to phase II trials in diseases with multiple subtypes. Stat Med. 2003;22:763–780. doi: 10.1002/sim.1399. [DOI] [PubMed] [Google Scholar]
  • 10.Berry DA. A guide to drug discovery: Bayesian clinical trials. Nature Reviews Drug Discovery. 2006;5:27–36. doi: 10.1038/nrd1927. [DOI] [PubMed] [Google Scholar]
  • 11.LeBlanc M, Rankin C, Crowley J. Multiple histology phase II trials. Clinical Cancer Research. 2009;15:4256–4262. doi: 10.1158/1078-0432.CCR-08-2069. [DOI] [PubMed] [Google Scholar]
  • 12.Carlin BP, Louis TA. Bayes and Empirical Bayes Methods for Data Analysis. 2nd Ed. Boca Raton, FL: Chapman & Hall/CRC; 2000. [Google Scholar]
  • 13.Berry SM, Carlin BP, Lee JJ, Muller P. Bayesian Adaptive Methods for Clinical Trials. Boca Raton, FL: Chapman & Hall/CRC; 2010. [Google Scholar]
  • 14.R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2011. [Google Scholar]
  • 15.Spiegelhalter D, Thomas A, Best N. Bayesian Inference Using Gibbs Sampling Manual (BUGS 0.5) Cambridge, United Kingdom: MRC Biostatistics Unit; 1995. [Google Scholar]
  • 16.United States Food and Drug Administration: 2010 draft guidance for industry on adaptive design clinical trials for drugs and biologics. doi: 10.1080/10543406.2010.514453. Available from: http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/ucm201790.pdf. [DOI] [PubMed]
  • 17.Ibrahim JG, Chen M-H, Gray RJ. Bayesian models for gene expression with DNA microarray data. Journal of the American Statistical Association. 2002;97:88–99. [Google Scholar]
  • 18.Kopetz S, Desai J, Chan E, Hecht JR, O'Dwyer PJ, Lee RJ, et al. PLX4032 in metastatic colorectal cancer patients with BRAF tumors. 2010ASCO Annual Meeting. J Clin Onc. 28:15s. abstr 3534. [Google Scholar]
  • 19.Maki RG, D'Adamo DR, Keohan ML, Saulle M, Schuetze SM, Undevia SD, et al. Phase II study of sorafenib in patients with metastatic or recurrent sarcomas. Journal of Clinical Oncology. 2009;27:3133–3140. doi: 10.1200/JCO.2008.20.4495. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES