Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Mar 1.
Published in final edited form as: Contemp Clin Trials. 2022 Jan 24;114:106688. doi: 10.1016/j.cct.2022.106688

A Likely Responder Approach for the Analysis of Randomized Controlled Trials

Eugene Laska a,b, Carole Siegel a,b, Ziqiang Lin a
PMCID: PMC8934276  NIHMSID: NIHMS1778572  PMID: 35085831

Abstract

Objective

To further the precision medicine goal of tailoring medical treatment to individual patient characteristics by providing a method of analysis of the effect of test treatment, T, compared to a reference treatment, R, in participants in a RCT who are likely responders to T

Methods

Likely responders to T are individuals whose expected response at baseline exceeds a prespecified minimum. A prognostic score, the expected response predicted as a function of baseline covariates, is obtained at trial completion. It is a balancing score that can be used to match likely responders randomized to T with those randomized to R; the result is comparable treatment groups that have a common covariance distribution. Treatments are compared based on observed outcomes in this enriched sample. The approach is illustrated in a RCT comparing two treatments for opioid use disorder.

Results

A standard statistical analysis of the opioid use disorder RCT found no treatment difference in the total sample. However, a subset of likely responders to T were identified and in this group, T was statistically superior to R.

Conclusion

The causal treatment effect of T relative to R among likely responders may be more important than the effect in the whole target population. The prognostic score function provides quantitative information to support patient specific treatment decision regarding T furthering the goal of precision medicine.

Keywords: causal inference, enriched sample, precision medicine, prognostic balancing score, likely responders

1. Introduction

We propose a method of analysis of a RCT1, the LR2 method, that furthers the goals of precision medicine. PM3 “….refers to the tailoring of medical treatment to the individual characteristics of each patient” [1] …and seeks to incorporate “individual variability in genes, environment, and lifestyle for each person” [2]. PM requires that there is a set of individual characteristics or features, x, that are predictive of intervention outcomes with a reasonable degree of accuracy. A function that yields an estimate of the expected value of the outcome, Y, in response to treatment T for a patient with features, x, is called a prognostic score function, denoted by E[Y|T, x]. Treatment T is “tailored for a patient with characteristics x if the expected level of response to T is sufficiently high that it meets or exceeds a clinically desired outcome. i.e., E[Y|T, x] ≥ minCond4. A patient meeting this condition is called a LR to T. Also, just as for any treatment requiring regulatory approval, the expected level of response to T is superior to the expected response to placebo, or usual care or some other appropriate control, R, in the target population. That is, it must be shown in a RCT that E[Y|T, x] > E[Y|R, x] for LRs.

The classical analysis of a RCT is testing whether the average treatment effect is zero in the whole sample., i.e., E[Y|T] = E[Y|R]. This is clearly insufficient for informing individualized clinical decisions [3]. In the LR method of analysis, a model for the prognostic score function is fit on those receiving T at the end of the trial. It is used to identify individuals who meet the requirement that they are LRs to T. Some LRs were randomized to T and some to R. The inference step is to compare treatments in the set of LRs to T. The LR analysis is performed under the potential outcome framework [39].

Many tools in the fields of statistics, predictive analytics and machine learning can be used to obtain an estimate of the prognostic score function, and the method used does not affect the theory we propose. If a well calibrated prognostic score function cannot be found, then the fundamental PM assumption that baseline characteristics predict response may not be true. A LR analysis is not possible and the traditional methods of analysis are appropriate.

Particularly when the dimension of x is large, the chance of misspecification of the prognostic score function is considerable. This would lead to incorrect identification of LRs to T and cannot be ignored. To address this possibility, we introduce the likely responder dry run diagnostic, LR-DRD5, a method for evaluating confounding bias in treatment effect estimation.

The details of the LR and the LR-DRD methods and an example RCT [10] are presented below. In Section 2.1, we describe the general setting and the underlying assumptions of the causal analysis. The first step is to identify the measure that defines a causal treatment difference. It is the population average treatment effect, PATE, described in Section 2.2. The null hypothesis to be tested is that the PATE = 0 among the LR to T. The definition of membership in the target population of LRs must be given by a function of pre-treatment values in order for the PATE to be properly evaluated. This is given in Section 2.3. The prognostic balancing score is introduced in Section 2.4 and it’s critical role in estimating the average treatment effect among the LRs is explained. In a RCT, randomization guarantees valid causal inference. But the LRs are a nonrandom subset, so the traditional testing methods do not apply. Sections 2.5 and 2.6 provide the technical detail of valid methods for statistically testing the null hypothesis PATE = 0 among the LRs. The new LR Dry Run Diagnostic method for assessing confounding bias in estimating the prognostic balancing score is described in Section 2.7. In Section 3 an analysis of a RCT for treating opioid use disorder is given as an illustration of the LR and the LR-DRD method. We conclude with some remarks in the Discussion Section 4.

2. Methodology

2.1. The general setting and assumptions

We consider an RCT of T compared to a standard comparator R. Associated with each individual u in a population, Ω, is a pre-randomization K-dimensional vector of features denoted by xRK, where x are realizations of the random variable X. The elements of x may be continuous or categorical, and they may be correlated. They may include e.g., clinical information, demographic and environmental characteristics and biomarkers from e.g., genomics, metabolomics, proteomics, and imaging. These features are believed to be predictive of Y.

The N subjects in the study are randomly assigned to T or to R, with, for simplicity, equal numbers in each arm. Let W = +1 represent assignment to T and W = −1 assignment to R and P[W =1] = P[W = −1] = 1/2. We sometimes abbreviate P[W =1] = P[T] and P[W =−1] = P[R]. The observed data (yu, wu, xu) are realizations of N independent and identically distributed multivariate random variables (Yu, Wu, Xu), where yu is the post-treatment outcome for subject u who is assigned to treatment wu and has feature vector xu, with distribution function F(Y| W, x). Let w = (w1, w2, …, w N) denote the vector of treatment assignments observed in the sample. Probability distributions on the triplet are induced by sampling from Ω and the randomized treatment assignment process.

We assume that each feature vector x determines a unique E[Y|T, x) which induces unique distributions over the covariate space. In particular, all individuals with the same feature vector x have the same expected outcome, E[Y|T, x). We make the usual assumptions in causal analysis, some of which follow because the data are from a RCT:

  1. consistency: ½ [(W+1) Y|T + (W−1) Y|R].

  2. Ignorability: there are no unmeasured confounders: W is independent of Y|T and Y|R conditional on X.

  3. positivity: for every individual, the probability of receiving either treatment is greater than zero.

  4. SUTVA: the stable unit treatment value assumption. Observations on one individual are unaffected by the treatment assignment of every other individual.

2.2. Individual and population causal treatment effects

Following the Rubin/Neyman approach to causal inference, the individual treatment effect of T on u is defined as the difference between u’s potential outcomes [39]. These are the two hypothetical but fixed outcomes that would have resulted had u received T, and the outcome had u received R. All else being equal, the difference between these so-called counter-factual outcomes must be causally attributed to the treatments and is considered to be the causal effect of T with respect to R. The causal expected or predicted individual treatment effect PITE [1113], for subject u with feature vector xu is

PITE(xu)=E[YuT,xu]E[YuR,xu].

For any measurable set Ψ ⊆ Ω, the causal population average treatment effect, PATE in Ψ is

PATE(Ψ)=E[PITE(xu)uΨ]=E[E(YT,xu)E(YR,xu)uΨ]. (1)

The sets Ψ that we consider are the total sample Ω, and the subgroup of LRs, Φ. If Y is a Bernoulli random variable indicating successful response, then E[Y|T, x] = P[Y=1|T, x], the probability of response for an individual with features, x. Then

PATE(Ψ)=E[P(Yu=1T,xu)P(Yu=1R,xu)uΨ]. (2)

The principal null hypothesis to be tested is PATE(Ω) = 0. The test of PATE(Φ) = 0 is optional.

2.3. An estimand for the Likely Responder analysis

There has been much attention paid to estimands in the recent statistical literature and regulatory agencies’ guidance documents [1418]. The target population must be specified in terms of pre-treatment values in order for the PATE to be evaluated in a “proper set”. Here, the target population is LRs to T. Membership in the set is determined by the covariate vector x and the potential outcomes [8,9] all of which are fixed pre-treatment values unaltered by treatment assignment. Below are the specifies of a LR estimand for Φ that comply with these conditions.

  1. The target population, Φ, is the set of individuals whose feature vectors make them LRs to T, determined by the condition E[Y|T, x] ≥ minCond, where the minCond is set pre-trial. A sample estimate, E^[YT,x], is used to identify them.

  2. The principal endpoint is the outcome response measure, Y, in Φ.

  3. The population summary variable is a specified estimate of the PATE(Φ) as discussed in Section 2.2

The population summary variable for LRs is a principal stratum direct effect as described by Frangakis and Rubin [19], where the stratum are comprised of individuals who are LRs to T regardless of the treatment to which they were assigned.

2.4. Prognostic balancing score and average treatment effect

Based on the definition of causal effects, the difference between treatments can only be validly estimated if the treatment contrasts are between individuals with the same or at least nearly the same feature vector x. Randomization no longer guarantees valid causal inference because the analysis is on a nonrandom subset, the LR. Therefore, individuals need to be matched on their feature profiles, a task that can be quite challenging. This can be accomplished using the prognostic balancing score E[Y|T, x]. A prognostic balancing score is any function of X, say b(X), that has the property that the conditional distribution of Y given treatment with T and b(X) is independent of X. That is, XY|T, b(X), where UV indicates that the random variables U and V are independent. Its estimate, E^[YT,x], is obtained using only subjects randomly assigned to T. Hansen [20] showed that to estimate causal effects, it suffices to form subject stratum by matching on a prognostic balancing score.

For any measurable set M ⊆ Ω, b(X) is sufficient for Y for u ∈ M. Hansen [20] proved that conditional on b(X), the distributions of X in the treatment and reference groups are equal i.e., they are balanced. Just as for propensity scores, he concluded that “under strongly ignorable treatment assignment, units with the same value of the balancing score but different treatments can act as controls for each other, in the sense that the expected difference in their responses equals the average treatment effect.” In particular, the conclusion applies in a RCT when M is the set of likely responders, Φ. The properties of prognostic score balance parallels the Rosenbaum and Rubin [21] account of propensity score balance in terms of sufficiency and conditional independence with a few issues that are not in question in an RCT. In particular, there is no hidden bias in treatment allocation nor in effect modification and the assignment probability is positive for both treatments.

2.5. Forming prognostic balancing score strata

Although developed on the data obtained from individuals assigned to T, the estimated prognostic score, E^[YuT,xu] can be evaluated for every participant u in the RCT, including those assigned to R. Similar to propensity score methods, the resulting scores can be used for matching or subclassification. Stratum comprised of individuals who have similar pre-randomization E[Y|T, x] are homogeneous with respect to balance. For inference in the LR population with respect to PATE(Φ), strata are formed by rank ordering the estimated prognostic scores of all individuals for whom E[Y| T, x] ≥ minCond and dividing them into quantiles. The number of quantiles to form depends on the size of the sample. Some guidance may be provided by the literature on propensity scores, E[W| x], about which much has been written [21, 22]. Stratifying propensity scores into five quantiles removes more than 90% of the bias due to the distribution of the covariates [22]. Whether similar results obtain for matching on prognostic scores remains to be determined. For inference over the whole sample with respect to PATE(Ω), strata are formed by rank ordering the prognostic scores of all trial participants and dividing them into quantiles.

Within each stratum formed for Φ and for Ω, the difference between the means of treatment T and R of the observed Y is an approximately unbiased estimator of the within stratum causal effect. For Bernoulli measures, it is the difference in the sample proportions of responders. In the potential outcome framework, averages of the stratum differences yield the approximately unbiased estimators PATE^(Φ) and PATE^(Ω). Confidence intervals can be obtained using a bootstrap or other method.

2.6. Testing if the causal estimand differs from zero

Various approaches can be used to test hypotheses about treatment effects in the whole sample and in the group of LRs. If parametric assumptions are plausible, likelihood methods can be employed. A randomization test [23,24] can be used in which the entire statistical analysis is repeated after each choice of treatment assignment. The fact that the LR subset was statistically determined must be taken into account.

If the null treatment hypothesis is to be tested in Ω as well as in Φ in an application, there are several options for the order of testing to preserve the familywise error. Possible strategies include testing using a Bonferroni type correction or using a closed testing step-down procedure. As outlined in a recent FDA Guidance Document [25], depending on their number, the LR subgroup may provide more power than the full sample, which would dictate the order of testing. However, if the fraction of the sample that meets the minCond is small, then the power in Φ may be severely limited.

2.7. Evaluating bias: the Likely Responder dry run diagnostic

Just as for propensity scores, the properties of the estimated prognostic scores should be examined to appraise the degree to which balancing goals [26] are met. Perhaps more worrisome is the possibility of confounding bias, particularly problematic if there are many influential covariates, some of which may be unknown, and the outcome variable, as here, is used to identify subtypes. The consequence may be biased estimation of effect size, leading to inflated type one error in testing treatment differences. Hansen [27] considered the problem of model misspecification particularly in epidemiology contexts where prognostic score models are also called Disease Risk Scores [28, 29]. DRS models are usually fit on the control group, extrapolated to the risk group and used to adjust for confounding. Hansen [27] proposed an approach that he called the dry run method designed to evaluate whether the prognostic score model controls confounding. Wyss et al [30] evaluated the method through simulations and an empirical example. We have adapted the idea to the LR analysis in the setting of a RCT, where treatment assignment probabilities, the analogue of exposure probabilities, are known.

In applying the LR-DRD tool, data from individuals assigned to R are set aside and the remaining data from T are conceptualized as comprising the full sample. The idea is to randomly split this sample into two sets, GT and GR, which, paralleling Hansen’s nomenclature, are called, respectively, the “pseudo-treatment group” and the “pseudo-reference group” of a pseudo RCT. In the pseudo RCT, the null hypothesis of no treatment difference given x is true in every matched set for every choice of the minCond. A LR analysis, except by random chance, should not find a proper subset, GT and GR, where the treatments differ. Hansen’s [20] demonstration that for any measurable set M ⊆ Ω, E[Yu|T, xu] is sufficient for Yu for uM, and that conditional on E[Y|T, x] the distributions of covariates in the treatment and reference groups are balanced, applies to the pseudo RCT just as it does in the full sample. The specific planned LR statistical methods to be used to analyze the full trial data are applied to the pseudo-treatment group following the prespecified steps; the method of estimation of the prognostic scores, the prespecified value of the minCond, the matching rule and the pooled test statistic. The results include an estimate of the effect size, the value of the pooled test statistic and the corresponding p-value. The random splitting procedure is repeated at least 1,000 times. The average of the estimated effect size should be close to zero and the distribution of the randomization p-values, which are probability integral transforms of the density of the test statistics under the null, should be uniform over the unit interval. The transform used to obtain the p-value is the cumulative distribution of its assumed density under the null. If the p-values are not uniform, one or more of the underlying assumptions are incompatible with the data. For example, suppose the test statistic is distributed as a student’s t under the null, but the truth is the mean is not zero. That is, the alternative is true so the true distribution of the test statistic is non-central t. Obtaining p-values assuming a t distribution from statistics that are distributed by a non-central t is not a probability integral transform and therefore will not, in general, result in a unit uniform distribution. There are two possible underlying assumptions that could explain the finding that the distribution is not uniform; the null is false or the model is misspecified. Since data from only one arm is being considered, the null is not false so it is likely the model is problematic.

It should be noted that failure to reject uniformity of the p-values does not guarantee that the model is correctly specified. However, if the sample size is reasonable, the confidence interval for the effect size is relatively narrow and covers zero and a unit uniform distribution is not rejected by an appropriate goodness of fit test, then estimates of treatment effects are probably unconfounded in the GT and GR sub samples.

If the distribution of p-values is not uniform and/or the estimated bias is considerable, an argument can be made that it is legitimate to search for a better model for E[Y|T, x] in the data set of subjects assigned to T, provided that the LR-DRD was performed with the rest of the data remaining hidden to the analyst.

3. Illustrative example: a LR analysis of a RCT for treating opioid use disorder

In Appendix 1 in the Supplementary material we present an example of an RCT, the X:BOT study [10], which compared T=XR-NTX to R=BUP-NX with n= 474. The investigators designated “opioid relapse-free” over the entire course of 24 weeks of outpatient treatment as the primary binary outcome measure. When analyzed by classical methods, no between treatment differences were found. We used random forest [31] with 45 predictors to identify LRs with the minCond set to 0.5. A chi-square goodness of fit test (p-value 0.437) indicated that the model fit hypothesis was not rejected. The AUC of the ROC was 0.722 and the Brier score [32] was 0.215. Figures 1A and 1B show the within quantile mean response rates in the whole sample and the LR subset. The details of the causal analysis are summarized in Table 1. The analysis found no difference in the whole sample but identified a subgroup of LRs whose outcomes are causally superior to the control agent. However, the LR-DRD analysis raises some concerns. The histogram of p-values of 5,000 random splits of the XR-NTX sample are shown in Figure 2 for both Ω and Φ together with Q-Q plots of their empirical distribution against a unit uniform distribution. For Ω, the Pearson Chi Square statistic for testing whether the empirical distribution of p-values is unit uniform is 75.25 with 60 df and (p = 0.089). The Chi Square statistic for Φ is 279.83 with 60 df and p < 0.001. The hypothesis that the empirical distribution of p-values is unit uniform for LRs is not strongly supported. However, visual examination of the Q-Q plot, Figure 2, suggests that the distribution is nearly uniform, except for small values of p. This small discrepancy suggests that the results cannot be disregarded nor can they be endorsed fully until validated with data from new samples.

Figure 1. Mean response rates in each quintile in the full sample and in the LR sample.

Figure 1

Figure 1

The estimated response rates of XR-NTX and BUP-NX are plotted at the mean within quintile model estimated P[Y = 1|XR-NTX; x] for each of the quintiles (5 for the full sample on the top and 2 for the LR group on the bottom) based on the sample covariates. The red dots represent the observed response rate of XR-NTX, and the black dots represent the observed response rate of BUP-NX. A red dot on the 45-degree line corresponds to a perfect within quantile fit on average for XR-NTX. The values in the boxes give the within quintile ranges of the estimated probabilities.

Table 1.

Full sample and LR potential outcome analysis of ATEs

Quantile w Sample Size E^[Yw,x] Range of E^[Yw,x]1 Observed response rate on the assigned treatment Difference in Proportion (SD) Randomization p-value2
Ω: the whole sample
1 XR-NTX 44 0.19 0.08, 0.26 0.32 −0.07 (0.10) 0.528
BUP-NX 51 0.18 0.05, 0.26 0.39
2 XR-NTX 44 0.34 0.26, 0.39 0.21 −0.12 (0.09) 0.193
BUP-NX 51 0.33 0.27, 0.40 0.33
3 XR-NTX 41 0.46 0.41, 0.53 0.56 −0.04 (0.10) 0.747
BUP-NX 52 0.47 0.40, 0.54 0.6
4 XR-NTX 40 0.6 0.54, 0.69 0.65 0.19 (0.10) 0.098
BUP-NX 56 0.61 0.54, 0.69 0.46
5 XR-NTX 35 0.79 0.70, 0.93 0.74 0.31 (0.10) 0.009
BUP-NX 60 0.8 0.70, 0.96 0.43
Pooled 1–5 XR-NTX 204 0.46 0.08, 0.93 0.48 0.05 (0.04) 0.316
BUP-NX 270 0.49 0.05, 0.96 0.44
Φ: the likely responders
1 XR-NTX 41 0.58 0.52, 0.67 0.63 0.17 (0.10) 0.11
BUP-NX 61 0.59 0.50, 0.67 0.46
2 XR-NTX 38 0.78 0.68, 0.93 0.76 0.32 (0.09) 0.005
BUP-NX 64 0.79 0.67, 0.96 0.44
Pooled 1–2 XR-NTX 79 0.68 0.52, 0.93 0.7 0.25 (0.07) <0.001
BUP-NX 125 0.69 0.52, 0.96 0.45

The results of the total sample and LR analysis of the within-quantile response rates for individuals who have had successful induction.

1

The within-quantile range of the estimated expected response to treatment with T, based on a random forest prognostic score model, scored for all study participants.

2

The within quantile p-values are from a binomial test randomized over 5000 replications.

3

The p-value in rows labeled “Pooled” are the result of the formal causal inference tests. They are obtained from the pooled quantile binomial score statistics. The p-value is determined from the 5000 permutations. The unadjusted p-values shown in the quantile rows are not used in testing. They are intended to give an indication of the within quantile effect size.

Bolded p-values indicate substantial evidence of a treatment effect that differs from zero.

Figure 2. The X:BOT Study LR-Dry Run Diagnostic.

Figure 2

1The top two panels are histograms of the t-test p-values from permuting assignments into pseudo groups in the sample who were randomly assigned to T.

2The bottom two panels are Q-Q-plots of the p-values vs the uniform distribution.

4. Discussion

The value of the LR approach may also be considered in the context of experimental design. According to the FDA, an enriched sample is comprised of individuals more likely to respond to T than the remainder of the population being studied [25]. The FDA concluded that the chance of a type 2 error, not finding a treatment difference when there is one, may be lessened in enriched samples. LRs in a RCT are an enriched sample and the X:BOT opioid use disorder RCT described in Section 3 is an example. Early contributions include Anscombe [33], Colton [34] and Cornfield, Halperin and Greenhouse [35]. Simon and Simon and their colleagues [3741] described a general class of enrichment designs that sequentially restrict entry to the trial in an adaptive manner. Wang, O’Neill and Hung [42,43], Jenkins, Stone and Jennison [44] and Magnusson and Turnbull [45] described adaptive enrichment designs.

There is much research on identification of optimal treatment regimes. These are function that map baseline features to a best treatment choice among candidate therapies. Some consider a single point in time [46,47] and some a sequence of decisions, called dynamic treatment regimes [4850]. Approaches to identifying individual causal effects have been explored by many authors [5163]. However, while the best treatment may provide a small incremental benefit over the competitor, the level of effect may not be greater than the minCond. Our approach provides the individual’s expected effect of T and the size of the treatment difference among LRs as input, so that a broad clinically informed treatment choice can be made.

A clinician making use of the results of the RCT may have different choices for the value of the minCond for different patients based on particulars such as the patient’s clinical characteristics and preferences, potential side effects, likely compliance, as well as the cost and properties of other candidate treatments not in the trial. The minCond demand may be low for a severely ill patient who has failed to respond to other treatments and high for a patient doing well on current therapy. To facilitate its broad use, a graphic devise that allows automated entry of x could display the patient’s prognostic score E[Y|T, x] and for a continuum of minCond values, the corresponding causal effect size and p-value. Conceptually, the prescriber decides on a threshold appropriate to the situation and asks whether the data support causality at that level. Armed with this information, a clinician may or may not choose to prescribe T, but the decision will be quantitatively informed. The clinician will not be dissuaded from using the medication because the average treatment effect in the entire population did not differ from zero.

Our LR approach to the analysis of a RCT is made possible for two reasons. First is the concept of potential outcomes that enables causal analyses of estimands in “proper subsets.” The second is prognostic balancing scores, that makes possible fulfilling the covariate distribution balance requirement through matched samples. Arbogast and Ray [29] and Wyss et al [64] reviewed Disease Risk/ prognostic scores and their application in pharmacoepidemiology, and Nguyen and Debray [65] showed how to use prognostic scores for causal inference with general as opposed to two treatment regimes.

Convincing proof of efficacy based on a LR approach requires strong evidence that the estimator of E[Y|T, x] is reasonably well calibrated, that there are some regions in the patient feature space that have high likelihood of treatment success preferably supported by biological considerations, and in at least some of those regions, observed positive effects are causally related to treatment. We cannot emphasize enough, the value of replication.

Supplementary Material

1

Acknowledgments

The authors would like to thank Drs. Charles Marmar, Morris Meisner, Joseph Wanderling and Paul Leber for the many stimulating discussions concerning the conduct, analysis and interpretation of RCTs. We are grateful to Drs. John Rotrosen and Edward Nunes for sharing their considerable insights into the X:BOT study in which they played important roles.

Funding

This work was supported by the National Institute on Alcohol Abuse and Alcoholism [grant number PO1AA027057-01 PI: Charles Marmar].

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declaration of competing interest

The authors declare that there is no conflict of interest.

Supplementary materials

The reader is referred to the on-line Supplementary Materials for more detail on the analysis of the X:BOT study [10]. The study and study data for trial is described by Lee et al [10] and full documentation, including the raw data, may be downloaded from the website https://datashare.nida.nih.gov/data. Also given is the annotated R programs used for the statistical analysis of the X:BOT trial.

1

RCT randomized controlled trial

2

LR likely responder

3

PM precision medicine

4

minimum condition

5

LR-DRD Likely Responder Dry Run Diagnostic

Data availability statement

Full documentation, including the raw data for the opioid use disorder study, that support the findings of this study may be downloaded from the website https://datashare.nida.nih.gov/data.

References

  • 1.National Research Council. Toward precision medicine: building a knowledge network for biomedical research and a new taxonomy of disease. Washington: National Academies Press, 2011. [PubMed] [Google Scholar]
  • 2.Abrams J, Conley B, Mooney M, et al. National Cancer Institute’s precision medicine initiatives for the new national clinical trials network. Am Soc Clin Oncol Educ Book 2014; 71–76. [DOI] [PubMed] [Google Scholar]
  • 3.Ruberg SJ, Chen L and Wang Y. The mean does not mean as much anymore: finding subgroups for tailored therapeutics. Clin Trials 2010; 7: 574–583. [DOI] [PubMed] [Google Scholar]
  • 4.Neyman JS. On the application of probability theory to agricultural experiments. essay on principles. section 9. (translated and edited by DM Dabrowska and TP Speed, Statistical Science (1990), 5, 465–480). Ann Agric Sci 1923; 10: 1–51. [Google Scholar]
  • 5.Fisher RA. Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd, 1925. [Google Scholar]
  • 6.Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Edu Psychol 1974; 66: 688–701. [Google Scholar]
  • 7.Imbens GW. Nonparametric estimation of average treatment effects under exogeneity: a review. Rev Econ Stat 2004; 86: 4–29. [Google Scholar]
  • 8.Imbens GW and Rubin DB. Causal inference for statistics, social, and biomedical sciences: an introduction. Cambridge: Cambridge University Press, 2015. [Google Scholar]
  • 9.Rubin DB. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association 2005; 100(469): 322–331. [Google Scholar]
  • 10.Lee JD, Nunes EV Jr, Novo P, et al. Comparative effectiveness of extended-release naltrexone versus buprenorphine-naloxone for opioid relapse prevention (X: BOT): a multicentre, open-label, randomised controlled trial. The Lancet 2018:23391(10118): 309–318. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Gadbury GL, Iyer HK and Albert JM. Individual treatment effects in randomized trials with binary outcomes. J Stat Plan Inference 2004; 121: 163–174. [Google Scholar]
  • 12.Lamont A, Lyons MD, Jaki T, et al. Identification of predicted individual treatment effects in randomized clinical trials. Stat Methods Med Res 2018; 27: 142–157 [DOI] [PubMed] [Google Scholar]
  • 13.Ballarini NM, Rosenkranz GK, Jaki T, et al. Subgroup identification in clinical trials via the predicted individual treatment effect. PLoS One 2018; 13: e0205971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.European Medicines Agency. ICH E9 (R1) addendum on estimands and sensitivity analysis in clinical trials to the guideline on statistical principles for clinical trials. 2017.
  • 15.Permutt T A taxonomy of estimands for regulatory clinical trials with discontinuations. Statistics in medicine 2016; 35(17): 2865–2875. [DOI] [PubMed] [Google Scholar]
  • 16.Permutt T Defining treatment effects: A regulatory perspective. Clinical Trials 2019: 1740774519830358. [DOI] [PubMed] [Google Scholar]
  • 17.National Research Council. The prevention and treatment of missing data in clinical trials. National Academies Press. 2010. [PubMed] [Google Scholar]
  • 18.Keene ON, Wright D, Phillips A, & Wright M (2021). Why ITT analysis is not always the answer for estimating treatment effects in clinical trials. Contemporary Clinical Trials, 108, 106494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Frangakis CE, Rubin DB. Principal stratification in causal inference. Biometrics 2002; 58(1): 21–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Hansen BB. The prognostic analogue of the propensity score. Biometrika 2008; 95(2): 481–488. [Google Scholar]
  • 21.Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70(1): 41–55. [Google Scholar]
  • 22.Stuart EA. Matching methods for causal inference: A review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics 2010; 25(1):1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Fisher RA. “ The Coefficient of Racial Likeness” and the Future of Craniometry. The Journal of the Royal Anthropological Institute of Great Britain and Ireland 1936; 66: 57–63. [Google Scholar]
  • 24.Lehmann EL, Romano JP. Testing statistical hypotheses. Springer Science & Business Media. 2006. [Google Scholar]
  • 25.The Food and Drug Administration. Enrichment Strategies for Clinical Trials to Support Approval of Human Drugs and Biological Products. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/enrichment-strategies-clinical-trials-support-approval-human-drugs-and-biological-products/; 2019. [Online; accessed 01-March-2019].
  • 26.Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Statistics in medicine 2009; 28(25): 3083–3107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Hansen BB. Bias reduction in observational studies via prognosis scores. tech. rep., Technical Report 441, University of Michigan, Statistics Department; 2006. [Google Scholar]
  • 28.Miettinen OS. Stratification by a multivariate confounder score. American journal of epidemiology 1976; 104(6): 609–620. [DOI] [PubMed] [Google Scholar]
  • 29.Arbogast PG, Ray WA. Use of disease risk scores in pharmacoepidemiologic studies. Statistical methods in medical research 2009; 18(1): 67–80. [DOI] [PubMed] [Google Scholar]
  • 30.Wyss R, Hansen BB, Ellis AR, et al. The “dry-run” analysis: a method for evaluating risk scores for confounding control. American journal of epidemiology 2017; 185(9): 842–852. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Breiman L Random forests. Machine learning 2001; 45(1): 5–32. [Google Scholar]
  • 32.Brier GW. Verification of forecasts expressed in terms of probability. Monthly weather review 1950; 78(1): 1–3. [Google Scholar]
  • 33.Anscombe FJ, “Sequential medical trials,” Journal of the American Statistical Association, 58 (1963) 365–384. [Google Scholar]
  • 34.Colton T, “A model for selecting one of two medical treatments,” Journal of the American Statistical Association, 58 (1963) 388–401. [Google Scholar]
  • 35.Cornfield J, Halperin M, Greenhouse SW. An adaptive procedure for sequential clinical trials. Journal of the American Statistical Association 1969; 64(327): 759–770. [Google Scholar]
  • 36.Simon RM Personalized Cancer Genomics. Annual Review of Statistics and Its Application, 2018; 5, 169–182. [Google Scholar]
  • 37.Simon R, Maitournam A. Evaluating the efficiency of targeted designs for randomized clinical trials. Clinical Cancer Research 2004; 10(20): 6759–6763. [DOI] [PubMed] [Google Scholar]
  • 38.Simon N, Simon R. Adaptive enrichment designs for clinical trials. Biostatistics 2013; 14(4): 613–625. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Freidlin B, Simon R. Adaptive signature design: an adaptive clinical trial design for generating and prospectively testing a gene expression signature for sensitive patients. Clinical cancer research 2005; 11(21): 7872–7878. [DOI] [PubMed] [Google Scholar]
  • 40.Freidlin B, Jiang W, Simon R. The cross-validated adaptive signature design. Clinical Cancer Research 2010; 16(2): 691–698. [DOI] [PubMed] [Google Scholar]
  • 41.Karuri SW, Simon R. A two-stage Bayesian design for co-development of new drugs and companion diagnostics. Statistics in medicine 2012; 31(10): 901–914. [DOI] [PubMed] [Google Scholar]
  • 42.Wang SJ, O’Neill RT, Hung HJ. Approaches to evaluation of treatment effect in randomized clinical trials with genomic subset. Pharmaceutical Statistics: The Journal of Applied Statistics in the Pharmaceutical Industry 2007; 6(3): 227–244. [DOI] [PubMed] [Google Scholar]
  • 43.Wang SJ, James Hung H, O’Neill RT. Adaptive patient enrichment designs in therapeutic trials. Biometrical Journal: Journal of Mathematical Methods in Biosciences 2009; 51(2): 358–374. [DOI] [PubMed] [Google Scholar]
  • 44.Jenkins M, Stone A, Jennison C. An adaptive seamless phase II/III design for oncology trials with subpopulation selection using correlated survival endpoints. Pharmaceutical statistics 2011; 10(4): 347–356 [DOI] [PubMed] [Google Scholar]
  • 45.Magnusson BP, Turnbull BW. Group sequential enrichment design incorporating subgroup selection. Statistics in medicine 2013; 32(16): 2695–2714. [DOI] [PubMed] [Google Scholar]
  • 46.Qian M, Murphy SA. Performance guarantees for individualized treatment rules. Annals of statistics. 2011;39:1180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Zhang B, Tsiatis AA, Davidian M, Zhang M, and Laber E, Estimating optimal treatment regimens from a classification perspective. Stat. 2012;1.1.103–114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Murphy S Optimal dynamic treatment regimes (with discussion) Journal of the Royal Statistical Society, Series B. 2003;65:331–336. [Google Scholar]
  • 49.Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Statistical Methods in Medical Research. 2009;19:317–343. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Laber E, Lizotte D, Qian M, Pelham W, Murphy S. Dynamic treatment regimes: Technical challenges and applications. Electronic Journal of Statistics. 2014;8:1225–1272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Basu A Estimating person-centered treatment (PeT) effects using instrumental variables: an application to evaluating prostate cancer treatments. J Appl Econ 2014; 29: 671–691. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Foster JC, Taylor JM and Ruberg SJ. Subgroup identification from randomized clinical trial data. Stat Med 2011; 30:2867–2880 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Doove LL, Dusseldorp E, Deun K, et al. A comparison of five recursive partitioning methods to find person subgroups involved in meaningful treatment–subgroup interactions. Adv Data Anal Classif 2013; 1–23. [Google Scholar]
  • 54.Freidlin B, McShane LM, Polley MY, et al. Randomized phase II trial designs with biomarkers. J Clin Oncol 2012; 30:3304–3309 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Imai K and Strauss A. Estimation of heterogeneous treatment effects from randomized experiments, with application to the optimal planning of the Get-Out-the-Vote campaign. Polit Anal 2011; 19: 1–19. [Google Scholar]
  • 56.Zhang Z, Wang C, Nie L, et al. Assessing the heterogeneity of treatment effects via potential outcomes of individual patients. J R Stati Soc C 2013; 62: 687–70 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Shen C, Jeong J, Li X, et al. Treatment benefit and treatment harm rate to characterize heterogeneity in treatment effect. Biometrics 2013; 69: 724–731. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Poulson RS, Gadbury GL and Allison DB. Treatment heterogeneity and individual qualitative interaction. Am Stat 2012;66: 16–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Cai T, Tian L, Wong PH, et al. Analysis of randomized comparative clinical trial data for personalized treatment selections. Biostatistics 2011; 12: 270–282 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Ruberg SJ, Chen L and Wang Y. The mean does not mean as much anymore: finding subgroups for tailored therapeutics. Clin Trials 2010; 7: 574–583. [DOI] [PubMed] [Google Scholar]
  • 61.Gadbury GL, Iyer HK and Albert JM. Individual treatment effects in randomized trials with binary outcomes. J Stat Plan Inference 2004; 121: 163. [Google Scholar]
  • 62.Lipkovich I and Dmitrienko A. Strategies for identifying predictive biomarkers and subgroups with enhanced treatment effect in clinical trials using SIDES. J Biopharm Stat 2014; 24: 130–153. [DOI] [PubMed] [Google Scholar]
  • 63.Zhang Z, Qu Y, Zhang B, et al. Use of auxiliary covariates in estimating a biomarker-adjusted treatment effect model with clinical trial data. Stat Meth Med Res 2016; 25: 2103–2119. [DOI] [PubMed] [Google Scholar]
  • 64.Wyss R, Glynn RJ, & Gagne JJ (2016). A review of disease risk scores and their application in pharmacoepidemiology. Current Epidemiology Reports, 3(4), 277–284. [Google Scholar]
  • 65.Nguyen TL, & Debray TP (2019). The use of prognostic scores for causal inference with general treatment regimes. Statistics in medicine, 38(11), 2013–2029 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

Data Availability Statement

Full documentation, including the raw data for the opioid use disorder study, that support the findings of this study may be downloaded from the website https://datashare.nida.nih.gov/data.

RESOURCES