Abstract
Objective:
A trial comparing extended-release naltrexone and sublingual buprenorphine-naloxone demonstrated higher relapse rates in individuals randomized to extended-release naltrexone. The effectiveness of treatment might vary based on patient characteristics. We hypothesized that causal machine learning would identify individualized treatment effects for each medication.
Methods:
A secondary analysis of a multicenter randomized trial that compared the effectiveness of extended-release naltrexone versus buprenorphine-naloxone for preventing relapse of opioid misuse. Three machine learning models were derived using all trial participants with 50% randomly selected for training (n=285) and the remaining 50% for validation. Individualized treatment effect was measured by the Qini value and c-for-benefit, with the absence of relapse denoting treatment success. Patients were grouped into quartiles by predicted individualized treatment effect to examine differences in characteristics and the observed treatment effects.
Results:
The best-performing model had a Qini value of 4.45 (95% CI: 1.02–7.83) and a c-for-benefit of 0.63 (95% CI: 0.53–0.68). The quartile most likely to benefit from buprenorphine-naloxone had a 35% absolute benefit from this treatment and at study entry they had a high median opioid withdrawal score (p<0.001), used cocaine on more days over the prior 30 days than other quartiles (p<0.001), and had highest proportions with alcohol and cocaine use disorder (p≤0.02). Quartile 4 individuals were predicted to be most likely to benefit from extended-release naltrexone, with the greatest proportion having heroin drug preference (p=0.02) and all experiencing homelessness (p<0.001).
Conclusions:
Causal machine learning identified differing individualized treatment effects between medications based on characteristics associated with preventing relapse.
Keywords: Opioid use disorder, opioid treatment, heterogeneity of treatment effect, buprenorphine, naltrexone
INTRODUCTION:
The United States continues to face an unabated opioid overdose crisis, with rising rates of fatal opioid overdoses highlighting the urgent need for effective and safe treatment strategies. Three medications for opioid use disorder (MOUD) reduce overdose events in patients with an opioid use disorder (OUD): methadone, buprenorphine, and extended-release injection naltrexone.1, 2 However, numerous barriers, including inflexible dosing regulations and needing frequent clinical visits, impede individuals with an OUD from accessing MOUD.3 The underutilization of MOUD remains,4 and more guidance is needed to help providers and patients choose which medication will improve outcomes using personalized medicine.5
The Extended-Release Naltrexone Versus Buprenorphine-Naloxone for Opioid Relapse Prevention (X:BOT) trial was the largest clinical trial to examine extended-release naltrexone (XR-NTX) versus buprenorphine-naloxone (BUP-NX) for preventing relapse.6 XR-NTX was associated with a higher rate of relapse compared to BUP-NX (65% vs 57%), largely due to induction failure; however, the per protocol analysis in participants who completed withdrawal management showed both treatments were safe and effective. To better inform individual patient care, clinical trial analyses could move beyond average treatment effects to derive and validate estimates of treatment effects for individual patients to delineate which patients would benefit from XR-NTX or BUP-NX.7
Identifying the most suitable treatment for individuals is becoming more precise with advances in machine learning. Individualized treatment effect estimates from machine learning models can predict how different treatments might work for a single patient based on their unique characteristics. Traditional methods examine interactions (i.e., effect modifiers) between certain variables one at a time. Traditional methods using data from X:BOT showed patients experiencing homelessness as the leading effect modifier in choosing XR-NTX over BUP-NX8; however, the methods employed were unable to examine multiple interactions simultaneously. With machine learning, it is possible to examine the non-linear relationships between all potential effect modifiers, handle larger amounts of modifiers, and prevent overfitting with regularization and internal validation.
We conducted a secondary analysis of the X:BOT trial that employed machine learning methods to examine the individualized treatment effect (ITE) of MOUD used to prevent relapse. We hypothesized that through the application of advanced machine learning techniques specifically tailored for personalized medicine, our models will predict individual patients who will benefit from XR-NTX or BUP-NX treatment. The prediction models will be based on a comprehensive analysis of individual-level characteristics available before randomization (baseline). The approach aims to optimize treatment decisions, tailoring them to each patient’s unique profile to maximize therapeutic effectiveness and improve clinical outcomes.
METHODS:
Study Design
The X:BOT trial was an open-label, randomized study that compared the effectiveness of XR-NTX versus BUP-NX in preventing relapse over six months in 570 English-speaking adults with OUD. Exclusion criteria included serious medical or psychiatric conditions. The participants were enrolled at eight community-based inpatient treatment sites. The site selection process came from NIDA’s Clinical Trial Network Nodes with efforts for geographic and ethnic diversity. Site participation came from New York, Ohio, Washington, Florida, Massachusetts, California, New Mexico, and Maryland. The BUP-NX administration began once withdrawal symptoms appeared. XR-NTX was intramuscularly injected after a medically managed withdrawal and a naloxone challenge were completed, and a negative opioid urine sample was provided. The primary measure was the successful induction of outpatient treatment with opioid-relapse-free survival for 24 weeks. Either induction failure or relapse after induction was considered a treatment failure. Further study details have been previously reported.9 For the causal machine learning analysis, the outcome of relapse was dichotomized during the study period, and completing induction with the absence of relapse denoted treatment success.
Predictor variables for causal machine learning
Our analysis applied causal machine learning models to predict ITE using a reduced variable list and a full variable list. In the reduced variable dataset, which was the primary analysis, we followed the variable selection by Nunes et al., starting with 51 variables with potential prognostic significance based on literature or clinical experience.8 The variables were sourced from the PhenXToolkit10 and included demographics, education, severity and characteristics of opioid use, criminal justice involvement, housing, family and social history, MOUD preference, past withdrawal experience, timeline follow-back for substance use and treatment11, medical comorbidities, psychiatric history, and several validated survey tools on depression12, quality of life13, and the Subjective Opioid Withdrawal Scale.14
In the secondary analysis, a full variable list was used with additional baseline variables that were collected but not examined in the Nunes et al. study. They were sourced from the X-BOT clinical trial’s public data repository maintained by the National Institute of Drug Abuse Clinical Trials Network. The additional variables included baseline laboratory tests, a risk assessment battery, vital signs, and supplementary surveys on substance use and mental health, which resulted in 610 potential variables.
Variables from either the reduced or full list were excluded if they met any of the following criteria: (1) greater than 35% missing data; (2) correlation statistic greater than 0.7 between variables; or (3) low frequency. If the pair-wise Pearson’s correlation coefficient between variables was >0.7, the variable with the larger mean absolute correlation was eliminated in each pair.15 Variables that were constant across more than 95% of observations were removed.17 The final full variable list was filtered to 392 variables, and the reduced variable list was filtered to 42 variables (Supplemental 1). The full list of variables removed from the analysis, including the variables with high missingness is shown in Supplemental 2.
Causal Machine Learning Models to Estimate Individualized Treatment Effect (ITE)
To estimate the ITEs, machine learning approaches were used that predicted the potential outcomes for each participant under both treatment conditions. The difference in predicted outcomes between the two treatments for each participant was computed to represent the estimated ITE (i.e., conditional average treatment effect (CATE)). The dataset was randomly split into 50% (n=285) training data and 50% (n=285) testing data with the same prevalence of outcomes across both datasets. Five-fold cross-validation in the training data was used to select the best causal machine-learning models with the highest discrimination from a list of six models (Supplemental 3). The following models had the highest performance and were selected for evaluation on the testing data: (1) modified covariate transformation with Elastic Net (Tian-EN)16; (2) X-learner with Bayesian Additive Regression Trees (X-BART)17; and (3) Uplift Random Forest (Uplift-RF).18 Tian-EN identifies ITE by transforming the predictor variables to estimate the ITE by multiplying each predictor by the treatment assignment vector18. After performing the covariate transformations, a regularization method is applied in logistic regression. Uplift-RF is a series of decision trees, each built on a bootstrapped sample of data, to predict the difference in potential outcomes under the two treatment options to estimate ITE. X-BART is a meta-algorithm that extends the Bayesian Additive Regression Trees model. Additional details for each model are in Supplemental 3.
For Tian-EN and Uplift-RF, five-fold cross-validation was used for hyperparameter tuning in a grid search approach. Hyperparameters used in the X-BART model were fixed a priori by the package authors as the best-performing parameter set from 10,000 combinations.19 Additional details on the grid search of hyperparameters are detailed in Supplemental 4. Each model was trained and tested using five random seeds with the mean ITE prediction across all seeds in the test set used as an individual’s predicted ITE.
Analysis Plan
Baseline characteristics were compared using nonparametric tests for categorical and integer variables. The primary outcome was defined as the completion of induction and absence of relapse in the first six months. The Qini value, derived from the Qini curve20, was the primary discrimination metric used to demonstrate the benefit of XR-NTX versus BUP-NX. The Qini value represents the area between the curve generated by ranking the population based on the model’s individualized treatment effect and the line denoting random assignment. A greater value signifies better discrimination, similar to an area under the receiver operating curve that discriminates between those predicted to have the event versus those who were not.
The c-for-benefit statistic was defined as the proportion of pairs for which the pair with the higher average predicted ITE had the greater observed benefit. The c-for-benefit thus measured the effectiveness of the model in discriminating against pairs of patients based on the potential outcomes.21 Bootstrapped 95% confidence intervals were calculated for the adjusted Qini value22 and the c-for-benefit statistic.
To better visualize participants predicted to benefit from XR-NTX to those predicted to benefit from BUP-NX, the results derived from the test set were subsequently divided into quartiles by participants’ predicted ITE. This allowed a simplified summarization of the baseline characteristics within each quartile and facilitated comparison across the quartiles. To evaluate whether the predicted treatment effect for individual participants modified the effect of trial group assignment on the primary outcome in the test set, we used a logistic regression model to test for interaction between the ITE and the treatment group assignment with the primary outcome as the dependent variable. This complemented insights from the Qini value and the c-for-benefit statistics, which quantify whether the model predicts treatment benefit greater than random chance. The observed treatment effect was also examined across quartiles of mean predicted ITE. Finally, we examined induction completion after randomization across the quartiles by the participants’ predicted ITE with comparison testing of proportions.
To examine the relationship between the observed average treatment effect and the mean ITE, bins were visualized in calibration plots. The Estimated Calibration Error for Treatment Heterogeneity (ECETH) assesses the calibration error for ITE models and is an adaptation of the Estimated Calibration Error for general machine learning models. ECETH was defined as the expected squared difference between the predicted and observed treatment effects in four bins.23 Further details on its calculation are available in Supplemental 3.
Variable importance is a critical step in understanding and interpreting the outputs of machine learning models. For Tian-EN, the magnitude of the regression coefficients represented variable importance. In Uplift-RF, variable importance was assessed as the sum of empirical improvement in uplift as a result of splitting on a variable across all nonterminal nodes. For X-BART, variable importance was determined for each of the two CATE estimators separately and then averaged. The scales of importance for all models were normalized to 100% for the most important variable identified by each method, with 100 representing the most important.
All analyses were performed in R version 3.6.3 (R Foundation for Statistical Computing). The study was deemed exempt by the University of Wisconsin Institutional Review Board, and the data are publicly available on datashare.nida.nih.gov. We followed the guidance of the Predictive Approaches to Treatment effect Heterogeneity statement on predictive modeling of heterogeneity of treatment effect in clinical trials.24
RESULTS:
Participants were 29.4% female and had a median age of 31 (IQR 26–39). Approximately 26.5% of the participants were unemployed and 25.1% were experiencing homelessness. Nearly 75% of participants were using heroin and IV drugs and 89.3% had previously received treatment with methadone or buprenorphine. Other substance use disorders were common, including cocaine (30.7%), cannabis (28.6%), and alcohol (27.9%). A co-occurring psychiatric disorder was present in 64% of participants. A history of sexual abuse was self-reported in 26.8%, and 40.6% self-reported a history of physical abuse. (Table 1)
TABLE 1.
Participant characteristics in train and test cohorts
| Baseline characteristic | Overall (n = 570) | Training set (n = 285) | Testing set (n= 285) | P-valuee |
|---|---|---|---|---|
| Male gender, n (%)* | 401 (70.4) | 197 (69.1) | 204 (71.6) | 0.58 |
| Age at randomization, median (IQR)* | 31.00 [26.00, 39.00] | 31.00 [26.00, 38.00] | 32.00 [27.00, 39.00] | 0.21 |
| Hispanic ethnicity, n (%)* | 99 (17.4) | 43 (15.1) | 56 (19.6) | 0.18 |
| Race | ||||
| Black or African American, n (%)* | 57 (10.0) | 28 (9.8) | 29 (10.2) | P>0.99 |
| White, n (%)* | 421 (73.9) | 213 (74.7) | 208 (73.0) | 0.7 |
| Other race, n (%)* | 41 (7.2) | 19 (6.7) | 22 (7.7) | 0.75 |
| Multiracial, n (%)* | 32 (5.6) | 17 (6.0) | 15 (5.3) | 0.86 |
| Education | ||||
| Years education completed, median [IQR]* | 12.00 [12.00, 13.00] | 12.00 [12.00, 14.00] | 12.00 [12.00, 13.00] | 0.12 |
| Opioid use | ||||
| Primary drug – heroin, n (%)* | 463 (81.5) | 224 (79.2) | 239 (83.9) | 0.18 |
| Any heroin use, n (%)* | 497 (87.2) | 248 (87.0) | 249 (87.4) | P>0.99 |
| Heroin use route – IV injection, n (%)* | 381 (74.3) | 183 (71.5) | 198 (77.0) | 0.18 |
| Primary drug cost per day, median [IQR]* | 80.00 [50.00, 106.25] | 80.00 [45.00, 120.00] | 80.00 [50.00, 100.00] | 0.98 |
| Opioid Withdrawal Score, median [IQR]* | 12.00 [5.00, 24.00] | 12.00 [5.00, 24.00] | 12.00 [5.00, 23.00] | 0.95 |
| Other substance use | ||||
| Cocaine use 30 days, n (%)* | 224 (39.3) | 111 (38.9) | 113 (39.6) | 0.93 |
| Methamphetamine use 30 days, n (%) | 94 (16.5) | 44 (15.4) | 50 (17.5) | 0.57 |
| Cannabis use 30 days, n (%)* | 283 (49.6) | 139 (48.8) | 144 (50.5) | 0.74 |
| Smoke cigarettes, n (%)* | 489 (85.8) | 238 (83.5) | 251 (88.1) | 0.15 |
| DSM-5 Substance Use Disorder b | ||||
| DSM-5 alcohol use disorder, n (%)* | 159 (27.9) | 87 (30.5) | 72 (25.3) | 0.19 |
| DSM-5 amphetamine use disorder, n (%)* | 106 (18.6) | 55 (19.3) | 51 (17.9) | 0.75 |
| DSM-5 benzodiazepine use disorder, n (%)* | 153 (26.8) | 68 (23.9) | 85 (29.8) | 0.13 |
| DSM-5 cannabis use disorder, n (%)* | 163 (28.6) | 83 (29.1) | 80 (28.1) | 0.85 |
| DSM-5 cocaine use disorder, n (%)* | 175 (30.7) | 84 (29.5) | 91 (31.9) | 0.59 |
| History of Substance Use | ||||
| Age of opioid use onset, median [IQR]* | 20.00 [17.00, 25.00] | 20.00 [16.00, 25.00] | 20.00 [17.00, 25.00] | 0.59 |
| Age of nicotine use onset, median [IQR]* | 15.00 [13.00, 18.00] | 15.00 [13.00, 18.00] | 15.00 [13.00, 17.00] | 0.45 |
| # previous drug misuse withdrawal management programs, median [IQR] | 1.00 [0.00, 3.00] | 2.00 [0.00, 3.00] | 1.00 [0.00, 3.00] | 0.17 |
| Previous methadone/buprenorphine treatment success (self-reported), n (%)c* | 201 (89.3) | 96 (91.4) | 105 (87.5) | 0.46 |
| Other psychiatric symptoms or disorders (clinician-rated) | ||||
| Hamilton depression score, median [IQR]* | 7.00 [4.00, 14.00] | 7.00 [4.00, 14.00] | 7.00 [4.00, 13.00] | 0.86 |
| Has psychiatric diagnosis, n (%)* | 365 (64.0) | 181 (63.5) | 184 (64.6) | 0.86 |
| Moderate or extreme anxiety/depression, n (%)* | 391 (68.6) | 204 (71.6) | 187 (65.6) | 0.15 |
| Pain | ||||
| Chronic pain at least 6 months, n (%)* | 73 (12.8) | 36 (12.7) | 37 (13.0) | P>0.99 |
| Moderate or extreme pain, n (%)* | 335 (58.8) | 162 (56.8) | 173 (60.7) | 0.395 |
| History of abuse | ||||
| Ever physically abused, n (%) | 231 (40.6) | 111 (38.9) | 120 (42.3) | 0.47 |
| Ever sexually abused, n (%) | 152 (26.8) | 74 (26.1) | 78 (27.5) | 0.8 |
| Living situation | ||||
| Experiencing homelessness, n (%)* | 143 (25.1) | 65 (22.8) | 78 (27.4) | 0.25 |
| Lives with a person with AUD, n (%)* | 65 (11.4) | 29 (10.2) | 36 (12.6) | 0.429 |
| Lives with a person using drugs, n (%)* | 116 (20.4) | 60 (21.1) | 56 (19.6) | 0.755 |
| Legal Status | ||||
| On parole or probation, n (%)* | 92 (16.1) | 42 (14.7) | 50 (17.5) | 0.43 |
| Median number of arrests that resulted in conviction [IQR] | 2.00 [1.00, 4.00] | 2.00 [1.00, 4.00] | 2.00 [1.00, 4.00] | 0.59 |
| Timing of randomization | ||||
| Late randomization* | 353 (61.9) | 179 (62.8) | 174 (61.1) | 0.73 |
| Randomized to buprenorphine-naloxone, n (%) | 278 (48.7) | 131 (46.0) | 156 (54.7) | 0.04 |
| No relapse within 24 weeks, n (%) | 222 (38.9) | 108 (37.9) | 114 (40.0) | 0.67 |
Denotes variable was included as predictor in ITE model.
Includes the following employment statuses: student, retired/disabled, in controlled environment.
Includes mild, moderate and severe DSM-5 substance use disorders.
Participants were asked whether they thought past treatments had been successful (yes/no).
High severity addiction refers to intravenous use at ≥ 6 bags/day
Kruskal-Wallis test was used for continuous variables. Chi-squared was used for binary variables.
AUD=alcohol use disorder
In the primary analysis with the reduced variable list, the Tian-EN model provided the greatest discrimination with the highest Qini value of 4.45 (95% CI: 1.02–7.83) and a c-for-benefit of 0.63 (95% CI: 0.53–0.68). Uplift-RF and X-BART had a Qini value of 4.26 (95% CI: 1.13–7.46) and 4.00 (95% CI: 0.48–7.30), respectively. The c-for-benefits were 0.61 (95% CI: 0.51–0.67) and 0.59 (95% CI: 0.50–0.67), respectively. The Qini plot is shown in Figure 1 and shows each model’s benefit over random treatment assignment. In secondary analysis with the full variable list, Tian-EN also had the highest Qini value and c-for-benefit of 2.16 (95% CI: −1.42–5.18; p-value: 0.20) and 0.58 (95% CI: 0.48–0.62), respectively (Supplemental 5).
FIGURE 1.

Qini Curve
The left figure depicts the discrimination of the Tian-EN, Uplift-RF and X-BART models in the validation cohort. The difference in between the method lines (extended-release naltrexone vs buprenorphine-naloxone selected for patients based on predicted ITE from the model) vs the random treatment assignment line (extended-release naltrexone vs buprenorphine-naloxone assigned randomly) demonstrates the Qini value, defined as the difference between the areas under the curve plotted by the model-based targeting and random targeting. Consistent with high discrimination, the Qini curve for all three models first increases (showing that the patients for whom the model predicted the largest treatment effect from buprenorphine-naloxone experienced the largest benefit from buprenorphine-naloxone) then plateaus (as the population begins to include patients with similar outcomes with either extended-release naltrexone or buprenorphine-naloxone), and finally decreases (showing that the patients for whom the model predicted the largest treatment effect from extended-release naltrexone experienced the largest benefit from extended-release naltrexone). The Qini value is highest for Tian-EN.
The reduced variable list models’ ITE predictions yielded an interaction effect with treatment, suggesting significant treatment variation based on patient characteristics (Tian-EN, p=0.020; Uplift-RF, p=0.015, X-BART, p=0.020). The top predictors of the Tian-EN model were participants experiencing homelessness, the presence of cocaine use disorder, and their opioid withdrawal score. The presence of cocaine use disorder and homelessness were also top predictors in the X-BART and Uplift-RF models (Figure 2). Homelessness persisted as the most important variable in the full variable list models (Supplemental 5).
FIGURE 2.

Variable Importance for the machine learning models
This figure displays ten variables that Tian-EN, Uplift-RF and X-BART identified as most important in predicting ITE. Variable importance was calculated using model-specific metrics (see methods). For Tian-EN, the magnitude of the regression coefficients represented variable importance. In Uplift-RF, variable importance was assessed as the sum of empirical improvement in uplift as a result of splitting on a variable across all nonterminal nodes. For X-BART, variable importance was determined for each of the two CATE estimators separately and then averaged. The x-axis scale of importance was normalized to 100% for the most important variable identified by each method, with 100 representing the most important. Homelessness was identified as the most important variable across all three models. Cocaine use (either DSM-5 cocaine use disorder or days of cocaine use in the past 30 days) was identified as an important variable by all three models. Only three variables were shown for Tien-EN as all other variables in the model had their coefficients shrunk to zero via penalization in the regression model.
Across all the models, the mean ITE prediction underestimated the observed average treatment demonstrated in the calibration plots (Supplemental 6). The Uplift-RF had the lowest calibration error with an ECETH of 0.015 (95% CI: −0.006 – 0.041). In the full variable model, Uplift-RF also had the lowest calibration error (Supplemental 5), but visually worse calibration compared to the models using the reduced variable list.
Figure 3 shows an overall increasing trend in the observed ATE across quartiles of predicted ITE for all machine learning models. In examining the mean ITE predictions across quartiles from those that benefited most from BUP-NX treatment (Quartile 1) to those that benefited most from XR-NTX treatment (Quartile 4), the proportion with the observed average treatment effect from BUP-NX was also highest in Quartile 1 and lowest in Quartile 4 (Table 2). Patients more likely to benefit from BUP-NX had greater mean scores on the Subjective Opioid Withdrawal Scale, spent more days using cocaine in 30 days, and had co-occurring substance use disorders (alcohol, benzodiazepine, and cocaine). Individuals predicted to be most likely to benefit from XR-NTX (Quartile 4) had the largest proportion of intravenous heroin use, and the large majority were experiencing homelessness (Table 2).
FIGURE 3.

Observed treatment effect in the test set by predicted ITE quartile from the machine learning models
Patients in the test cohort are grouped into quartiles by their predicted individualized treatment effect for each model, ranging from the quartile predicted to most benefit from the use of a buprenorphine-naloxone (Q1) to the quartile predicted to most benefit from the use of extended release-naltrexone (Q4). The observed average treatment effect in the Y-axis is the difference in the incidence of the primary outcome (no relapse within 24 weeks) between the buprenorphine-naloxone treatment group and the extended release-naltrexone treatment group. The dots are the proportion that is favored by that treatment with values above 0 favoring buprenorphine and values below zero favoring naltrexone. Bars indicate the 95% confidence intervals. For each model, the interaction between the predicted treatment effect quartile and the effect of trial group assignment on the primary outcome was significant (Tian-EN: p = 0.020; Uplift-RF: p = 0.010; X-BART: p = 0.020).
TABLE 2.
Participant characteristics per quartile of mean Individualized Treatment Effect
| Baseline characteristic | Overall | Quartile 1 (benefit from BUP-NX) | Quartile 2 | Quartile 3 | Quartile 4 (benefit from XR-NTX) | P-value* |
|---|---|---|---|---|---|---|
| n | 285 | 71 | 71 | 71 | 72 | |
| Male gender, n (%) | 204 (71.6) | 47 (66.2) | 48 (67.6) | 51 (71.8) | 58 (80.6) | 0.22 |
| Age at randomization, median [IQR] | 32.00 [27.00, 39.00] | 32.00 [26.50, 39.50] | 31.00 [26.00, 36.50] | 32.00 [26.50, 40.50] | 34.00 [28.00, 40.25] | 0.29 |
| Hispanic ethnicity, n (%) | 56 (19.6) | 14 (19.7) | 18 (25.4) | 12 (16.9) | 12 (16.7) | 0.53 |
| Race | ||||||
| Black or African American, n (%) | 29 (10.2) | 6 (8.5) | 5 (7.0) | 10 (14.1) | 8 (11.1) | 0.52 |
| White, n (%) | 208 (73.0) | 56 (78.9) | 53 (74.6) | 48 (67.6) | 51 (70.8) | 0.46 |
| Other race, n (%) | 22 (7.7) | 3 (4.2) | 8 (11.3) | 7 (9.9) | 4 (5.6) | 0.33 |
| Multiracial, n (%) | 15 (5.3) | 3 (4.2) | 4 (5.6) | 4 (5.6) | 4 (5.6) | 0.98 |
| Education | ||||||
| Years education completed, median [IQR] | 12.00 [12.00, 13.00] | 12.00 [12.00, 13.00] | 12.00 [12.00, 14.00] | 12.00 [11.00, 13.00] | 12.00 [12.00, 13.00] | 0.35 |
| Employment | ||||||
| Looking for work, n (%) | 179 (63.5) | 49 (69.0) | 36 (50.7) | 51 (73.9) | 43 (60.6) | 0.02 |
| Opioid use | ||||||
| Primary drug – heroin, n (%) | 239 (83.9) | 57 (80.3) | 57 (80.3) | 56 (78.9) | 69 (95.8) | 0.02 |
| Any heroin use, n (%) | 249 (87.4) | 61 (85.9) | 64 (90.1) | 61 (85.9) | 63 (87.5) | 0.86 |
| Heroin use route - IV injection, n (%) | 198 (77.0) | 48 (73.8) | 52 (81.2) | 49 (77.8) | 49 (75.4) | 0.77 |
| Primary drug cost per day, median [IQR] | 80.00 [50.00, 100.00] | 70.00 [50.00, 105.00] | 90.00 [60.00, 122.50] | 80.00 [50.00, 112.50] | 72.50 [50.00, 100.00] | 0.20 |
| Opioid Withdrawal Score, median [IQR] | 12.00 [5.00, 23.00] | 14.00 [5.00, 26.00] | 22.00 [15.50, 30.00] | 5.00 [3.00, 8.00] | 9.50 [4.00, 19.00] | P<.001 |
| Other substance use at study entry | ||||||
| Cocaine use 30 days, median [IQR]** | 0.00 [0.00, 4.00] | 5.00 [2.00, 17.00] | 0.00 [0.00, 0.00] | 0.00 [0.00, 0.00] | 0.00 [0.00, 1.25] | P<.001 |
| Cannabis use 30 days, median [IQR]** | 1.00 [0.00, 7.00] | 1.00 [0.00, 6.50] | 1.00 [0.00, 5.50] | 0.00 [0.00, 11.00] | 0.50 [0.00, 4.25] | 0.87 |
| Smoke cigarettes, n (%) | 251 (88.1) | 66 (93.0) | 58 (81.7) | 63 (88.7) | 64 (88.9) | 0.22 |
| DSM-5 Substance Use Disorder | ||||||
| DSM-5 alcohol use disorder, n (%) | 72 (25.3) | 27 (38.0) | 11 (15.5) | 16 (22.5) | 18 (25.0) | 0.02 |
| DSM-5 amphetamine use disorder, n (%) | 51 (17.9) | 13 (18.3) | 14 (19.7) | 9 (12.7) | 15 (20.8) | 0.59 |
| DSM-5 benzodiazepine use disorder, n (%) | 85 (29.8) | 30 (42.3) | 14 (19.7) | 16 (22.5) | 25 (34.7) | 0.01 |
| DSM-5 cannabis use disorder, n (%) | 80 (28.1) | 25 (35.2) | 18 (25.4) | 17 (23.9) | 20 (27.8) | 0.45 |
| DSM-5 cocaine use disorder, n (%) | 91 (31.9) | 67 (94.4) | 0 (0.0) | 6 (8.5) | 18 (25.0) | P<.001 |
| History of Substance Use | ||||||
| Age of opioid use onset, median [IQR] | 20.00 [17.00, 25.00] | 21.00 [17.00, 28.00] | 21.00 [18.00, 26.00] | 18.00 [16.00, 24.00] | 21.00 [17.00, 25.00] | 0.18 |
| Age of nicotine use onset, median [IQR] | 15.00 [13.00, 17.00] | 15.00 [13.00, 17.00] | 15.00 [13.00, 17.00] | 15.00 [13.00, 18.00] | 15.00 [12.50, 18.00] | 0.96 |
| Prior methadone/buprenorphine treatment success (self-reported), n (%) | 105 (87.5) | 27 (81.8) | 22 (84.6) | 30 (90.9) | 26 (92.9) | 0.52 |
| Other psychiatric symptoms or disorders (clinician-rated) | ||||||
| Hamilton depression score, median [IQR] | 7.00 [4.00, 13.00] | 7.00 [4.00, 14.00] | 9.00 [4.00, 14.50] | 6.00 [3.00, 11.00] | 8.50 [4.00, 13.25] | 0.14 |
| Has psychiatric diagnosis, n (%) | 184 (64.6) | 48 (67.6) | 45 (63.4) | 48 (67.6) | 43 (59.7) | 0.72 |
| Moderate or extreme anxiety/depression, n (%) | 187 (65.6) | 47 (66.2) | 58 (81.7) | 37 (52.1) | 45 (62.5) | 0.003 |
| Pain | ||||||
| Chronic pain for at least 6 months, n (%) | 37 (13.0) | 14 (19.7) | 7 (9.9) | 6 (8.5) | 10 (13.9) | 0.19 |
| Moderate or extreme pain, n (%) | 173 (60.7) | 43 (60.6) | 50 (70.4) | 31 (43.7) | 49 (68.1) | 0.004 |
| Living situation | ||||||
| Experiencing homelessness, n (%) | 78 (27.4) | 0 (0.0) | 0 (0.0) | 6 (8.5) | 72 (100.0) | P<.001 |
| Lives with a person with AUD, n (%) | 36 (12.6) | 12 (16.9) | 10 (14.1) | 5 (7.0) | 9 (12.5) | 0.34 |
| Lives with a person using drugs, n (%) | 56 (19.6) | 17 (23.9) | 13 (18.3) | 15 (21.1) | 11 (15.3) | 0.60 |
| Legal Status | ||||||
| On parole or probation, n (%) | 50 (17.5) | 14 (19.7) | 13 (18.3) | 14 (19.7) | 9 (12.5) | 0.62 |
| Medication preference | ||||||
| Strong medication preference, n (%) | 77 (27.0) | 20 (28.2) | 19 (26.8) | 23 (32.4) | 15 (20.8) | 0.48 |
| No medication preference, n (%) | 96 (33.7) | 25 (35.2) | 20 (28.2) | 25 (35.2) | 26 (36.1) | 0.73 |
| Not prefer buprenorphine-naloxone, n (%)* | 75 (26.3) | 19 (26.8) | 20 (28.2) | 24 (33.8) | 12 (16.7) | 0.13 |
| Prefer buprenorphine-naloxone, n (%)* | 92 (32.3) | 20 (28.2) | 22 (31.0) | 24 (33.8) | 26 (36.1) | 0.76 |
| Not prefer extended-release naltrexone, n (%)* | 67 (23.5) | 17 (23.9) | 14 (19.7) | 18 (25.4) | 18 (25.0) | 0.85 |
| Prefer extended-release naltrexone, n (%)* | 85 (29.8) | 22 (31.0) | 25 (35.2) | 23 (32.4) | 15 (20.8) | 0.26 |
| Timing of randomization | ||||||
| Late randomization, n (%) | 174 (61.1) | 44 (62.0) | 50 (70.4) | 45 (63.4) | 35 (48.6) | 0.06 |
| Severity of addiction | ||||||
| High severity addiction, n (%) | 117 (41.1) | 22 (31.0) | 31 (43.7) | 27 (38.0) | 37 (51.4) | 0.09 |
| Randomized to burprenorphine-naltrexone, n (%) | 156 (54.7) | 35 (49.3) | 45 (63.4) | 41 (57.7) | 35 (48.6) | 0.23 |
| Completed induction after randomization, n (%) | 237 (83.1) | 56 (78.9) | 61 (85.9) | 59 (83.1) | 61 (84.7) | 0.70 |
| No relapse within 24 weeks, n (%) | 114 (40.0) | 30 (42.3) | 26 (36.6) | 34 (47.9) | 24 (33.3) | 0.30 |
| Observed average treatment effect | 35 (11, 60) | 21 (−3.4, 46) | 2.1 (−24, 28) | −9 (−37, 15) | ||
AUD=alcohol use disorder
Induction failures occurred in 16.1% of participants who did not go on to receive the assigned medication. Participants in the top two quartiles for predicted benefit from BUP-NX had induction failures between 11.1% and 14.1%. Participants in the top two quartiles for predicted benefit from XR-NTX had induction failure rates between 15.3% and 16.9%. No significant differences were found in induction failures across the quartiles of predicted ITEs (p=0.70). In the full variable model, additional variables with suicide ideations, suicide attempts, and hospitalizations were in greater proportions and counts for patients more likely to benefit from BUP-NX (p<0.05 after correction for multiple comparisons).
DISCUSSION:
In a secondary analysis of one of the largest clinical trials focusing on the treatment of patients with OUD, we discovered that machine learning methods can identify patient characteristics that are better suited for either XR-NTX or BUP-NX to prevent relapse. BUP-NX was more beneficial for patients diagnosed with greater severity of withdrawal symptoms and co-occurring substance use disorders, and XR-NTX was more beneficial for those experiencing homelessness and primarily intravenous heroin users. More importantly, the lack of difference in induction failures between those predicted to benefit from BUP-NX versus those predicted to benefit from XR-NTX suggest that the model is learning more about the medication benefits rather than the likelihood of failing the withdrawal management phase. The ITE analysis highlighted the benefits between the two medications based on individual patient characteristics, offering valuable insights for healthcare providers to personalize therapeutic choices in patients with OUD.
Simulation studies have exposed numerous shortcomings in regression-based modeling of treatment-covariate interactions. Specifically, these interactions necessitate case penalization to mitigate the risk of overfitting, while their models often yield poorly calibrated predictions of benefit on an absolute scale.21 Our study extends the work by Nunes et al., unveiling significant effect modifiers not identified in their study. Prior studies like Nunes et al. investigated multiple individual variables and their interactions, which constitutes multiple testing and requires correction. In ITE with machine learning, all variables were examined together, and collinearities and other non-significant interactions were accounted for with different penalizing parameters built into the model. We employed a counterfactual framework, applied hyperparameters with penalization, and tested in an independent, hold-out test dataset to prevent overfitting. We only reported the results of the test set, and the success of the model is whether treatment with our machine learning model could enhance personalized care, which was measured using the Qini value (and not p-values). Our tables examining multiple variables across quartiles with significance testing were merely a visualization after the model training and testing had already been completed.
The machine learning models revealed patient characteristics that could benefit more from XR-NTX treatment, which was previously unknown from the X-BOT trial’s broader findings. These results demonstrate the power of machine learning in distilling complex, individual patient data, and identifying discrete patient characteristics that have differential treatment responses, a task that conventional statistical methods are limited in performing. All models could differentiate between the treatment effects and the results were similar across the models. Among the tested models, the Tian-EN model demonstrated the best discrimination measured by the Qini value. Despite the Tian-EN model demonstrating a correct directional trend across quartiles of predicted mean ITE when compared to the average treatment effect, a consistent underestimation of the average treatment effect from the trial was observed. This suggests the correct direction in absolute treatment effect but some miscalibration in the magnitude of effect.
High dropout and relapse rates observed in previous studies2, 25, 26 prompts the question of whether personalizing care could enhance treatment efficacy in OUD. In a large observational study examining medication practices for initiating MOUD from a multistate claims database, co-occurring substance use disorder was associated with lower buprenorphine initiation and higher naltrexone initiation.4 Our results showed that patients with co-occurring substance use disorder were less likely to relapse if they initiated BUP-NX as compared to XR-NTX. Moreover, individuals experiencing homelessness, often linked to involuntary displacement, initiate fewer OUD treatments27, reflecting the broader challenges faced in adhering to consistent treatment regimens. Like Nunes et al., our study showed XR-NTX as the optimal medication among individuals experiencing homelessness. While factors such as medication accessibility and payer reimbursement policies influence prescribing practices, leading to the infrequent prescription of XR-NTX, the implementation of ITE models can empower providers to consider naltrexone as a viable treatment option in scenarios where it may not have been previously considered. We showed other substance use disorders, looking for work, heroin drug preference, anxiety, and pain as additional characteristics that differed between MOUDs. Some of these characteristics in non-homeless adults also appeared in a secondary analysis of X:BOT that applied mediation analysis.29 Opioid withdrawal score and cocaine use disorder were among other moderators of treatment relapse that were not identified previously.
We conducted a comprehensive investigation of both linear and non-linear relationships, leveraging three approaches: logistic regression, decision trees, and a Bayesian framework (i.e., Tian-EN, X-BART, and Uplift-RF). The Tian-EN model found three variables in its list of top predictors because it used a regularization technique that encourages a simpler model with fewer parameters (i.e., Least Absolute Shrinkage Selection Operator). The tree-based models provided a longer list of top predictors that included co-occurring substance use disorders, recent use of other drugs, and psychiatric diagnoses. These characteristics carried strong face validity and were previously found to be associated with health outcomes in patients with OUD.4, 28 Subsequently, we stratified patients into quartiles based on their mean ITE estimates, juxtaposing these against the trial’s average treatment effect. All the models showed a clear trend of the observed treatment effect as the ITE predictions moved from favoring BUP-NX to favoring XR-NTX. Although some of the differences in predictor variables across the quartiles did not show a trend, they did confirm the findings of the variable importance plot from the model’s test set performance.
Machine learning models utilized for ITE exhibit an advantage in accommodating more predictors than traditional methods8, and have recently found success in other areas of medicine.30–32 Although the inclusion of a full variable list had worse discrimination and calibration than the reduced variable list, suggesting that adding unimportant variables may worsen the performance of these models, additional treatment effect modifiers not identified in the reduced list surfaced such as suicidality and prior hospitalizations. These are modifiable risk factors, but larger studies are needed to confirm these findings. Nevertheless, this study underscores the potential of machine learning for hypothesis-generating studies.
Although X:BOT is amongst the largest clinical trials in medications for OUD, it underrepresents non-white individuals and the female sex.33 As a secondary analysis of an existing clinical trial, the trained ITE model would require additional testing beyond the internal validation performed on the test dataset. Additional external validation studies are needed to affirm their real-world applicability. Lastly, machine learning models require a lot of data. Simulation data indicate that random forest models require hundreds of outcomes per variable for stability34, which may have contributed to some calibration errors we observed. It also illustrates the value of expert determination in variable selection when faced with limited sample sizes, as our reduced variable model was more stable than the full variable model.
CONCLUSION
This study demonstrates an application of machine learning to assess treatment effect heterogeneity in OUD treatment trials. Differences in patient baseline characteristics can predict treatment success between XR-NTX and BUP-NX for preventing relapse in patients with OUD. Regression and tree-based machine learning methods for ITE are a promising direction forward to develop tailored treatments for patients with addiction.
Supplementary Material
Sources of funding:
Dr. Afshar is supported by an R01 from NIH/NIDA (R01DA051464). Dr. Churpek is supported by an R35 award from NIH/NIGMS (R35GM145330). Dr. Sinha is supported by an R35 award from NIH/NIGMS (R35GM142992). Dr. Rotrosen is supported by a U10 award from NIH/NIDA (U10DA013035).
Footnotes
Conflicts of Interest: None
REFERENCES
- 1.Larochelle MR, Bernson D, Land T, et al. Medication for Opioid Use Disorder After Nonfatal Opioid Overdose and Association With Mortality: A Cohort Study. Ann Intern Med. 2018;169:137–145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Lee JD, Friedmann PD, Kinlock TW, et al. Extended-Release Naltrexone to Prevent Opioid Relapse in Criminal Justice Offenders. N Engl J Med. 2016;374:1232–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Madras BK, Ahmad NJ, Wen J and Sharfstein JS. Improving Access to Evidence-Based Medical Treatment for Opioid Use Disorder: Strategies to Address Key Barriers within the Treatment System. NAM Perspect. 2020;2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Xu KY, Mintz CM, Presnall N, Bierut LJ and Grucza RA. Comparative Effectiveness Associated With Buprenorphine and Naltrexone in Opioid Use Disorder Co-occurringing Polysubstance Use. JAMA Netw Open. 2022;5:e2211363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Krawczyk N, Fawole A, Yang J and Tofighi B. Early innovations in opioid use disorder treatment and harm reduction during the COVID-19 pandemic: a scoping review. Addict Sci Clin Pract. 2021;16:68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Lee JD, Nunes EV Jr., Novo P, et al. Comparative effectiveness of extended-release naltrexone versus buprenorphine-naloxone for opioid relapse prevention (X:BOT): a multicentre, open-label, randomized controlled trial. Lancet. 2018;391:309–318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hoogland J, IntHout J, Belias M, et al. A tutorial on individualized treatment effect prediction from randomized trials with a binary endpoint. Stat Med. 2021;40:5961–5981. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Nunes EV Jr., Scodes JM, Pavlicova M, et al. Sublingual Buprenorphine-Naloxone Compared With Injection Naltrexone for Opioid Use Disorder: Potential Utility of Patient Characteristics in Guiding Choice of Treatment. Am J Psychiatry. 2021;178:660–671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Lee JD, Nunes EV, Mpa PN, et al. NIDA Clinical Trials Network CTN-0051, Extended-Release Naltrexone vs. Buprenorphine for Opioid Treatment (X:BOT): Study design and rationale. Contemp Clin Trials. 2016;50:253–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hamilton CM, Strader LC, Pratt JG, et al. The PhenX Toolkit: get the most from your measures. Am J Epidemiol. 2011;174:253–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hjorthoj CR, Hjorthoj AR and Nordentoft M. Validity of Timeline Follow-Back for self-reported use of cannabis and other illicit substances--systematic review and meta-analysis. Addict Behav. 2012;37:225–33. [DOI] [PubMed] [Google Scholar]
- 12.Carrozzino D, Patierno C, Fava GA and Guidi J. The Hamilton Rating Scales for Depression: A Critical Review of Clinimetric Properties of Different Versions. Psychother Psychosom. 2020;89:133–150. [DOI] [PubMed] [Google Scholar]
- 13.Rabin R and de Charro F. EQ-5D: a measure of health status from the EuroQol Group. Ann Med. 2001;33:337–43. [DOI] [PubMed] [Google Scholar]
- 14.Handelsman L, Cochrane KJ, Aronson MJ, Ness R, Rubinstein KJ and Kanof PD. Two new rating scales for opiate withdrawal. Am J Drug Alcohol Abuse. 1987;13:293–308. [DOI] [PubMed] [Google Scholar]
- 15.Kuhn M caret: Classification and regression training. 2014;2023. [Google Scholar]
- 16.Tian L, Alizadeh AA, Gentles AJ and Tibshirani R. A Simple Method for Estimating Interactions between a Treatment and a Large Number of Covariates. J Am Stat Assoc. 2014;109:1517–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kunzel SR, Sekhon JS, Bickel PJ and Yu B. Metalearners for estimating heterogeneous treatment effects using machine learning. Proc Natl Acad Sci U S A. 2019;116:4156–4165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Leo Guelman NG, Perez-Marin Ana M.. Random forests for uplift modeling: An insurance customer retention case Modeling and Simulation in Engineering, Economics, and Management Berlin, Heidelberg: Springer; 2012(115): 123–133. [Google Scholar]
- 19.Vincent Dorie JH, Uri Shalit, Marc Scott, Dan Cervone. Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition. Statistical Science. 2019;32:43–68. [Google Scholar]
- 20.Devriendt FMD, Verbeke W. A literature survey and experimental evaluation of the State-of-the-Art in uplift modeling: A stepping stone toward the development of prescriptive analytics. Big Data. 2018;6:13–41. [DOI] [PubMed] [Google Scholar]
- 21.van Klaveren D, Steyerberg EW, Serruys PW and Kent DM. The propos’d ‘concordance-statistic for bene’it’ provided a useful metric when modeling heterogeneous treatment effects. J Clin Epidemiol. 2018;94:59–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Steve Yadlowsky SF, Nigam Shah, Emma Brunskill, Stevan Wager. Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects. arXiv. 2021:1–53. [Google Scholar]
- 23.Yizhe Xu SY. Calibration error for heterogeneous treatment effects. Porceedings of the 25th International Conference on Artificial Intelligence and Statsistics (AISTATS). 2022;151. [Google Scholar]
- 24.Kent DM, Paulus JK, van Klaveren D’Agostinoino R, et al. The Predictive Approaches to Treatment effect Heterogeneity (PATH) Statement. Ann Intern Med. 2020;172:35–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Hser YI, Saxon AJ, Huang D, et al. Treatment retention among patients randomized to buprenorphine/naloxone compared to methadone in a multi-site trial. Addiction. 2014;109:79–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Hser YI, Evans E, Huang D, et al. Long-term outcomes after randomization to buprenorphine/naloxone versus methadone in a multi-site trial. Addiction. 2016;111:695–705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Barocas JA, Nall SK, Axelrath S, et al. Population-Level Health Effects of Involuntary Displacement of People Experiencing Unsheltered Homelessness Who Inject Drugs in US Cities. JAMA. 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Fridell M, Bäckström M, Hesse M, Krantz P, Perrin S and Nyhlén A. Prediction of psychiatric comorbidity on premature death in a cohort of patients with substance use disorders: a 42-year follow-up. BMC Psychiatry. 2019;19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Rudolph KE, Diaz I, Hejazi NS, et al. Explaining differential effects of medication for opioid use disorder using a novel approach incorporating mediating variables. Addiction. 2021;116:2094–2103. [DOI] [PubMed] [Google Scholar]
- 30.Goligher EC, Lawler PR, Jensen TP, et al. Heterogeneous Treatment Effects of Therapeutic-Dose Heparin in Patients Hospitalized for COVID-19. JAMA. 2023;329:1066–1077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Seitz KP, Spicer AB, Casey JD, et al. Individualized Treatment Effects of Bougie versus Stylet for Tracheal Intubation in Critical Illness. Am J Respir Crit Care Med. 2023;207:1602–1611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Edward JA, Josey K, Bahn G, et al. Heterogeneous treatment effects of intensive glycemic control on major adverse cardiovascular events in the ACCORD and VADT trials: a machine-learning analysis. Cardiovasc Diabetol. 2022;21:58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Rudolph KE, Russell M, Luo SX, Rotrosen J and Nunes EV. Under-representation of key demographic groups in opioid use disorder trials. Drug Alcohol Depend Rep. 2022;4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Sanchez-Pinto LN, Venable LR, Fahrenbach J and Churpek MM. Comparison of variable selection methods for clinical predictive modeling. Int J Med Inform. 2018;116:10–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
