Abstract
Rationale
A recent randomized trial found that using a bougie did not increase the incidence of successful intubation on first attempt in critically ill adults. The average effect of treatment in a trial population, however, may differ from effects for individuals.
Objective
We hypothesized that application of a machine learning model to data from a clinical trial could estimate the effect of treatment (bougie vs. stylet) for individual patients based on their baseline characteristics (“individualized treatment effects”).
Methods
This was a secondary analysis of the BOUGIE (Bougie or Stylet in Patients Undergoing Intubation Emergently) trial. A causal forest algorithm was used to model differences in outcome probabilities by randomized group assignment (bougie vs. stylet) for each patient in the first half of the trial (training cohort). This model was used to predict individualized treatment effects for each patient in the second half (validation cohort).
Measurements and Main Results
Of 1,102 patients in the BOUGIE trial, 558 (50.6%) were the training cohort, and 544 (49.4%) were the validation cohort. In the validation cohort, individualized treatment effects predicted by the model significantly modified the effect of trial group assignment on the primary outcome (P value for interaction = 0.02; adjusted qini coefficient, 2.46). The most important model variables were difficult airway characteristics, body mass index, and Acute Physiology and Chronic Health Evaluation II score.
Conclusions
In this hypothesis-generating secondary analysis of a randomized trial with no average treatment effect and no treatment effect in any prespecified subgroups, a causal forest machine learning algorithm identified patients who appeared to benefit from the use of a bougie over a stylet and from the use of a stylet over a bougie using complex interactions between baseline patient and operator characteristics.
Keywords: intubation, critical illness, machine learning, prediction models
At a Glance Commentary
Scientific Knowledge on the Subject
A recent randomized trial found that use of a bougie did not increase the incidence of successful intubation on the first attempt in critically ill adults, nor was an effect detected in any subgroup. The effect for individuals, however, may differ from the average effect across a study population, and machine learning approaches have been developed to predict such individualized treatment effects.
What This Study Adds to the Field
In this secondary analysis of a randomized trial with no average treatment effect, a causal forest machine learning algorithm was able to identify patients for whom interactions between baseline characteristics resulted in apparent benefit from either the use of a bougie or the use of a stylet. These machine learning methods show potential for deriving evidence-based estimates of individualized treatment effects from clinical trials.
Tracheal intubation is a common procedure in the emergency department and ICU. Nearly half of patients undergoing emergency tracheal intubation experience life-threatening hypoxemia or hypotension, and approximately 3% experience cardiac arrest (1). The risk of complications increases when intubation does not occur on the first attempt (2–4). During tracheal intubation, two devices are commonly used to advance the endotracheal tube past the vocal cords and into the trachea: a stylet (a malleable metal instrument placed inside the endotracheal tube) or a tracheal tube introducer (a thin, flexible plastic rod passed into the trachea first), typically referred to as a bougie. The recent BOUGIE (Bougie or Stylet in Patients Undergoing Intubation Emergently) multicenter randomized trial compared the effect of using a bougie versus a stylet on the incidence of successful intubation on the first attempt in patients for whom the operator believed either technique was acceptable (5). Among the 1,102 patients in the trial, the use of a bougie did not significantly increase the likelihood of successful intubation on the first attempt, overall or in any prespecified patient subgroup.
Randomized trials like BOUGIE typically report an average treatment effect, which represents the difference between the treatments on average across all patients in the trial population. However, the effect of a treatment on outcomes may vary for different patients based on their individual characteristics. To best inform clinical care for individual patients, the analysis of clinical trials might move beyond presenting only average treatment effects to deriving and validating estimates of treatment effects for individual patients. Individualized treatment effect is the predicted difference in outcomes between two treatments for an individual patient, based on the set of his or her individual characteristics. This differs from traditional analysis of heterogeneity of treatment effect, in which a first-order interaction between group assignment and a proposed effect modifier is assessed one at a time (6, 7). Although in parallel-group randomized trials patient outcomes are observed after the assignment to only one treatment, machine learning approaches have been developed that can predict individualized treatment effect using baseline data. This approach has the potential to guide more personalized therapy, even from studies that show no overall difference between groups in average treatment effect.
We conducted a secondary analysis of the dataset for the BOUGIE trial, in which we used machine learning methods to derive and validate estimates for the treatment effect of bougie versus stylet on successful intubation during the first attempt for individual patients. Our aim was to apply an innovative machine learning technique with rigorous, prespecified parameters to demonstrate the potential opportunities and limitations of these methods to inform the interpretation of clinical trial results. We hypothesized that the machine learning model could effectively identify patients who would benefit from receipt of a bougie or receipt of a stylet using their baseline characteristics. Some of the results of this study were presented in abstract form at the American Thoracic Society international conference in 2022 (8).
Methods
BOUGIE was a pragmatic, multicenter, randomized trial at 15 sites in the United States that randomized 1,102 critically ill adults to the use of a bougie versus a stylet for emergency tracheal intubation. Patients for whom operators believed either a bougie or a stylet was required or contraindicated were excluded. The primary outcome was successful intubation on the first attempt. This secondary analysis of deidentified data from the BOUGIE trial (5) was approved by the University of Wisconsin Institutional Review Board (2019-1258) and followed the guidance of the Predictive Approaches to Treatment effect Heterogeneity statement on predictive modeling of heterogeneity of treatment effect in clinical trials (7).
All 1,102 patients in the BOUGIE trial were included in this analysis. The study population was divided into two groups using the midpoint of enrollment in the trial: 1) patients enrolled in the first half of the trial were included in a training cohort for development of the predictive model; and 2) patients enrolled in the second half of the trial were included in a validation cohort for validation of the predictive model. The primary outcome from the original trial, successful intubation on the first attempt, was used as the outcome for the current analyses. We selected model predictors that were present at the time of trial enrollment, including baseline patient demographics, vital signs, severity of illness, and difficult airway characteristics, which included any of the following: vomiting; witnessed aspiration; upper gastrointestinal bleeding; epistaxis or oral bleeding; upper airway mass, infection, or trauma; head and neck radiation; limited neck mobility; limited mouth opening; history of obstructive sleep apnea; or other. Characteristics of the intubating clinician were also used in the model, such as their prior number of intubations using a bougie (see Supplemental Methods in the online supplement for a complete list of prespecified predictor variables used in the model). The training and validation cohorts were compared using the chi-square test for categorical variables and the Wilcoxon rank-sum test for continuous variables.
In the training cohort, a causal forest algorithm was used to predict individualized differences in outcome probabilities by randomized group assignment (bougie group vs. stylet group) for each patient based on the baseline characteristics using all prespecified model predictors (9). The causal forest model comprises predictions obtained from a collection of individual decision trees. Each decision tree is created using a randomly chosen subset of the training observations and baseline variables. Variables and their cut-points in the tree are chosen to maximize heterogeneity in treatment effect across the splits. An average treatment effect is then calculated in each leaf using the remaining training observations not included in the tree’s construction. This use of a different set of observations in tree creation and effect estimation reduces bias and overfitting. Final predictions are determined by aggregating the results across all trees for a single individual. To improve stability, five causal forest models were constructed in the training cohort using different random seed initialization in sequential runs, with 2,000 trees in each run (10). Platt scaling and centering of the mean prediction were performed in the training data to improve model calibration, with these same scaling values applied to the predictions in the test set. The intervention and outcome data elements were not missing for any subjects in the dataset, and missing covariates were handled natively in the causal forest algorithm as missing and analyzed as missing without imputation.
The models developed in the training cohort were then applied to the validation cohort to predict individualized treatment effect values for each patient, averaging the predictions across each of the five model iterations. To assess whether the model could accurately discriminate patients in the validation cohort more likely to benefit from use of a bougie versus a stylet, the adjusted qini value was calculated and the corresponding qini curve plotted (11, 12). The qini curve depicts, on the x-axis, the cumulative proportion of the population ordered by predicted increasing benefit for stylet from the model predictions, and the y-axis represents the difference in the frequency of the outcome by treatment group (average treatment effect) among that proportion of the treated population scaled for the total population. The adjusted qini value is the area between the curve derived from arranging the population by the individualized treatment effect from the model and the line representing a random order, with a larger value indicating better discrimination. The adjusted qini value is the qini value scaled by Kendall’s rank correlation between the predicted and observed individualized treatment effects.
To facilitate comparison of patients predicted to benefit from use of a bougie to those predicted to benefit from use of a stylet, the validation cohort was divided into quartiles by the patients’ predicted individualized treatment effect. To evaluate whether the predicted treatment effect for individual patients modified the effect of trial group assignment on the primary outcome in the validation cohort, we used a logistic regression model to test for interaction between the individualized treatment effect value and the trial group assignment (bougie group vs. stylet group), with the primary outcome as the dependent variable.
For continuous variables, partial dependence plots were used to visually explore the average marginal effects of each variable on the predicted outcomes in the model. To illustrate the relative contributions of different baseline variables to the predicted treatment effects for individual patients, we selected two example patients: one who was predicted to benefit from use of a bougie and one who was predicted to benefit from use of a stylet. We sequentially evaluated the importance of each baseline variable in the model for each of the two example patients by comparing the predicted treatment effect for that individual with the predicted treatment effect when substituting the median value for a variable with all others held constant. The difference in individual treatment effect estimates corresponds to the relative contribution of that variable to the treatment effect for that individual patient. All analyses were performed using R version 3.6.3 (R Foundation for Statistical Computing) using the grf package for causal forest modeling and tools4uplift package for adjusted qini calculations. Additional explanation of the methods used in this analysis is available in the online supplement, and the annotated code used to develop the models described in this work is available at https://git.doit.wisc.edu/smph-public/dom/uw-icu-data-science-lab-public/causalforestite.
Results
Baseline Characteristics
The BOUGIE trial enrolled patients from April 29, 2019 to February 14, 2021, and the enrollment date of the median patient was Day 245. Data from the 558 patients (50.6%) enrolled up to this date were included in the training cohort. The remaining 544 (49.4%) were included in the validation cohort (Table 1; see Table E1 in the online supplement). Of all patients in the analysis, 59% were men, the median age was 58 years (interquartile range [IQR], 43–68 yr), and the median body mass index (BMI) was 26.4 (IQR, 22.7–31.3). As anticipated with two temporally distinct cohorts within a trial, the groups differed in some baseline characteristics (Table 1).
Table 1.
Baseline Characteristics of Patients in the Training Cohort and the Validation Cohort
Training Cohort (n = 558) | Validation Cohort (n = 544) | P Value* | |
---|---|---|---|
Preintubation patient characteristics | |||
Age, yr | 58 (43–67) | 58 (44–69) | 0.32 |
Male | 314 (56.3) | 336 (61.8) | 0.06 |
Black | 160 (28.7) | 104 (19.1) | <0.001 |
BMI | 26.3 (22.5–31.0) | 26.76 (23.03–31.32) | 0.41 |
SpO2 | 100 (98–100) | 100 (97–100) | 0.66 |
FiO2 | 0.65 (0.30–0.80) | 0.40 (0.21–0.70) | <0.001 |
SBP | 129 (110–149) | 133 (114–154) | 0.014 |
Receiving vasopressors | 74 (13.3) | 64 (11.8) | 0.45 |
Glasgow Coma Scale score | 11 (7–14) | 10 (7–14) | 0.73 |
APACHE II score | 17 (12–23) | 17 (12–22) | 0.79 |
Primary diagnosis of trauma | 120 (21.5) | 76 (14.0) | 0.001 |
Active medical conditions | |||
Gastrointestinal tract hemorrhage | 50 (9.0) | 49 (9.0) | 0.98 |
Acute encephalopathy | 403 (72.2) | 362 (66.5) | 0.04 |
Hypoxemic respiratory failure | 244 (43.7) | 176 (32.4) | <0.001 |
Hypercarbic respiratory failure | 77 (13.8) | 67 (12.3) | 0.47 |
Characteristics of the intubation procedure | |||
Difficult airway characteristics | |||
Obesity† | 68 (12.2) | 54 (9.9) | 0.23 |
Other difficult airway characteristic‡ | 91 (16.3) | 78 (14.3) | 0.36 |
Preoxygenation method | |||
Standard nasal cannula | 236 (42.3) | 150 (27.6) | <0.001 |
High-flow nasal cannula | 78 (14.0) | 52 (9.6) | 0.02 |
Nonrebreather mask | 310 (55.6) | 267 (49.1) | 0.03 |
Bag mask, no ventilation | 31 (5.6) | 54 (9.9) | 0.01 |
Bag mask, ventilation | 61 (10.9) | 70 (12.9) | 0.32 |
Etomidate given for induction | 350 (62.7) | 408 (75.0) | <0.001 |
Neuromuscular blockade with rocuronium | 291 (52.2) | 339 (62.3) | 0.001 |
Straight laryngoscope blade or missing | 129 (23.1) | 54 (9.9) | <0.001 |
Direct laryngoscopy on initial attempt | 129 (23.1) | 145 (26.7) | 0.18 |
Operator had critical care specialty training§ | 209 (37.5) | 176 (32.4) | 0.08 |
Prior intubations using a bougie | 10 (3–20) | 10 (5–20) | 0.002 |
Randomized to bougie | 283 (50.7) | 273 (50.2) | 0.86 |
Primary outcome: successful intubation on the first attempt | 465 (83.3) | 435 (80.0) | 0.15 |
Definition of abbreviations: APACHE = Acute Physiology and Chronic Health Evaluation; BMI = body mass index; SBP = systolic blood pressure; SpO2 = saturation of peripheral oxygen.
Data are presented as median (interquartile range) or n (%). Missing values: BMI, 65 (5.9%); APACHE II score, 4 (0.4%); SpO2 at induction, 41 (3.7%); SBP at induction, 41 (3.7%); highest FiO2 in prior hour, 25 (2.3%); Glasgow Coma Scale score, 11 (1.0%); bougie experience, 2 (0.2%).
Testing for difference P value is chi-square test for categorical variables and Wilcoxon rank-sum test for continuous variables.
As listed in the electronic health record at baseline.
Other difficult airway characteristics include: vomiting; witnessed aspiration; upper gastrointestinal bleeding; epistaxis or oral bleeding; upper airway mass, infection, or trauma; head and neck radiation; limited neck mobility; limited mouth opening; history of obstructive sleep apnea; other.
Reference: emergency medicine, anesthesia, or other.
The operators performing intubation were primarily emergency medicine physicians (n = 693, 62.9%) or critical care physicians (n = 385, 34.9%) and had performed a median number of 60 prior intubations, of which a median of 10 (IQR, 4–20) had been performed using a bougie.
Individualized Treatment Effect Model
The causal forest model demonstrated that baseline covariates related to both patient and operator characteristics modified the effect of the use of a bougie versus a stylet on the outcome of successful intubation on the first attempt (Figure 1). In the model, the most important variables in determining the treatment effect of bougie versus stylet for individual patients were: the presence of difficult airway characteristics assessed before intubation, BMI, the severity of illness as assessed by the Acute Physiology and Chronic Health Evaluation (APACHE) II score, systolic blood pressure, and the operator’s prior number of intubations performed using a bougie.
Figure 1.
Variable importance plot. This figure displays the 10 most important causal forest model variables, as determined by the number of times a candidate partitioning variable was chosen to be in the first splits of a tree in the causal forest model. x-axis scale of importance was normalized to 100% for the most important variable. Difficult airway characteristics included any of the following: vomiting; witnessed aspiration; upper GI bleeding; epistaxis or oral bleeding; upper airway mass, infection, or trauma; head and neck radiation; limited neck mobility; limited mouth opening; history of obstructive sleep apnea; or other. APACHE = Acute Physiology and Chronic Health Evaluation II Score; BMI = body mass index; SBP = systolic blood pressure; SpO2 = saturation of peripheral oxygen.
Global model performance in the validation cohort demonstrated that the predicted individualized treatment effect for each patient significantly modified the effect of randomized treatment group assignment (bougie vs. stylet) on the outcome of successful intubation on the first attempt (P value for interaction = 0.02; Tables 2 and E2 and Figure 2). The qini plot of the model (Figure 3) demonstrated an initial increase in observed uplift (consistent with a beneficial effect of a bougie among patients predicted to benefit from use of a bougie) and a final steep decrease in observed uplift (consistent with a beneficial effect of a stylet among patients predicted to benefit most from use of stylet), with an adjusted qini coefficient of 2.46, consistent with the model’s ability to discriminate treatment effects better than random chance. In addition to assessment of model calibration via the adjusted qini coefficient, the adjusted coefficient of determination (adjusted R2) of the individual treatment effect percentile on the corresponding segmented uplift was 0.82, consistent with good calibration.
Table 2.
Randomization, Outcomes, Treatment Difference, and Baseline Characteristics by Quartiles of Individualized Treatment Effect Estimates in the Validation Cohort
Validation Cohort (n = 544) | Quartile 1 (n = 136) | Quartile 2 (n = 136) | Quartile 3 (n = 136) | Quartile 4 (n = 136) | P Value* | |
---|---|---|---|---|---|---|
Preintubation patient characteristics | ||||||
Age, yr | 58 (44 to 69) | 60.50 (48 to 70) | 59 (46.75 to 69) | 57.50 (45 to 71) | 53.50 (41 to 64.50) | 0.047 |
Male | 336 (61.8) | 92 (67.6) | 87 (64.0) | 82 (60.3) | 75 (55.1) | 0.18 |
Black | 104 (19.1) | 34 (25.0) | 20 (14.7) | 21 (15.4) | 29 (21.3) | 0.10 |
BMI | 26.8 (23.0 to 31.3) | 25.6 (23.8 to 30.3) | 26.9 (23.5 to 30.0) | 25.8 (21.2 to 30.9) | 28.1 (22.3 to 37.8) | 0.053 |
SpO2 | 100 (97 to 100) | 100 (95.5 to 100) | 100 (97 to 100) | 99 (97 to 100) | 100 (97 to 100) | 0.52 |
FiO2 | 0.40 (0.21 to 0.70) | 0.66 (0.36 to 1.00) | 0.40 (0.21 to 0.66) | 0.33 (0.21 to 0.66) | 0.39 (0.21 to 0.66) | <0.001 |
SBP | 133 (114 to 154) | 139.50 (123 to 161) | 137.50 (118 to 157) | 133 (113 to 152) | 123 (104 to 141) | <0.001 |
Receiving vasopressors | 64 (11.8) | 20 (14.7) | 13 (9.6) | 15 (11.0) | 16 (11.8) | 0.61 |
Glasgow Coma Scale score | 10 (7 to 14) | 9 (4 to 13) | 10 (6 to 14) | 11 (7 to 14) | 11 (7 to 15) | 0.006 |
APACHE II score | 17 (12 to 22) | 19.50 (15 to 25.25) | 16 (12 to 20.50) | 15 (11 to 19) | 18 (12.50 to 20.50) | <0.001 |
Primary diagnosis of trauma | 76 (14.0) | 24 (17.6) | 11 (8.1) | 16 (11.8) | 25 (18.4) | 0.042 |
Active medical conditions | ||||||
Gastrointestinal tract hemorrhage | 49 (9.0) | 4 (2.9) | 13 (9.6) | 16 (11.8) | 16 (11.8) | 0.034 |
Acute encephalopathy | 362 (66.5) | 94 (69.1) | 91 (66.9) | 93 (68.4) | 84 (61.8) | 0.57 |
Hypoxemic respiratory failure | 176 (32.4) | 59 (43.4) | 34 (25.0) | 42 (30.9) | 41 (30.1) | 0.01 |
Hypercarbic respiratory failure | 67 (12.3) | 14 (10.3) | 14 (10.3) | 22 (16.2) | 17 (12.5) | 0.41 |
Characteristics of the intubation procedure | ||||||
Difficult airway characteristic | ||||||
Obesity† | 54 (9.9) | 6 (4.4) | 7 (5.1) | 9 (6.6) | 32 (23.5) | <0.001 |
Other difficult airway characteristic‡ | 78 (14.3) | 0 (0.0) | 0 (0.0) | 0 (0.0) | 78 (57.4) | <0.001 |
Preoxygenation method | ||||||
Standard nasal cannula | 150 (27.6) | 41 (30.1) | 40 (29.4) | 30 (22.1) | 39 (28.7) | 0.42 |
High-flow nasal cannula | 52 (9.6) | 13 (9.6) | 12 (8.8) | 18 (13.2) | 9 (6.6) | 0.31 |
Nonrebreather mask | 267 (49.1) | 52 (38.2) | 67 (49.3) | 75 (55.1) | 73 (53.7) | 0.023 |
Bag mask, no ventilation | 54 (9.9) | 12 (8.8) | 20 (14.7) | 12 (8.8) | 10 (7.4) | 0.18 |
Bag mask, ventilation | 70 (12.9) | 21 (15.4) | 11 (8.1) | 16 (11.8) | 22 (16.2) | 0.17 |
Etomidate given for induction | 408 (75.0) | 101 (74.3) | 101 (74.3) | 103 (75.7) | 103 (75.7) | 0.98 |
Neuromuscular blockade with rocuronium | 339 (62.3) | 89 (65.4) | 87 (64.0) | 80 (58.8) | 83 (61.0) | 0.68 |
Straight laryngoscope blade or missing | 54 (9.9) | 22 (16.2) | 11 (8.1) | 4 (2.9) | 17 (12.5) | 0.002 |
Direct laryngoscopy on initial attempt | 145 (26.7) | 37 (27.2) | 38 (27.9) | 43 (31.6) | 27 (19.9) | 0.17 |
Operator had critical care specialty training§ | 176 (32.4) | 46 (33.8) | 45 (33.1) | 46 (33.8) | 39 (28.7) | 0.77 |
Prior intubations using a bougie | 10 (5 to 20) | 20 (12 to 36) | 10 (5 to 20) | 8 (3 to 15) | 5 (2 to 15) | <0.001 |
Randomized to bougie | 273 (50.2) | 62 (45.6) | 75 (55.1) | 63 (46.3) | 73 (53.7) | 0.27 |
Primary outcome: successful intubation on the first attempt, overall | 435 (80.0) | 119 (87.5) | 112 (82.4) | 109 (80.1) | 95 (69.9) | 0.003 |
In bougie group | 209 (76.6) | 56 (90.3) | 61 (81.3) | 48 (76.2) | 44 (60.3) | |
In stylet group | 226 (83.4) | 63 (85.1) | 51 (83.6) | 61 (83.6) | 51 (81.0) | |
Average treatment effect | ||||||
Difference in incidence of the primary outcome between bougie group and stylet group, % (95% CI)ǁ | −6.8 (−13.9 to 0.2) | 5.2 (−7.2 to 17.6) | −2.3 (−16.6 to 12.0) | −7.4 (−22.4 to 7.6) | −20.7 (−37.0 to −4.4) | 0.022¶ |
Definition of abbreviations: APACHE II = Acute Physiology and Chronic Health Evaluation II; BMI = body mass index; CI = confidence interval; SBP = systolic blood pressure; SpO2 = saturation of peripheral oxygen.
Data are presented as median (interquartile range) or n (%) unless otherwise noted. Missing values: BMI, 25 (4.6%); APACHE II score, 3 (0.6%); SpO2 at induction, 11 (2.0%); SBP at induction, 10 (1.8%); highest FiO2 in prior hour, 8 (1.5%); Glasgow Coma Scale score, 3 (0.6%).
Testing for a difference across quantiles using Kruskal-Wallis and chi-square tests.
As listed in the electronic health record at baseline.
Other difficult airway characteristics include: vomiting; witnessed aspiration; upper gastrointestinal bleeding; epistaxis or oral bleeding; upper airway mass, infection, or trauma; head and neck radiation; limited neck mobility; limited mouth opening; history of obstructive sleep apnea; or other.
Reference: emergency medicine, anesthesia, or other.
The average treatment effect and 95% CI for each quartile is a summary measure of the individualized treatment effects for all patients in the quartile. CIs for the individual treatment effect estimates for individual patients in the quartile could exclude 0.0 (no difference), even when the CIs for all patients in the quartile included 0.0 (no difference).
Likelihood ratio test P value for interaction term between predicted individualized treatment effect and treatment.
Figure 2.
Observed treatment effect in the validation cohort by predicted treatment effect quartile from a causal forest model. Patients in the validation cohort are grouped into quartiles by their individual predicted treatment effect from the causal forest model, ranging from the quartile predicted to most benefit from use of a bougie (Q1) to the quartile predicted to most benefit from use of a stylet (Q4). The observed average treatment effect, overall and in each quartile, is the difference in the incidence of the primary outcome (successful intubation on the first attempt) between the bougie group and the stylet group. Bars indicate 95% confidence intervals. The interaction between predicted treatment effect quartile and the effect of trial group assignment on the primary outcome was significant (P = 0.02).
Figure 3.
Qini plot. This figure depicts the discrimination of the causal forest model in the validation cohort. The difference between the solid line (bougie vs. stylet selected for patients based on predicted individualized treatment effect from the model) versus the dotted line (bougie vs. stylet selected randomly) demonstrates the uplift gain, defined as the difference between the areas under the curve plotted by the model-based targeting and random targeting. Consistent with the high discrimination of the model, the qini curve first increases (showing that the patients for whom the model predicted the largest treatment effect with a use of the bougie experienced the largest benefit from use of a bougie) then plateaus (as the population begin to include patients with similar outcomes with either bougie or stylet), and finally decreases (showing that the patients for whom the model predicted the largest treatment effect with use of a stylet experienced the largest benefit from use of a stylet). Adj = adjusted.
In the quartile of patients most likely to benefit from use of a bougie rather than a stylet, successful intubation on the first attempt occurred in 90.3% of patients randomized to the bougie group and 85.1% of patients randomized to the stylet group, for an absolute difference of 5.2 percentage points (95% confidence interval [CI], −7.2 to 17.6). Patients in this quartile had higher APACHE scores, higher FiO2, and lower Glasgow Coma Scale scores, and were intubated by clinicians with more prior experience using a bougie, compared with patients in the other quartiles. In the quartile of patients most likely to benefit from use of a stylet rather than a bougie, successful intubation on the first attempt occurred in 60.3% of patients randomized to the bougie group and 81.0% of patients randomized to the stylet group, for an absolute difference of −20.7 percentage points (95% CI, −37.0 to −4.4). Patients in this quartile were younger, had higher BMI, were more likely to have difficult airway characteristics on prospective assessment, and were intubated by operators with less prior experience using a bougie.
Figure 4 depicts the partial dependence plots for the four most important continuous variables in the model, demonstrating the average marginal effect of each variable at different values. Patients undergoing intubation by an operator with more prior experience intubating with a bougie had higher predicted benefit from use of a bougie, as did patients with higher systolic blood pressures and higher APACHE scores. In contrast, patients with very low or very high values for BMI had greater predicted benefit from use of a stylet.
Figure 4.
Partial dependence plot. This figure depicts the change in predicted benefit over ranges for the most important continuous predictors. The x-axis shows the change in the variable of interest, and the y-axis shows the direction of benefit. Bootstrapped 95% confidence intervals are shown in gray. APACHE = Acute Physiology and Chronic Health Evaluation II score; BMI = body mass index; ITE = individualized treatment effect; SBP = systolic blood pressure.
Figure 5 displays the output from the individualized treatment effect model for two patients in the validation cohort, one for whom the model predicts a treatment benefit from use of a bougie and one for whom the model predicts treatment benefit from use of a stylet. This figure highlights how different variables can impact the predicted benefit of the interventions at the patient level based on their individual characteristics.
Figure 5.
Individual patient examples. This figure depicts the influence of individual variables on the predicted treatment effect for bougie versus stylet for two individual patients in the validation cohort. For each variable on the y-axis, the value for that patient is presented compared with the median value for the cohort in parentheses. The x-axis shows the model’s predicted treatment effect with bougie versus stylet for the patient, with 0.0 representing no difference in successful intubation on the first attempt between use of a bougie and use of a stylet. Blue arrows signify variables that make benefit from use of a bougie more likely, and red arrows signify variables that make benefit from use of a stylet more likely. (A) Information for a patient whose predicted individualized treatment effect (ITE) of 0.124 was consistent with a 12.4% absolute increase in the incidence of successful intubation on the first attempt in the bougie group compared with the stylet group. (B) Information for a patient whose ITE of −0.084 was consistent with an 8.4% absolute decrease in the incidence of successful intubation on the first attempt in the bougie group compared with the stylet group. APACHE II = Acute Physiology and Chronic Health Evaluation II.
Discussion
This secondary analysis of the BOUGIE trial found that causal forests identified treatment effects for individual patients that ranged from benefit from use of a bougie to benefit from use of a stylet, despite no average treatment effect in the overall trial and no treatment effect in any univariate subgroup analysis prespecified in the trial. These findings were validated in a temporally distinct cohort in the trial, driven by clinically relevant effect modifiers, and identified differences in treatment response of a similar magnitude to those the original trial aimed to detect. Furthermore, these results exemplify how machine learning approaches to estimating individualized treatment effect in clinical trials may provide evidence to personalize care, even in trials with no difference between groups in the average treatment effect. Our findings have potential implications for methods of estimating individual treatment effect in future clinical trials and assessing the clinical effects of bougie versus stylet during tracheal intubation.
Traditional approaches to examining heterogeneity of treatment effect in randomized clinical trials involve subgroup analyses that examine each potential effect modifier one at a time or use risk scores to analyze whether patients’ baseline risk of experiencing an outcome modifies the effect of treatment (13). Our machine learning approach using a causal forest for individualized treatment effect overcomes important limitations of these traditional approaches. First, the causal forest method can, by its design, incorporate interactions between multiple covariates and nonlinear relationships, which traditional subgroups and risk-based heterogeneity of treatment effect analyses cannot. Our results show why this is important: many of the same prespecified variables that did not demonstrate effect modification when analyzed one at a time in the original subgroup analyses (e.g., difficult airway characteristics, operator experience) were found to have higher-order interactions in the causal forest method that resulted in clinically important effect modification. Second, the ability to perform a single statistical test for interaction between the predicted individualized treatment effect and outcome in a separate test cohort likely reduces the risk of spurious findings from multiple testing associated with traditional subgroup analyses, which are often performed using the entire trial without a separate validation cohort to confirm the results. Third, the ability to develop the individualized treatment effect models in a training cohort and validate them in a distinct validation cohort reduces the risk of overfitting and increases the likelihood of the findings reproducing in future patient populations. In our approach, differences between the temporally distinct training and validation cohorts disadvantage the model’s ability to provide accurate predictions in the validation cohort, providing a more rigorous test of this methodology than random sampling of cohorts would. Fourth, presenting information on patients stratified by their individual treatment effect estimates can provide a comprehensive picture of how patients who benefit from one treatment versus another differ, in a way that is not common in traditional one-at-a-time subgroup analyses. Fifth, analyses of clinical trials using causal forests or other methods may provide an approach to deriving and validating estimates of individual treatment effect that are both evidence based (derived from randomized trial data) and personalized (treatment effect estimates based on the characteristics of individual patients). After prospective validation, such models could be developed into clinical decision support tools and incorporated into the electronic health record to guide evidence-based personalized treatment decisions at the point of care (Figure E1).
Although the results of our secondary analysis are hypothesis generating and should not be directly used to inform clinical practice, the variables found to identify patients likely to benefit from intubation with a bougie versus a stylet have clinical relevance. The presence of difficult airway characteristics was found to be the most important variable in determining the treatment effect of bougie versus stylet for individual patients. The prespecified one-at-a-time subgroup analyses of the original BOUGIE trial used a composite of difficult airway characteristics that included obesity and airway features only visible on laryngoscopy, such as a glottic view obscured by body fluids. This subgroup of patients in the original BOUGIE trial showed no benefit to the use of a bougie, with an adjusted odds ratio of 0.95 (95% CI, 0.56–1.62; P value for interaction = 0.85). Our current causal forest analysis, however, specified covariates a priori and found that both BMI and the prospectively assessed difficult airway characteristics other than obesity are both independently important in the predictive model. These characteristics may each have a relationship with the use of a bougie versus a stylet and successful intubation on the first attempt that were not discernable when combined in the prespecified subgroup analysis of the trial, where the subgroups were specified using a single variable. Moreover, these characteristics may be most predictive when considered in the context of other covariates, and these relationships were likely captured by our multivariable model.
Similarly, the prespecified one-at-a-time subgroup analysis of BOUGIE did not find that the operator’s prior number of intubations with a bougie modified the effect of trial group assignment on the primary outcome (P value for interaction = 0.5). The causal forest model used here, however, identified that the operator’s prior experience with a bougie did contribute important information about the relationship between bougie versus stylet and successful intubation on the first attempt, when allowed to interact with other variables, like the presence of difficult airway characteristics, systolic blood pressure, and BMI. Furthermore, we observed that patients in the quartile most likely to benefit from use of a stylet more often had difficult airway characteristics and higher BMI—primarily anatomical considerations. Patients in the quartile more likely to benefit from use of a bougie, in contrast, appeared to have higher severity of illness and higher FiO2 requirement—primarily physiological considerations. Before any clinical application of these findings is possible, estimates generated from this model need to be studied in clinical trials to determine whether prospective use of such estimates improves clinical outcomes in a new, external cohort.
This study has several limitations. First, the causal forest models derived and validated here are complex models based on ensemble machine learning decisions that resist simplification. However, we do provide a method for how to explain the model predictions for individual patients and present our results by quartile of risk to increase the interpretability of our findings. Second, although our analysis included a temporally distinct validation cohort, external validation in a separate randomized clinical trial of bougie versus stylet would provide a higher level of confidence in the model’s individual treatment effect estimates. Trials examining the effect on outcomes of using predicted individualized treatment effects to guide treatment decisions would be required before such models could be used to inform care for patients. Third, our model was limited by data availability, and other unmeasured baseline factors could potentially improve the predicted differences in comparative outcomes between these two interventions than the data available in this study. Variables used in this model, however, did allow for prediction of differences in outcomes, and better predictors, when identified, would improve model performance in future studies. Finally, the BOUGIE trial was conducted in academic medical centers in the United States, and the results may not generalize to other settings or operators substantively different from the study cohort.
Conclusions
In this secondary analysis of a multicenter randomized trial with no significant difference in average treatment effect, a causal forest machine learning algorithm identified patients for whom interactions between patient characteristics and operator characteristics appeared to result in benefit from use of a bougie rather than a stylet and from use of a stylet rather than a bougie. Future research should advance approaches for applying machine learning methods to trials to derive, validate, evaluate, and implement evidence-based estimates of individual treatment effect.
Footnotes
Supported by NIH grants T32HL087738 (K.P.S.), K23HL153584 (J.D.C.), T32HL007605 (K.G.B.), and UL1 RR024975 (T.W.R.); and NHLBI grants K23HL143053 (M.W.S.) and R01HL157262 (M.M.C.).
Author Contributions: Study concept and design: K.P.S., A.B.S., J.D.C., K.G.B., E.T.Q., M.W.S., and M.M.C. Acquisition of data: J.D.C., B.E.D., W.H.S., A.A.G., S.A.T., S.G., L.M.S., D.B.P., D.J.V., J.R.W., A.M.J., K.C.D., C.G.H., M.R.W., M.E.P., T.W.R., and M.W.S. Analysis and interpretation of data: K.P.S., A.B.S., J.D.C., K.G.B., E.T.Q., E.J.G.L., P.S., M.W.S., and M.M.C. Drafting of the manuscript: K.P.S., A.B.S., M.W.S., and M.M.C. Critical revision of the manuscript for important intellectual content: K.P.S., A.B.S., J.D.C., K.G.B., E.T.Q., E.J.G.L., B.E.D., W.H.S., A.A.G., S.A.T., S.G., L.M.S., D.B.P., D.J.V., J.R.W., A.M.J., K.C.D., C.G.H., M.R.W., M.E.P., T.W.R., P.S., M.W.S., and M.M.C. Statistical analysis: K.P.S., A.B.S., E.J.G.L., M.W.S., and M.M.C. Study supervision: J.D.C., T.W.R., M.W.S., and M.M.C. K.P.S., A.B.S., M.W.S., and M.M.C. had full access to all the data and take responsibility for the integrity of the data and the accuracy of the data analysis. A.B.S., E.J.G.L., and M.M.C. conducted and are responsible for the data analysis.
This article has an online supplement, which is accessible from this issue’s table of contents at www.atsjournals.org.
Originally Published in Press as DOI: 10.1164/rccm.202209-1799OC on March 6, 2023
Author disclosures are available with the text of this article at www.atsjournals.org.
References
- 1. Russotto V, Myatra SN, Laffey JG, Tassistro E, Antolini L, Bauer P, et al. INTUBE Study Investigators Intubation practices and adverse peri-intubation events in critically ill patients from 29 countries. JAMA . 2021;325:1164–1172. doi: 10.1001/jama.2021.1727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Mort TC. Emergency tracheal intubation: complications associated with repeated laryngoscopic attempts. Anesth Analg . 2004;99:607–613. doi: 10.1213/01.ANE.0000122825.04923.15. [DOI] [PubMed] [Google Scholar]
- 3. Sakles JC, Chiu S, Mosier J, Walker C, Stolz U. The importance of first pass success when performing orotracheal intubation in the emergency department. Acad Emerg Med . 2013;20:71–78. doi: 10.1111/acem.12055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Hasegawa K, Shigemitsu K, Hagiwara Y, Chiba T, Watase H, Brown CA, III, et al. Japanese Emergency Medicine Research Alliance Investigators Association between repeated intubation attempts and adverse events in emergency departments: an analysis of a multicenter prospective observational study. Ann Emerg Med . 2012;60:749–754.e2. doi: 10.1016/j.annemergmed.2012.04.005. [DOI] [PubMed] [Google Scholar]
- 5. Driver BE, Semler MW, Self WH, Ginde AA, Trent SA, Gandotra S, et al. BOUGIE Investigators and the Pragmatic Critical Care Research Group Effect of use of a bougie vs endotracheal tube with stylet on successful intubation on the first attempt among critically ill patients undergoing tracheal intubation: a randomized clinical trial. JAMA . 2021;326:2488–2497. doi: 10.1001/jama.2021.22002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Wang R, Lagakos SW, Ware JH, Hunter DJ, Drazen JM. Statistics in medicine: reporting of subgroup analyses in clinical trials. N Engl J Med . 2007;357:2189–2194. doi: 10.1056/NEJMsr077003. [DOI] [PubMed] [Google Scholar]
- 7. Kent DM, Paulus JK, van Klaveren D, D’Agostino R, Goodman S, Hayward R, et al. The Predictive Approaches to Treatment effect Heterogeneity (PATH) statement. Ann Intern Med . 2020;172:35–45. doi: 10.7326/M18-3667. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Spicer A, Semler MW, Casey JD, Driver B, Prekker ME, Wang L, et al. Individualized treatment effects of bougie vs stylet for tracheal intubation [abstract] Am J Respir Crit Care Med . 2022;205:A5784. doi: 10.1164/rccm.202209-1799OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Athey S, Wager S.Estimating treatment effects with causal forests: an application. 2019. https://arxiv.org/abs/1902.07409
- 10. Sinha P, Spicer A, Delucchi KL, McAuley DF, Calfee CS, Churpek MM. Comparison of machine learning clustering algorithms for detecting heterogeneity of treatment effect in acute respiratory distress syndrome: a secondary analysis of three randomised controlled trials. EBioMedicine . 2021;74:103697. doi: 10.1016/j.ebiom.2021.103697. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Radcliffe N. Using control groups to target on predicted lift: building and assessing uplift model. Direct Market J Direct Market Assoc Anal Council . 2007;1:14–21. [Google Scholar]
- 12. Devriendt F, Moldovan D, Verbeke W. A literature survey and experimental evaluation of the state-of-the-art in uplift modeling: a stepping stone toward the development of prescriptive analytics. Big Data . 2018;6:13–41. doi: 10.1089/big.2017.0104. [DOI] [PubMed] [Google Scholar]
- 13. Iwashyna TJ, Burke JF, Sussman JB, Prescott HC, Hayward RA, Angus DC. Implications of heterogeneity of treatment effect for reporting and analysis of randomized trials in critical care. Am J Respir Crit Care Med . 2015;192:1045–1051. doi: 10.1164/rccm.201411-2125CP. [DOI] [PMC free article] [PubMed] [Google Scholar]