Abstract
Background/Objective:
Assessing prognosis for acetaminophen-induced acute liver failure (APAP-ALF) patients during the first week of hospitalization often presents significant challenges. Current models such as the King‟s College Criteria (KCC) and the Acute Liver Failure Study Group (ALFSG) Prognostic Index are developed to predict outcome using only a single time point on hospital admission. Models using longitudinal data are not currently available for APAP-ALF patients. We aim to develop and compare performance of prediction models for outcomes during the first week of hospitalization for APAP-ALF patients.
Methods:
Models are developed for the ALFSG registry data to predict longitudinal outcomes for 1042 APAP-ALF patients enrolled 01/1998 – 02/2016. The primary outcome is defined as daily low versus high coma grade. Accuracy in prediction of outcome (AC), sensitivity (SN), specificity (SP) and area under the receiver operating curve (AUC) are compared between the following models: classification and regression tree, random forest, frequentist generalized linear mixed model (GLMM), Bayesian GLMM, BiMM tree, and BiMM forest using original and imputed datasets.
Results:
BiMM tree offers predictive (test set) 63% AC, 72% SP and 53% SN for the original dataset, whereas BiMM forest offers predictive (test set) 69% AC, 63% SP and 74% SN for the imputed dataset. BiMM tree has the highest AUC for the original testing dataset (0.697), whereas BiMM forest and standard random forest have the highest AUC for the imputed testing dataset (0.749). The three most important predictors of daily outcome for the BiMM tree are pressor use, bilirubin and creatinine. The BiMM forest model identifies lactate, ammonia and ALT as the three most important predictors of outcome.
Conclusions:
BiMM tree offers a prognostic tool for APAP-ALF patients, which has good accuracy and simple interpretation of predictors which are consistent with clinical observations. BiMM tree and forest models are developed using the first week of in-patient data and are appropriate for predicting outcome over time. While the BiMM forest has slightly higher predictive AC, the BiMM tree model is simpler to use at the bedside.
Keywords: acute liver failure, acetaminophen hepatotoxicity, fulminant hepatic failure, decision tree, random forest
1. Introduction
Acetaminophen (APAP) is the most common cause of acute liver failure (ALF) in Europe and North America [1, 2]. Injury and recovery follow a hyper-acute pattern, in which maximum hepatocyte destruction is complete by 72 hours following a one-time ingestion, with potential recovery equally swift. Despite reasonable post-transplant outcomes, liver transplantation (LT) for acetaminophen-induced acute liver failure (APAP-ALF) often presents significant challenges in management due to the rapidity and severity of illness, the potential for recovery without LT and the presence of complex psychosocial issues in most patients [3, 4]. Data from the NIH- funded Acute Liver Failure Study Group (ALFSG) shows that approximately 25% of APAP patients are listed for LT and less than 10% receive LT [5]. Current data suggest that APAP recovery for many patients is determined by 3–4 days following onset of illness [6]. With advances in intensive care unit (ICU) management such as continuous renal replacement therapy (RRT) and neuroprotective strategies, many patients who would otherwise have succumbed may remain alive for longer periods well beyond the initial insult [7, 8].
Several prognosis models are available for predicting survival in ALF, but few are developed using daily measures of outcome with post-admission data. Common predictive models include King’s College Criteria (KCC), Acute Liver Failure Study Group Prognostic Index (ALFSG-PI), decision trees by Speiser and colleagues, and ALF early dynamic (ALFED) model. While the KCC [9] has been validated on admission, prediction of outcome at later time points appears less accurate [10] when hepatic dysfunction would be characterized primarily by immunosuppression rather than multi-organ failure [8]. Numerous studies have shown relatively poor sensitivity of the KCC APAP criteria, ranging between 25% and 76%, meaning that many patients who did not meet criteria had poor outcomes during the incident hospitalization [2, 11–13]. Conversely, low specificity implies that some patients may have a good outcome despite meeting KCC and potentially could undergo unnecessary LT [12, 14]. Aside from KCC, the ALFSG-PI [15] has been evaluated for predicting 21-day transplant-free survival at admission and post-admission time points. However, a limitation of the ALFSG-PI is that only admission data are used to develop the model and it only uses information from a specific day rather than longitudinal data. Speiser et al. [16] provide post-admission prognosis models using decision tree methodology, but these are developed using summary statistics from post-admission data rather than including data from each day. The use of summary statistics across multiple days of data for patients may have resulted in a loss of information, so models may not achieve optimal accuracy. Kumar and colleagues provide a prediction model called ALFED which incorporates dynamic early changes in laboratory and clinical variables, but it was developed using data from non-APAP etiologies of ALF and may not predict well for APAP-ALF patients [17].
The primary aim of this study is to develop and compare performance of prediction models for outcomes during the first week of hospitalization for APAP-ALF patients. We use novel binary mixed model (BiMM) tree and BiMM forest methodologies to determine prognosis for use at admission (early) and post-admission (days 2–7) in APAP-ALF patients. BiMM tree [18] provides a decision tree framework for developing prediction models for longitudinal outcomes using binary splits on variables which can be read like a flow chart. Decision trees are popular in diverse medical fields [19, 20], and BiMM tree models offer an intuitive method for predicting longitudinal measures of outcome, using processes familiar to clinicians (e.g. “high” versus “low” va lues of a predictor). Though decision tree methods such as BiMM tree provide a simple, intuitive method for obtaining predictions, accuracy of models can often be improved using an ensemble, or collection, of decision trees (e.g. random forest [21]). Therefore, we also employ BiMM forest [22], a random forest method for developing prediction models for longitudinal binary outcomes. For comparison, we will also develop standard models for analyzing longitudinal binary outcomes: generalized linear mixed models (GLMMs). We implement the GLMMs using both frequentist and Bayesian paradigms. Though standard decision tree and random forest methodology do not account for longitudinal data, we will develop these methods for comparison as well. We hypothesize that BiMM tree and forest models will have similar or modestly higher predictive accuracy, sensitivity, and specificity compared to traditional GLMM, decision tree and random forest methodology. We expect BiMM tree and forest methods to perform better than GLMM because these frameworks have fewer assumptions about the form of the data (e.g. nonlinear relationships and interactions among predictor variables), and are therefore more robust to challenging datasets such as the ALFSG registry.
2. Materials and Methods
2.1. Study Design
Data from 1042 APAP-ALF patients enrolled within the ALFSG database from January 1998 to February 2016 (25 sites overall, 14 currently active; see acknowledgements) are used in this retrospective cohort study. The authors’ Institutional Review Board (IRB)/Health research ethics boards of all enrolling US ALFSG sites have approved all research and all clinical investigation has been conducted according to the principles expressed in the Declaration of Helsinki. Consent/assent is obtained from all patients/their next of kin for collection of data in the US ALFSG registry. Patient records are anonymized and de-identified prior to use in this analysis.
2.2. Participants
ALFSG registry eligibility criteria include: a) hepatic encephalopathy of any degree; b) evidence of moderately severe coagulopathy (international normalized ratio (INR) greater than or equal to 1.5); c) presumed acute illness onset of less than 26 weeks; and d) no cirrhosis [23]. For this study, only patients within the ALFSG registry with primary diagnoses of APAP determined by the site investigator are eligible.
2.3. Operational Definitions
Hepatic encephalopathy (HE) grade is defined using the West Haven Criteria (summarized); grade 1: any alteration in mentation, grade 2: being somnolent or obtunded but easily rousable or presence of asterixis, grade 3: being rousable with difficulty and, grade 4: unresponsive to deep pain [24]. In this study we defined ‘low coma grade’ as grade 1 or 2 and ‘high coma grade’ as grade 3 or 4. For evaluating the predictive performance of the models, specificity is the proportion of correctly predicted poor outcomes and sensitivity is the proportion of correctly predicted good outcomes.
2.4. Variables
The primary outcome of interest is binary: low coma grade versus high coma grade, which is collected daily for the first seven days following study admission until patients die, receive a LT, or are discharged/transferred from the hospital. We define ‘good outcome’ as low coma grade, and ‘poor outcome’ as high coma grade. We consider several variables collected at one-time point, as well as daily variables, for developing prediction models. Variables collected only on admission include gender, ethnicity and age. Daily variables considered for prediction modeling include AST, ALT, phosphate, lactate, platelets, bilirubin, ammonia, creatinine, INR, pressor use, and RRT.
2.5. Methods for Developing Prediction Models for Longitudinal Data
A commonly used methodology for developing prediction models in the setting of longitudinal data is GLMM. For binary outcomes, GLMMs have the form
where Yit is the binary outcome for cluster i = 1,…,M for longitudinal measurements t=1,…,Ti, logit() is the logistic link function, Xit is a matrix of fixed covariates for cluster i for longitudinal measurement t,β is a vector of fitted coefficients for the fixed covariates, Zit is the clustered covariate for cluster i for longitudinal measurement t, and bit is the fitted random effect for cluster i for longitudinal measurement t. In words, the predictors are partitioned into fixed and random effects which are linearly related to the binary outcome through the logistic link function. In this case, the fixed predictors would consist of all predictors to be used to model the outcome (e.g. demographics, clinical characteristics, and laboratory values), and the random effect adjusts for the dependency of longitudinal outcomes for each patient. GLMMs can be implemented in both frequentist and Bayesian settings.
Because GLMMs require assumptions which are not always valid (e.g. linear association between the predictors and the outcome through the link function and interactions must be specified by the user), we developed a method called BiMM tree [18] which can handle nonlinear predictors of outcome and naturally models interactions based on the data. Details of the BiMM tree algorithm are described elsewhere [18]. To summarize, BiMM tree is an algorithm in which a decision tree (namely, classification and regression tree (CART)) is developed and indicator variables for similar groupings of observations (i.e. the terminal nodes) are then used in a Bayesian GLMM to adjust for longitudinal outcomes. This portion of the algorithm takes the form
where CART (Xit) is represented within the GLMM as a matrix of indicator variables reflecting membership of each longitudinal observation t=1,…,Ti for cluster i=1,…,M in terminal nodes within the CART model.
Because accuracy and stability of decision tree methods may often be improved with an ensemble framework, we then extended the BiMM tree method into the random forest framework. Details of the BiMM forest algorithm are described elsewhere [22]. Similar to BiMM tree, BiMM forest uses an algorithm in which a random forest is developed, then the predicted probability of each observation from the random forest is used within a Bayesian GLMM to adjust for longitudinal outcomes. This portion of the algorithm takes the form
where RF (Xit) is represented within the GLMM as the predicted probability from the random forest of each longitudinal observation t=1,…,Ti for cluster i=1,…,M. β0 is the coefficient for the intercept an β1 is the coefficient for the vector of random forest probabilities, RF(Xit)
2.6. Statistical Methods
All models are constructed using a training dataset (525 patients and 2253 observations) and are assessed using a test dataset (517 patients and 2208 observations). Training and test data are randomly split such that daily measurements for each patient appear only in one of the datasets. Analyses are completed using SAS Version 9.3 (SAS Institute, Cary, NC) and R software [25]. Patient characteristics are presented as mean (standard deviation (SD)) or N (percent) and compared using t-tests and binomial tests using the R package tableone [26]. P-values adjusted for longitudinal measures within the daily dataset are computed using standard GLMM methodology.
We develop the following models: classification and regression tree (CART), random forest, support vector machine (SVM), k nearest neighbor (KNN), artificial neural networks (ANN), frequentist GLMM, Bayesian GLMM, BiMM tree, and BiMM forest. We note that the first five methods do not adjust for longitudinal outcomes, whereas the other methods account for longitudinal outcomes. R packages employed to develop models include: rpart [27], randomForest [28], kernlab [29], class [30], neuralnet [31], lme4 [32] and blme [33]. Because some of the methods should be employed with a complete dataset (i.e. random forest, KNN, GLMMs and BiMM forest), we also develop models using an imputed dataset. KNN predictions may only be made for test datasets using the R package class, so there are no accuracy statistics available for the training dataset. For simplicity, we use the rfImpute function within the randomForest R package to impute missing predictor values [28]. Models are assessed in terms of overall accuracy, sensitivity and specificity for training and test datasets using binomial estimates and confidence intervals. Area under the receiver operating curve (AUC) is determined using the R package ROCR [34].
3. Results
3.1. Patient Characteristics
Demographic and clinical characteristics of patients are displayed in Table 1 by outcome status for admission and all daily data. Of the 1042 patients, the mean age is significantly higher for patients with poor outcome compared to patients with good outcome on admission (39 versus 36). There are significantly more females with poor outcome compared to good outcome (82% versus 70%). There are no significant differences in ethnicity between the outcome groups. Upon study admission day, patients with poor outcome have significantly lower ALT, lower AST, higher bilirubin, higher creatinine, higher phosphate, and higher ammonia compared to those with good outcome. The poor outcome group also has a higher percentage of patients being treated with MV, pressors, and RRT. In total, there are 4461 observations of data collected for the 1042 patients. On days 1–7 there are respectively 1042, 875, 704, 571, 488, 423, and 358 patients with data available. Patients have an average of approximately four days of data. The right panel of Table 1 displays clinical characteristics of patients, with p-values adjusted for repeated measurements. Aside from ALT, AST and phosphate, all predictors differ significantly by outcome group.
Table 1: Patient Characteristics: Mean (SD) or N (%).
Table 1 displays patient characteristics for admission and across all daily data, including demographics, laboratory measurements, and clinical management.
| Type of Variable | Variable | Admission Data (N=1042 observations) | All Daily Data (T=4461 observations) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| N | Poor Outcome Mean (SD) or Number (%) | Good Outcome Mean (SD) or Number (%) | P-value | T | Poor Outcome Mean (SD) or Number (%) | Good Outcome Mean (SD) or Number (%) | P-value | ||
| Collected at one time | Female | 1042 | 443 (81.7) | 348 (69.6) | <0.001 | ||||
| Non-Hispanic | 1040 | 504 (93.3) | 472 (94.4) | 0.558 | |||||
| Age | 1042 | 39.11 (12.68) | 36.19 (12.74) | <0.001 | |||||
| Collected at days 1–7 | ALT | 1029 | 4049.11 (3149.08) | 5173.12 (3921.35) | <0.001 | 4292 | 2359.43 (2538.94) | 2687.80 (3078.94) | 0.087 |
| AST | 1029 | 5013.51 (5015.01) | 5703.38 (5634.56) | 0.038 | 4319 | 2230.88 (3621.64) | 2126.72 (3869.71) | 0.427 | |
| Bilirubin | 1026 | 5.64 (4.94) | 5.02 (4.88) | 0.043 | 4295 | 8.27 (6.48) | 6.38 (6.00) | <0.001 | |
| Creatinine | 1036 | 3.02 (7.86) | 2.12 (1.91) | 0.013 | 4360 | 2.70 (4.24) | 2.31 (2.29) | <0.001 | |
| Phosphate | 921 | 3.15 (2.13) | 2.72 (1.71) | 0.001 | 2499 | 3.32 (4.43) | 3.40 (7.41) | 0.587 | |
| Lactate | 191 | 1.28 (3.07) | 2.00 (3.70) | 0.149 | 816 | 2.31 (3.89) | 4.43 (4.48) | <0.001 | |
| Platelets | 1029 | 136.66 (89.19) | 208.34 (96.37) | 0.235 | 4316 | 104.05 (69.76) | 137.15 (68.46) | <0.001 | |
| Ammonia | 393 | 149.43 (117.36) | 122.19 (134.28) | 0.033 | 1088 | 115.95 (90.95) | 93.40 (99.85) | 0.002 | |
| INR | 1019 | 3.67 (2.77) | 3.76 (2.59) | 0.621 | 4236 | 3.15 (15.19) | 2.57 (2.90) | 0.012 | |
| MV | 1042 | 471 (86.9) | 86 (17.2) | <0.001 | 4456 | 2057 (88.4) | 450 (21.1) | <0.001 | |
| Pressors | 1042 | 188 (34.7) | 46 (9.2) | <0.001 | 4456 | 734 (31.5) | 147 (6.9) | <0.001 | |
| RRT | 1042 | 134 (24.7) | 44 (8.8) | <0.001 | 4456 | 663 (28.5) | 243 (11.4) | <0.001 | |
Patients are randomly assigned to be in either the training dataset or test dataset for model development, regardless of the number of daily measurements of data. Table 2 displays demographic and clinical characteristics for each dataset. There are no significant differences of predictor variables between the test and training datasets, aside from RRT use, which is slightly higher in the test dataset compared to the training dataset.
Table 2: Comparing Training and Test Datasets.
Table 2 displays patient characteristics for the training and test dataset. There were few significant differences between the datasets, aside from RRT and AST.
| Type of Variable | Variable | All Daily Data (T=4461 observations from N= 1042 patients) | |||
|---|---|---|---|---|---|
| T | Training Data Mean (SD) or Number (%) | Test Data Mean (SD) or Number (%) | P-value | ||
| Collected at one time | Female | 1042 | 399 (76.0) | 392 (75.8) | 1.000 |
| Non-Hispanic | 1040 | 490 (93.5) | 486 (94.2) | 0.746 | |
| Age | 1042 | 38.17 (12.78) | 37.24 (12.79) | 0.242 | |
| Collected at days 1–7 | ALT | 4292 | 2589.47 (2846.02) | 2443.30 (2782.92) | 0.089 |
| AST | 4319 | 2298.40 (3989.40) | 2061.22 (3470.36) | 0.037 | |
| Bilirubin | 4295 | 7.20 (6.02) | 7.52 (6.61) | 0.101 | |
| Creatinine | 4360 | 2.54 (3.76) | 2.49 (3.13) | 0.653 | |
| Phosphate | 2499 | 3.51 (7.79) | 3.21 (3.48) | 0.221 | |
| Lactate | 816 | 3.53 (4.36) | 3.44 (4.36) | 0.755 | |
| Platelets | 4316 | 113.17 (69.70) | 126.30 (67.82) | 0.365 | |
| Ammonia | 1088 | 107.52 (90.83) | 104.23 (99.92) | 0.572 | |
| INR | 4236 | 3.03 (15.54) | 2.72 (2.99) | 0.367 | |
| MV | 4456 | 1281 (56.9) | 1226 (55.6) | 0.415 | |
| Pressors | 4456 | 446 (19.8) | 435 (19.7) | 0.985 | |
| RRT | 4456 | 414 (18.4) | 492 (22.3) | 0.001 | |
| Poor Outcome | 4461 | 1086 (48.2) | 1043 (47.2) | 0.538 | |
3.2. Original Dataset Models
We develop CART, Frequentist GLMM, Bayesian GLMM, and BiMM tree models using the original training dataset. Random forest and BiMM forest require all missing data to be imputed prior to modeling, so these models could not be developed using the original (unimputed) dataset which contains missing predictor values. Diagrams for the CART and BiMM tree are displayed within Figure 1. If the logic statement is true, then one follows the left branch, and if the logic statement is false, one follows the right branch. For binary predictors (e.g. pressors), 1 indicates that the patient is on the treatment and 0 indicates that the patient is not on the treatment. The CART uses seven variables and eight nodes in order to obtain predictions of outcome, whereas the BiMM tree uses three variables and three nodes. The models are identical up until the fourth node, in which the BiMM tree has a terminal node, but the CART continues to use AST and additional variables.
Figure 1: Original Dataset Tree Diagrams.
Figure 1 displays the standard decision tree (CART) and BiMM tree diagrams for the original, unimputed dataset. Terminal nodes with 0 represent high coma grade and with 1 represent low coma grade. The fractions at the bottom of the terminal nodes represent the number of observations falling within the outcome group listed above, out of the total number of observations falling within the terminal node.
Accuracy, sensitivity and specificity for the original training and test dataset models are presented within Table 3. The BiMM tree model has the highest training dataset accuracy compared to other models, along with the highest sensitivity and specificity. The Frequentist and Bayesian GLMM models make identical predictions for the training dataset. For the test dataset, most models have similar prediction accuracy of approximately 63% (aside from ANN with accuracy of 62% and SVM with accuracy of 65%), though the breakdown of sensitivity and specificity is different comparing the models. The GLMMs and BiMM tree have slightly higher specificity compared to CART, which balance out the sensitivity and specificity more than the other models. The SVM had the highest specificity, although had the lowest sensitivity. A drawback of the GLMM, SVM and ANN models is that only observations with non-missing values for all variables could be used in modeling, so that a substantial portion of outcomes could not be obtained. The CART and BiMM tree methods can handle missing predictor data, so predictions are obtained for all observations within the original, unimputed training and test datasets.
Table 3: Accuracy Statistics for Models with Original Dataset.
Table 3 displays the accuracy, sensitivity and specificity for test and training original datasets, along with their 95% Binomial Confidence Intervals.
| Method | Training Data T=2253 | Test Data T=2208 | ||||||
|---|---|---|---|---|---|---|---|---|
| T | Accuracy | Sensitivity | Specificity | T | Accuracy | Sensitivity | Specificity | |
| CART | 2253 | 0.702 (0.683,0.721) | 0.683 (0.655,0.710) | 0.723 (0.695,0.749) | 2208 | 0.639 (0.619,0.660) | 0.613 (0.584,0.641) | 0.669 (0.640,0.698) |
| Frequentist GLMM | 127 | 0.417 (0.330,0.508) | 0.312 (0.211,0.427) | 0.580 (0.432,0.718) | 138 | 0.630 (0.544,0.711) | 0.551 (0.426,0.671) | 0.710 (0.588,0.813) |
| Bayesian GLMM | 127 | 0.417 (0.330,0.508) | 0.312 (0.211,0.427) | 0.580 (0.432,0.718) | 138 | 0.638 (0.552,0.718) | 0.536 (0.412,0.657) | 0.739 (0.619,0.837) |
| BiMM Tree | 2253 | 0.907 (0.894,0.918) | 1.000 (0.997,1.000) | 0.820 (0.797,0.842) | 2208 | 0.630 (0.610,0.651) | 0.530 (0.499,0.561) | 0.720 (0.693,0.746) |
| ANN | 127 | 0.677 (0.588,0.757) | 0.880 (0.757,0.955) | 0.545 (0.428,0.659) | 138 | 0.616 (0.529,0.697) | 0.594 (0.469,0.711) | 0.638 (0.513,0.750) |
| SVM | 127 | 0.803 (0.723,0.868) | 0.940 (0.835,0.987) | 0.714 (0.600,0.812) | 138 | 0.645 (0.559,0.724) | 0.449 (0.329,0.574) | 0.841 (0.733,0.918) |
3.3. Imputed Dataset Models
In order to compare all models, we use an imputed dataset to predict daily outcomes of ALF patients. Figure 2 displays the CART and BiMM tree models, along with the variable importance plot from the random forest. The CART and BiMM tree models look fairly similar, though the CART includes four additional nodes compared to the BiMM tree. Again, the CART model uses more predictors compared to the BiMM tree, which uses only three predictors. Within the random forest variable importance plot, the most important predictors appear at the top and the least important predictors appear at the bottom. The random forest identifies lactate as the most important predictor of daily outcome, followed by ammonia and ALT. The least important predictors of outcome are sex and ethnicity, consistent with clinical literature. Partial dependence plots are examined to assess the relationship between important predictors and outcome. Lactate greater than 6 mmol/L and ALT greater than 5000 IU/L are associated with higher odds of poor outcome (Figure 3).
Figure 2: Imputed Dataset Tree Diagrams.
Figure 2 displays the standard decision tree (CART), BiMM tree and random forest variable importance for the imputed dataset. Terminal nodes in the trees with 0 represent high coma grade and with 1 represent low coma grade. The fractions at the bottom of the terminal nodes represent the number of observations falling within the outcome group listed above, out of the total number of observations falling within the terminal node. The most important variables are at the top of the variable importance plot, ranging down to the bottom (least important) variables.
Figure 3: Partial Dependence Plots for Lactate and ALT.
Figure 3 displays partial dependence plots for continuous variables in the random forest model. Increasing slopes within the plots represent increasing log odds of poor outcome (low coma grade) as values of the variable increase.
Accuracy, sensitivity and specificity for the imputed dataset models are presented within Table 4. Similar to the original dataset results, the BiMM tree model has the highest training dataset accuracy compared to other models, along with the highest sensitivity. The Frequentist and Bayesian GLMM models make very similar predictions for the training dataset. Models which adjusted for longitudinal outcomes (i.e. GLMMs and BiMM methods) have higher performance statistics for the training dataset compared to models which did not adjust for longitudinal outcomes (i.e. CART, random forest, ANN and SVM). For the test dataset, the standard random forest, GLMMS, and BiMM forest have similar prediction accuracy of approximately 69%. The BiMM tree, SVM, ANN and KNN had slightly lower test set accuracy compared to other models. All models have higher sensitivity than specificity for the test dataset.
Table 4: Accuracy Statistics for Models with Imputed Dataset.
Table 4 displays the accuracy, sensitivity and specificity for test and training imputed datasets, along with their 95% Binomial Confidence Intervals.
| Method | Training Data T=2253 | Test Data T=2208 | ||||||
|---|---|---|---|---|---|---|---|---|
| T | Accuracy | Sensitivity | Specificity | T | Accuracy | Sensitivity | Specificity | |
| CART | 2253 | 0.730 (0.711,0.748) | 0.787 (0.763,0.811) | 0.668 (0.639,0.696) | 2208 | 0.653 (0.633,0.673) | 0.762 (0.737,0.786) | 0.531 (0.500,0.562) |
| RF | 2253 | 0.757 (0.739,0.775) | 0.799 (0.775,0.822) | 0.712 (0.684,0.739) | 2208 | 0.688 (0.668,0.707) | 0.743 (0.717,0.768) | 0.626 (0.596,0.656) |
| Frequentist GLMM | 2253 | 0.869 (0.854,0.882) | 0.886 (0.866,0.904) | 0.850 (0.827,0.871) | 2208 | 0.686 (0.666,0.705) | 0.724 (0.697,0.749) | 0.643 (0.613,0.672) |
| Bayesian GLMM | 2253 | 0.868 (0.853,0.881) | 0.888 (0.868,0.905) | 0.846 (0.823,0.867) | 2208 | 0.686 (0.666,0.705) | 0.724 (0.697,0.749) | 0.644 (0.614,0.673) |
| BiMM Tree | 2253 | 0.920 (0.908,0.931) | 1.000 (0.997,1.000) | 0.845 (0.823,0.865) | 2208 | 0.653 (0.632,0.672) | 0.669 (0.640,0.698) | 0.638 (0.609,0.655) |
| BiMM forest | 2253 | 0.872 (0.857,0.855) | 0.868 (0.848,0.887) | 0.876 (0.854,0.895) | 2208 | 0.688 (0.668,0.707) | 0.743 (0.717,0.768) | 0.626 (0.596,0.656) |
| ANN | 2253 | 0.705 (0.689,0.724) | 0.845 (0.823,0.865) | 0.554 (0.524,0.584) | 2208 | 0.662 (0.642,0.682) | 0.819 (0.800,0.841) | 0.487 (0.456,0.519) |
| KNN | 2208 | 0.597 (0.577,0.618) | 0.625 (0.596,0.653) | 0.567 (0.536,0.597) | ||||
| SVM | 2253 | 0.773 (0.755,0.790) | 0.832 (0.809,0.853) | 0.709 (0.681,0.736) | 2208 | 0.673 (0.653,0.693) | 0.747 (0.721,0.772) | 0.591 (0.560,0.621) |
3.4. Area Under the Receiver Operating Curve
In addition to comparing accuracy, sensitivity and specificity, we compare models for original data and imputed training and test datasets using ROC plots (Figure 4). For the original training dataset, the SVM has the best AUC, which was 0.927, followed by BiMM tree with 0.907, CART with 0.735, ANN with 0.681 and the GLMM methods with 0.417. Thus, the SVM and BiMM tree have the best model fit for the complete training dataset. For the imputed training dataset, the BiMM forest has the highest AUC (0.952), followed closely behind by Frequentist GLMM (0.941), Bayesian GLMM (0.940) and BiMM tree (0.921). SVM, RF and CART have lower AUC (0.855, 0.829 and 0.770 respectively) compared to the other methods, which adjust for longitudinal outcomes. ANN had the lowest AUC, 0.707. Overall, the BiMM forest has the highest AUC, indicating best model fit for the imputed training dataset.
Figure 4: Receiver Operating Curve (ROC) Plots for Original Dataset Models and Imputed Dataset Models.
Figure 4 displays the ROC for original and imputed test and training datasets. ROCs extending further out from the straight line indicates better model fit.
For the original test dataset, the SVM has the best AUC of 0.725, followed by BiMM tree with 0.697, CART with 0.682, ANN with 0.629, and GLMM methods with 0.603. For the imputed test dataset, the BiMM forest and RF have the highest AUCs (0.749), followed by SVM (0.730), Frequentist GLMM (0.707), Bayesian GLMM (0.708), BiMM tree (0.707), CART (0.698) ANN (0.660), and KNN (0.530).
4. Discussion
4.1. Key Results
In this study, we developed several prediction models using different statistical and machine learning methods, which need to be compared and discussed in terms of biological plausibility. We provide prediction models developed specifically for APAP-ALF patients which can be used at hospital admission and during in-patient hospitalization using daily outcomes. Models are developed using a training dataset and evaluated using a validation dataset for both original unimputed data and imputed data with missing values filled in. The prediction (test dataset) accuracy of the models with the original dataset are similar, around 63%. The BiMM tree has significantly higher training dataset accuracy compared to the standard CART, which does not account for clustered outcomes. The CART model is also more complex compared to the BiMM tree because it includes more predictor variables. Moreover, the CART splits may not be consistent with observations in clinical practice. For example, the fourth node, which splits AST < 6058 indicates that high AST is associated with high coma grade, may be counterintuitive because high AST is typically associated with poor survival. Additionally, AST is not typically a laboratory variable which is considered to be predictive of outcome based on current prediction models [9, 15]. On the other hand, the BiMM tree is clinically relevant, in which poor daily outcomes are associated with pressor use, high bilirubin, and high creatinine. A benefit of the BiMM tree method compared to the Frequentist GLMM and Bayesian GLMM is that all data, regardless of missing values, can be evaluated, whereas only observations with non-missing values of all predictors could be included for the GLMMs.
To compare the prediction models using all observations, we additionally develop models using an imputed dataset. The standard CART and random forest have the lowest training dataset accuracy, which makes sense because these models do not account for clustering in the outcomes. Consistent with the original dataset model, CART with the imputed dataset is more complex and does not make sense clinically because the first node identifies high lactate and high ALT as predictors of good outcome. The BiMM tree has a similar structure to the CART, which is not consistent with clinical observations of outcome. These models, which are contrary to clinical presentation of patients, highlight the danger of imputing missing values, particularly when there is a large percentage of missing data (e.g. lactate in this dataset which is missing 82% of values). Although it has a large amount of missing data, we considered lactate in prediction modeling because it has been identified as an important predictor of outcome in ALF [13]. While the CART and BiMM tree models do not make clinical sense, the BiMM forest is able to identify that high lactate and high ALT is associated with poor outcomes in APAP ALF patients. For the imputed dataset, the BiMM forest offers test set accuracy of 69%, training set accuracy of 87% and training set AUC of 0.952. The GLMM models have similar training and test dataset accuracy to the BiMM forest, with slightly lower AUCs. It is not surprising that the BiMM tree and BiMM forest use different predictors because the methods are quite different: a single tree may be adequate in predicting outcomes, whereas a forest may identify different but equally good predictors of outcome because many decision trees are developed and results are aggregated.
Overall, the model which offers good predictive ability, is consistent with clinical practice, offers clear interpretation of predictors and is easy to use for obtaining predictions is the BiMM tree with the original dataset. While the prediction accuracy was slightly lower than the competing models, we believe it is the best model because it is simple to use in practice at the bedside for predicting daily outcomes and it is consistent with what is seen in the clinical presentation of APAP-ALF patients. Compared to the GLMM models, the BiMM tree is easier to use because it requires only three variables within a user-friendly flow chart which does not require calculation or an application. Additionally, interpretation is simpler for the BiMM tree compared to the GLMMs because there is no need for understanding odds ratios or regression parameter estimates. We develop ANN, KNN and SVM models to compare results, but these are not ideal models for this application because they do not offer interpretation of predictors. We advise against using models developed with the imputed dataset because there is a substantial amount of missing data for some predictor variables, and resulting models may not be consistent with clinical practice. A benefit of the BiMM tree method is that it can handle missing data without the need for imputation. The BiMM forest is another viable option for daily predictions with clinically meaningful associations between predictors and outcome; however, an online application would need to be developed so that predictions could be obtained for new patients.
4.2. Comparison with Previous Studies
In this study, a decision support tool for predicting the likelihood of daily outcome during the first week of hospitalization is developed, which is novel since most prognostic models are constructed using hospital admission data and are not meant for use over time. Direct comparison of performance characteristics of models presented in this paper with current clinical prediction models is not possible because different outcome variables are used. In the current study, we use daily measures of high versus low coma grade, whereas most prediction models in the clinical literature use survival at a fixed time point. However, some similar clinical variables are used between models presented in this study and current clinical models. The BiMM tree developed with original data uses similar predictors as other prognosis models in the clinical literature: KCC includes creatinine [9], model for end stage liver disease (MELD) includes creatinine and bilirubin [35], ALFSG-PI includes pressor use and bilirubin [15], and ALFED includes bilirubin [17]. CART models for the prediction of 21-day survival produced using aggregated post-admission data in a previous study uses MELD, ventilator use, and lactate [16]; thus, decision tree models considering longitudinal data in the present study are quite different from those which do not use daily data.
Aside from using different outcome variables, the BiMM tree prediction model differs from many others in the literature because predictions are easily obtained using a flow chart, whereas other models (e.g. MELD and ALFSG-PI) require the use of an application or calculation of scores. Though KCC also provides a simple set of scoring rules, it has demonstrated poor specificity, particularly at later time points during ALF progression [2, 11–13]. Similarly, the ALFED model provides simple scoring rules, but it was developed using non-APAP-ALF patients and a fixed outcome (mortality). A future study could investigate developing a daily prediction model for coma grade of non-APAP-ALF patients. We use a daily measurement of outcome to develop prediction models rather than an outcome for a single time point because disease progression can change on a daily basis in the ALF setting. It is of clinical interest to obtain predictions of outcome which fluctuate over time rather than obtaining a single prediction for several weeks in advance to help clinicians develop management plans for ALF patients (e.g. whether to list a patient for a liver transplant).
4.3. Comparing Methodologies for Longitudinal Prediction Modeling
Our primary focus of this study is to compare performance statistics of novel BiMM tree and BiMM forest to traditional methods (GLMMs and standard tree/forest models). BiMM tree [18] and BiMM forest [22] are machine learning algorithms which may be applied to develop accurate prediction models for complex datasets (e.g. containing many predictors, predictors with missing values, and predictors with extreme values) which have clustered and longitudinal endpoints. Statistical models should account for data of this structure because values of a variable collected for a patient at many time points are correlated, creating groups called clusters. In addition to having clustered and longitudinal outcomes, some datasets (e.g. ALFSG registry data) contain complexities which make developing prediction models challenging using traditional methodology. For example, GLMMs may be suboptimal if datasets contain nonlinear predictors of outcome or complex interactions among predictors which are not specified correctly. BiMM tree and BiMM forest provide data-driven methods for developing prediction models for longitudinal data which do not require the user to specify nonlinear associations or interaction terms. Compared to standard CART and random forest, BiMM methods are more appropriate for longitudinal data since they incorporate clustering effects. Based on data simulations [22], BiMM forest may provide higher accuracy compared to BiMM tree; however, BiMM tree is simpler to use in practice than BiMM forest, which requires an application to obtain predictions.
4.4. Limitations
Though BiMM tree offers an alternative to current prognosis criteria, there are some limitations of this study which should be considered. First, data used to develop and assess new models are from the North-American ALFSG registry, so models may not be appropriate for populations elsewhere where transplant decisions may vary. Given the orphan status of ALF, it is difficult to find robust external datasets that have many patients with serially collected clinical features. However, models are created using internal validation (test dataset) to address the issue of generalizability. Therefore, it is hypothesized that the BiMM tree model should perform well with other populations of APAP-ALF patients. The BiMM forest offers among the highest of prediction accuracies of the models; however, a limitation is that an online application is required for obtaining predictions in practice and interpretability is not as straight-forward as the decision tree models.
An important consideration in this study is that the models handle missing data in different ways, and it is challenging to compare all models regardless of missing data. A benefit of BiMM tree is that it can handle large amounts of missing data, whereas GLMMs and BiMM forest need complete data, which required imputation of missing values. Because there is a large amount of missing data in some of the predictors (e.g. lactate and ammonia), models produced with the imputed data may not be appropriate. This is evident in the resulting models, which are not consistent with clinical observations in practice. This is a main reason we recommend use of the BiMM tree prediction model produced with the original unimputed dataset, even though it had slightly lower prediction accuracy than some of the other models. A future study could investigate the use of BiMM forest for imputation of longitudinal data with missing values.
Performance of the BiMM tree is modest for the test dataset, highlighting the difficulty of predicting daily coma grade over time in APAP-ALF patients. Some challenges of the ALFSG registry dataset in developing longitudinal prediction models include missing data and a heterogeneous population which is treated using varying protocols across study sites. A future study could investigate whether other ensemble classifiers or deep learning techniques would increase the accuracy, sensitivity, specificity and AUC of a prediction model. Given these limitations, it would be beneficial to use external datasets to validate the BiMM tree model developed in this study. Additionally, incorporation of biomarkers of hepatic regeneration may improve upon models for prognosticating ALF (e.g. fatty acid binding proteins [36, 37]).
5. Conclusions
Several models are produced for determining daily outcomes of APAP-ALF patients which can be used during the course of hospitalization. Offering a simple, accurate, and clinically consistent method for assessing high versus low coma grade, BiMM tree provides a prediction model developed for daily outcome measurements. Data from the ALFSG registry suggests that the BiMM tree prediction model offers good prediction accuracy (63%) and overall performance (training data AUC 0.907), but additional datasets should be used to externally validate these findings.
Highlights.
Longitudinal prognosis models are not available for acute liver failure patients
We develop prediction models using novel BiMM tree and BiMM forest methodology
BiMM forest has slightly higher accuracy, but BiMM tree is simpler to use at the bedside
Acknowledgements:
The data collection for this study was funded by the NIH/NIDDK (U01 DK58369). This work was partially supported by the NIH/NCATS Grant (KL2 TR001421), NIH/NCATS Grant (TL1 TR001451). The funding sources had no involvement in the analysis or writing of this report.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Competing interests: Authors have no competing interests to declare.
References
- 1.Fagan E and Wannan G, Reducing paracetamol overdoses. BMJ, 1996. 313(7070): p. 1417–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Larson AM, et al. , Acetaminophen-induced acute liver failure: results of a United States multicenter, prospective study. Hepatology, 2005. 42(6): p. 1364–72. [DOI] [PubMed] [Google Scholar]
- 3.Karvellas CJ, et al. , Medical and psychiatric outcomes for patients transplanted for acetaminophen-induced acute liver failure: a case-control study. Liver Int, 2010. 30(6): p. 826–33. [DOI] [PubMed] [Google Scholar]
- 4.Bernal W, et al. , Acute liver failure. Lancet, 2010. 376(9736): p. 190–201. [DOI] [PubMed] [Google Scholar]
- 5.Reddy KR, et al. , Liver transplantation for Acute Liver Failure: Results from the NIH Acute Liver Failure Study Group. Hepatology, 2012. 56(4(Suppl)): p. 246A. [Google Scholar]
- 6.Simpson KJ, et al. , The utilization of liver transplantation in the management of acute liver failure: comparison between acetaminophen and non-acetaminophen etiologies. Liver Transpl, 2009. 15(6): p. 600–9. [DOI] [PubMed] [Google Scholar]
- 7.Stravitz RT, et al. , Intensive care of patients with acute liver failure: recommendations of the U.S. Acute Liver Failure Study Group. Crit Care Med, 2007. 35(11): p. 2498–508. [DOI] [PubMed] [Google Scholar]
- 8.Antoniades CG, et al. , The importance of immune dysfunction in determining outcome in acute liver failure. Journal of hepatology, 2008. 49(5): p. 845–61. [DOI] [PubMed] [Google Scholar]
- 9.O’Grady JG, et al. , Early indicators of prognosis in fulminant hepatic failure. Gastroenterology, 1989. 97(2): p. 439–45. [DOI] [PubMed] [Google Scholar]
- 10.Pauwels A, et al. , Emergency liver transplantation for acute liver failure. Evaluation of London and Clichy criteria. J Hepatol, 1993. 17(1): p. 124–7. [DOI] [PubMed] [Google Scholar]
- 11.Schmidt LE and Dalhoff K, Serum phosphate is an early predictor of outcome in severe acetaminophen-induced hepatotoxicity. Hepatology, 2002. 36(3): p. 659–65. [DOI] [PubMed] [Google Scholar]
- 12.Schmidt LE and Larsen FS, MELD score as a predictor of liver failure and death in patients with acetaminophen-induced liver injury. Hepatology, 2007. 45(3): p. 789–96. [DOI] [PubMed] [Google Scholar]
- 13.Bernal W, et al. , Blood lactate as an early predictor of outcome in paracetamol-induced acute liver failure: a cohort study. Lancet, 2002. 359(9306): p. 558–63. [DOI] [PubMed] [Google Scholar]
- 14.Shakil AO, et al. , Acute liver failure: clinical features, outcome analysis, and applicability of prognostic criteria. Liver Transpl, 2000. 6(2): p. 163–9. [DOI] [PubMed] [Google Scholar]
- 15.Koch DG, et al. , Development of a Model to Predict Transplant-free Survival of Patients with Acute Liver Failure. Clinical Gastroenterology and Hepatology, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Speiser JL, Lee WM, and Karvellas CJ, Predicting outcome on admission and post-admission for acetaminophen-induced acute liver failure using classification and regression tree models. PLoS One, 2015. 10(4): p. e0122929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kumar R, et al. , Prospective derivation and validation of early dynamic model for predicting outcome in patients with acute liver failure. Gut, 2012. 61(7): p. 1068–1075. [DOI] [PubMed] [Google Scholar]
- 18.Speiser JL, et al. , BiMM tree: a decision tree method for modeling clustered and longitudinal binary outcomes. Communications in Statistics - Simulation and Computation, 2018: p. 1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Garzotto M, et al. , Improved detection of prostate cancer using classification and regression tree analysis. J Clin Oncol, 2005. 23(19): p. 4322–9. [DOI] [PubMed] [Google Scholar]
- 20.Aguiar FS, et al. , Classification and regression tree (CART) model to predict pulmonary tuberculosis in hospitalized patients. BMC Pulm Med, 2012. 12: p. 40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Breiman L, Random forests. Machine Learning, 2001. 45(1): p. 5–32. [Google Scholar]
- 22.Speiser JL, et al. , BiMM forest: A random forest method for modeling clustered and longitudinal binary outcomes. Chemometrics and Intelligent Laboratory Systems, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.O’Grady JG, Schalm SW, and Williams R, Acute liver failure: redefining the syndromes. Lancet, 1993. 342(8866): p. 273–5. [DOI] [PubMed] [Google Scholar]
- 24.Atterbury CE, Maddrey WC, and Conn HO, Neomycin-sorbitol and lactulose in the treatment of acute portal-systemic encephalopathy. A controlled, double-blind clinical trial. The American journal of digestive diseases, 1978. 23(5): p. 398–406. [DOI] [PubMed] [Google Scholar]
- 25.Team., R.D.C. R: a language and environment for statistical computing. 2008. Vienna, Austria. [Google Scholar]
- 26.Yoshida K and Bohn J, tableone: Create” Table 1” to Describe Baseline Characteristics. R package version 0.7, 2015. 3. [Google Scholar]
- 27.Therneau TM and Atkinson EJ An introduction to recursive partitioning using the Rpart routines. Mayo Foundation, 1997. [Google Scholar]
- 28.Liaw A and Weiner M, Classification and Regression by randomForest. R News 2002. 2: p. 18–22. [Google Scholar]
- 29.Karatzoglou A, et al. , kernlab-an S4 package for kernel methods in R. Journal of statistical software, 2004. 11(9): p. 1–20. [Google Scholar]
- 30.Ripley BD and Venable W. R package: class. Functions for Classification 2019; Available from: https://cran.r-project.org/web/packages/class/class.pdf. [Google Scholar]
- 31.Günther F and Fritsch S, neuralnet: Training of neural networks. The R journal, 2010. 2(1): p. 30–38. [Google Scholar]
- 32.Bates D, et al. , Package ‘lme4’. convergence, 2015. 12: p. 1. [Google Scholar]
- 33.Dorie V, blme: Bayesian Linear Mixed-Effects Models. 2013, R package. [Google Scholar]
- 34.Sing T, et al. , ROCR: visualizing classifier performance in R. Bioinformatics, 2005. 21(20). [DOI] [PubMed] [Google Scholar]
- 35.Kamath PS, et al. , A model to predict survival in patients with end-stage liver disease. Hepatology, 2001. 33(2): p. 464–70. [DOI] [PubMed] [Google Scholar]
- 36.Karvellas CJ, et al. , The association between FABP7 serum levels with survival and neurological complications in acetaminophen-induced acute liver failure: a nested case–control study. Annals of intensive care, 2017. 7(1): p. 99. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Karvellas CJ, et al. , Elevated FABP1 serum levels are associated with poorer survival in acetaminophen‐induced acute liver failure. Hepatology, 2017. 65(3): p. 938–949. [DOI] [PMC free article] [PubMed] [Google Scholar]




