Skip to main content
PLOS One logoLink to PLOS One
. 2020 Jun 25;15(6):e0235117. doi: 10.1371/journal.pone.0235117

Using structured pathology data to predict hospital-wide mortality at admission

Mieke Deschepper 1,*, Willem Waegeman 2, Dirk Vogelaers 3,4, Kristof Eeckloo 1,5
Editor: Juan F Orueta6
PMCID: PMC7316243  PMID: 32584872

Abstract

Early prediction of in-hospital mortality can improve patient outcome. Current prediction models for in-hospital mortality focus mainly on specific pathologies. Structured pathology data is hospital-wide readily available and is primarily used for e.g. financing purposes. We aim to build a predictive model at admission using the International Classification of Diseases (ICD) codes as predictors and investigate the effect of the self-evident DNR (“Do Not Resuscitate”) diagnosis codes and palliative care codes. We compare the models using ICD-10-CM codes with Risk of Mortality (RoM) and Charlson Comorbidity Index (CCI) as predictors using the Random Forests modeling approach. We use the Present on Admission flag to distinguish which diagnoses are present on admission. The study is performed in a single center (Ghent University Hospital) with the inclusion of 36 368 patients, all discharged in 2017. Our model at admission using ICD-10-CM codes (AUCROC = 0.9477) outperforms the model using RoM (AUCROC = 0.8797 and CCI (AUCROC = 0.7435). We confirmed that DNR and palliative care codes have a strong impact on the model resulting in a decrease of 7% for the ICD model (AUCROC = 0.8791) at admission. We therefore conclude that a model with a sufficient predictive performance can be derived from structured pathology data, and if real-time available, can serve as a prerequisite to develop a practical clinical decision support system for physicians.

1. Introduction

1.1 Reuse of readily available hospital-wide data

Large amounts of data are registered in well-defined formats in hospitals. These datasets contain administrative data—such as age, billing data, specialism, and so on—and structured pathology data using the International Classification of Diseases (ICD) codes. Such datasets exhibit much information that should be useful for secondary goals, although this information is currently unused for predicting in-hospital mortality. Added value could be generated from existing hospital databases without the need for much additional effort or time being spent on noncare activities on the part of caregivers.

1.2 Current approach predicting mortality

Early identification of patients with a high risk of mortality is crucial to adequately and timely act by health care providers. The prediction of mortality is a well-researched topic in intensive care [13] and cardiac diseases [4], yet little research has been based on hospital-wide datasets. Currently, the Charlson Comorbidity Index (CCI) and the Risk of Mortality (RoM) score are widely used to predict mortality.

The CCI score is obtained from 17 weighted comorbidities and was initially developed to assess risk of one-year mortality [5]. This method dates from 1987 and is intended to provide a fast and easy risk assessment. The clinical conditions were initially retrieved manually from hospital charts, but are now available as ICD-10-CM codes, allowing automated extraction and calculation of the score in larger samples, as has been done by Quan (2005) and Sundararajan (2004) [6, 7]. Although the initial purpose of this measure is to asses risk of one year mortality, as this measure is still used to predict in-hospital mortality, we add this to our list of comparison measures.

RoM represents the likelihood of dying calculated from all comorbidities [8] based on all ICD codes: a weight is given to all secondary diagnoses. In the second step, the standard risk of mortality level of each secondary diagnosis is modified based on patient age, principal diagnosis, pathology group, and procedures. This is aggregated into subclasses, numbered 1 to 4, representing categories rather than scores. RoM should not be confused with Severity of Illness (SoI) score, which is calculated from the same data. SoI is defined as the extent of organ system derangement or physiologic decompensation [8]. RoM is used for risk adjustment of in-hospital mortality indicators from the Agency for Healthcare Research and Quality (AHRQ). This categorization of risk of mortality has previously been demonstrated to correlate strongly with observed mortality in a medical ICU setting [9]. The algorithm to calculate RoM is neither free nor open source, and as such a license is needed. RoM was shown to have a better predictive value than CCI for in-hospital mortality, but this study only encompassed older patients in surgical settings [10]. Furthermore, separate diagnoses of CCI scores have been demonstrated to better predict in-hospital mortality than the score itself in hip fracture patients [11].

Many hospitals make structured pathology information available in the form of ICD-10-CM codes. In many countries, including Belgium, this classification is obligated to calculate the hospital reimbursement. The ICD-10-CM diagnosis code is a seven character code, which can be approached as a chapter (one character), a category (three characters) and a full code (all seven characters; see Fig 1). These codes form the basis of the aggregated pathology groups (APR-DRG) and the RoM and SoI values.

Fig 1. ICD-10-CM diagnosis code hierarchy and with example: S52, fracture of the forearm.

Fig 1

The ICD-10-CM code consists of a chapter (S), category (S52) and full code (S52.521A).

In Belgian hospitals, these codes are obtained via an extensive manual process. Trained ICD-coding-experts search for the pathology described for a patient in the Electronic Health Record (EHR) (and other databases or manual records), with the discharge letter being one of the main sources. They then translate the patient’s diagnoses and procedures at admission into adequate ICD codes.

1.3 Proposed approach predicting in-hospital mortality

RoM and CCI are both aggregated measures with ICD-10-CM codes as their basis. The first difference in approach is to use the individual ICD-10-CM diagnosis codes as predictors instead of these aggregated measures. In order to develop an early-warning system for in-hospital mortality that would be useful in practice, it is important to concentrate on variables known at admission. However, most studies considering ICD diagnosis (or aggregations like RoM) as predictors for in-hospital mortality, make use of all the codes registered upon completion of the hospitalization episode, including those generated by complications during hospitalization. To address this problem, the “Present on Admission” (PoA) flag can be used to indicate that a diagnosis was present at the time the admission order was written. This flag is available in some ICD-9-CM databases, but is mandatory from ICD-10-CM onwards. Adding this flag results in better models for in-hospital mortality prediction [12]. Next, nearly all studies that build a prediction model on ICD data have been restricted in scope to specific pathologies, with the exception of two hospital-wide studies [13, 14]. These include in their analysis all diagnoses and procedures at the moment of discharge. Likewise, in order to be practically useful in a real world setting, RoM should also be calculated at admission, rather than on the complete ICD-10-CM coding set after completion of the hospital episode. We are not aware of any studies that have looked at the prediction of in-hospital mortality using ICD codes or RoM categories at admission.

Patients with a “Do Not Resuscitate” (DNR) or a palliative care code at admission are already at high (and intrinsically predictable) risk of mortality. We did not find any articles in the literature that excluded these patients from a hospital-wide model predicting in-hospital mortality. We hypothesize that these diagnoses are of limited relevance in the development of an in-hospital mortality warning system. As such, we aimed to build a model excluding these patients.

A final difference from earlier research is the hospital-wide scope of the model and the application of machine learning (ML) techniques. In recent studies ML techniques have been adopted instead of the more commonly used logistic regression (LR), as ML techniques perform better at prediction than LR [15, 16].

1.4 Objective

The objective of this study is to assess whether in-hospital mortality can be predicted accurately through individual ICD-10-CM codes available at admission, and to compare and evaluate this approach with existing scoring systems based on CCI and RoM. Our analysis quantifies the performance of aggregated CCI and RoM scores versus individual ICD-10-CM codes on a large hospital-wide group of pathologies, excluding DNR and palliative care codes. This may lead to a data-driven, machine learning approach based on nonlinear models, resulting in a clinical decision support system.

2. Materials and methods

2.1 Study population and study variables

The study cohort includes patients discharged between 1 January 2017 and 31 December 2017 at a single center, the 1061-bed University Hospital, Ghent, Belgium. Hospitalized patients were excluded from the analysis if they lacked detailed coding due to incomplete records, and also in the case of particular patient groups subject to specific hospital budget rules. This mainly refers to patients who stayed for more than half of their hospitalization period in a psychiatric department; no ICD coding available for these.

The dataset at admission contains only diagnoses positively flagged as PoA and diagnoses that are always present on admission, such as ‘Z880 Allergy status to penicillin’, ‘Z794 Long term (current) use of insulin’ or ‘I252 Old myocardial infarction’ [17]. As we are hypothesizing that DNR and palliative care codes at admission have a high but essentially unnecessary impact for the development of an in-hospital mortality warning system, and since the prediction for such patients would not be contributive, we also fit models omitting patients with these codes. In order to confirm or invalidate the performance of the predictors, models are also fitted at discharge (with all diagnosis and measures calculated at discharge). One of the differences between RoM and ICD is the principal diagnosis: we also create datasets without a principal diagnosis in order to gain insight into its weight on the models.

The measure for RoM and CCI are recalculated on all datasets. The calculation of CCI is straightforward, for RoM we use the 3M algorithm under the license of Ghent Univerity Hospital.

The dependent (outcome) variable is in-hospital mortality. RoM, CCI or ICD-10-CM diagnoses are taken as predictors. RoM is added as an ordinal variable, CCI as a continuous score and in models with an ICD-10-CM diagnosis the three hierarchy levels (chapter, category, code) are all added as dummy variables, translating into flags in the column for chapter 20 (S), a flag for category S52 and a flag for code S52.521A in the example given in Fig 1.

We only include the diagnosis codes in our ICD models, not the procedure codes. For planned admission, we can assume that a single surgical procedure was the reason for hospitalization, but this certainty is not possible when multiple procedures are performed, as our strategy is to minimize the number of false positives.

2.2 Statistical analysis

We use the Random Forests approach to build the predictive model. It is known that this nonlinear method outperforms logistic regression, which is more commonly used in medical applications [18]. In comparison with other data-driven approaches, Random Forests tends to perform as one of the best techniques overall for solving classification problems [3, 4]. Furthermore, on a similar dataset with unplanned readmissions as the outcome variable, Random Forests turned out to outperform penalized logistic regression and the gradient boosting machine learning approach [19]. The existing literature was scrutinized to determine which methods have already been proven to deliver solid solutions. We do not use deep learning methods, which are very popular, and have been used in recent research [20, 21] dealing with similar outcomes. Such techniques are very suitable for complex features, such as pixels in images, but they are not useful for standard tabular datasets as in this study.

For all models, the data is first split into training (60%), validation (20%), and test (20%) sets, in order to prevent overfitting. Otherwise models that can simply “remember” the training data (rather than generalizing from it) would be rewarded. The two main parameters for the model are the number of trees (fixed here at one thousand) and the maximum depth, which is tuned. The model can be interpreted using a variable importance plot. An implementation as described in [22] is used.

We compare four sets of predictors: CCI, RoM, ICD and ICD_noPDX. The last of these, ICD_noPDX, is the set of ICD diagnoses without the principal diagnosis. This is defined as a separate category in order to assess whether the principal diagnosis impacts the models, as RoM is calculated on the basis of all comorbidities without this principal diagnosis. For each set of predictors, we calculate and compare three models: 1) at admission, 2) at admission excluding patients with DNR or palliative care diagnosis code, and 3) at discharge.

We assess the performance of each model with the Area Under the Receiver Operating Characteristic curve (AUCROC) on the test dataset. The AUCROC is typically preferred over other measures in situations where the data is imbalanced as in our study and other health care datasets [23]. We also calculate the Area Under the Precision-Recall Curve (AUCPR), which mainly focusses on correctly predicting patients for which the model assumes they have a high probability of mortality [24]. The ROC curve shows the False Positive Rate on the x-axis and the True Positive Rate on the y-axis, while the PR curve has the True Positive Rate (or Recall) on the x-axis and the Precision (or positive predictive value) on the y-axis.

All analyses are performed using the R Statistical Software, version 3.4.1 with the h2o and mltools packages.

The study was approved by the ethics committee at Ghent University Hospital (Belgian registration no. B670201836838).

3. Results

3.1 Description of the study variables

A total of 36 368 patients were hospitalized and discharged during the study period. After excluding admissions as per protocol, the final study cohort included 34 671 patients, of whom 919 (3%) did not survive. 41% of the included patients belonged to a surgical pathology group. The excluded patients were all admissions in the psychiatry department except for three with incomplete records. 1063 patients had a DNR or palliative care code at admission; after excluding these, 33 608 patients remained in the cohort that was modeled.

Table 1 provides an overview of the characteristics of the survivors and non-survivors, adding age and sex for demographic description (these were not used in the models). For continuous variables we show the median with the first and third quartiles. CCI scores and RoM categories are shown upon admission and discharge. The CCI scores do not differ, but the distribution for RoM categories for non-survivors at discharge differs from the distribution at admission.

Table 1. Population overview: Characteristics of survivors and non-survivors.

Population overview survivors (N: 33 752 = 97%) non-survivors (N: 919 = 3%) p-value*
Age 52 [30–67] 70 [58–80] <0.001
Sex (% male) 17 320 (51%) 543 (59%) <0.001
% Diagnoses Present on admission (PoA) 93% 80% <0.001
DNR at admission 469 (1.5%) 231 (25%) <0.001
Palliative care flag at admission 234 (0.7%) 286 (31%) <0.001
CCI Admission 0 [0–2] 3 [1–6] <0.001
Discharge 0 [0–2] 3 [1–6] <0.001
RoM (1–2–3–4) Admission 71% - 22% - 6% - 1% 10% - 34% - 40% - 16% <0.001
Discharge 70% - 22% - 7% - 1% 6% - 22% - 40% - 33% <0.001

Data are reported as n (%) or medians (1st– 3rd quartile), or otherwise indicated.

* p-values based on Pearson chi-square for categorical variables and the Wilcoxon rank-sum test for continuous variables.

Legend: DNR = Do Not Resuscitate; PoA = Present on Admission flag; CCI = Charlson Comorbidity Index; RoM = Risk of Mortality

3.2 Models

The resulting AUCs derived from our models are summarized in Table 2, using the four predictor sets at admission, with and without excluding patients with DNR or palliative care diagnosis code, and at discharge. We also add the number of predictors for each set (row) and the number of records included per model type (column).

Table 2. Model AUC results.

Admission Discharge
# predictors All (n = 34 671) Excluding patients with DNR or palliative care diagnosis code (n = 33 608) # predictors All (n = 34 671)
AUCROC AUCPR AUCROC AUCPR AUCROC AUCPR
CCI 1 0.7435 0.0615 0.7015 0.0270 1 0.7471 0.0654
RoM 4 0.8797 0.1393 0.8601 0.1086 4 0.9272 0.1979
ICD 4743 0.9477 0.4035 0.8791 0.2476 4961 0.9774 0.5542
ICD_noPDX 3761 0.9340 0.3837 0.8623 0.1911 4050 0.9671 0.5425

AUC results for the Random Forests models: each line is a set of predictors. At admission we built the model for all diagnoses and excluding admissions with DNR or palliative care code. At discharge all diagnoses known for the whole episode were used in the models. As the dataset is imbalanced, the AUCPR is shown as well as the AUCROC. The differences between the predictor sets are larger using AUCPR.

Legend: CCI = Charlson Comorbidity Index; RoM = Risk of Mortality; ICD = International Classification of Diseases; DNR = Do Not Resuscitate; ICD_noPDX = all ICD codes without the principal diagnosis code; AUCROC = Area Under the ROC curve; AUCPR = Area under Precision-Recall Curve

The models using ICD-10-CM codes as predictors outperform the others. The models using CCI as predictors have low AUCROC. The resulting ROC curves are shown in Fig 2A, while the resulting Precision-Recall plots are shown in Fig 2B.

Fig 2.

Fig 2

A (Upper panel). ROC curves showing the results using three different sets of predictors in the Random Forests model. The figure on the left shows the ROC curve with all diagnoses known at admission, while the figure on the right shows all diagnoses known at discharge. The ROC curve in the middle is for the models using only the diagnoses known at admission, excluding all admissions with DNR and palliative care codes at admission. B (Lower panel). Precision-Recall plot showing the results using three different sets of predictors in the Random Forests model. The figure on the left shows the PR curve with all diagnoses known at admission and on the right all diagnoses known at discharge. The PR curve in the middle are the models using only the diagnoses known at admission, excluding all admissions with codes DNR or palliative care codes at admission. Both approaches show low performance using CCI as a predictor for in-hospital mortality, while the models using ICD as predictors perform best overall. Legend: CCI = Charlson Comorbidity Index; RoM = Risk of Mortality; ICD = International Classification of Diseases.

The difference between the models using ICD-10-CM as predictors and RoM is the smallest when excluding DNR and palliative care codes. When we exclude the principal diagnosis from the ICD predictor set, the set still delivers a better prediction than using the aggregated RoM. As hypothesized, the models excluding patients with DNR and palliative code at admission have lower AUCROC and AUCPR. The effect of these diagnosis codes is visually shown in the variable importance plot (Fig 3). The most important variables for each model are shown next to each other. Due to the imbalanced dataset, the difference between the results using AUCPR is even more pronounced in favor of using ICD as predictor set.

Fig 3. Variable importance plot for the models using ICD-10-CM diagnosis codes as predictors at admission and discharge, either using all diagnosis codes, without the Do Not Resuscitate or palliative care codes at admission, or without these codes and without the principal diagnosis.

Fig 3

Whereas ‘Encounter for palliative care’, ‘Do not resuscitate’ and ‘Encounter for other aftercare and medical care’ are the three most important variables in the models with all diagnosis, ‘Cardiac arrest’ is the most important variable in the model with the excluded patients.

4. Discussion

This study establishes that modeling based on all individual ICD-10-CM codes is a better predictor of in-hospital mortality at admission than hitherto used combined scores such as RoM and CCI. The PoA flag available in ICD-10-CM is necessary in order to retain only those diagnoses recognized at admission in the model. RoM does not allow the automatic exclusion of diagnoses not present upon admission, in contrast to ICD coding, which favors the latter in the development of a clinically-relevant decision support system. From Table 1, we could already indicate that for non-survivors the diagnoses not present at admission have an impact on the RoM classification. Evidently and intrinsically, the codes for DNR and palliative care—as key components of care paths—aimed at humanizing the dying process. They avoid futile care, which is very strongly associated with subsequent in-hospital mortality. Hence they clearly need to be excluded from prediction models of in-hospital mortality at admission, as these models should be aimed at providing a relevant support tool for clinical decision making.

We confirm that individual diagnoses perform better than the aggregated measures of the CCI score [11]. Our results show that CCI cannot be considered as a robust predictor for in-hospital mortality and should not be further used for this purpose. Moreover, prediction of in-hospital mortality was not the initial purpose of introducing CCI, which was rather developed to assess mortality after one year.

RoM turns out to be suboptimal compared to the full ICD diagnosis set. We have to be aware that RoM is calculated individually, belonging to a certain pathology group (APR-DRG). As such, this measure should be considered at the individual level or by pathology group. For the prediction models this requirement is fulfilled, as the patients are handled individually. One advantage of RoM compared to all the ICD codes is that is has very easy and intuitive properties. Also, the restriction to only four predictors (instead of almost 5000) saves computation time in building the model. However, an extra calculation step is needed to retrieve the RoM at admission, which is not a common practice. This also requires that the necessary 3M license is available in the software package (e.g., EHR), as the calculation is needed at real time and may come with an extra license cost. One of the key differences between the aggregated RoM score and the ICD predictors is the inclusion of the principal diagnosis. Where RoM applies this for risk adjustment, our models with ICD codes uses this principal diagnosis function as a full predictor. The models we constructed excluding the principal diagnosis as predictor all perform better than the models based on RoM. Another potential explanation for the difference could be the fixed weighting in the RoM calculation; this is unlike the weighting of the predictors, which depends in our ICD models on the patient mix used for training.

We have found only a few prior hospital-wide studies predicting in-hospital mortality at admission in the literature [20]. One study using deep learning as technique and extracting the data from EHRs into a specific format achieved an AUCROC of 0.90. The potential of ICD-10-CM diagnoses as predictors in hospital-wide studies has not been sufficiently researched and as such we can only compare the AUCROC calculated from our model to that based on RoM at moment of discharge on all patients (RoM_AUCROC = 0.93). In this perspective, a single study only included non-cardiac surgery patients in its model [10], obtaining an AUCROC of 0.97. In another study including only non-chirurgical patients [14], an AUCROC of 0.86 was observed. In a preliminary study [13] on the same data set, without the inclusion of laboratory data and without using a penalization factor in the logistic regression, an AUCROC of only 0.81 was achieved. As other techniques, patient mix, and predictors are included, the models are not fully comparable, hampering conclusions. Our models containing all ICD-10-CM diagnosis codes at discharge already have an AUCROC of 0.98, and could be optimized using variables such as age. We believe that our models have a good performance with still a large potential for improvement.

In our study, the codes are manual retrieved by ICD-coding-experts using the information from the EHR. This implies a certain degree of human error and bias within the codes. These codes, however are also the basis for the calculation of CCI and RoM and as such the same bias holds for all predictors. We are also aware that the codes for DNR will be biased: there will be a general tendency towards undercoding rather than overcoding, as they do not influence the severity of illness of an individual patient and as such do not have an impact on the financial reimbursement. However, our results show that patients with these codes already have a large effect on the model. As such, we believe that the potentially-missing codes will only have a minor effect on our models. The same holds for the PoA flag. This flag will be biased, as human interference is needed to unflag the diagnosis code. However, the results comparing the measures will remain the same, as the same codes are flagged for all measures. The results of the models may be slightly biased due to repeated measures (e.g. if a patient was admitted more than one time during the study period), as all admissions are treated equally. To overcome this bias, we should remove all admissions previous to the readmissions. In our dataset only 2.5% of the admissions are readmitted within 30 days.

To optimize the model we should include administrative variables with proven importance for adjusting mortality risk [25]. We did not include the ICD-10-PCS procedures codes, as we could not distinguish which procedure was known and planned upon admission. RoM implicitly uses procedures, as the RoM is risk-adjusted for the procedures and depends on a pathology group (which may be non-surgical or surgical). Nevertheless, our models with ICD-10-CM diagnosis codes still outperform RoM as a predictor, and could thus only be improved. The predictive performance of the risk adjustment models could be further improved, among other factors, through the inclusion of laboratory data, as shown in many studies [12, 14, 18, 26]. It has been shown that a limited set of routine laboratory results upon admission can contribute to risk stratification and independently predict mortality in patients hospitalized with acute heart failure [27]; the inclusion of laboratory data at admission from our EHR thus seems necessary [28]. In many electronic health records these laboratory data can be found as Logical Observation Identifiers Names and Codes (http://www.regenstrief.org/resources/loinc/).

The dataset for this study was extracted from a single center. It is possible that the performance may differ in other institutions with other patient mixes. However, we believe that the conclusions should be independent of the actual patient mix. We should also be aware that in some cases no represented sample is available in our historical data. Not only will there be new patients, with different case mixes, but ICD-10-CM also has yearly updates with the introduction of new codes. Refreshing of the model on a frequent basis is thus necessary in order to continue optimal predictions.

In conclusion, a predictive model with trustworthy operating characteristics can be derived from compulsory administrative data. A predictive model containing ICD-10-CM codes outperforms the conventional tools of combined scores. Data available at admission are required to develop a clinically-relevant warning system. An automated system would allow real-time alerts, without increasing workload and additional costs, while improving patient outcomes.

Data Availability

Data have been uploaded to GitHub and are accessible using the following link: https://github.com/descheppermieke/Using-structured-pathology-data-to-predict-hospital-wide-mortality-at-admission.

Funding Statement

For this work Willem Waegeman received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme.

References

  • 1.Kim S, Woong PR, 김우재. A Comparison of Intensive Care Unit Mortality Prediction Models through the Use of Data Mining Techniques. Healthcare Informatics Research. 2011;17(4):232–43. PubMed PMID: KJD:ART001615961. 10.4258/hir.2011.17.4.232 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Clermont G, Angus DC, DiRusso SM, Griffin M, Linde-Zwirble WT. Predicting hospital mortality for patients in the intensive care unit: A comparison of artificial neural networks with logistic regression models. Crit Care Med. 2001;29(2):291–6. 10.1097/00003246-200102000-00012 PubMed PMID: WOS:000167179400011. [DOI] [PubMed] [Google Scholar]
  • 3.Awad A, Bader-El-Den M, McNicholas J, Briggs J. Early hospital mortality prediction of intensive care unit patients using an ensemble learning approach. International Journal of Medical Informatics. 2017;108:185–95. 10.1016/j.ijmedinf.2017.10.002 PubMed PMID: WOS:000414862500025. [DOI] [PubMed] [Google Scholar]
  • 4.Allyn J, Allou N, Augustin P, Philip I, Martinet O, Belghiti M, et al. A Comparison of a Machine Learning Model with EuroSCORE II in Predicting Mortality after Elective Cardiac Surgery: A Decision Curve Analysis. Plos One. 2017;12(1). 10.1371/journal.pone.0169772 PubMed PMID: WOS:000391641500136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Charlson ME, Pompei P, Ales KL, Mackenzie CR. A NEW METHOD OF CLASSIFYING PROGNOSTIC CO-MORBIDITY IN LONGITUDINAL-STUDIES—DEVELOPMENT AND VALIDATION. Journal of Chronic Diseases. 1987;40(5):373–83. 10.1016/0021-9681(87)90171-8 PubMed PMID: WOS:A1987G855900002. [DOI] [PubMed] [Google Scholar]
  • 6.Quan HD, Sundararajan V, Halfon P, Fong A, Burnand B, Luthi JC, et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Medical Care. 2005;43(11):1130–9. 10.1097/01.mlr.0000182534.19832.83 PubMed PMID: WOS:000233268500010. [DOI] [PubMed] [Google Scholar]
  • 7.Sundararajan V, Henderson T, Perry C, Muggivan A, Quan H, Ghali WA. New ICD-10 version of the Charlson comorbidity index predicted in-hospital mortality. Journal of Clinical Epidemiology. 2004;57(12):1288–94. 10.1016/j.jclinepi.2004.03.012 PubMed PMID: WOS:000226154300010. [DOI] [PubMed] [Google Scholar]
  • 8.3M. ALL PATIENT REFINED DIAGNOSIS RELATED GROUPS (APR-DRGs) 2003. Available from: https://www.hcup-us.ahrq.gov/db/nation/nis/APR-DRGsV20MethodologyOverviewandBibliography.pdf.
  • 9.Baram D, Daroowalla F, Garcia R, Zhang G, Chen JJ, Healy E, et al. Use of the All Patient Refined-Diagnosis Related Group (APR-DRG) Risk of Mortality Score as a Severity Adjustor in the Medical ICU. Clin Med Circ Respirat Pulm Med. 2008;2:19–25. Epub 2008/01/01. 10.4137/ccrpm.s544 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.McCormick PJ, Lin H-m, Deiner SG, Levin MA. Validation of the All Patient Refined Diagnosis Related Group (APR-DRG) Risk of Mortality and Severity of Illness Modifiers as a Measure of Perioperative Risk. Journal of Medical Systems. 2018;42(5). 10.1007/s10916-018-0936-3 PubMed PMID: WOS:000430223200009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Toson B, Harvey LA, Close JCT. The ICD-10 Charlson Comorbidity Index predicted mortality but not resource utilization following hip fracture. Journal of Clinical Epidemiology. 2015;68(1):44–51. 10.1016/j.jclinepi.2014.09.017 PubMed PMID: WOS:000346690800006. [DOI] [PubMed] [Google Scholar]
  • 12.Pine M, Jordan HS, Elixhauser A, Fry DE, Hoaglin DC, Jones B, et al. Enhancement of claims data to improve risk adjustment of hospital mortality. Jama-Journal of the American Medical Association. 2007;297(1):71–6. 10.1001/jama.297.1.71 PubMed PMID: WOS:000243271900023. [DOI] [PubMed] [Google Scholar]
  • 13.Sakhnini A, Saliba W, Schwartz N, Bisharat N. The derivation and validation of a simple model for predicting in-hospital mortality of acutely admitted patients to internal medicine wards. Medicine. 2017;96(25). 10.1097/md.0000000000007284 PubMed PMID: WOS:000404116900071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Schwartz N, Sakhnini A, Bisharat N. Predictive modeling of inpatient mortality in departments of internal medicine. Intern Emerg Med. 2018;13(2):205–11. Epub 2018/01/01. 10.1007/s11739-017-1784-8 [DOI] [PubMed] [Google Scholar]
  • 15.Shmueli G. To Explain or to Predict? Statistical Science. 2010;25(3):289–310. 10.1214/10-sts330 PubMed PMID: WOS:000286550700002. [DOI] [Google Scholar]
  • 16.Couronne R, Probst P, Boulesteix AL. Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinformatics. 2018;19(1):270 Epub 2018/07/19. 10.1186/s12859-018-2264-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.CMS. ICD-10-CM Official Guidelines for Coding and Reporting 2017. Available from: https://www.cms.gov/Medicare/Coding/ICD10/Downloads/2017-ICD-10-CM-Guidelines.pdf.
  • 18.Sahni N, Simon G, Arora R. Development and Validation of Machine Learning Models for Prediction of 1-Year Mortality Utilizing Electronic Medical Record Data Available at the End of Hospitalization in Multicondition Patients: a Proof-of-Concept Study. J Gen Intern Med. 2018;33(6):921–8. Epub 2018/02/01. 10.1007/s11606-018-4316-y . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Deschepper M, Eeckloo K, Vogelaers D, Waegeman W. A hospital wide predictive model for unplanned readmission using hierarchical ICD data. Comput Methods Programs Biomed. 2019. Epub 2019/02/20. 10.1016/j.cmpb.2019.02.007 . [DOI] [PubMed] [Google Scholar]
  • 20.Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, et al. Scalable and accurate deep learning with electronic health records. Npj Digital Medicine. 2018;1 10.1038/s41746-017-0008-y PubMed PMID: WOS:000444179800001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Min X, Yu B, Wang F. Predictive Modeling of the Hospital Readmission Risk from Patients' Claims Data Using Machine Learning: A Case Study on COPD. Scientific Reports. 2019;9 10.1038/s41598-019-39071-y PubMed PMID: WOS:000459094800043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.H2O.ai. Distributed Random Forest (DRF) 2017. Available from: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/drf.html.
  • 23.Provost F, Fawcett T. Robust classification for imprecise environments. Machine Learning. 2001;42(3):203–31. 10.1023/a:1007601015854 PubMed PMID: WOS:000166612900001. [DOI] [Google Scholar]
  • 24.Saito T, Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. Plos One. 2015;10(3). 10.1371/journal.pone.0118432 PubMed PMID: WOS:000350685900033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.van den Bosch WF, Kelder JC, Wagner C. Predicting hospital mortality among frequently readmitted patients: HSMR biased by readmission. BMC Health Serv Res. 2011;11:11 10.1186/1472-6963-11-11 PubMed PMID: WOS:000288750600001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Pine M, Jordan HS, Elixhauser A, Fry DE, Hoaglin DC, Jones B, et al. Modifying ICD-9-CM Coding of Secondary Diagnoses to Improve Risk-Adjustment of Inpatient Mortality Rates. Medical Decision Making. 2009;29(1):69–81. 10.1177/0272989X08323297 PubMed PMID: WOS:000264356900007. [DOI] [PubMed] [Google Scholar]
  • 27.Novack V, Pencina M, Zahger D, Fuchs L, Nevzorov R, Jotkowitz A, et al. Routine Laboratory Results and Thirty Day and One-Year Mortality Risk Following Hospitalization with Acute Decompensated Heart Failure. Plos One. 2010;5(8):11 10.1371/journal.pone.0012184 PubMed PMID: WOS:000280968100010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, et al. Prediction of In-hospital Mortality in Emergency Department Patients With Sepsis: A Local Big Data-Driven, Machine Learning Approach. Academic Emergency Medicine. 2016;23(3):269–78. 10.1111/acem.12876 PubMed PMID: WOS:000373106300005. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Juan F Orueta

27 May 2020

PONE-D-20-08968

Using structured pathology data to predict hospital-wide mortality at admission

PLOS ONE

Dear Dr. Deschepper,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

This is an interesting manuscript. Overall, the methods of the study are appropriate, the results are clearly presented and the discussion is well developed. However, there are some points that should be addressed.

  • First, the authors must respond to the questions raised by the reviewer.

  • I have other minor comments. Figure 3 presents the importance of variables for three statistical models. It would be interesting to add a fourth column showing the importance of diagnoses for the ICD_noPDX model (all the diagnoses except the principal one) excluding admissions with DNR or palliative care code. Also, in table1 the row “Present on admission (%)” needs to be clarified. I suppose that are percentages of diagnoses, but it is somehow confuse because the columns are referred to patients instead of diagnoses.

  • In addition, the PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (https://journals.plos.org/plosone/s/data-availability). If there are ethical or legal restrictions on sharing a sensitive data set, authors should provide further information within their Data Availability Statement

Please submit your revised manuscript by Jul 11 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Juan F. Orueta, MD, PhD

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements:

1.    Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.plosone.org/attachments/PLOSOne_formatting_sample_main_body.pdf and http://www.plosone.org/attachments/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please ensure that your method is described in sufficient detail to meet our criteria on reproducibility http://journals.plos.org/plosone/s/submission-guidelines#loc-methods-software-databases-and-tools.

3. In ethics statement in the manuscript and in the online submission form, please provide additional information about the patient records used in your retrospective study. Specifically, please ensure that you have discussed whether all data were fully anonymized before you accessed them and/or whether the IRB or ethics committee waived the requirement for informed consent. If patients provided informed written consent to have data from their medical records used in research, please include this information.

4. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

5. Your ethics statement must appear in the Methods section of your manuscript. If your ethics statement is written in any section besides the Methods, please move it to the Methods section and delete it from any other section. Please also ensure that your ethics statement is included in your manuscript, as the ethics section of your online submission will not be published alongside your manuscript.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This is an interesting and well performed study on the prediction of in-hospital mortality using ICD-10-CM codes. The authors used a retrospective data from patients admitted to a single center in Belgium during one year. The study uses Random Forests approach to build the prediction model. Based on the above data set, a comprehensive model for the prediction of in-hospital mortality was devised. The authors show that their model performs better than RoM and CCI. While the study has merit there are few issues that need to be addressed.

The dataset for hospitalized patients may include repeated measure data (e.g., if a patient was admitted than one time during the study period). Some patients (roughly 12-15% of patients are readmitted within one month and perhaps ~ 25% are readmitted within 3 months). The authors should address this issue (how this was adjusted for). Only one observation should be obtained per patient, and if not, that should be addressed or discussed.

Minor comments;

1. Please add p values to Table 1.

Reviewer #2: There are many studies on mortality prediction in the hospital, and this study is very interesting. The number of cases is 30,000 or more, which is a large-scale data, which is very useful in data analysis. In addition, it is novel that the analysis excludes DNR cases, and we believe that this study can provide scientifically important information.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Jun 25;15(6):e0235117. doi: 10.1371/journal.pone.0235117.r002

Author response to Decision Letter 0


3 Jun 2020

Comments from the editor and reviewers

Editor

I have other minor comments. Figure 3 presents the importance of variables for three statistical models. It would be interesting to add a fourth column showing the importance of diagnoses for the ICD_noPDX model (all the diagnoses except the principal one) excluding admissions with DNR or palliative care code.

Response: We updated Figure 3 and added the extra column with the model at admission without the principal diagnosis and without the admissions with DNR or palliative care codes at admission.

Figure 3 Variable importance plot for the models using ICD-10-CM diagnosis codes as predictors at admission and discharge, either using all diagnosis codes, without the Do Not Resuscitate or Palliative care codes at admission, or without these codes and without the Principal Diagnosis. Whereas ‘Encouter for palliative care’, ‘Do not resuscitate’ and ‘Encouter for other aftercare and medical care’ are the three most important variables in the models with all diagnosis, ‘Cardiac arrest’ is the most important variable in the model with the excluded patients.

Also, in table1 the row “Present on admission (%)” needs to be clarified. I suppose that are percentages of diagnoses, but it is somehow confuse because the columns are referred to patients instead of diagnoses.

Response: The interpretation of the row is indeed the percentage of diagnoses. We acknowledge that the description is not clear and changed this to “% Diagnoses Present on admission (PoA)”.

In addition, the PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (https://journals.plos.org/plosone/s/data-availability). If there are ethical or legal restrictions on sharing a sensitive data set, authors should provide further information within their Data Availability Statement

Response: We have reviewed the ethical and legal restrictions regarding the data. As this contains no unique identifier we can anonymize the data and make the data publicly available.

We made a github repository with all datasets and a description in the README file. The link of the repository: https://github.com/descheppermieke/Using-structured-pathology-data-to-predict-hospital-wide-mortality-at-admission

Reviewer #1

This is an interesting and well performed study on the prediction of in-hospital mortality using ICD-10-CM codes. The authors used a retrospective data from patients admitted to a single center in Belgium during one year. The study uses Random Forests approach to build the prediction model. Based on the above data set, a comprehensive model for the prediction of in-hospital mortality was devised. The authors show that their model performs better than RoM and CCI.

Response: We thank the reviewer for the kind words and for the time spent to improve our study.

While the study has merit there are few issues that need to be addressed.

The dataset for hospitalized patients may include repeated measure data (e.g., if a patient was admitted than one time during the study period). Some patients (roughly 12-15% of patients are readmitted within one month and perhaps ~ 25% are readmitted within 3 months). The authors should address this issue (how this was adjusted for). Only one observation should be obtained per patient, and if not, that should be addressed or discussed.

Response: We thank the reviewer for this comment. We acknowledge that there is a bias due to some readmissions and will add this to the Discussion. We do believe that the bias is limited due to a low readmission rate (< 3% Deschepper M, Eeckloo K, Vogelaers D, Waegeman W. A hospital wide predictive model for unplanned readmission using hierarchical ICD data. Comput Methods Programs Biomed. 2019. Epub 2019/02/20. doi: 10.1016/j.cmpb.2019.02.007. PubMed PMID: 30777619. for discharge year 2016). We also focus on the comparison of the measures and to show that ICD-10 codes can be used as a predictor for mortality at admission.

Nevertheless, we do agree that some bias will appear in our models and, as such we add an extra paragraph in the Discussion section:

The results of the models may be slightly biased due to repeated measures (e.g. if a patient was admitted more than one time during the study period), as all admissions are treated equally. To overcome this bias, we should remove all admissions previous to the readmissions. In our dataset only 2.5% of the admissions are readmitted within 30 days.

Minor comments;

1. Please add p values to Table 1.

Response: Upon request we added the p-values to Table 1. All variables are, due to the large dataset, significant.

Table 1 Population overview: characteristics of survivors and non-survivors

Population overview

survivors

(N: 33 752=97%) non-survivors

(N: 919=3%) p-value*

Age 52 [30 - 67] 70 [58 - 80] <0.001

Sex ( % male) 17 320 (51%) 543 (59%) <0.001

% Diagnoses Present on admission (PoA) 93% 80% <0.001

DNR at admission 469 (1.5%) 231 (25%) <0.001

Palliative care flag at admission 234 (0.7%) 286 (31%) <0.001

CCI Admission 0 [0 - 2] 3 [1 - 6] <0.001

Discharge 0 [0 - 2] 3 [1 - 6] <0.001

RoM

(1 - 2 - 3 - 4) Admission 71% - 22% - 6% - 1% 10% - 34% - 40% - 16% <0.001

Discharge 70% - 22% - 7% - 1% 6% - 22% - 40% - 33% <0.001

Data are reported as n (%) or medians (1st – 3rd quartile), or otherwise indicated.

* p-values based on Pearson chi-square for categorical variables and the Wilcoxon rank-sum test for continuous variables.

Legend: DNR = Do Not Resuscitate; PoA= Present on Admission flag; CCI = Charlson Comorbidity Index; RoM = Risk of Mortality

Reviewer #2

There are many studies on mortality prediction in the hospital, and this study is very interesting. The number of cases is 30,000 or more, which is a large-scale data, which is very useful in data analysis. In addition, it is novel that the analysis excludes DNR cases, and we believe that this study can provide scientifically important information.

Response: We thank the reviewer for the kind words and for the time spent to improve our study.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Juan F Orueta

10 Jun 2020

Using structured pathology data to predict hospital-wide mortality at admission

PONE-D-20-08968R1

Dear Dr. Deschepper,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Juan F. Orueta, MD, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Juan F Orueta

11 Jun 2020

PONE-D-20-08968R1

Using structured pathology data to predict hospital-wide mortality at admission

Dear Dr. Deschepper:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Juan F. Orueta

Academic Editor

PLOS ONE


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES