Abstract
Objective:
Develop and test the performance of electronic versions of the Children’s Hospital of Pittsburgh pediatric risk of mortality (CHP e-PRISM-IV) and pediatric logistic organ dysfunction (CHP e-PELOD-2) scores.
Design:
Retrospective, single-center cohort derived from structured electronic health record data.
Setting:
Large, quaternary pediatric intensive care unit (PICU) at a freestanding, university-affiliated children’s hospital.
Patients:
All encounters with a PICU admission between January 1, 2009 and December 31, 2017 identified using electronic definitions of inpatient encounter.
Measurements:
The main outcome was predictive validity of each score for hospital mortality, assessed as model discrimination and calibration. Discrimination was examined with the area under the receiver operating characteristics curve (AUROC) and the area under the precision-recall curve (AUPRC). Calibration was assessed with the Hosmer-Lemeshow goodness of fit test and calculation of a standardized mortality ratio (SMR). Models were recalibrated with new regression coefficients in a training subset of 75% of encounters selected randomly from all years of the cohort and the calibrated models were tested in the remaining 25% of the cohort. Content validity was assessed by examining correlation between electronic versions of the scores and prospectively calculated data (CHP e-PRISM-IV) and an alternative informatics approach (CHP e-PELOD-2).
Results:
The cohort included 21,335 encounters. Correlation coefficients indicated strong agreement between different methods of score calculation. Uncalibrated AUROCs were 0.96 (95% confidence interval 0.95-0.97) for CHP e-PELOD-2 and 0.87 (0.85-0.89) for e-CHP PRISM-IV for inpatient mortality. The uncalibrated CHP e-PRISM-IV SMR was 0.63 (0.59-0.66), demonstrating strong agreement with previous, prospective evaluation at the study center. The uncalibrated CHP e-PELOD-2 SMR was 0.20 (0.18-0.21). All models required recalibrating (all Hosmer-Lemeshow goodness of fit P <0.001) and subsequently demonstrated acceptable goodness of fit when examined in a test subset (n=5,334) of the cohort.
Conclusions:
Electronically-derived intensive care acuity scores demonstrate very good to excellent discrimination and can be calibrated to institutional outcomes. This approach can facilitate both performance improvement and research initiatives and may offer a scalable strategy for comparison of inter-institutional PICU outcomes.
Keywords: acuity score, predictive analytics, quality improvement, clinical informatics, electronic health record
Introduction
In 2001, the United States National Academy of Medicine (then known as the Institute of Medicine) published Crossing the Quality Chasm, which outlined a set of overarching aims to provide direction for a “sweeping redesign” of the US healthcare system.(1) Six aims were proposed in the report, namely that care should be safe, effective, patient-centered, timely, efficient, and equitable, and information technology was envisioned to have a central role in helping health systems meet these criteria. In 2007, the Academy of Medicine working committee on Learning Health Systems expanded this vision to describe an ideal system as being able to rapidly and continuously refine care by seamlessly capturing and analyzing institutional information streams, particularly data derived from electronic health records (EHR).(2) As data-rich, high-risk healthcare arenas, pediatric intensive care units (PICUs) are poised to adopt, and benefit from, the Learning Health Systems model.
The field of pediatric critical care has long-relied on data-driven benchmarking to compare outcomes and promote performance improvement initiatives. Robust, composite metrics of patient acuity were first developed more than a quarter century-ago to aid prognostication and incorporation of illness severity into rigorous cohort analyses.(3) Multi-center quality improvement collaboratives, such as Virtual PICU in North America and the Pediatric Intensive Care Audit Network (PICA Net) in Europe, were subsequently established to compile pertinent administrative and clinical data for purposes of reporting risk-adjusted outcomes to inform cycles of continuous improvement.(4, 5) Participation in such collaboratives commonly requires relatively laborious manual extraction of data such as vital signs, laboratory values and patient characteristics, to calculate widely used, validated risk of mortality metrics, such as recent versions of the Pediatric Risk of Mortality (PRISM-IV), the Pediatric Index of Mortality 3 (PIM3), and the Pediatric Logistic Organ Dysfunction (PELOD-2) scores.(6–8)
Many institutions have accumulated rich repositories of highly-granular data following the adoption of EHRs in recent years. Automating generation of validated metrics of pediatric patient acuity using EHR-data could cost-effectively speed up improvement cycles, promoting rigor in institutional administrative assessments of clinical processes. Accordingly, we aimed to use structured EHR-data to construct two previously validated, composite metrics of pediatric illness severity, PRISM IV and the PELOD-2, to support our field’s ongoing work to achieve the six aims outlined by the National Academy of Medicine.(6, 8)
Methods
Study design
This is a retrospective cohort study which includes all discharged patients with a PICU admission at our quaternary center between January 1, 2009 and December 31, 2017. The study center serves a region of approximately 5 million people, encompassing Western Pennsylvania and bordering states, and is a level 1 pediatric trauma center. Approximately 2,500 patients are admitted to the 36-bed PICU per year. For the purposes of this study, only the first PICU admission during a hospitalization was included. The Institutional Review Board of the University of Pittsburgh approved this study (PRO17030743).
Cohort ascertainment
Data were obtained by interrogating a cloned EHR database using the business intelligence platform SAP BusinessObjects (SAP, Paris, France), a graphical user interface which authors structured query language (SQL) code. The cohort was identified by first querying all admissions to a PICU bed space during the study period. Data were extracted as XML files and uploaded to R version 3.4.1 (www.r-project.org) for cleaning and curation. To mitigate error introduced by bed assignments incorrectly attributed to non-inpatient encounters, only encounters with associated pulse oximetry value (SpO2) and mean arterial pressure (MAP) values were included in the final cohort, as MAP recordings tend to be limited to the ICU, operating rooms and emergency departments at our institution and SpO2 is conventionally the first documented vital sign in the ICU. As documented verification of patient age is performed at registration for an inpatient stay at our institution and age is also necessary for score calculation, only encounters with a verified age were included in the final cohort. Data distributions were examined with histograms, variables were initially categorized as normal, abnormal or absent to examine missingness, and the accuracy of query code was verified by manually reviewing a minimum of 5 charts per data element.
Data Mapping and Cleaning
The Children’s Hospital of Pittsburgh (CHP) e-PELOD-2 was constructed per the design of the PELOD-2, incorporating the worst value of 10 measures of organ dysfunction during a hospitalization.(8) Structured data surrogates for all measures of 5 organ systems assessed by PELOD-2 were identified in the study site EHR. For the neurologic system, the lowest Glasgow Coma Scale score was used irrespective of sedation administration, as determining clinically significant sedative dosing was not possible given the retrospective nature of the data. Pupillary exams were dichotomized as reactive versus unreactive, without incorporating the use of mydriatic medications or pupillary size, as both mitigating factors occur infrequently. MAP values were included only if they were greater than the associated diastolic blood pressure. Creatinine (Cr) was converted from mg/dL to micromoL/L by multiplying by 88.42 micromoL/L/mg/dL. The paO2/FiO2 values were calculated by matching the respective paO2 and FiO2 closest in time to one another and including only pairs within 1 hour or less of each other. The original design of the PELOD-2 included paCO2 as a variable, noting that the value could be obtained from an arterial, capillary or venous source. We elected to adjust venous pCO2 to an arterial and capillary scale by subtracting 6 mmHg.(9) Use of mechanical ventilation was positive if a non-portable ventilator was used during hospitalization, as patients requiring chronic mechanical ventilation with superimposed acute respiratory disease are conventionally transitioned from their portable home ventilator to an inpatient ventilator at the study center. Additionally, most noninvasive ventilation in our PICU is provided using portable ventilators and we aimed to exclude these patients per the design of the PELOD-2. All values documented as “<” or “>” a range cutoff were converted to the limit of that range; for example, a white blood cell count of “<0.1 x 109/L” was converted to “0.1 x 109/L”. Absent values were considered to be normal, per the design of both the PRISM IV and PELOD-2.(6, 8)
Data cleaning steps were similar for the physiologic and laboratory parameters of CHP e-PRISM-IV as for e-PELOD-2; however, data selection occurred in defined time windows surrounding admission to the PICU, as detailed below. As the design of PRISM does not require normalizing pH values, neither pH nor pCO2 values were adjusted to an arterial scale. Admission source was identified as patient location immediately preceding PICU admission. Admission diagnoses, recorded as International Classification of Diseases (ICD) versions 9 and 10, were used to identify patients with cancer or a low-risk system of primary dysfunction. All ICD codes are entered directly into the EHR by the primary attending physician throughout a patient’s hospitalization and are registered as “working diagnoses.” Following discharge, health information management coders ascribe “admission diagnosis” to the code reflective of the condition that was initially identified to warrant hospitalization, which is most commonly the primary working diagnosis entered by the attending on the day of admission. Similarly, an admission diagnosis of cardiac arrest served as a surrogate for the PRISM criterion of cardiopulmonary resuscitation in the 24 hours prior to admission.
Statistical Analyses
Content Validity of CHP e-PELOD-2 and e-PRISM-IV.
Content validity of the CHP e-PELOD-2 was assessed by comparison to an alternative, informatics strategy for e-PELOD-2 calculation using an Informatica Toolkit (Informatica Co., Redwood, CA, USA) to extract, transform and load data from a dedicated data mart developed to track morbidity and mortality outcomes relevant to the PICU at our institution. The data mart contains discharged patients with a PICU encounter beginning January 1, 2015 to present. An Informatica approach does not currently exist for e-PRISM-IV calculation and content validity of the CHP e-PRISM-IV was evaluated through direct comparison of PRISM neurologic (GCS and pupillary exam) and non-neurologic (all other physiologic and laboratory parameters) subscores calculated from data previously collected prospectively with manual chart review for an existing convenience sample of 105 encounters admitted to the PICU in 2011 and 2012. Subscores were calculated using the point system detailed by Pollack et al.(10) Spearman’s correlation coefficient assessed the CHP e-PRISM-IV neurologic and non-neurologic subscores as compared to the prospective PRISM subscores and the CHP e-PELOD-2 as compared to the Informatica e-PELOD-2.
Summary cohort data are displayed as median (interquartile range) for continuous data and N (%) for proportion data. Predictive validity was evaluated by examining model discrimination and calibration related to hospital mortality. Score discrimination was assessed by calculating the area under the receiver operating curve (AUROC) and 95% confidence intervals, in keeping with the assessments of discrimination performed with each scores original construction.(3, 6, 8) Additionally, the area under the precision recall curve (AUPRC) was assessed for each score. The AUPRC provides a measurement of precision (or accuracy) as it relates to recall (or sensitivity) and is particularly informative in cases of class imbalance, in which the outcome of interest occurs rarely, as is the case of mortality in many modern PICUs.(11, 12) Calibration was assessed with the Hosmer-Lemeshow goodness of fit test, dividing predicted values into decile bins, as well as evaluation of the observed to predicted mortality ratio, otherwise known as the standardized mortality ratio (SMR), determined using each score. Predicted probabilities of death were non-normally distributed and bootstrapping with 10,000 replicates was used to generate 95% confidence intervals of the SMR. . Acceptable goodness of fit was considered as a Hosmer-Lemeshow P >0.05. Recalibration was performed by dividing the cohort into train and test data sets in a 3 to 1 ratio using the sample function in R to randomly select 75% of the cohort as a train set and assigned the remaining 25% as a test set. New regression coefficients were then calculated for the weighted variables forCHP e-PELOD-2. or For the CHP e-PRISM-IV, new regression coefficients were calculated for each component of the PRISM-IV equation, including the weighted neurologic and non-neurologic subscores, as well as unweighted non-physiologic parameters such as age category.
Model performance was assessed by calculating mortality probabilities using the regression coefficients published by each score’s designers and recalibrated model performance was assessed using the test datasets. A sensitivity analysis examined the CHP e-PRISM-IV performance when the window of included variables was extended from 4-6 hours surrounding PICU admission (CHP e-PRISM-IV 2+4, per the design of the most recent version of the PRISM) to the first 12 hours (CHP e-PRISM-IV 12) and 24 hours (CHP e-PRISM-IV 24) of a PICU encounter. All analyses were conducted in R version 3.4.1 (www.r-project.org) with R studio version 1.0.143 (www.rstudio.com). An R Shiny application was built that allows readers the ability to conduct their own secondary analyses, examining base model receiver operating characteristic and calibration plots for the e-PELOD-2 and e-PRISM-IV 2+4 for cohorts based on specific patient characteristics and alternative outcomes.
Results
A total of 21,335 encounters were identified between January 1, 2009 and December 31, 2017 (Figure 1). The distributions of normal, abnormal and absent laboratory values are displayed in Figure 2. No substantial differences in these distributions were observed across study years. Cohort characteristics are displayed in Table 1. Correlation was evident between the CHP e-PRISM-IV 2+4 subscores and PRISM subscores calculated from prospectively collected data from 105 encounters, with a correlation coefficient of 0.77 (P<0.001) for the non-neurologic score and 0.74 (P<0.001) for the neurologic score. .The data mart included 7,472 e-PELOD-2 scores for comparison with the CHP e-PELOD-2 and demonstrated a correlation coefficient of 0.98 (P<0.001).
Figure 1.

Consort diagram demonstrating cohort ascertainment
Figure 2.

Normal (blue), abnormal (red) and absent (white) proportion of laboratory results for (a) the CHP e-PELOD-2 and (b) the CHP e-PRISM-IV. Laboratory results for the e-PELOD-2 are collected across an entire hospitalization, whereas results for the e-PRISM-IV are collected from the 2 hours preceding and 4 hours following PICU admission.
Table 1.
Cohort characteristics
| Characteristic | Entire Cohort | Survivors | Non-survivors |
|---|---|---|---|
| Displayed as median [IQR] or n (%) | N = 21,335 | n = 20928 (98.1) | n = 407 (1.9) |
| Age (years)*,** | 5.4 [1.6, 13.1] | 5.4 [1.6, 13.0] | 7.2 [1.5, 15.7] |
| Female | 9259 (43.4) | 9073 (43.4) | 186 (45.7) |
| Race | |||
| White | 16275 (76.3) | 15970 (76.3) | 305 (74.9) |
| Black | 3719 (17.4) | 3669 (17.5) | 50 (12.3) |
| Other | 1341 (6.3) | 1289 (6.2) | 52 (12.8) |
| Admission Source* | |||
| Operating room | 4758 (22.3) | 4723 (22.6) | 35 (8.6) |
| Emergency department | 7523 (35.2) | 7417 (35.4) | 106 (26.0) |
| Referring hospital | 5756 (27.0) | 5583 (26.7) | 173 (42.5) |
| Non-ICU inpatient unit | 3298 (15.5) | 3205 (15.3) | 93 (22.9) |
| Mechanical Ventilation** | 6935 (32.5) | 6561 (31.4) | 374 (91.9) |
| Admission Diagnosis of Cardiac Arrest* | 92 (0.4) | 91 (0.4) | 1 (0.2) |
| Low-Risk Admission Diagnosis* | 1935 (9.1) | 2075 (9.9) | 16 (3.9) |
| Cancer Diagnosis* | 381 (1.8) | 362 (1.7) | 19 (4.7) |
| Any Neurologic Dysfunction** | 9658 (45.3) | 9261 (44.3) | 397 (97.5) |
| Any Cardiac Dysfunction** | 18512 (86.8) | 18118 (86.6) | 394 (96.8) |
| Any Renal Dysfunction** | 4037 (18.9) | 3769 (18.0) | 268 (65.8) |
| Any Respiratory Dysfunction** | 7336 (34.4) | 6951 (33.2) | 385 (94.6) |
| Any Hematologic Dysfunction** | 4778 (22.4) | 4443 (21.2) | 335 (82.3) |
| CHP e-PRISM-IV Predicted Probability of Mortality | 1.1% [0.6%, 2.2%] | 1.0% [0.6%, 2.2%] | 18.2% [4.0%, 58.5%] |
| CHP e-PELOD-2 Predicted Probability of Mortality | 0.9% [0.3%, 5.5%] | 0.9% [0.3%, 5.5%] | 94.2% [79.9%, 99.1%] |
PRISM-IV criteria
PELOD-2 criteria
Discrimination and calibration results of the uncalibrated base models and calibrated models are in Table 2. The SMR was 0.63 (95% CI 0.59-0.66) using the CHP e-PRISM-IV uncalibrated base model probability and 0.20 (95% CI 0.18-0.21) using the uncalibrated base model probability of the CHP e-PELOD-2. The Hosmer-Lemeshow goodness of fit test demonstrated P <0.001 for all base models, indicating the need for recalibration. New regression coefficients for all model variables derived with the training data demonstrated acceptable calibration when applied to the test data. Supplemental Figure 1 displays calibration plots for the uncalibrated base model and the calibrated model applied to the test data, as well as the AUPRC curves for the calibrated models. The calibrated CHP e-PRISM-IV models tend to overpredict mortality at higher score values. A secondary analysis tool displays receiver operating characteristic curves and calibration plots of the base model according to user-defined cohort characteristics and outcomes, and is available at https://chp-acuity-scores.shinyapps.io/chpacuityanalysis/ (additional data definitions are in the Supplemental Methods).
Table 2.
Performance of the uncalibrated models in the entire cohort (N=21,335) and performance of the recalibrated models in the test subset cohort (n=5,334).
| Model Characteristics | e-PRISM-IV 2+4 | e-PRISM-IV 12 | e-PRISM-IV 24 | e-PELOD-2 |
|---|---|---|---|---|
| Uncalibrated Base Model (N = 21,335) |
||||
| AUROC (95% CI) | 0.87 (0.85-0.89) | 0.90 (0.88-0.92) | 0.91 (0.89-0.93) | 0.96 (0.95-0.97) |
| Hosmer-Lemeshow P value | <0.001 | <0.001 | <0.001 | <0.001 |
| AUPRC | 0.41 | 0.44 | 0.45 | 0.54 |
| SMR (95% CI) | 0.63 (0.59 – 0.66) | 0.52 (0.49 – 0.55) | 0.43 (0.40 – 0.45) | 0.20 (0.18 – 0.21) |
| Calibrated Test Model (n=5,334) |
||||
| AUROC (95% CI) | 0.88 (0.83-0.92) | 0.89 (0.85-0.94) | 0.90 (0.86-0.94) | 0.97 (0.96-0.98) |
| Hosmer-Lemeshow P Value | 0.35 | 0.47 | 0.08 | 0.90 |
| AUPRC | 0.35 | 0.41 | 0.46 | 0.60 |
| SMR (95% CI) | 1.00 (0.88 – 1.08) | 1.02 (0.90 – 1.10) | 1.02 (0.93 – 1.11) | 0.93 (0.92 – 0.96) |
Abbreviations: AUROC, Area Under the Receiver Operating Curve; AUPRC, Area Under the Precision-Recall Curve; CI, Confidence Interval; SMR, Standardized Mortality Ratio
Discussion
We demonstrate the construction of well-performing, widely-accepted PICU acuity metrics using strictly structured EHR data. In doing so, we provide a generalizable approach to score construction that leverages the automaticity of the EHR. The predictive validity of both CHP e-PRISM-IV and CHP e-PELOD-2, as measured by discrimination and calibration, supports the utility of these tools for adding additional rigor to institutional data analyses. Rapid transformation of institutional data into actionable information can facilitate performance improvement initiatives and has the potential to expedite progress in improving overall outcomes in pediatric intensive care. Cyclic reevaluation of clinical processes and outcomes, most often in the structure of a Shewhart cycle, is a common strategy for working towards achievement of the six aims of safe, effective, patient-centered, timely, efficient and equitable care outlined by the Academy of Medicine.(13) Of central importance to iterative examination of PICU data is appropriate consideration of and adjustment for illness severity.
Our models performed comparably to their related gold-standards. The CHP e-PRISM-IV AUROC of 0.87 (95% CI 0.85-0.89) agrees with reported AUROCs for the development and validation cohorts of PRISM IV of 0.88 (estimated 95% CI 0.85-0.91 based on reported standard error of the mean of 0.013) and 0.90 (estimated 95% CI 0.86-0.94 based on reported standard error of the mean of 0.018).(6) Similarly, the CHP e-PELOD-2 AUROC of 0.96 (95% CI 0.95-0.97) closely agrees with the reported AUROC for the PELOD-2 of 0.94 (95% CI 0.93-0.96).(8) Comparison of the e-PRISM-IV neurologic and non-neurologic subscores to prospectively collected subscores demonstrated good correlation though was lower than the correlation coefficient obtained comparing two separate electronic approaches to calculating the CHP e-PELOD-2. In part, this likely reflects the narrower time window for variable selection for the CHP e-PRISM-IV (4 hours for vital sign variables and 6 hours for laboratory variables) as compared to the CHP e-PELOD-2 (an entire hospitalization). PICU admission time was determined electronically based on unit transfer times and the time of first recorded pulse oximetry, which can differ from times apparent in manual chart review or observed at the bedside.
That the uncalibrated CHP e-PRISM-IV overpredicted hospital mortality, with a SMR of 0.63 (95% CI 0.59-0.66), demonstrates excellent agreement with previous, prospective assessments of standardized mortality at our institution for approximately 1,400 patients using PRISM III.(14) , This finding is also compatible with the recognized need to recalibrate composite acuity metrics when applied to single center data or specified populations.(15) Calibration plots of recalibrated CHP e-PRISM-IV scores at our institution indicate increasing overprediction with rising illness severity. These findings reflect the challenges of adequately modeling the complexity, commonly directly related to acuity, of heterogeneous intensive care patients. As a predictive index, PRISM contains a relatively narrow time window of information, proving useful for comparative benchmarking but inadequately capturing individual characteristics, such as genetic risk factors, that might account for disease trajectories diverging from initial expectations. Recognition of the importance of inter-individual variability in treatment response is the basis for recent calls for precision medicine in the ICU.(16) Both PRISM-IV and PELOD-2 also lack incorporation of highly invasive therapies such as extracorporeal life support, and instead regress risk across a broader range of therapies. Customized severity of illness scores, or scores refined and recalibrated according to a population of interest, have been demonstrated to outperform parent scores in large, multi-center datasets.(17) Our online secondary analysis tool provides vantage of the shifting predictive performance of CHP e-PRISM-IV and e-PELOD-2 with different patient population characteristics.
Overprediction is also represented by the relatively modest AUPRCs for each score. AUROCs have traditionally been used to display discrimination of composite clinical acuity metrics but do not adequately depict performance in the case of imbalanced datasets.(6–8, 18) Receiver operating characteristics curves are constructed with true positive rate (TPR; sensitivity) as the y-axis and false positive rate (FPR; 1-specificity) as the x-axis, whereas precision recall curves are constructed with the TPR as the x-axis and the positive predictive value (PPV) as the y-axis. As mortality in many modern PICUs is approximately 2%, an acuity score may depict a relatively low FPR, resulting in a robust AUROC, but simultaneously demonstrate a more modest PPV, generating a less robust AUPRC. Greater AUPRC values represent better model performance, with perfect performance being a value of 1.0, though there are no widely accepted values to distinguish good versus poor performance. The calibration and AUPRCs of the CHP e-PRISM-IV and CHP e-PELOD-2 do not obviate their inclusion in bedside decision-making but emphasize the importance of incorporating many other variables not included in these metrics in clinical assessments of acuity and ultimate care decisions. Two adult intensive care acuity metrics, the Acute Physiology and Chronic Health Evaluation (APACHE) IV score and the Sequential Organ Failure Assessment (SOFA) score, have demonstrated good ongoing discrimination in 24-hour mortality when measured continuously on an hourly basis.(19) When applied to more than half a million adult intensive care encounters, admission APACHE IV and SOFA scores demonstrated AUROCs of 0.85 and 0.81, respectively, for discriminating ICU mortality and an AUPRC of 0.30 and 0.22, respectively, indicating slightly worse discrimination measured by AUPRC compared to the metrics developed in the present study. Composite indicators of patient acuity have proven clinically useful for tasks ranging from patient assessment, communication between different providers, staff assignments and unit-level resource allocations.(20–22) In academic environments, novice learners may also derive some value from the partial synthesis of patient condition provided by acuity scores such as the CHP e-PRISM-IV and e-PELOD-2, though it would be important to pair the scores with explicit statements of limitations so as to not misinterpret their utility.
If implemented by other institutions, these scores could facilitate outcomes benchmarking among peer PICUs. Harmonizing electronic data from different institutions is a challenging task, as differing data structures present barriers to applying standard analysis algorithms without first performing labor intensive cleanup. Sophisticated examples of overcoming such challenges have recently been reported by health systems in collaboration with computer scientists from Google; however, many institutions do not possess the data science capabilities to readily adopt such an approach.(23) In contrast, the approach outlined in the present work can be readily adapted to other centers with established EHRs and joint teams of clinicians and systems analysts with shared knowledge of health record data structures. While some score variables were mapped to center-specific data elements, timestamped vital signs and laboratory results compose substantial portions of both the CHP e-PRISM-IV and CHP e-PELOD-2 and are ubiquitous in modern EHRs. We opted to use ICD codes to capture presenting characteristics such as patient origin as these codes are readily available in our institution’s data warehouse and are entered electronically during patient care by attending physicians. Many centers continuously collect electronic data on patient location of origin, in-house transfers and relevant diagnosis codes. Alternative approaches to substituting manual chart review could include linking billing databases with the EHR or employing natural language processing (NLP) to extract structured features from admission or progress notes. NLP offers potential superiority compared to ICD codes, and has already demonstrated acceptable sensitivity and specificity in identifying deep venous thromboses in pediatric radiology reports, as well as identifying children with pneumonia in a federated EHR database.(24, 25)
The present study demonstrates the development and performance of two common, composite metrics constructed entirely with structured electronic data. Strengths of the present work include the large, highly granular cohort available for analyses, and the excellent performance of the constructed scores.. An online, R shiny application for secondary analysis of data allows readers to further explore the strengths and limitations of each score model across a variety of patient characteristics and outcomes.
Limitations of the present study largely relate to limitations associated with including only structured data. Diagnostic codes have demonstrated poor agreement with prospectively collected data in other studies.(26, 27) That physicians were responsible for entering diagnostic codes in the present dataset may have introduced error, as few receive formal training in the use of ICD terminology. The use of admission ICD codes reflects time of hospital admission rather than PICU admission, though high acuity diseases present on admission, such as cardiac arrest, are expected to begin hospitalization in the PICU. Additionally, limiting development of the models to the inclusion of only structured data ignores substantial amounts of unstructured data available in the EHR. Our own institution’s cloned EHR, or data warehouse, is composed of approximately 200 tables captured from the nearly 6000 harbored by the Cerner Millennium database. Innovative computational methods, storage technologies and algorithms for analyses are necessary to fully leverage such immense troves of information.
In conclusion, we present a readily-adaptable approach to constructing validated measures of illness severity for PICU patients using EHR data. Work is ongoing at our center to incorporate these scores into real-time dashboards which will facilitate outcomes tracking, clinical process analyses and rapid generation of illness severity scores for research purposes. The adaptability of this approach provides a scalable strategy for rapid comparisons of inter-institutional outcomes, compatible with the Academy of Medicine’s vision of a large-scale Learning Health System.
Supplementary Material
Supplemental Figure 1. Calibration plots of uncalibrated models constructed with the entire cohort (a, d, g, and j), calibration plots of calibrated models based on a test subset (n=5,334) of the cohort (b, e, h and k) and precision-recall curves based on the test subset (c, f, i, and l). Yellow shaded regions indicate 95% confidence intervals. The color legend to the right of each precision-recall curve corresponds to the predicted probability of mortality, ranging from 0 (red) to 1.0 (purple). Abbreviations: AUPRC, area under the precision-recall curve.
Acknowledgements:
We would like to thank the Children’s Hospital of Pittsburgh Foundation for all the support they provide to our institution.
Funding: This work was supported by NIH grants NICHD T32 HD40686 (CMH, PMK), the UPMC Children’s Hospital of Pittsburgh Scientific Fund, and the UPMC Children’s Hospital of Pittsburgh Foundation’s Trust Young InvestigatorGrant (CMH).
Copyright form disclosure: Dr. Horvat received support for article research from the National Institutes of Health (NIH). Dr. Fink’s institution received funding from PCORI and NIH. Dr. Kochanek received funding from Society of Critical Care Medicine (stipend as Editor-in-Chief of Pediatric Critical Care Medicine). The remaining authors have disclosed that they do not have any potential conflicts of interest.
Footnotes
e-Supplement 1. (https://chp-acuity-scores.shinyapps.io/chpacuityanalysis/)
References
- 1.Institute of Medicine (US) Committee on Quality of Health Care in America: Crossing the Quality Chasm: A New Health System for the 21st Century [Internet]. Washington (DC): National Academies Press (US); 2001. Available from: http://www.ncbi.nlm.nih.gov/books/NBK222274/ [PubMed] [Google Scholar]
- 2.The learning healthcare system: workshop summary. Washington, DC: National Academies Press; 2007. [PubMed] [Google Scholar]
- 3.Pollack MM, Ruttimann UE, Getson PR: Pediatric risk of mortality (PRISM) score. Crit Care Med 1988; 16:1110–1116 [DOI] [PubMed] [Google Scholar]
- 4.VPICU - The Laura P. and Leland K. Whittier Virtual PICU [Internet] [cited 2018 Sep 7] Available from: http://vpicu.net/
- 5.PICANet – Paediatric Intensive Care Audit Network for the UK and Ireland [Internet] [cited 2018 Sep 7] Available from: https://www.picanet.org.uk/
- 6.Pollack MM, Holubkov R, Funai T, et al. : The Pediatric Risk of Mortality Score: Update 2015. Pediatr Crit Care Med J Soc Crit Care Med World Fed Pediatr Intensive Crit Care Soc 2016; 17:2–9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Straney L, Clements A, Parslow RC, et al. : Paediatric index of mortality 3: an updated model for predicting mortality in pediatric intensive care*. Pediatr Crit Care Med J Soc Crit Care Med World Fed Pediatr Intensive Crit Care Soc 2013; 14:673–681 [DOI] [PubMed] [Google Scholar]
- 8.Leteurtre S, Duhamel A, Salleron J, et al. : PELOD-2: an update of the PEdiatric logistic organ dysfunction score. Crit Care Med 2013; 41:1761–1773 [DOI] [PubMed] [Google Scholar]
- 9.Rang LCF, Murray HE, Wells GA, et al. : Can peripheral venous blood gases replace arterial blood gases in emergency department patients? CJEM 2002; 4:7–15 [DOI] [PubMed] [Google Scholar]
- 10.Pollack MM, Patel KM, Ruttimann UE: PRISM III: An updated Pediatric Risk of Mortality score. Crit Care Med 1996; 24:743. [DOI] [PubMed] [Google Scholar]
- 11.Davis J, Goadrich M: The relationship between Precision-Recall and ROC curves [Internet] In: Proceedings of the 23rd international conference on Machine learning - ICML ’06 Pittsburgh, Pennsylvania: ACM Press; 2006. p. 233–240.[cited 2018 Sep 14] Available from: http://portal.acm.org/citation.cfm?doid=1143844.1143874 [Google Scholar]
- 12.Saito T, Rehmsmeier M: The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS One 2015; 10:e0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Best M, Neuhauser D: Walter A Shewhart, 1924, and the Hawthorne factory. Qual Saf Health Care 2006; 15:142–143 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Pollack MM, Holubkov R, Funai T, et al. : Simultaneous Prediction of New Morbidity, Mortality, and Survival Without New Morbidity From Pediatric Intensive Care: A New Paradigm for Outcomes Assessment. Crit Care Med 2015; 43:1699–1709 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Marcin JP, Pollack MM: Review of the acuity scoring systems for the pediatric intensive care unit and their use in quality improvement. J Intensive Care Med 2007; 22:131–140 [DOI] [PubMed] [Google Scholar]
- 16.Maslove DM, Lamontagne F, Marshall JC, et al. : A path to precision in the ICU [Internet] Crit Care 2017; 21 Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5376689/ [DOI] [PMC free article] [PubMed]
- 17.Lee J, Maslove DM: Customization of a Severity of Illness Score Using Local Electronic Medical Record Data. J Intensive Care Med 2017; 32:38–47 [DOI] [PubMed] [Google Scholar]
- 18.Rothman MJ, Tepas JJ, Nowalk AJ, et al. : Development and validation of a continuously age-adjusted measure of patient condition for hospitalized children using the electronic medical record. J Biomed Inform 2017; 66:180–193 [DOI] [PubMed] [Google Scholar]
- 19.Badawi O, Liu X, Hassan E, et al. : Evaluation of ICU Risk Models Adapted for Use as Continuous Markers of Severity of Illness Throughout the ICU Stay. Crit Care Med 2018; 46:361–367 [DOI] [PubMed] [Google Scholar]
- 20.Lambert V, Matthews A, MacDonell R, et al. : Paediatric early warning systems for detecting and responding to clinical deterioration in children: a systematic review [Internet] BMJ Open 2017; 7Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5353324/ [DOI] [PMC free article] [PubMed]
- 21.Gardner-Thorpe J, Love N, Wrightson J, et al. : The Value of Modified Early Warning Score (MEWS) in Surgical In-Patients: A Prospective Observational Study. Ann R Coll Surg Engl 2006; 88:571–575 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Choi J, Choi JE, Fucile JM: Power up your staffing model with patient acuity. Nurs Manag (Harrow) 2011; 42:40–43 [DOI] [PubMed] [Google Scholar]
- 23.Rajkomar A, Oren E, Chen K, et al. : Scalable and accurate deep learning with electronic health records. Npj Digit Med 2018; 1:18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Gálvez JA, Pappas JM, Ahumada L, et al. : The use of natural language processing on pediatric diagnostic radiology reports in the electronic health record to identify deep venous thrombosis in children. J Thromb Thrombolysis 2017; 44:281–290 [DOI] [PubMed] [Google Scholar]
- 25.Meystre S, Gouripeddi R, Tieder J, et al. : Enhancing Comparative Effectiveness Research With Automated Pediatric Pneumonia Detection in a Multi-Institutional Clinical Repository: A PHIS+ Pilot Study. J Med Internet Res 2017; 19:e162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Woodworth GF, Baird CJ, Garces-Ambrossi G, et al. : Inaccuracy of the administrative database: comparative analysis of two databases for the diagnosis and treatment of intracranial aneurysms. Neurosurgery 2009; 65:251-256-257 [DOI] [PubMed] [Google Scholar]
- 27.Khwaja HA, Syed H, Cranston DW: Coding errors: a comparative analysis of hospital and prospectively collected departmental data. BJU Int 2002; 89:178–180 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental Figure 1. Calibration plots of uncalibrated models constructed with the entire cohort (a, d, g, and j), calibration plots of calibrated models based on a test subset (n=5,334) of the cohort (b, e, h and k) and precision-recall curves based on the test subset (c, f, i, and l). Yellow shaded regions indicate 95% confidence intervals. The color legend to the right of each precision-recall curve corresponds to the predicted probability of mortality, ranging from 0 (red) to 1.0 (purple). Abbreviations: AUPRC, area under the precision-recall curve.
