Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2015 Nov 5;2015:1890–1898.

Learning a Severity Score for Sepsis: A Novel Approach based on Clinical Comparisons

Kirill Dyagilev 1, Suchi Saria 1,2
PMCID: PMC4765650  PMID: 26958288

Abstract

Sepsis is one of the leading causes of death in the United States. Early administration of treatment has been shown to decrease sepsis-related mortality and morbidity. Existing scoring systems such as the Acute Physiology and Chronic Health Evaluation (APACHE II) and Sequential Organ Failure Assessment scores (SOFA) achieve poor sensitivity in distinguishing between the different stages of sepsis. Recently, we proposed the Disease Severity Score Learning (DSSL) framework that automatically derives a severity score from data based on clinical comparisons – pairs of disease states ordered by their severity. In this paper, we test the feasibility of using DSSL to develop a sepsis severity score. We show that the learned score significantly outperforms APACHE-II and SOFA in distinguishing between the different stages of sepsis. Additionally, the learned score is sensitive to changes in severity leading up to septic shock and post treatment administration.

Introduction

Sepsis — a whole-body inflammatory response to infection — is one of the leading causes of death in the inpatient setting and is associated with significantly higher costs of care1. The risk of sepsis-related adverse outcomes can be reduced by early treatment2. Thus, changes in a patient’s health status is regularly assessed by the caregiver (clinicians and nurses) to plan timely interventions. This requires the caregiver to interpret a diverse array of markers (e.g., heart rate, respiratory rate, blood counts, and serum measurements) that measure the underlying physiologic and metabolic state. Such clinical assessments are time-consuming, require extensive experience, and more importantly, are prone to missing signs of decline that may be only subtly visible in the observed markers.

In this paper, we address the problem of quantifying (scoring) the latent sepsis severity of a patient at a given time. That is, we derive a mapping from the high-dimensional observed marker data to a numeric score that tracks changes in sepsis severity over time — as health worsens, the score increases, and as the individual’s health improves, the score declines. More generally, a means for accurate estimation and tracking of an individual’s health status can enable clinicians to detect critical decline such as decompensations, and acute adverse events in a timely manner. Additional potential benefits include assessment for whether an individual is being responsive to therapy, stratification of patients for resource management, and risk adjustment for clinical research3.

Background

Qualitatively, the concept of disease severity score has been described as the total effect of disease on the body; the irreversible effect is referred to as damage, while the reversible component is referred to as activity4. The precise interpretation of concepts of damage and activity are typically based on the application at hand. Desirable properties of a severity scale include: 1) face and content validity i.e., the variables included are important and clinically credible, and 2) construct validity i.e., the scoring system parallels an independently ascertained severity measurement4.

Historically, severity scores have been defined and derived in a number of different ways5. One approach is to have clinical experts fully specify the score. Namely, using the existing clinical literature, a panel of experts identifies factors that are most indicative of severity of the target disease. These factors are weighted by their relative contribution to the severity and summed together to yield the total resulting score. For example, the Acute Physiology And Chronic Health conditions score6 (APACHE II), which assesses the overall health state in an inpatient setting, uses factors that are most predictive of mortality. For instance, a heart rate between 110 and 139 beats per minute adds 2 points to the final score while a heart rate higher than 180 beats per minute adds 4 points. Similarly, mean arterial blood pressure between 70 and 109 mm Hg adds no points while a value between 50 and 69 mm Hg adds 2 points. A number of additional widely used scoring systems have been designed in this way, including the Multiple Organ Dysfunction Score7 (MODS), the Sequential Organ Failure Assessment8 (SOFA), and the Medsger’s scoring system4.

A second approach commonly taken is to characterize severity as a predictive score that is associated with the risk of experiencing a target downstream event. That is, high severity states are more likely to be associated with adverse events and higher mortality rates. To train such a score, supervised learning is used with the presence or absence of the downstream adverse event (e.g., septic shock or motality) as the labels. For example, the pneumonia severity index (PSI) uses this approach to combine 19 factors including age, vitals and laboratory test results, to calculate severity as the probability of morbidity and mortality among patients with community acquired pneumonia9. The relative weight of each factor in the resulting score was derived by training a logistic regression function that predicts risk of death in the following 30-hour window. Others have similarly used downstream adverse events such as development of c.diff infection10, septic shock11, morbidity12, and mortality13 as surrogates for training severity scores. Yet other approaches use probabilistic state estimation techniques1416.

However, the above-mentioned approaches for the derivation of disease severity scores have their limitations. The expert-based approach captures known clinical expertise well, but it does not extend to populations where current clinical knowledge is incomplete. The approach of regressing against the risk of downstream adverse event yields scores that suffer from bias due to treatment-related censoring17,18. Specifically, scores learned from data assuming one standard of care, when used under a different standard of care may lead to erroneous conclusions about the individuals disease severity state. Say, we have data from a unit where children with temperature of 102°F are frequently prescribed treatment for the flu. This intervention subsequently cures the patients and they do not experience any associated adverse events (e.g., sepsis and septic shock) or death. However, it is also known that a child with temperature of 102°F, if left untreated, is likely to die. Thus, a temperature of 102°F is clinically considered to be a high severity state which in this unit is frequently treated in a timely manner. Since the learning algorithm uses the presence of a downstream adverse event as a surrogate marker for severity, rare occurrence of adverse events following 102°F causes the learning algorithm to score the temperature of 102°F as benign. This is problematic for two reasons. First, it is not consistent with how severity is interpreted clinically. Moreover, if such a score were to be used to guide interventions, it might erroneously cause the clinician to undertreat children with temperature of 102°F, thus likely worsening outcomes. See Dyagilev and Saria18 for a numerical example.

In recent work18, we proposed an alternative Disease Severity Score Learning (DSSL) framework for learning severity scores. The DSSL framework leverages this key observation that, while requesting experts to quantify disease severity at a given time is challenging, acquiring clinical comparisons – clinical assessments that order the disease severity at two different times – is often easy. From these clinical comparisons, DSSL learns a function that maps the patients observed feature vectors to a scalar severity score. With some abuse of terminology, we refer to this mapping function as the DSS function. We showed empirically18 that the generalization performance of DSSL is less sensitive to variations in treatment administration patterns compared to the supervised learning approach described above.

In this paper, we test the feasibility of using DSSL to develop a sepsis severity score. We use the MIMIC-II dataset19 containing electronic health record data from patients admitted to the Beth Israel Deaconess Intensive Care Units between 2001 and 2008. Clinical comparisons required for training the score are generated automatically using the Surviving Sepsis Campaign (SSC) guidelines20. We show that the learned score significantly outperforms APACHE-II and SOFA in distinguishing between the different stages of sepsis. Additionally, the learned score is sensitive to changes in severity leading up to septic shock and post treatment administration. We also show that only a small number of clinical comparisons are needed to obtain a high quality score. Thus, by not having to rely on needing a large number of clinical comparisons, the use of DSSL in domains where guidelines are unavailable for generating clinical comparisons in an automated manner (as we do in this paper) becomes feasible.

Methods

In this section, we describe the DSSL framework and the L-DSS method for learning linear DSS functions.

General DSSL Framework

We consider longitudinal data routinely collected in a hospital setting. These include covariates obtained at the time of admission such as age, gender, and clinical history; time-varying measurements such as heart rate and respiratory rate; and text notes summarizing the patients evolving health status. These data are processed and transformed into tuples <xip,tip> where xipd is a d-dimensional feature vector associated with patient p ∈ P at time tip for i ∈ {1, …, Tp} and Tp is the total number of tuples for patient p. A feature vector xip contains raw measurements (e.g., last measured heart rate or last measured white blood cell count) and features derived from one or more measurements (e.g., the mean and variance of the measured heart rate over the last six hours or the total urine output in the last six hours per kilogram of weight). Let D denote the set of tuples across all patients in the study. The problem of learning a DSS function is defined by the sets O and S of pairs of tuples from the set D of all tuples, and by the set G of permissible DSS functions. The set O contains pairs of tuples (<xip,tip>,<xjq,tjq>) that are ordered by severity based on clinical assessments, i.e., xip corresponds to a more severe state than xjq. We refer to each of these paired tuples as a clinical comparison and the set O as the set of all available clinical comparisons. These clinical comparisons can be obtained by presenting clinicians with data xip for patient pP at time tip and data xjq for patient q ∈ P at time tjq. For each such pair of feature vectors, the clinical expert identifies which of these correspond to a more severe health state; the expert can choose not to provide a comparison for a pair where the severity ordering is ambiguous. These pairs can also be generated in an automated fashion by leveraging existing clinical guidelines. In the Experimental Methods section, we describe how we use an existing guideline in the task of learning sepsis severity score. The set S contains pairs of tuples (<xip,tip>,<xi+1p,ti+1p>) that correspond to feature vectors that are taken from the same patient p at consecutive time steps tip and ti+1p. These pairs are used to impose smoothness constrains on the learned severity scores. We thus refer to the pairs in S as the smoothness pairs.

Finally, the set G contains a parameterized family of candidate DSS functions g that map feature vectors x to a scalar severity score. In this paper we focus on linear DSS function alone, however, the DSSL framework extends to non-linear DSS functions as well.

In the DSSL framework, our goal is to learn parameters of the function g ∈ G that quantifies the severity of the disease state represented by a feature vector x. This is done using an empirical risk minimization approach, i.e., an objective function is constructed that maps functions gG to their empirical risk. This function objective contains two key terms. The first term penalizes g for pairs of tuples (<xip,tip>,<xjq,tjq>)O for which the severity ordering induced by g on vectors xip and xjq is inconsistent with the ground truth clinical assessment. The second term imposes penalty of high rates of temporal changes of the DSS value, thus encouraging selection of a temporally smooth DSS function. The smoothness requirement allows to learn severity scores that mimic the natural inertia exhibited by biological systems. In what follows we present the formal definition of this objective function for learning a linear DSS.

Learning a Linear DSS

Below, we describe the L-DSS algorithm for learning linear DSS functions. This algorithm builds on the widely-used soft max-margin training21 technique that seeks to maximize the distance between the pairs that are at different severity levels. We briefly review the key concepts of soft max-margin ranking before we describe these are adapted to the task of learning a linear DSS function.

Soft Max-Margin Ranking

Consider the toy example shown in Figure 1. In this example, we assume that all feature vectors are taken on same patient and the time at which these feature vectors are taken is irrelevant. We simplify the notation accordingly. Let D contain the three feature vectors {x1,x2,x3} where xi2, and O contain the pairs (x2, x1) and (x3, x2), i.e., feature vectors x2 and x3 have higher disease severity x1 and x2 respectively.

Figure 1:

Figure 1:

We show the projections of x1, x2 and x3 on vectors w¯1 and w¯2 representing two candidate ranking functions; vectors drawn in red, green and blue identify projections of x1, x2 and x3 respectively. Ranking is induced by the differences in the projections—for example, w¯2 induces the ordering gw2(x1)>gw2(x2) because gw2(x1)gw2(x2)>0.

Max-margin ranking seeks to find a vector w such that the margin between pairs of different severity levels is maximized. In our example, we show parameter vectors w1, w2 and w3 for three candidate ranking functions in Figure 1. For each feature vector x, the assigned (severity) score for a given ranking function parameter wi is computed as the projection, gwi(x), of x on wi. The induced ranking between two vectors x1 and x2 is computed based on the margin which is defined as the difference in their projections. In the example shown, the rankings induced by both gw1 and gw3 correctly order all pairs in O, i.e., gw1(x3)>gw1(x2)>gw1(x1) and gw3(x3)>gw3(x2)>gw3(x1), while the rankings induced by w2 do not. Furthermore, w3 also induces an ordering with a larger margin between the pairs in O. Margin-maximization leads to an ordering that is more robust with respect to noise in x.

More formally, for each pair of feature vectors (xi, xj) ∈ O, we define the margin of their separation by the function gw() as ¼i,jw=gw(xi)gw(xj). The maximum-margin approach suggests that we can improve generalization and robustness of the learned separator by selecting w that maximizes the number of tuples that are ordered correctly (i.e., μi,jw>0) while simultaneously maximizing the minimal normalized margin μi,jw/w. Using the standard soft max-margin framework, the SVMRank algorithm21 approximates the above-mentioned problem as the following convex optimization program:

minw,ςordi,j[12w2+λO|O|(xi,xj)Oςordi,j],subject to theordering contraints:(xi,xj)O:gw(xi)gw(xj)=wT(xixj)1ςordi,jandςordi,j0 (1)

Joachims et al.21 solved this optimization program in the dual formulation21. Chapelle et al.22,23 proposed to transform the problem in (1) into a twice differentiable unconstrained convex problem and solve it in its primal form. To this end, we observe that for every value of w, the optimal values of ςordi,j are given by ςordi,j=max{0,1wT(xixj)}. Substituting to (1), we obtain the following unconstrained convex optimization problem:

minw[12w2+λO|O|(xi,xj)Omax{0,1wT(xixj)}].

The terms of the form max{0, a}, also called the hinge loss, are not differentiable at a = 0. We approximate these terms with the Huber loss Lh for 0 < h < 1 given by

Lh(a)={0,ifa<h(a+h)2/(4h),if|a|ha,ifa>h

It can be seen that Lh(a) is identical to hinge loss for a < −h and a > h. This approximation yields the following unconstrained, convex, twice-differentiable optimization problem:

minw[12w2+λO|O|(xi,xj)OLh(1wT(xixj))]. (2)

The L-DSS Objective and Optimization Algorithm

We now describe the L-DSS algorithm for learning linear DSS functions. We return to our original setting, where we are given sets O and S which contain feature vectors belong to more than one patients at varying times.

The L-DSS objective function is obtained by augmenting the Eq. (2) with the following term:

(<xip,tip>,<xi+1p,ti+1p>)S[wt(xi+1pxip)ti+1ptip]2.

This term penalizes DSS functions that exhibit large changes in the severity score over short durations, hence encouraging selection of temporally smooth DSS functions. Substitution of this term yields the following L-DSS objection function:

ObjectiveL-DSS:wmin12w2+λOO(<xip,tip>,<xjp,tjp>)OLh(1wT(xipxjp))+λs|S|(<xip,tip>,<xi+1p,ti+1p>)S[wT(xi+1pxip)ti+1ptip]2 (3)

This optimization program is solved using the Newton-Raphson algorithm.

Experimental Methods

In this section we derive a sepsis severity score by applying the L-DSS method to the large scale MIMIC-II dataset19. With some abuse of terminology, we refer to the resulting sepsis severity score as the L-DSS-Sepsis score. We assess the quality of the obtained L-DSS-Sepsis score by testing its sensitivity to severity changes of different granularity. In our first experiment, we test whether the L-DSS-Sepsis score is sensitive to significant changes severity. Specifically, we test whether the L-DSS-Sepsis score can accurately distinguish and order correctly by their severity different stages of sepsis. We show that the L-DSS-Sepsis score outperforms on this task two clinical scores that are widely used in the ICU setting. In two additional experiments, we analyze the sensitivity of the L-DSS-Sepsis score to a finer-grained changes in sepsis severity. In particular, in the second experiment, we show that the L-DSS-Sepsis score is sensitive to changes in severity state leading up to septic shock. This suggests that the L-DSS-Sepsis score can potentially be used for early detection of this adverse event. In the third experiment, we show that the L-DSS-Sepsis score is sensitive to post-therapy changes in severity and hence could potentially be used to assess patient’s treatment response.

The rest of this section proceeds as follows. We begin with the brief overview of MIMIC-II, the dataset used for learning the sepsis severity score. We then describe the clinical guideline for coarse grading of sepsis severity and show how this coarse grading can be used for automatic generation of clinical comparison pairs. Next, we describe how to choose the values of free parameters for the L-DSS algorithm. We conclude with the detailed description of the three experiments that are used to assess the quality of the learned L-DSS-Sepsis score.

Dataset

We use MIMIC-II, a publicly available dataset containing electronic health record data from patients admitted to the ICUs at the Beth Israel Deaconess Medical Center from 2001 to 200819. We only include adults (> 15 years old) in our study (N = 16, 234). We compute 42 different features that are derived from vital sign measurements, clinical history variables, and laboratory test results.

Sepsis Severity Grading and Automatic Generation of the Clinical Comparison Pairs

One of the inputs required by the L-DSS algorithm is the set O of clinical comparisons. For sepsis, these pairs can be created automatically by leveraging the coarse severity grading of sepsis proposed in the Surviving Sepsis Campaign (SSC)20. SSC classifies sepsis development into the following four stages with decreasing levels of severity: septic shock, severe sepsis, SIRS, and ”none” (i.e., none of the above). For each of these stages, the guideline defines criteria as combination of 1) thresholds for individual measurements that capture deviation of this measurement from its normal range, and 2) presence of specific diagnosis codes or diagnoses noted in their clinical notes. However, not all measurements needed for implementing the SSC criteria may be available at the given time. We thus relax the definition of the SSC guidelines so it uses measurements made within a short time window prior to the given time. For instance, for fast changing signals (e.g., blood pressure) we use a two hour window, and for slower changing signals (e.g., creatinine), we use an eight hour window. If no measurements were taken in the designated time window, we consider sepsis stage to be unknown. We implement these criteria and automatically identify tuples in the dataset where these SSC definitions are met.

We leverage the SSC grading of sepsis severity to automatically generate clinical comparisons in the following manner. First, we randomly assign the 16, 234 patients in the MIMIC dataset to training (60%) and testing sets (40%). Within the training set, we assign two thirds of the patients to the development set and the remaining third to the validation set. For each of the development, validation and testing sets of patients we generate a separate set O of clinical comparison pairs. This is done by randomly sampling an equal number of pairs of feature vectors ( xip, xjq) for each of six combinations of different sepsis stages according to SSC, i.e., shock-severe, shock-SIRS, shock-none, severe-SIRS, severe-none, and SIRS-none. These pairs include feature vectors that are taken from the same patient (i.e., p = q) or from different patients (i.e., pq). Overall, we extracted 30, 000 clinical comparison pairs for the validation set, and 60, 000 clinical comparison pairs for the testing set. For the development set, we extracted several sets O of different sizes, ranging from 120 pairs to 60, 000 pairs. We construct the set S of smoothness pairs to contain 80, 000 randomly sampled pairs ( xip, xi+1p) of consecutive feature vectors.

When generating clinical comparisons by automating clinical rules, a natural question that one might ask is whether the learned scores simply recover these clinical rules and thus yield no generalization beyond the SSC grading. This is not the case for two reasons. First, the guideline criteria rely on information captured in notes and diagnosis codes that are not available to the L-DSS method. Second, the temporal smoothness term constrains the learned scores to be smooth and to generalize beyond grading for the coarse severity stages. This issue is covered in greater detail in Section 3.1 of Dyagilev and Saria18.

Selection of Free Parameters

In addition to sets O and S, the required input of the L-DSS algorithm includes the values of free parameters λO and λS. To specify these values, we follow the two-step procedure proposed in Dyagilev and Saria18. First, with λS set to 0, we select the value of λO that maximizes the accuracy of ordering of clinical comparison pairs for the validation set of patients. That is, we count the fraction of clinical comparison pairs in the set O that are ordered by L-DSS in concordance with the ordering prescribed by the SSC. We refer to this quantity as the severity ordering accuracy or SOA. For a given λO, we choose the value of λS that maximizes the SOA of the resulting score. When there are several such values, we pick to the largest λS among these maximizers. Namely, when there are multiple scores that achieve the maximal SOA, we give preference to the smoothest one.

Missing data imputation

Following Hug24, we used a modified Last Observation Carried Forward (LOCF) method to impute missing data in given feature vectors. Specifically, within a given time window w, the last value of the observed signal was used. The window length w for each signal was set to be the typical sampling frequency of this signal. Specifically, the value of w was determined by combining clinical knowledge regarding signal’s typical rate of change and manual inspection of the distribution of duration between consecutive measurements of this signal. If no measurements were taken within the specified window, the missing value was replaced with the population mean.

Experiment 1: Distinguishing between the Severity Stages of Sepsis

In our first experiment, we evaluate whether L-DSS-Sepsis can distinguish and correctly order the different stages of sepsis severity. We consider L-DSS-Sepsis scores learned from a different number of training clinical comparison pairs. For each size of the training set, we report the SOA on the testing set of the obtained L-DSS-Sepsis score. As a baseline for our assessment, we use the severity ordering accuracy of existing clinical scores that are widely used in ICU3. There exists several general purpose severity scores that have been validated to assess illness severity and risk of mortality among septic patients5,2527. In this paper we focus on two of them. The first one is the Acute Physiology and Chronic Health Evaluation6 (APACHE II) score, which is a widely used scoring system for assessing general (not necessarily sepsis-related) disease severity in hospitalized individuals. The second one is based on the Sequential Organ Failure Assessment (SOFA) score8, which was originally designed to assess per-organ sepsis-related damage severity. Specifically, we use the total SOFA score which is is the sum of SOFA scores of all organ systems.

Experiment 2: Is L-DSS-Sepsis sensitive to changes in severity leading up to adverse events?

In the second experiment, we assess whether L-DSS-Sepsis is sensitive enough to capture changes in severity that can occur over the time period leading up to an adverse event. To this end, we examine the behavior of L-DSS-Sepsis in the 18 hour period prior to septic shock. We consider all patients with septic shock in our data with at least 18 hours of data prior to septic shock onset (N = 684). On these patients, we define three time intervals of interest: 1) 6 hours prior to the onset of septic shock; 2) 6 12 hours prior to the onset of the septic shock; 3) 12 – 18 hours prior to the onset of septic shock. We denote the average values of the learned scores in these intervals by s¯06, s¯612 and s¯1218, respectively. We calculate values of Δ1=s¯06s¯612 and Δ2=(s¯06s¯612)(s¯612s¯1218) for each patient. Using the standard one-tailed t-test, we assess the p-value ptrend-up for whether the obtained ∆1 can be observed by chance under the null hypothesis that ∆1 are drawn from a zero mean distribution. Similarly, we assess the p-value prate acceleration for whether the obtained ∆2 can be observed by chance under the null hypothesis that ∆2 are drawn from a zero mean distribution.

Experiment 3: Is L-DSS-Sepsis sensitive to post-therapy changes in severity?

We now evaluate whether L-DSS-Sepsis is sensitive to changes in severity state due to administration of fluid bolus – a treatment used for septic shock20. To wards this, we use the self-controlled case series method. We compare trends exhibited by L-DSS-Sepsis values over the five hour intervals prior to and post the administration of fluid bolus. We refer to the trends over these intervals as ∆prior and ∆post. The value of ∆prior is computed as the difference between the value of the L-DSS-Sepsis at the time of treatment administration and the mean value of L-DSS-Sepsis over the five hour interval prior to treatment administration. Similarly, the value of ∆post is calculated as the difference between the mean value of L-DSS-Sepsis over the five hour interval after treatment administration and the value of L-DSS-Sepsis at the moment of treatment administration. If the patient is responsive to fluid therapy, then ∆treat = ∆postprior < 0, that is, if the L-DSS-Sepsis was trending up prior to treatment administration, we expect this trend to be attenuated or even reversed by the treatment.

We identify cases of fluid administration events related to sepsis using the following criteria: 1) the patient is experiencing SIRS, severe sepsis or septic shock at the time of treatment administration, and 2) the patient is hypotensive (has systolic blood pressure below 100 mm Hg), a commonly used criteria for prescribing fluids in sepsis. To avoid confounding due to multiple administration of fluids, we restrict our attention to treatment administrations that were not preceded or followed by another fluid bolus administration within a five hour window. This yielded a total of 81 fluid bolus administration events. Employing the one-tailed t-test, we assess the p-value ptreatment response for whether the observed values of ∆treat = ∆post prior can be observed by chance under the null hypothesis that ∆treat are drawn from a zero mean distribution.

Results and Discussion

In our first experiment we assess the ability of the L-DSS-Sepsis score to distinguish and correctly order the different stages of sepsis severity. Specifically, we train the L-DSS-Sepsis score on clinical comparisons sets O of different sizes and measure the corresponding values of severity ordering accuracy (SOA). We present the results of this experiment in Figure 2(a). As expected, the SOA is an increasing function of the number of training clinical comparisons. We note that the L-DSS algorithm yields highly accurate scores even with relatively small amount of training examples. In particular, the maximal SOA equals 0.855 and is obtained by L-DSS-Sepsis scores trained on 30, 000 clinical comparisons or more. The SOA of 0.847, which is 99% of the maximal SOA, is obtained for 1200 clinical comparisons or more. The routinely used clinical scores — APACHE-II and Total SOFA — yield SOA of 0.654 and 0.601, respectively. They are thus outperformed by all L-DSS-Sepsis scores that are trained on 120 clinical comparisons or more. This performance achieved by the L-DSS-Sepsis is significant from a clinical standpoint as it orders severity states more accurately than the two widely used clinical scores. Moreover, it suggests that the L-DSS algorithm yields high quality severity scores even when trained on a small number of clinical comparisons. Hence, it can be applied to diseases where clinical comparisons need to be manually created by experts, thus their number should be minimal.

Figure 2:

Figure 2:

(a) Severity ordering accuracy (SOA) of the L-DSS-Sepsis scores on the testing set for number of training clinical comparisons. Vertical line marks 1200 training clinical comparisons. (b) Distribution of the L-DSS-Sepsis values for different stages of sepsis. (c) Temporal trajectory of a severity score over the time period leading up to septic shock. The vertical line marks the onset of septic shock. (d) Temporal trajectory of a severity score before and after administration of fluid bolus. The solid vertical line marks the administration of fluid bolus.

Hereafter, we use the L-DSS-Sepsis score trained on 1200 clinical comparisons. In Figure 2(b) we plot the probability density of the L-DSS values at different sepsis severity stages. We observe that, on average, the value of L-DSS score during septic shock is higher than in other less severe stages of sepsis. Since the L-DSS score is temporally smooth, this suggest that the L-DSS trajectory should trend up over the time period leading up to septic shock, i.e, should be sensitive to changes in severity leading up to adverse events. In our second experiment, we verify this hypothesis. Overall, over 67% (95% confidence interval: 63% – 70%) out of 684 observed values of ∆1 and 57% (95% confidence interval: 53%–60%) out of 684 observed values of ∆2 were positive. The obtained p-values ptrend-up < 10−21 and prate acceleration < 2.25 × 10−2 rule out the null hypothesis in favor of the stated hypothesis, that is, the learned scores trend up significantly prior to a septic shock and this trend accelerates over time. As an illustration, in Figure 2(c), we show the L-DSS trajectory on an example patient leading up to the onset of shock. It can be seen that the severity score of this patient exhibits a clear upward trend prior to the septic shock. Early identification of this trend can potentially alert the caregiver about a need for a medical intervention and prevent the impending adverse event.

In our last experiment, we addressed the question of whether the L-DSS score is sensitive to post-therapy changes in severity. Overall, the change of trend ∆treat is negative in at least 77% (95% confidence interval: 68% – 86%) out of 81 recorded values of ∆treat. The obtained p-value ptreatment response < 5 × 108 rules out the null hypothesis in favor of the stated hypothesis, that is, that is L-DSS shows significant response to therapy. As an illustration, in Figure 2(d), we show the L-DSS trajectory on an example patient 10 hours before and after administration of fluid bolus. It can be seen that the severity score of this patient trends up prior to treatment and trends down post-treatment.

Conclusion

In this paper we evaluated the feasibility of automatically learning a score (L-DSS-Sepsis) that tracks the severity of sepsis over time. We validated the learned sepsis severity score using electronic health record data obtained from patients admitted to four different ICUs at an academic medical center over a period of eight years. Compared to existing illness severity scores of APACHE-II and SOFA, the L-DSS-Sepsis score was significantly more accurate in distinguishing between the different stages of sepsis. The L-DSS-Sepsis score also showed face validity: its trajectories were temporally smooth and tended to trend upwards in individuals that progressed to septic shock. Furthermore, the L-DSS-Sepsis score also behaved as expected after fluid administration–as the patient’s health improved, the value of the L-DSS-Sepsis score decreased. These experiments suggest that an automated severity tracking score such as L-DSS-Sepsis may enable early interventions and monitor a patient’s responsiveness to therapy.

Our study has several limitations. While the L-DSS-Sepsis score showed desirable behavior in tracking changes in illness severity, additional studies are needed to validate whether it identifies at-risk patients before the caregiver has identified them as such. Similarly, whether the L-DSS-Sepsis score can be used to develop a tool for measuring fluid responsive remains to be studied. Finally, the key advantage of the DSSL framework is that it is less prone to practice pattern variations. Data from more than hospital using on a similar population of patients is needed to further test this hypothesis.

Overall, the discussed results are promising. The DSSL framework is general purpose and can be applied to other disease domains and populations. Furthermore, we are encouraged by the fact that the DSSL framework yields high quality sepsis severity scores even when given as few as 120 clinical comparisons. This suggests that it can be successfully applied to diseases for which coarse severity grading guidelines are not available and thus clinical comparisons need to be obtained from asking experts to annotate.

References

  • [1].Kumar G, Kumar N, Taneja A, Kaleekal T, Tarima S, McGinley E, et al. Nationwide trends of severe sepsis in the 21st century (2000–2007) CHEST Journal. 2011;140(5):1223–1231. doi: 10.1378/chest.11-0352. [DOI] [PubMed] [Google Scholar]
  • [2].Sebat F, Musthafa AA, Johnson D, Kramer AA, Shoffner D, Eliason M, et al. Effect of a rapid response system for patients in shock on time to treatment and mortality during 5 years. Critical Care Medicine. 2007;35(11):2568–2575. doi: 10.1097/01.CCM.0000287593.54658.89. [DOI] [PubMed] [Google Scholar]
  • [3].Keegan MT, Gajic O, Afessa B. Severity of illness scoring systems in the intensive care unit. Critical Care Medicine. 2011;39(1):163–169. doi: 10.1097/CCM.0b013e3181f96f81. [DOI] [PubMed] [Google Scholar]
  • [4].Medsger T, Bombardieri S, Czirjak L, Scorza R, Rossa A, Bencivelli W. Assessment of disease severity and prognosis. Clinical and Experimental Rheumatology. 2003;21(3 SUPP/29):S42–S46. [PubMed] [Google Scholar]
  • [5].Ghanem-Zoubi NO, Vardi M, Laor A, Weber G, Bitterman H. Assessment of disease-severity scoring systems for patients with sepsis in general internal medicine departments. Critical Care Medicine. 2011;15(2):R95. doi: 10.1186/cc10102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification system. Critical Care Medicine. 1985;13(10):818–829. [PubMed] [Google Scholar]
  • [7].Marshall JC, Cook DJ, Christou NV, Bernard GR, Sprung CL, Sibbald WJ. Multiple organ dysfunction score: a reliable descriptor of a complex clinical outcome. Critical Care Medicine. 1995;23(10):1638–1652. doi: 10.1097/00003246-199510000-00007. [DOI] [PubMed] [Google Scholar]
  • [8].Vincent JL, Moreno R, Takala J, Willatts S, De Mendonça A, Bruining H, et al. The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. Intensive Care Medicine. 1996;22(7):707–710. doi: 10.1007/BF01709751. [DOI] [PubMed] [Google Scholar]
  • [9].Fine MJ, Auble TE, Yealy DM, Hanusa BH, Weissfeld LA, Singer DE, et al. A prediction rule to identify low-risk patients with community-acquired pneumonia. New England Journal of Medicine. 1997;336(4):243–250. doi: 10.1056/NEJM199701233360402. [DOI] [PubMed] [Google Scholar]
  • [10].Wiens J, Horvitz E, Guttag JV. Advances in Neural Information Processing Systems. 2012. Patient Risk Stratification for Hospital-Associated cdiff as a Time-Series Classification Task; pp. 467–475. [Google Scholar]
  • [11].Ho JC, Lee CH, Ghosh J. Imputation-enhanced prediction of septic shock in ICU patients; Proceedings of the ACM SIGKDD Workshop on Health Informatics; 2012. pp. 21–27. [Google Scholar]
  • [12].Saria S, Rajani AK, Gould J, Koller D, Penn AA. Integration of early physiological responses predicts later illness severity in preterm infants. Science Translational Medicine. 2010;2(48) doi: 10.1126/scitranslmed.3001304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Pirracchio R, Petersen ML, Carone M, Rigon MR, Chevret S, van der Laan MJ. Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study. The Lancet Respiratory Medicine. 2015;3(1):42–52. doi: 10.1016/S2213-2600(14)70239-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Jackson CH, Sharples LD, Thompson SG, Duffy SW, Couto E. Multistate Markov models for disease progression with classification error. Journal of the Royal Statistical Society: Series D (The Statistician) 2003;52(2):193–209. [Google Scholar]
  • [15].Saria S, Koller D, Penn A. Proceedings of Neural Information Processing Systems (NIPS), Predictive Models in Personalized Medicine Workshop. 2010. Learning individual and population level traits from clinical temporal data; pp. 1–9. [Google Scholar]
  • [16].Wang X, Sontag D, Wang F. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014. Unsupervised learning of disease progression models; pp. 85–94. [Google Scholar]
  • [17].Paxton C, Niculescu-Mizil A, Saria S. AMIA Annual Symposium Proceedings. Vol. 2013. American Medical Informatics Association; 2013. Developing predictive models using electronic medical records: challenges and pitfalls; pp. 1109–1115. [PMC free article] [PubMed] [Google Scholar]
  • [18].Dyagilev K, Saria S. Learning (Predictive) Risk Scores in the Presence of Censoring due to Interventions. Machine Learning Journal: Special Issue on Machine Learning for Health and Medicine. 2015 To appear. [Google Scholar]
  • [19].Saeed M, Villarroel M, Reisner AT, Clifford G, Lehman LW, Moody G, et al. Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II): a public-access intensive care unit database. Critical Care Medicine. 2011;39(5):952–960. doi: 10.1097/CCM.0b013e31820a92c6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Dellinger RP, Levy MM, Rhodes A, Annane D, Gerlach H, Opal SM, et al. Surviving Sepsis Campaign: international guidelines for management of severe sepsis and septic shock, 2012. Intensive Care Medicine. 2013;39(2):165–228. doi: 10.1007/s00134-012-2769-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Joachims T. Proceedings of the eighth ACM SIGKDD international conference on Knowledge Discovery and Data mining. 2002. Optimizing search engines using clickthrough data; pp. 133–142. [Google Scholar]
  • [22].Chapelle O, Keerthi SS. Efficient algorithms for ranking with SVMs. Information Retrieval. 2010;13(3):201–215. [Google Scholar]
  • [23].Chapelle O. Training a support vector machine in the primal. Neural Computation. 2007;19(5):1155–1178. doi: 10.1162/neco.2007.19.5.1155. [DOI] [PubMed] [Google Scholar]
  • [24].Hug C. Detecting hazardous intensive care patient episodes using real-time mortality models. 2009. [Google Scholar]
  • [25].Barriere SL, Lowry SF. An overview of mortality risk prediction in sepsis. Critical care medicine. 1995;23(2):376–393. doi: 10.1097/00003246-199502000-00026. [DOI] [PubMed] [Google Scholar]
  • [26].Jones AE, Trzeciak S, Kline JA. The Sequential Organ Failure Assessment score for predicting outcome in patients with severe sepsis and evidence of hypoperfusion at the time of emergency department presentation. Critical care medicine. 2009;37(5):1649–1654. doi: 10.1097/CCM.0b013e31819def97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Vorwerk C, Loryman B, Coats T, Stephenson J, Gray L, Reddy G, et al. Prediction of mortality in adult emergency department patients with sepsis. Emergency Medicine Journal. 2009;26(4):254–258. doi: 10.1136/emj.2007.053298. [DOI] [PubMed] [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES