Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2021 Jan 25;2020:773–782.

Is Deep Reinforcement Learning Ready for Practical Applications in Healthcare? A Sensitivity Analysis of Duel-DDQN for Hemodynamic Management in Sepsis Patients

MingYu Lu 1, Zachary Shahn 2, Daby Sow 2, Finale Doshi-Velez 3, Li-wei H Lehman 1
PMCID: PMC8075511  PMID: 33936452

Abstract

The potential of Reinforcement Learning (RL) has been demonstrated through successful applications to games such as Go and Atari. However, while it is straightforward to evaluate the performance of an RL algorithm in a game setting by simply using it to play the game, evaluation is a major challenge in clinical settings where it could be unsafe to follow RL policies in practice. Thus, understanding sensitivity of RL policies to the host of decisions made during implementation is an important step toward building the type of trust in RL required for eventual clinical uptake. In this work, we perform a sensitivity analysis on a state-of-the-art RL algorithm (Dueling Double Deep Q-Networks) applied to hemodynamic stabilization treatment strategies for septic patients in the ICU. We consider sensitivity of learned policies to input features, embedding model architecture, time discretization, reward function, and random seeds. We find that varying these settings can significantly impact learned policies, which suggests a need for caution when interpreting RL agent output.

Introduction

Artificial intelligence is changing the landscape of healthcare and biomedical research. Reinforcement Learning (RL) and Deep RL (DRL) in particular provide ways to directly help clinicians make better decisions via explicit treatment recommendations. Recent applications of DRL to clinical decision support include estimating strategies for sepsis management1–5, mechanical ventilation control6, and HIV therapy selection7. However, the quality of these DRL-proposed strategies is hard to determine: the treatment strategies are typically learned from retrospective data without access to the unobservable counterfactual reflecting what would have happened had clinicians followed the DRL strategy. Unlike DRL strategies for Atari8 or other games9, which can be evaluated by simply using them to play the game, testing a DRL healthcare strategy via a randomized trial can be prohibitively expensive and unethical.

Thus, it is critical that we find ways to assess DRL-derived strategies prior to experimentation. One particular axis of assessment is robustness. DRL algorithms involve many choices, and if the output treatment policy is highly sensitive to some choice, that may imply either (a) getting that choice right is truly important or (b) we should be cautious about assigning credence to the output policy because a seemingly small and possibly unimportant change can have a large impact on results. In contrast, if the output policy is robust to analysis decisions then any errors are likely due to traditional sources of bias in observational studies (such as unobserved confounding), which can be more readily considered by subject matter experts assessing the credibility and actionability of DRL results.

In this paper, we explore the sensitivity of a particular DRL algorithm (duelling double Deep Q-networks, or Duel-DDQN10,11, a state-of-the-art DRL method which has led to many success in scaling RL to complex sequential decision-making problems9) to data preparation and modeling decisions in the context of hemodynamic management in septic patients. Septic patients require repeated fluid and/or vasopressor administration to maintain blood pressure, but appropriate fluid and vasopressor treatment strategies remain controversial12,13. Past DRL applications by Komorowski,1,2 Raghu,3,4 and Peng et al5 make different implementation decisions while seeking to identify optimal fluid and vasopressor administration strategies in this setting (see additional discussion in Gottesman et al14). However, these works do not perform systematic sensitivity analyses around their choices.

Starting with a baseline model similar to the works above, we perform sensitivity analyses along multiple axes, including: (1) inclusion of treatment history in the definition of a patient’s state; (2) time bin durations; (3) definition of the reward; (4) embedding network architecture; and (5) simply setting different random initialization seeds. In all cases, we find that the Duel-DDQN is sensitive to algorithmic choices. In some cases, we have clear guidance: for example, making sensible decisions about a patient now requires knowing about their prior treatments. In other cases, we find high sensitivity with no clear physiological explanation; this suggests an area for caution and concern.

The paper is organized as follows. We first quickly review background and related work in DRL and sepsis and introduce some notation and terminology used throughout the paper. We then describe the components of a ‘baseline’ DDQN implementation in detail. For select components that we examine in a sensitivity analysis, we discuss why different specifications could be reasonable and how we go about evaluating sensitivity to alternative specifications. Then we present results of our sensitivity analysis and conclude with a discussion highlighting several limitations and pitfalls to avoid when applying DRL in clinical settings.

Background and Related Work

Markov Decision Process(MDP) & Q-Learning

A Markov decision process (MDP) is used to model the patient environment and trajectories, which consists of15

  • A set of states S, plus a distribution of starting states p(s0).

  • A set of actions A.

  • Transition dynamics T (st+1|st, at) that map a state action pair at time t onto a distribution of states at time t + 1.

  • An immediate/instantaneous reward function rt = R(st, at, st+1).

  • A discount factor γ ∈ [0, 1] , where lower values place more emphasis on immediate rewards.

Every roll-out of a policy accumulates rewards from the environment, resulting in the return R=t=0T1γtrt+1. The goal of RL is to find an optimal policy π, which achieves the maximum expected return from all the states. π = arg maxπ E[R|π]. To find π, one of the reinforcement learning algorithm is Q-learning. The basic idea to evaluate the policy is to use temporal difference (TD) learning over the policy iteration to minimize the TD error15,16. Qπ(s,a)Qπ+α(r+γ+Qπ(s,a)Qπ(s,a))

More formally, Q learning aims to approximate the optimal action-value function given the observed state s and the action a at time t15,16. The future reward rt is discounted at every time step t by a constant factor. Q*(s,a)=E[rt+γ*rt+1+γ2*rt+2+|st,at=a,π]

Dueling Double Deep Q-Learning (Dueling DDQN) with Prioritized Experience Relay (PER)

To approximate the optimal action-value function, we can use a deep Q-network: Q(s, a; θ) with parameter θ. To estimate this network, we optimize the following sequence of loss functions at iteration i:Li(θi)=Es,a,r,s[(yiDQNQ(s,a;θi))2];yiDQN=r+γmaxQ(s,a;θ), updating parameters by gradient descent such that θiLi(θi)=Es,a,r,s[yiDQNQ(s,a;θi)θiQ(s,a;θi)].

Dueling Double Deep Q-learning11 is a particular state-of-the-art deep Q-learning algorithm consisting of separate ‘dueling’ architectures that decouple the value and advantage streams in deep Q-networks11 to determine the value of the next state10. Prioritized experience replay10,17, i.e. sampling mini-batches of experience that have high expected impact on learning, further improves efficiency.

Learning Sepsis Management with DRL

Sepsis is a life-threatening organ dysfunction disease caused by dysregulated host response to infections12. How to maintain septic patients’ hemodynamic stability via administration of intravenous fluid (IV fluid) and vasopressors is a key research and clinical challenge.12,13 A number of DRL studies have been carried out to address this issue in the past few years. Raghu et al3,4 applied a Dueling Q Network with sparse autoencoder. Peng et al5 further presented a mixture-of-experts framework, combining a kernel and a Dueling DDQN with PER to personalize sepsis treatment. It is worth noting that these studies were conducted based on different reward settings. Raghu et al3,4 used hospital and 90-day mortality as a sparse reward issued at the end of patients’ trajectories, and subsequent analysis used short-term rewards such as SOFA score in combination with lactate levels3, while Peng et al5 used changes in probability of mortality.

Data Description & Cohort

Data for our cohort were obtained from the Medical Information Mart for Intensive Care (MIMIC-III v1.4)18 database. The data set contained all MIMIC-III patients meeting Sepsis-3 criteria19 from years 2008-2012. It comprised 7,956 patients with 649,661 clinical event entries. In this analysis, we extracted and collected static features (e.g. demographic), past treatment history and a summary of hourly observation (mean, maximum, and minimum within an hour) of all laboratory values within patients’ first 72-hour ICU stay. For intravenous fluids, we extracted commonly used IV fluids for resuscitation (colloids and crystalloids). The detailed features are shown in Table 1. All measured values were standardized, and we carried forward covariate values from the most recent measurement. The data set was split using 80% for training and validation and 20% for testing.

Table 1:

Details of attributes including vital signs, laboratory values, and treatment history

Demographic Age, Weight, Height, Ethnicity
Vital Sign\Laboratory GCS, Heart rate, Temperature, Respiratory Rate, Diastolic Blood Pressure, Systolic Blood Pressure, Mean Arterial Blood Pressure, Potassium, Sodium, Chloride, Magnesium, Calcium, Anion gap, Hemoglobin, Hematorcit, WBC, Platelets, Bands, PTT, PT, INR, Arterial pH, SpO2, FiO2, PaO2, TotalCO2, pCO2, Arterial Base excess, Bicarbonate, Arterial Lactate, SOFA score, Glucose, Creatinine, BUN, Total Bilirubin, Indirect Bilirubin, Direct Bilirubin, AST, ALT, Total Protein, Troponin, CRP, Elixhauser Score, Albumin
Treatment Vasopressor & IV Fluid

Methods: Baseline Implementation

We first describe our baseline implementation of a Dueling DDQN with PER to learn an optimal resuscitation strategy. This implementation combines elements from several published RL applications to sepsis treatment4,5, with slight modifications. In describing the baseline, we also illustrate the many components involved in specification of a Dueling DDQN analysis. In our sensitivity analysis, we will systematically vary these components from our baseline model.

Time Discretization

We divided patient data into one-hour bins. To avoid inappropriately adjusting for covariate values that were measured after treatment actions were taken20, we performed a time-rebinning procedure. If a treatment action occurred within an existing time bin, covariate measurements made after the treatment action in that bin were moved to the following bin and the time of the treatment action became the new endpoint of the time bin. Figure 1 illustrates this process with the blue bar covering 1.75 hr to 2.75 hr defining the time period for which covariate measurements contribute to the bin ending at 2.75 hours. (Time rebinning is rare in RL literature but necessary to avoid adjusting for post-treatment variables.)

Figure 1:

Figure 1:

Time discretization

Compressing Patient Histories

We follow Peng et al.5 in encoding patient states recurrently using an LSTM autoencoder representing the cumulative history for each patient. LSTMs can summarize sequential data through an encoder structure into a fixed-length vector and then reconstruct into its original sequential form through the decoder structure21. The summarized information can be used to represent time series features23. LSTM-RNN models can prevent a vanishing or exploding gradient and are commonly used to capture long-term sequence structures24.

Action definition and Treatment Discretization

Following Raghu et al.4 and Peng et al.5, we focus on intravenous fluids and vasopressors as the actions of the MDP. We computed the hourly rate of treatment as the action and sum the rate when there are overlapped treatment events of the same type. The hourly rate of each treatment is divided into 5 bins defined by quartiles under current physician practice. Accordingly, a 5 by 5 action space is defined for the medical intervention4. An action of (0,0) means no treatment is given; whereas, an index of (4, 4) represents top quartile dosages of both fluids and vasopressors.

Reward formulation

We follow Peng et al.5 in defining the reward at time t as the change in negative log-odds of 30 day mortality between t and t + 1 according to a predictive model for 30 day mortality. The probability of mortality was estimated with a 2-layer neural network with 50 and 30 hidden units with L1 regularization given the recurrent embedding of the compressed history at the corresponding time. Let f (o) be the probability of mortality given observations through the current time point o and f (o′) be the probability of mortality given observations through the next time step. Then we define the reward

r(o,a,o)=logf(o)1f(o)+logf(o)1f(o). (1)

Dueling DDQN Architecture

Following Raghu et al4, our final Duelling Double-Deep Q Network (Dueling DDQN)10, 11 with PER architecture has two hidden layers of size 128, using batch normalization after each, Leaky-ReLU activation functions, a split into equally sized advantage and value streams, and a projection onto the action-space by combining these two streams. The Duel-DDQN architecture divides the value function V into the value of the patient’s underlying physiological condition, called the Value stream, and the value of the treatment given, called the Advantage stream.

Methods: Sensitivity analysis

In this section, we describe the ways in which we altered the baseline Dueling DDQN implementation described in the previous section in our sensitivity analysis. For each component that we varied, we specify the alternatives we considered, explain why the choice could be important, and also explain why each alternative might be considered reasonable. We emphasize that the aim of this sensitivity analysis is not to determine which choices are best, as there is no ground truth available to make such a determination. Rather, it is to understand the robustness of the learned treatment policy with respect to a priori reasonable-seeming alternatives. In particular, we explore the effects on learned policies of: including treatment history in the state definition; varying the time bin size; varying the reward specification; specifying different recurrent embedding models; and setting different random seeds.

Including Past Treatment History

One decision is whether to include the history of past treatments in the representation of patient state. Several prior works3,5,6 did not do so, but given the Markov assumption on which DQNs rest, this amounts to assuming that past treatments cannot impact future outcomes through pathways that do not run through measured covariates included in the state. This assumption will usually be false. For example, vasopressors have potentially serious long term cardiovascular side effects13, but the added risk after administering more vasopressors would not be captured in short term changes in measured patient covariates. Thus, in our sensitivity analysis we compare learned policies from DQN implementations with state summaries that include and exclude treatment history. In this analysis, we consider the cumulative dosage of all previous treatment (IV fluid and vasopressor) until t – 1 as a proxy for treatment history at t.

Duration of Time Bins

When applying a discrete time RL algorithm to data with actions taken and measurements recorded in continuous time, an implementation decision that inevitably arises is how to bin time into discrete chunks in which to define patient states St. With infinite data, shorter time bins would be superior for two reasons. First, the state at each time step reflects the patient’s condition closer to when the treatment action at that time step was decided. This improves confounding adjustment, since states are more reflective of information that actually influenced treatment decisions. Second, more time steps allow for more flexible and dynamic learned strategies that are more responsive to changes in patient state. However, with finite data, the capacity to learn more flexible strategies with shorter time bins can be detrimental, leading to instability of the estimated optimal policy.

In the case of sepsis, past work has used 4 hour time bins4. Treatment decisions in this clinical context are made on a finer time scale than 4 hours, which is why we defined 1 hour time bins in our baseline model. But stability is also a major concern in this dataset, so either choice is defensible. Hence, we compared the learned policies of Dueling DDQNs fit to 1 hour and 4 hour time binned data sets in our sensitivity analysis.

Horizon of Rewards

A key decision is specifying the reward function. Ideally, the reward function would summarize the entire long term utility of the stay, as this is what we really seek to optimize. However, for reasons of practicality, researchers often choose short term rewards measured at each time-step. When rewards are short term, there is more ‘signal’ in that it is easier to estimate associations between rewards and actions. If a RL algorithm is truely robust, the learned policies would be broadly similar whether we choose to optimize our true reward of interest or a shorter term proxy. To investigate the impact of using long term against short term reward, we also compared reward functions that were weighted mixtures of long and medium term information about outcomes for a range of weights.

We define a utility function reward U as follows. Let M be the worst possible SOFA score. Let Y be observed SOFA at the end of the stay. Let S = 1 if the patient survived more than 1 year after admission, 0 otherwise. Let H be hours survived after admission. Let C be a constant that controls relative weight assigned to SOFA score at the end of stay and survival. We define r′(C) as

if H24*365then U=log(1+MYC),else U=log(H24*365+1) (2)

For large values of C, survival time is all that matters. For low values of C, patient state at the end of the stay matters a lot for patients who survive more than 1 year. For all values of C, rewards are medium to long term (since they are based on patient state at the end of the stay or later as opposed to the following time bin), but differing C levels reflect different subjective prioritization of patient health outcomes.We compare learned DQN policies using short term reward (1) with long term rewards (2) for varying values of C.

Choice of Embedding Model

Another question is how to summarize patient history. In DQNs, it is important for the information contained in the state St at each time t to satisfy several conditions. First, St should satisfy the “sequential exchangeability” assumption22, which would be satisfied if St contains all relevant information about variables influencing treatment decisions at time t and associated with future rewards, i.e. St should contain sufficient information to adjust for confounding. If sequential exchangeability fails, then estimates of the impact of actions on future rewards will be biased, and therefore the estimate of the optimal treatment strategy will be biased as well.

To learn an optimal treatment strategy, it is also important that St contain relevant information about variables that are effect modifiers. An effect modifier is a variable with the property that the conditional average effect of an action on future rewards varies with the variable’s value. Good treatment rules assign treatment based on the value of effect modifiers. (Effect modifiers may or may not be confounders, which are necessary to include in the model to avoid bias but may not be good inputs to treatment decision rules.)

Finally, DQNs make a very strong Markov assumption on states8,16. St must be defined to be sufficiently rich that this Markov assumption can approximately hold. Thus, to allow for realistic long term temporal dependencies, states at each time should be rich summaries of patient history. Without a priori knowledge of exactly which aspects of patient history to retain (to adjust for confounding, model effect modification, and satisfy the Markov assumption), a reasonable strategy is to define patient states as embeddings generated by a RNN23,25.

However, RNN embeddings are not optimized to retain the types of information specifically required to be contained in DQN states. Different choices of black box embedding method may generate states that satisfy the DQN requirements to varying degrees and produce different learned policies, with no principled way to choose between them.

In our sensitivity analysis, we compare two common RNN embedding models–long short-term memory (LSTM)21 and gated recurrent unit (GRU)26. The architectures have been shown to perform comparably across a range of tasks24. Each consists of two hidden layers of 128 hidden units and is trained with mini-batches of 128 and the Adam optimizer for 200 epochs or until convergence. We have no reason a priori to believe that either option would produce more suitable embeddings than the other, and the point of comparing them is to determine whether the decision is important.

Random restarts

Finally, solving a DDQN is a non-convex optimization, and thus random restarts are frequently used to find a good local optimum. As Henderson et al27 reports, a wide range of results can be obtained from the same deep RL algorithm depending on the random initialization of the network’s weights initialization. To observe the impact of random weight initialization in our dataset, we fit our baseline model repeatedly using different seeds. While one would generally simply take the best of the random restarts as the solution, high variation across random restarts might mean that reproducing a result will be more challenging as the problem has many diverse local optima.

Methods: Evaluation Metrics & Experimental Settings

Metrics Previous works have used off-policy estimators, such as Weighted Doubly Robust (WDR28) to estimate the quality of a proposed policy. However, these estimators can have high variance as well as bias. Instead, we compare policies based on the distribution of actions they recommended. If these distributions are very different from each other or from clinicians (whom we know act reasonably) then that may be a cause for skepticism.

Specifically, for each time point in the patient history in the test set, we compute the Dueling DDQN’s recommended action. The action distribution for a policy is simply the frequency with which each of the 25 combinations of vasopressor and fluid doses are recommended across all person-times in the test set. We use heat maps like Figure 2 to display action distributions, where the y axis represents the IV fluid dosage (quartile), the x axis represents the vasopressor dosage (quartile), and the density of the color represents the frequency with which the treatment action is applied across all person-times in the test set. These action distributions are aggregates in that we are summing over all time points and patients. Comparison of aggregate action distributions between different experimental settings can provide insights into the ways in which policies differ.

Figure 2:

Figure 2:

Action distribution

Parameters and Optimization

We train the Dueling DDQN for 100,000 steps (except the reward horizon experiment, where we perform early stopping at 15,000 steps to prevent over-fitting) with a batch size of 30. We conducted 5 restarts for every experimental setting (except the random restart experiment, where we look at variation across individual restarts). Following Peng et al5, of the policies resulting from the 5 restarts we choose the one with highest value as estimated by a weighted doubly robust off policy evaluation method28. For models trained with long term rewards, r′(C), where the WDR estimator is unfeasible, we selected a policy from the 5 restarts based on the Q-value.

Results

We altered aspects of the baseline Dueling DDQN implementation described in the previous section and compared the resulting learned policies according to their action distributions. In the following, we abbreviate Dueling DDQN trained with embedding to DQN-embedding. For, example the Dueling DDQN trained with LSTM and 1 hourly binned data is called DQN-LSTM-1hr.

Treatment History: Excluding treatment history leads to aggressive treatment policies

Here, we compare treatment strategies output by Dueling DDQNs that do and do not include treatment history in patient state representations (Figure 3). The DQN-LSTM-1hr trained without treatment history recommends nonzero vasopressor doses at all time points. It frequently recommends high doses of each treatment compared to physicians and an agent trained with treatment history included in the state definition. Excluding past treatment information increases the frequency of average recommended dosage of vasopressor and fluid by 1.6 - 1.8 times and 1.7 - 3.1 times, respectively in the test set. We hypothesize an explanation for this behavior in the Discussion section.

Figure 3:

Figure 3:

The action distribution of Dueling DDQNs trained with/without past treatment history information. Note that the agent trained without treatment history aggressively prescribes high (3rd quartile, 4th quartile) dosage of vasopressor and IV fluid. Low dosage treatment are rarely administered to patients.

Time bin durations: Longer time bins result in more aggressive policies.

Figure 4 illustrates that while different ways of segmenting time do not affect the clinician action distributions (by definition), they have a large effect on the Dueling DDQN action distributions. The DQN-LSTM’s 4 hour bins increased the frequency of nonzero vasopressor doses by 40% and decreased the overall usage of IV fluid only, a less aggressive action, by 34% compared to baseline settings.

Figure 4:

Figure 4:

Comparison of action distribution across DQN-LSTM 4-hour and DQN-LSTM 1-hour time. (Note the same color density does not represent the same count in 4 hour bins and 1 hour bins). The 4 hour bins lead to much more frequent recommendations of high vasopressor doses by the DQN-LSTM, while the physician’s policy remains conservative.

Rewards: Long-term objectives lead to more aggressive and less stable policies

In figure 5, we see that longer term objectives resulted in more aggressive policies—specifically in more frequent high fluid doses than our short term baseline reward for all values of C (for C = 100, it is difficult to see visually in the heat map, but the agent administered the maximum fluid dosage 40% of the time). Policies also vary considerably across level of emphasis on medium versus long term outcomes determined by values of C. We noticed higher variation across random restarts in the long term reward settings than the short term baseline settings (see Figure 8). This could indicate that optimization is more challenging and unstable for long term rewards.

Figure 5:

Figure 5:

Comparison of Duel-DDQN trained with short term & long/intermediate reward across varying Cs.

Figure 8:

Figure 8:

(a) Q values distribution across different seeds (b) Comparison between long term and short term (baseline) objectives. Note that long term reward implementation demonstrates higher variance compared to baseline. X-axis labels coefficient of variance or relative standard deviation, cv=σμ where σ is the standard deviation and µ is the mean of action distribution across random restarts of each of the 25 discretized actions (as in Figure2.)

Embedding model: High sensitivity to architecture

Results comparing LSTM and GRU embeddings can be found in Figure 6. We can observe that both our baseline (LSTM) implementation and the GRU implementation recommended nonzero doses of vasopressors significantly more frequently than physicians. However, the GRU implementation was more aggressive, recommending nonzero fluid doses significantly more often than both physicians and the DQN-LSTM.

Figure 6:

Figure 6:

GRU implementation: Comparing to the baseline (LSTM), the most observed treatment by DQN-GRU 1hr was to deliver a medium dosage vasopressors without IV fluid; Exclusion of Treatment history: DQN-GRU administers maximum dosage of both vasopressor and IV fluid most of the time; Time bin duration : 4-hour bins increased the overall usage of IV fluid by 20% and more than doubled increased maximum dosage of IV fluid.

The choice of embedding architecture also interacts with other analysis settings, whose effects differ depending on embedding architecture. We illustrate interactions with treatment history and time segmentation.

Exclusion of Treatment History Excluding prior treatment history has an even more extreme effect when embeddings use the DQN-GRU architecture, with maximum dosage of both treatments being delivered most of the time.

Different time segmentation 4 hour time bins led to more frequent high vasopressor doses in the baseline LSTM implementation, but more frequent high fluid doses in the GRU implementation.

There is no way to apply clinical, physiological, or statistical knowledge to reason about which embedding architecture is more appropriate as they are conceptually quite similar. Thus, the variation stemming from the choice of embedding is a source of concern.

Random Restarts: DRL policies have many local optima

Our final sensitivity analysis looked at variation across restarts of the algorithm, which assesses sensitivity to where the algorithm was initialized. While the substantial differences in action distribution between the Dueling DDQN and physician policies remained constant across seeds in our baseline model, there was still much variation in the resulting action distributions, especially for vasopressors (Figure 7a). In Figure 7b, we see that despite these differences, the estimated values of these policies are similar; this demonstrates that the variation is not because the optimization sometimes landed in a poor optimum, but because there are many optima with similar estimated quality that lead to qualitatively different policies. This is another cause for concern, as it suggests that the agent has no way of telling if any of these very different policies are better than the others.

Figure 7:

Figure 7:

(a) Treatment distribution across random restarts in baseline. While variance of IV fluid is small, distribution of vasopressor varies across different seeds. (b) Comparision of values of WDR estimator across random seeds in each settings (* represents 4 hour time bins duration). In the setting of exclusion of treatment history, agents are highly sensitive to the seeds.

We also found that policies with medium and long-term objective policies were much more sensitive to random seed. Figure 8b depicts the distribution over the action space of the variance across 5 random restarts of the frequency with which that action was recommended. Despite the similar estimated Q values across random seeds (Figure 8a), the variances were much greater for the implementations using long term rewards. The presence of many local optima increases variance and makes it more challenging to differentiate policies in implementations with long term rewards.

Subgroup Analysis: Grouping by SOFA score finds DQN agents are underaggressive in high risk patients and over-aggressive in low risk patients

We further perform an analysis in subgroups defined by severity of sepsis as indicated by Sequential Organ Failure Assessment (SOFA score29). The SOFA score is a commonly used tool to stratify and compare patients in clinical practice, with higher scores indicating worse condition. When the assessment is greater than 15, mortality is increased up to 80%.29 In Figure 9, the Dueling DDQN agents are significantly more aggressive than physicians in treating lower risk patients. As was also observed in Raghu et al3, in high risk patients the reverse is true. Physicians commonly give maximum doses of both vasopressors and fluids, while the DQN agents rarely do. This suggests that the Dueling DDQN models may not be correctly accounting for patient severity or adjusting for confounding by indication.

Figure 9:

Figure 9:

Subgroup analysis: For patients with SOFA < 5, both baseline and GRU implementation are 5-7 times more likely than physicians to give patients vasopressors.

Conclusion and Discussion

State-of-the-art deep reinforcement learning approaches are largely untested in rich and dynamic healthcare settings. We presented a sensitivity analysis exploring how a Dueling DDQN agent would react to alternative specifications, including: 1) approaches to adjusting for treatment history; 2) discretized time bin durations; 3) recurrent neural network state representations; 4) reward specifications; and 5) random seeds. We have shown that choices between equally a priori justifiable implementation settings can have large clinically significant impacts on learned DQN policies. Given this lack of robustness, results from individual implementations should be received skeptically.

The one area where our results do seem to point toward some clear guidance concerns the inclusion of treatment history in the state. Exclusion of treatment history from the state is only warranted under the implausible Markov assumption that past treatments only influence future outcomes through measured intermediate variables. If this assumption fails and there are cumulative dangers from too much treatment (e.g. pulmonary edema from fluid overload and cardiovascular side effects from vasopressors in our application), then past treatment will affect both the response (treatment is less likely to be beneficial given excessive past treatment) and the current treatment decision (treatment is less likely to be administered given extensive past treatment, and outcomes are likely to be worse given extensive past treatment). Thus, omitting treatment history would make excessive treatment appear more beneficial than it actually is. Indeed, the behavior we observed in our Dueling DDQN agents was consistent with the behavior that would be predicted by theory. Agents trained without treatment history included in their states recommended vasopressors or fluids at every timestep, an obviously harmful strategy. Yet all three prior DQN implementations in sepsis omitted treatment history from state definitions.

While we cannot provide definitive statistical or physiological explanations for most of the DQN outputs observed in our sensitivity analysis, here we discuss possible sources of DQN instabilities. One theme appeared to be unreasonable policies stemming from extrapolation beyond treatment decisions observed in the data under current practice. For example, in our SOFA score subgroup analysis we saw the DQN agents recommending clearly harmful actions rarely seen in the data, i.e. failing to give high doses to the highest risk patients and frequently giving high doses to low risk patients. Also, the fact that different initializations found solutions with similar estimated Q-values but qualitatively different action distributions (also observed in Arjumand et al30) suggests that the problem is not sufficiently constrained. We need better ways to incorporate knowledge of what features are important and what actions are reasonable to constrain learned policies31.

There is a long road from the current state of DRL healthcare applications to clinically actionable insights or treatment recommendations. Currently, AI researchers apply DRL algorithms to clinical problems and claim that the policies they learn would greatly improve health outcomes compared to current practice3,5. In this work, we demonstrate that had these researchers made slightly different (but a priori reasonable) decisions, they would have obtained very different policies that also appeared superior to current practice. Beginning to map this sensitivity is a small but important step along the road to clinically actionable DRL policies. We hope our observations will lead to future work on characterizing and (more importantly) obtaining the type of robustness required to justify empirical testing of a DRL policy via a clinical trial.

We close with some speculative suggestions for that future work. First, a wide range of analysis settings can be compared in extensive and physiologically faithful simulation experiments where the ground truth value of resulting learned policies would be available. This could shed light on certain operating characteristics and best practices. For example, alternative approaches to preventing models from extrapolating too far from current practice as suggested above could be evaluated in this framework.

Further, when policies are sensitive to algorithmic choices, one could search for areas of broad agreement across policies recommended under a range of settings. These areas of agreement would be policy fragments, i.e. recommendations only applying to specific contexts. These strategy fragments could then be rigorously assessed by subject matter experts for plausibility and their effects could be estimated by epidemiologists using more stable techniques for treatment effect estimation. Seeking approaches to make DRL robust and human-verifiable will help us properly leverage information in health records to improve care.

Footnotes

ICU admissions between 2008-2012 were recorded using the MetaVision system with higher resolution treatment information.

Figures & Table

Table 2:

Summary of the variance across different experimental settings

Timing Encoder Treatment history Reward
Baseline 1 hr LSTM (full history) Yes Short term: (1) Immediate change in prognosis
Alternatives 1hr, 4hr LSTM\GRU (full history) Yes, No Short term: (1) Immediate change in prognosis;
Long term: (2) Combinations of SOFA at end of stay and survival time
Raghu et al4 4 hr Sparse Autoencoder (current obs. only) No Long term: In-hospital mortality
Peng et al5 4 hr LSTM (full history) No Short term*: (1) Immediate change in prognosis
*

In Peng et al’s reward, prognosis was estimated conditional on current observations only. In the baseline implementation’s reward, prognosis was estimated conditional on full patient history.

References

  • 1.Komorowski M., Gordon A, Celi L.A., Faisal A. “A Markov decision process to suggest optimal treatment of severe infections in intensive care”. in Neural Information Processing Systems Workshop on Machine Learning for Health.2016. [Google Scholar]
  • 2.Komorowski M., Celi L.A., Badawi O., et al. The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nat Med. 2018;24:1716–1720. doi: 10.1038/s41591-018-0213-5. [DOI] [PubMed] [Google Scholar]
  • 3.Raghu Aniruddh, Komorowski M., Celi L.A., Szolovits Peter, Ghassemi Marzyeh. Deep Reinforcement Learning for Sepsis Treatment. Machine Learning For Health at the conference on Neural Information Processing Systems; 2017. arXiv:1711.09602. [Google Scholar]
  • 4.Raghu Aniruddh, Komorowski M., Ahmed Imran, Celi Leo, Szolovits Peter, Ghassemi Marzyeh. Continuous State-Space Models for Optimal Sepsis Treatment - a Deep Reinforcement Learning Approach. 2017. arXiv:1705.08422.
  • 5.Peng X, Ding Y, Wihl D, et al. Improving Sepsis Treatment Strategies by Combining Deep and Kernel-Based Reinforcement Learning. AMIA Annu Symp Proc. 2018;2018:887–896. [PMC free article] [PubMed] [Google Scholar]
  • 6.Niranjani Prasad, Li-Fang Cheng, Corey Chivers, Michael Draugelis, Barbara E Engelhardt. A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. 2017. arXiv:1704.06300.
  • 7.Parbhoo S, Bogojeska J, Zazzi M, Roth V, Doshi-Velez F. Combining Kernel and Model Based Learning for HIV Therapy Selection. AMIA Jt Summits Transl Sci Proc. 2017;2017:239–248. [PMC free article] [PubMed] [Google Scholar]
  • 8.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. DeepMind Technologies. Playing Atari with Deep Reinforcement Learning; NIPS Deep Learning Workshop; 2013. arXiv:1312.5602. [Google Scholar]
  • 9.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Human-level control through deep reinforcement learning. Nature. 2015;518:529–533. doi: 10.1038/nature14236. [DOI] [PubMed] [Google Scholar]
  • 10.Hasselt Hado V. Double Q-learning. Advances in Neural Information Processing Systems. 2010;23:2613–2621. [Google Scholar]
  • 11.Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas. Dueling Network Architectures for Deep Rein- forcement Learning. arXiv:1511.06581.
  • 12.Andrew Rhodes, Evans Laura E., Waleed Alhazzani, et al. Surviving Sepsis Campaign: International Guidelines for Management of Sepsis and Septic Shock: 2016. Intensive Care Med. 2017;43:304–377. doi: 10.1007/s00134-017-4683-6. [DOI] [PubMed] [Google Scholar]
  • 13.Overgaard Christopher B. Dzˇav´ık Vladim´ır. Inotropes and Vasopressors. Circulation. 2008. 2008;118:1047–1056. doi: 10.1161/CIRCULATIONAHA.107.728840. [DOI] [PubMed] [Google Scholar]
  • 14.Omer Gottesman1, Fredrik Johansson, Joshua Meier1, Jack Dent1, et al. Evaluating Reinforcement Learning Algorithms in Observational Health Settings. 2018. arXiv:1805.12298.
  • 15.Arulkumaran K., Deisenroth M. P., Brundage M., Bharath A. A. Deep Reinforcement Learning: A Brief Survey. IEEE Signal Processing Magazine. Nov 2017;vol. 34(no. 6):26–38. [Google Scholar]
  • 16.Chris Watkins, Peter Dayan. Technical Note Q-Learning. Machine Learning. 1992;8:279–292. [Google Scholar]
  • 17.Tom Schaul, John Quan, Ioannis Antonoglou, David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
  • 18.Johnson Alistair E.W., Pollard Tom J., Lu Shen, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Singer M., Deutschmann C. S., Seymour C. W., et al. The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3) JAMA. 2016;315(8):801–810. doi: 10.1001/jama.2016.0287. doi:10.1001/jama.2016.0287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Guanglei Hong, Stephen W. Raudenbush Causal Inference for Time-Varying Instructional Treatments. Journal of Educational and Behavioral Statistics. 2008;33(3):333–362. doi: 10.3102/1076998607307355. [Google Scholar]
  • 21.Sepp Hochreiter Long short-term memory Neural computation. 1997;9(8):1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
  • 22.James M Robins. Robust estimation in sequentially ignorable missing data and causal inference models. Biometrics. December 2005;61:962–972. doi: 10.1111/j.1541-0420.2005.00377.x. DOI: 10.1111/j.1541-0420.2005.00377.x. [DOI] [PubMed] [Google Scholar]
  • 23.Zhengping Che, Sanjay Purushotham1, Kyunghyun Cho, et al. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Sci Rep. 2018;8:6085. doi: 10.1038/s41598-018-24271-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. NIPS Deep Learning and Representation Learning Workshop; 2014. arXiv:1412.3555. [Google Scholar]
  • 25.Bram Baker. Conference on Neural Information Processing Systems 2002. Dept.of Psychology,LeidenUniversity/ IDSIA.2002. [Google Scholar]
  • 26.Kyunghyun Cho, Bart van Merrie¨nboer, Dzmitry Bahdanau, Yoshua Bengio. On the Properties of Neural Machine Translation: En- coder:Decoder Approaches. Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation; 2014. DOI:10.3115/v1/W14-4012. [Google Scholar]
  • 27.Henderson Peter, Islam Riashat, Bachman Philip, Pineau Joelle, Precup Doina, Meger David. On Artificial Intelligence. AAAI; 2018. Deep Reinforcement Learning that Matters.AAAI Conference. arXiv preprint arXiv:1709.06560. [Google Scholar]
  • 28.Jiang N. Lihong Li Microsoft Doubly Robust Off-policy Evaluation for Reinforcement Learning. arXiv:1511.03722.
  • 29.Flavio Lopes Ferreira MD, Daliana Peres Bota MD. Annette Bross MD, et al. Serial Evaluation of the SOFA Score to Predict Outcome in Critically Ill Patients. The Journal of the American Medical Association. 2001. [DOI] [PubMed]
  • 30.Masood Muhammad, Doshi-Velez Finale. Diversity-Inducing Policy Gradient: Using Maximum Mean Discrepancy to Find a Set of Diverse Policies. 2019. pp. 5923–5929. 10.24963/ijcai.2019/821.
  • 31.Fujimoto Scott, Meger David, Precup Doina. Off-Policy Deep Reinforcement Learning without Exploration. 2018. arXiv:1812.02900.

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES