Where do doctors disagree? Characterizing Decision Points for Safe Reinforcement Learning in Choosing Vasopressor Treatment

Esther Brown; Shivam Raval; Alex Rojas; Jiayu Yao; Sonali Parbhoo; Leo A Celi; Siddharth Swaroop; Weiwei Pan; Finale Doshi-Velez

. 2025 May 22;2024:222–231.

Where do doctors disagree? Characterizing Decision Points for Safe Reinforcement Learning in Choosing Vasopressor Treatment

Esther Brown ¹, Shivam Raval ¹, Alex Rojas ¹, Jiayu Yao ¹, Sonali Parbhoo ¹, Leo A Celi ^2,¹, Siddharth Swaroop ¹, Weiwei Pan ¹, Finale Doshi-Velez ¹

PMCID: PMC12099420 PMID: 40417508

Abstract

In clinical settings, domain experts sometimes disagree on optimal treatment actions. These “decision points” must be comprehensively characterized, as they offer opportunities for Artificial Intelligence (AI) to provide statistically informed recommendations. To address this, we introduce a pipeline to investigate “decision regions”, clusters of decision points, by training classifiers for prediction and applying clustering techniques to the classifier’s embedding space. Our methodology includes: a robustness analysis confirming the topological stability of decision regions across diverse design parameters; an empirical study using the MIMIC-III database, focusing on the binary decision to administer vasopressors to hypotensive patients in the ICU; and an expert-validated summary of the decision regions’ statistical attributes with novel clinical interpretations. We demonstrate that the topology of these decision regions remains stable across various design choices, reinforcing the reliability of our findings and generalizability of our approach. We encourage future work to extend this approach to other medical datasets.

Introduction

Reinforcement learning (RL) has been seen as promising for supporting clinical decision-making ¹. However, RL algorithms can only support clinical decision-making in situations where the possible alternatives have already been explored in the data: if a certain treatment has never been tried, then an algorithm has no statistical basis to recommend it. This observation suggests that one way to approach RL in clinical settings is to first identify where clinicians disagree about treatment decisions. These points of disagreement, coinciding with instances in the data where the clinician decision vary for the same type of patient, are situations in which an algorithm may provide potentially useful recommendations. Understanding these points of disagreement can also help clinicians gain a better understanding of their decisions. For example, are their actions more or less aggressive than their colleagues’?

Previous work identified and clustered these points of clinician disagreement (termed decision regions)². The authors demonstrate that these decision regions can aid in optimizing hypotension treatment in the ICU by highlighting key points of clinician disagreement, which can serve as a basis for planning. That is, if two clinicians disagree within a region, this region can be characterized as a high-variance area or a decision region. Consequently, any planning algorithm should prioritize planning within this region, as opposed to treating every region uniformly. Based on decision regions, the authors identified policies that appeared promising; however, their work did not evaluate the robustness of each discovered decision region against various design choices inherent in the identification and clustering process. Without these robustness checks, the specific policy proposed in Zhang et al.² could be evaluated, but one could not make more general claims about the types of clinician disagreement for hypotension treatment.

In this work, we propose a pipeline to discover robust decision regions, as well as characterize each decision region in terms of clinical features that contribute to the disagreement. While Zhang et al. ² focused on multiple treatments for hypotension, we focus on the binary choice of when to apply vasopressors, as these are the most frequently occurring in ICU settings. Specifically:

We describe key design choices in the identification and clustering of points of disagreement, called decision points, in a given dataset (such as the number of nearest neighbors and perplexity).
We describe a pipeline for finding decision regions that are robust across various design choices.
We apply our pipeline to MIMIC-III³ and find consistent regions where doctors disagree on treating hypotensive patients in the ICU. The consistency of the structure of these regions across different choices of data representation and transformation allows us to make clinical claims about the types of disagreement. Finally, we validated the clinical insights from our analysis with a clinical expert.

Our results demonstrate that robust clustering enables us to pinpoint specific clinical scenarios where disagreements commonly occur, providing a valuable foundation for optimizing treatment protocols in those challenging cases.

Related Work

Many works apply machine learning, specifically RL, to design treatment for diseases including diabetes ⁴, HIV^5,6 and sepsis⁷. We focus on hypotension management for ICU patients with sepsis, which has been studied in recent RL work e.g.,^7,8,1. Rather than developing RL methodology or new treatment, in this work, we focus on the task of robust identification of regions where clinicians disagree, in order to facilitate the design of RL algorithms.

State Abstraction and Discretization.

Existing works propose various approaches for clustering continuous spaces into discrete ones ^9,10. Some approaches involve grouping states whose raw features are close in terms of Euclidian distances ^11,12. Other approaches use temporal consistency: if two states frequently transition to similar next states, they are grouped together ^11,13. A goal using temporal consistency is to ensure that the state dynamics are preserved even after abstraction ¹⁴. Another RL-based clustering approach considers the value function as a criterion^9,11. The idea in this approach is to partition the state space in such a manner that the error in approximating the value function remains minimal ¹⁵. In this way, states with similar values or value dynamics are clustered together. These works still try to cluster every state (e.g., ¹⁶). In contrast to prior works, we focus on a smaller number of discrete states by focusing only on key parts of the state space where clinical experts exhibit high degrees of disagreement.

Decision Regions.

Closest to our work, Zhang et al. ² identified individual decision points and then clustered decision points into decision regions using hierarchical clustering for state space abstraction. However, they only explored one clustering algorithm using only the raw feature space. In contrast, we explore several different representations and algorithm hyper-parameters, which allows us to find a set of robust clusters.

Background

We assume a given a dataset $D$ = {(x_p,t, a_p,t)} with measurement-action tuples taken from P patients over T time steps, where x_p,t represents the vital signs and a_p,t are the treatments taken by clinicians for the p-th patient at time t.

Kernels.

A data point, x_p,t, is a decision point if there is a lack of clinician consensus in similar situations (similar data points). This requires us to quantify the similarity between (the representations of) data points. We use a kernel function to quantify this similarity, k : $Z$ × $Z$ → [0, 1], where k(z_p,t, z_p′,t′) signifies the similarity between the representation of states x_p,t and x_p′,t′. We use normalized kernels, such that a value of 0 denotes no similarity, and a value of 1 represents complete similarity. In particular, we will choose to use an RBF kernel. That is, for any pair of representations (z_p,t, z_p′,t′), we measure the similarity of these two representations as,

k (z_{p, t}, z_{p^{'}, t^{'}}) = \exp (- {‖ w^{⊤} z_{p, t} - w^{⊤} z_{p^{'}, t^{'}} ‖}_{2}^{2}),

where W is a hyperparameter of the kernel.

In the methods section, we give more details on how to use our quantification of similarity to find clinical disagreements about treatment decisions: this requires finding (i) patients that are similar, and then (ii) looking for significant action variation within the set of similar patients.

Dimensionality Reduction and Visualization.

We analyze the topology of clinical disagreement by mapping our decision points onto a two-dimensional plane, using two dimensionality reduction methods. The first method we use is t-Distributed Stochastic Neighbor Embedding (t-SNE), a non-linear dimensionality reduction algorithm that preserves the local structure of high-dimensional data ¹⁷. t-SNE ensures that similar data points remain close to each other while dissimilar points are placed further apart. The second method is Uniform Manifold Approximation and Projection (UMAP), which also builds a graph to represent the dimensional structure ¹⁸ of the dataset. UMAP calculates a joint probability distribution that quantifies the similarities between data points or subsets of data. UMAP assigns higher probabilities to point pairs that are more similar, effectively preserving both local and broader dataset topology with greater computational efficiency than t-SNE. An important difference between t-SNE and UMAP is that t-SNE prioritizes preserving the local structure of high-dimensional data, while UMAP preserves both the local and global structures of high-dimensional data, offering a more complete view of the data’s manifold 19.

Methods

Our goal is to identify and characterize situations where doctors disagree on their recommended treatments for the same patient state. We call these patient states “decision points”. Our approach is to first identify a set of highly “similar” patient states in a dataset consisting of patient trajectories and physician treatments. We then look for states at which doctors recommend very different treatments – these are our “decision points”. Lastly, we describe the clinical characteristics of our decision points by analyzing low-dimensional projections of representations of the data.

Methods Overview.

The complete process for identifying a set of decision points (the places of clinical disagreement) is illustrated in Figure 1. Our decision points pipeline takes in ICU time series data to identify points of disagreement in treatment actions among clinicians. In this process, we consider two representations of the data: the raw patient data and an LSTM embedding of the data. We evaluate the extent to which these representations capture salient aspect of the data by training a classification model to predict clinician action based on the data representation. We then identify points of significant action variation, called decision points, based on the representations.

After identifying the set of decision points for each representation, we study the global structures (i.e. topology) of these sets via cluster analysis in low-dimensional projection of decision points. Finally, we extract clinical insights from our topological analysis by describing the set of clinical features that characterizes decision points. Importantly, we validate the insights from our pipeline with clinical experts.

Step 1: Learning Representations.

We explore how different representations affect the robustness of our identification and analysis of decision points. Denote the representations of the patient features x_p,t by z_p,t. We can use the raw patient features, x_p,t = z_p,t. Although using raw features is convenient, they do not account for temporal information in patient data, which may be important. Thus, in this work, we also use a Long Short-Term Memory (LSTM) network, which captures temporal dependencies, to extract representations.

The LSTM network takes a sequence of patient data as input and compresses it into a vector z_p,t, allowing the model to retain information from the past while also adapting to new observations. We first train the LSTM to predict clinician actions, a_p,t, based on the patient state observed over a continuous chunk of time, (x_p,t₀, x_p,t₀+1, … , x_p,t−1, x_p,t), where t₀ and t are the start and end time, respectively. Training this way encourages the LSTM model to learn representations of the data that are predictive of clinician actions, as clinicians also take into account the history of the p-th patient when making their decision.

After the training phase, we remove the softmax layer from the LSTM and treat the activation of hidden nodes as the representation, z_p,t. For additional details on the LSTM model, please refer to the appendix section.¹

Step 2: Evaluating Representations Using a Classifier.

How do we know that the representations we learned in Step 1 capture properties of the data that are salient for clinical decisions in hypotension management? To evaluate our learned representations, we use the representations to predict physician actions for patients – we build a kernel classifier on the dataset of representations and clinician actions. Better representations should lead to better classification performance.

Step 3: Identifying Decision Points.

We identify points where doctors disagree in their recommended treatment. We do this in the representation space, where the learned representations of the dataset $D_{z}$ is a collection of {z_p,t, a_p,t} pairs, for all patients p and times t. For each point (z_p,t, a_p,t) ∈ $D_{z}$ , we first find a set of points {z_p′,,t′} that are “similar” to z_p,t, and we call this set a neighborhood of z_p,t. Within this neighborhood, we check for disagreement in the clinical decisions (actions) {a_p′,t′} on these points. If disagreement is high, we say that z_p,t is a decision point.

Our notion of similarity is defined via a kernel function. We say that a point z_p′,t′ is in the neighborhood of z_p,t if the kernel function evaluated on the pair is greater than some fixed threshold, a real number r: k(z_p,t, z_{p′, t′}) ≥ r. For each possible action a, we count the number of points in the neighborhood that was recommended the action a. We say that there is sufficient disagreement in recommended action if at least N points were assigned to each action.

The set of decision points resulting from our process captures situations where doctors disagree in terms of their recommended treatment for the same patient state.

We note that the choice of the kernel function k, the similarity threshold r and the disagreement threshold N are design choices. Different choices of these values may lead to different sets of points being classified as decision points. In the next step, we discuss how to extract global structures in the set of decision points and how to check that these global structures are robust across different design choices.

Step 4: Identifying Global Structures of Decision Points Through Dimensionality Reduction and Topological Analysis.

We extract global structures of decision points through low-dimensional projections (see the “Dimensionality Reduction and Visualization” subsection in the Background for details on t-SNE and UMAP). Specifically, we perform cluster analysis on the decision points in the projection space.

We note that the projection methods we consider are inherently stochastic and may result in projections with different clusterings depending on initialization. The resulting clustering may also be sensitive to the hyperparameters of projection algorithms (we also need to consider the design choices in Step 3 that affect the identification of decision points). Thus, we need to check that the structures we find in a low-dimensional projection are invariant to hyperparameter choices and inherent randomness of the algorithms.

To check that our clustering is robust, i.e. captures real structures in the data rather than artifacts of hyperparameter choices, we compute the percentage of data whose cluster memberships are invariant to changes in design choices.

For any two clusterings, $C_{1}$ , $C_{2}$ , resulting from two sets of design choices, we match each cluster A in $C_{1}$ to a cluster B in $C_{2}$ that is most similar to A, where we use the Jaccard index to measure the similarity between A and B. We then compute the percentage of points in the dataset where their cluster memberships are unchanged under the matching – we define this as the consistency score.

The Jaccard index, also known as the Jaccard similarity coefficient, measures the similarity of sample sets in terms of the proportion of their shared points. It is a value between 0 and 1, where a value of 1 indicates that the two sets are identical, and a value of 0 indicates that they are disjoint (have no points in common). Given two clusters A and B, the Jaccard index is defined as the size of their intersection divided by the size of their union: $J (A, B) = \frac{| A \cap B |}{| A \cup B |}$ .

Given clusterings $C_{1}$ and $C_{2}$ of the dataset, each with K clusters, and given a matching between the clusters in $C_{1}$ and those in $C_{2}$ , the Consistency Score is given by:

consistency (C_{1}, C_{2}) = \frac{1}{size of dataset} \sum_{k = 1}^{K} | C_{k} \cap match (C_{k}) |,

where C_k are the clusters in $C_{1}$ , and match(C_k) is the matching cluster in $C_{2}$ for cluster C_k in $C_{1}$ .

Finally, for N sets of hyperparameters, we compute the consistency score for every possible pair of clusterings induced by different hyperparameter choices. We visualize the consistency scores as an N × N heatmap, and verify that the cluster assignments of data points are robust to changes in hyperparameter choices (see Figure 2 for example).

Figure 2: — This figure shows the **consistency analysis** of the clustering found in RNN representation R across UMAP and t-SNE projections. A clustering is consistent if the same data points are grouped together across different design choices. We compare clusterings resulting from every pair of different hyperparameter choices in UMAP and t-SNE. We measure the similarity of two clusterings using the *consistency score*, which computes the percentage of points that maintain the same cluster assignment under both clusterings. A consistency score of 1 (light color) indicates the highest degree of similarity between two clusterings, while 0 (dark color) indicates the lowest degree of similarity. Subfigures (a) and (b) are heatmaps of consistency scores across different UMAP and t-SNE projections of decision points. The different projections correspond to various choices of hyperparameters: four choices of **distance metrics** (Euclidean, Bray-Curtis, Manhattan, Canberra) and different values of **nearest neighbors (for UMAP)** and **perplexity (for t-SNE)**. Each grid point in the heatmap represents the consistency score averaged over 10 random initializations using each set of hyperparameters in the pairwise comparison. **Key Takeaway**: We observe that while there are high consistency blocks (where clusterings resulting from different sets of hyperparameter choices are highly similar) for both UMAP and t-SNE, UMAP projections generally exhibit greater consistency than t-SNE. The minimal contrast across both heatmaps further illustrates that the clustering results remain largely consistent across a broad range of hyperparameters. Based on these scores and cluster representations, a UMAP configuration with 20 nearest neighbors is selected for further analysis due to its optimal balance between cluster definition and consistency.

Step 5: Clinical Characterization of Decision Points and Domain Validation.

In the last step, we characterize the clusters we found in Step 4 in terms of clinical features. With input from clinical experts, we identify a set of clinical features that are relevant for treatment. For example, in our case, where we focus on hypotension management through administering vassopressor, mean arterial pressure and systolic blood pressure are important features that clinicians take into account when treating patients. We compute the distribution of these top clinical features within each cluster and compare across clusters. This allows us to see if different clusters capture clinically-meaningful differences (see Figure 4 and 3 for example). Finally, we validate our clinical characterization of the clusters with domain experts.

Figure 4: — In this figure, we show the heatmap coloring of decision points under UMAP projections by 16 clinical features. Here we use the RNN representation $R$ . In each subfigure, cooler colors indicate lower values of the corresponding clinical feature and warmer colors indicate high values. Note that in the UMAP projections, we observe a number of small clusters that group together in a large cluster and one isolated small cluster. For most clinical features, the distribution of values are identical across all clusters. **However, for key clinical features (such as mean blood pressure and oxygen saturation), the distribution of values are different across clusters.** For example, in column three, row three, we have the UMAP colored by mean blood pressure (MBP). We see that here clusters can be characterized by their MBP values: we have two clusters (blue) where the MBP is less than 65, and the remaining clusters (orange) where the MBP is greater than or equal to 65. **This visualization provides us with ways to characterize the clusterings we find in the decision points in clinically meaningful ways.**

Figure 3: — This figure compares the distribution of key clinical features (with normalized values) for decision and non-decision points using the RNN representation $R$ . Decision points with Mean Arterial Pressure (MAP) greater than 65 (Cluster 1) are shown in red, and those with MAP less than or equal to 65 (Cluster 2) are in blue. Non-decision points with MAP greater than 65 are shown in salmon, and those with MAP less than or equal to 65 are in gray. **We observe significant differences in many clinical features between decision and non-decision points, but there is no singular feature that characterizes the two types of points**. For instance, decision points typically have higher lactate and lower FiO2 normalized values compared to non-decision points. Additionally, we observe that even when decision points have MAP within the normal range (greater than or equal to 65), other features such as lactate sometimes fall outside their normalized range. These differences can contribute to clinical disagreements regarding the administration of vasopressor treatment.

Cohort and Data Processing

In this work, we characterize disagreement in hypotensive treatments using the MIMIC-III dataset³. This dataset comprises medical records for the intensive care unit (ICU) patients admitted to the Beth Israel Deaconess Medical Center from 2001 to 2012. A patient is considered hypotensive when their mean arterial pressure (MAP) falls below 65 mmHg during their ICU stay. Observations are grouped into hourly bins, starting from the first observation and ending at the last observation. Each bin represents the average of the features for that hour.

The p-th patient trajectory X_p = [x_p,1, …, x_p,T]^T is a matrix where each column represents a vital sign or a feature derived from the vital signs, over a total of T hours. Each element x_p,t corresponds to an hourly bin. We engineered features based on the relevant vitals for hypotension, indicating whether a particular vital sign was measured in the preceding 8 hours. Additionally, we created features representing the amount of vasopressor given over the past 8 hours (full feature list in Table 3 in the appendix). The trajectory’s maximum length, T, is capped at discharge or 72 hours for data filtering purposes.

For each x_p,t, the treatment action a_p,t indicates if a vasopressor was administered. After filtering for patients aged between 18 and 85, our final cohort includes 10,184 patients with 1,247,456 entries. Lab measurements and vital signs were imputed using forward filling, followed by replacing remaining missing values with medians. Missing treatment actions were replaced with zeros, assuming that all actions were recorded.

For classification, 75% of the data was allocated to training and validation, while the remaining 25% was reserved for testing. Each patient’s data was assigned strictly to either the training/validation set or the testing set.

Results: Decision Point Identification and Analysis for MIMIC-III

We apply our method to identify decision points in the treatment of hypotensive patients in the ICU. Using the MIMIC-III dataset, we study hypotensive ICU patient histories with binary actions taken by clinicians who were choosing whether or not to administer vasopressors. Empirical results show that our method can identify clinically meaningful boundaries between data points where physicans agreement on treatments and data points where they disagree. We provide insights on clinical differences between these two categories. We find that our analysis is robust across different hyper-parameters of clustering algorithms. Given the robustness of our method, we encourage its application to other domains to investigate decision boundaries and clinical disagreement for other tasks.

Experimental design.

We examine two decision point representations: (1) raw input $X$ and (2) RNN $R$ . In evaluating the usefulness of these representations for predicting physician action, binary classification, $R$ demonstrated better performance than χ. However, both methods lead to low-dimensional projections that showed consistent and clinically meaningful clustering. In fact, in both $X$ and $R$ we find essentially the same set of decision points, robust to multiple metrics, design choices, and parameters. Detailed comparisons are in the appendix. Furthermore, in Figure 2, we see that the clustering of the decision points in low-dimensional projections are also consistent across different choices of hyperparameters of the projection algorithms.

In the appendix, we provide an analysis for a multiclass setting, considering the inclusion of fluids as a treatment action. Despite the performance drop in predicting clinician action due to data imbalance, our method effectively detect consistent decision points in this setting across different choices of suitable distance metrics, the number of nearest neighbors, and perplexity (figures 29 and 30 in the appendix).

The representations given by our pipeline are invariant and robust to different key design choices.

In Figure 2, we plot projections from both representations R and χ using both UMAP and t-SNE, finding that they are consistent across different design choices such as nearest neighbors and perplexity. The projections from the pipeline are based on the binary decision points for “no action” and “vaso” decisions. We extend this evaluation in the appendix (Figure 35) to probe the consistency of decision points against additional design choices. Using $R$ and $X$ , the decision points’ partitions are robust across the different parameters and design choices. Specifically, the partitions delineate the projections by critical hypotension clinical metrics such as mean arterial blood pressure, systolic blood pressure, and oxygen saturation (Figure 21).

Our pipeline identifies Mean Arterial Pressure (MAP) as playing a role in clinical disagreement for vasopressor administration, while urine volume and lactate might also lead to disagreement.

In Figure 3, we observe that MAP plays a key role in clinician disagreements around vasopressor administration. For example, we observe a distinct partition in the UMAP projections of decision points using representations $R$ and $X$ for “no action” and “vaso” decisions, based on MAP values (Figures 25 and 23). This finding is consistent with clinical protocols and literature around hypotension, which consider a MAP below 65 mmHg as a threshold for hypotension interventions and potential vasopressor therapy²⁰.

However, when MAP values exceed 65 mmHg, indicating that patients could have been more ill, decisions were not as straightforward. Figure 3 shows that clinicians also consider variations in urine volume, FiO2, and lactate levels. This trend aligns with recent studies indicating that early and aggressive management of multiple physiological parameters may improve survival in critical care settings^21,20. Specifically, we observed some cluster of decision points with MAP values above 65 mmHg that were characterized by reduced urine volumes and increased FiO2 and lactate levels compared to non-decision points (MAP values above 65 mmHg). These patterns suggest a clinical awareness of other hypotension indicators, such as organ perfusion and respiratory efficiency, which may warrant vasopressor administration even in the absence of low MAP. The clinical reasoning behind this approach is supported by evidence indicating that a combination of physiological markers, rather than a single clinical feature like MAP, provides a more accurate reflection of a patient’s condition and the need for intervention²¹.

Beyond MAP, we find that other features also contribute to clinical decisions and disagreements around administering vasopressor (see the resulting decision points in Figure 3). For instance, when MAP is within the normal range, elevated lactate levels emerge as a clinical feature that can create disagreements around vasopressor therapy. Elevated lactate is often a marker of tissue hypoxia and can indicate a state of increased metabolic stress or organ dysfunction ^22,23. This suggests that clinicians consider a broader range of clinical indicators when assessing a patient’s hemodynamic state and the potential need for vasopressors to support tissue perfusion. In an ICU setting where there might be more pressure on clinician during the decision making process, it is possible that clinicians might use additional indications to assess a patient’s cardiovascular function before administering vasopressor. While our pipeline focuses on lab measurements, vital signs, and treatment actions from the MIMIC dataset, future work could explore how additional contexts, like comorbid conditions and cardiopulmonary status, affect what points are classified as decision points. With insufficient patient health context, we may misclassify certain points as decision points, reflecting either true disagreement or insufficient data. This could be a limitation of our approach.

Discussion and Conclusion

In this work, we develop a robust pipeline to identify and characterize decision points—points of disagreement between clinicians—in a clinical setting. This is because we are interested in exploring structures of decision points in representation space (i.e. geometry topology). We implement two methods in order to create representations of decision points: (1) a kernel method on raw inputs and (2) a method where we feed RNN embeddings into a kernel.

We specifically focus on hypotensive patients in the ICU and the binary decision of administering vasopressors. Projections of decision points resulting from our two representations are shown in Figures 4 and 3. The structures of the embedding spaces are consistent across various design choices such as nearest neighbors, distance metrics, and perplexities. This highlights the robustness of the clinical insights from our analysis.

We then analyze the topology of these representations in order to get clinical insights. The structures highlight the clinical features that contribute to clinical disagreement and to decision points, such as mean arterial pressure, systolic pressure, and oxygen saturation. These insights align with clinicians’ experiences and observations.

Exploration for Biases.

Additionally, we checked if decision points varied across gender, age, and insurance. Each subgroup contained equal proportions of decisions points (i.e., clinical disagreements) with no significant variations across subgroups. For example, in Figure 9, we see that “no action” and “vaso” treatments appear uniformly distributed across all the subgroups. However, our method can be used to expose whether such differences exist for other datasets and in other settings.

Comparison to Prior Work.

Zhang et al. discovered nineteen clusters in a four-treatment setting, in contrast to the two clusters found in our two-treatment setting. We opted for the two-treatment setting due to a significant class imbalance in the four-treatment options, (76.8% no action, 17.9% vasopressor, 3% IV fluid, 2.4% both treatments) resulted in high error rates in the decision point classifier, thus casting doubt on the validity of the identified decision points. We emphasize this is not just our modeling: Zhang et al. also experienced high classification error rates in the four-treatment setting. That said, if one does trust the decision points identified in the four treatment setting, we do recover many more clusters in a way that may be consistent with² (see Figure 33 in the appendix). How to identify decision points when class imbalance or other data factors preclude accurate classifiers remains an open question.

In conclusion, our method robustly identifies (1) clinical features that distinguish between decision points and non-decision points—where doctors disagree and agree, (2) axes of feature variations among different types of disagreements, and (3) clinical insights in situations where doctors exhibit disagreements. We demonstrated its application in the context of vasopressor administration in hypotensive ICU patients, and our approach can be used to uncover insights in many other clinical decision-making contexts. While these points of disagreement, where clinician decisions vary for the same type of patient, are situations where an algorithm can offer useful recommendations, it is important to note that not all decision points indicate areas for intervention; some may result from insufficient data or missing clinical contexts.

Footnotes

The appendix can be found in the paper linked in our GitHub repository at https://github.com/dtak/decision-points/tree/AMIA

References

1.Yu C, Liu J, Nemati S, Yin G. Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR) 2021;55(1):1–36. [Google Scholar]
2.Zhang K, Wang Y, Du J, Chu B, Celi LA, Kindle R, Doshi-Velez F. Identifying decision points for safe and interpretable reinforcement learning in hypotension treatment. arXiv preprint arXiv:2101.03309. 2021.
3.Johnson AE, Pollard TJ, Shen L, Lehman LW, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG. Mimic-iii, a freely accessible critical care database. Scientific data. 2016;3(1):1–9. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Bothe MK, Dickens L, Reichel K, Tellmann A, Ellger B, Westphal M, Faisal AA. The use of reinforcement learning algorithms to meet the challenges of an artificial pancreas. Expert review of medical devices. 2013;10(5):661–673. doi: 10.1586/17434440.2013.827515. [DOI] [PubMed] [Google Scholar]
5.Ernst Damien, et al. Proceedings of the 45th IEEE Conference on Decision and Control. IEEE; 2006. Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. [Google Scholar]
6.Parbhoo S, Bogojeska J, Zazzi M, Roth V, Doshi-Velez F. Combining kernel and model based learning for hiv therapy selection. AMIA Summits on Translational Science Proceedings. 2017. p. 239. [PMC free article] [PubMed]
7.Komorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature medicine. 2018;24(11):1716–1720. doi: 10.1038/s41591-018-0213-5. [DOI] [PubMed] [Google Scholar]
8.Chan B, Chen B, Sedghi A, Laird P, Maslove D, Mousavi P. Generalizable deep temporal models for predicting episodes of sudden hypotension in critically ill patients: a personalized approach. Scientific Reports. 2020;10(1) doi: 10.1038/s41598-020-67952-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Lee Yonggu, Park Chulwung, Kang Shinjin. Deep embedded clustering framework for mixed data. IEEE Access. 2022;11:33–40. [Google Scholar]
10.Perera Dilruk, Liu Siqi, Feng Mengling. 2023 International Joint Conference on Neural Networks (IJCNN) IEEE; 2023. Demystifying complex treatment recommendations: A hierarchical cooperative multi-agent rl approach; pp. 1–10. [Google Scholar]
11.Barto Andrew G, Mahadevan Sridhar. Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems. 2003;13(1-2):41–77. [Google Scholar]
12.Kochenderfer Mykel J. Adaptive abstraction for model-based reinforcement learning. 2006.
13.Jong Nicholas K, Hester Todd, Stone Peter. The utility of temporal abstraction in reinforcement learning. AAMAS (1) 2008. pp. 299–306.
14.Botvinick Matthew Michael. Hierarchical reinforcement learning and decision making. Current opinion in neurobiology. 2012;22(6):956–962. doi: 10.1016/j.conb.2012.05.008. [DOI] [PubMed] [Google Scholar]
15.Taherian Nahid, Shiri Mohammad Ebrahim. Q*-based state abstraction and knowledge discovery in reinforcement learning. Intelligent Data Analysis. 2014;18(6):1153–1175. [Google Scholar]
16.Hutsebaut-Buysse Matthias, Mets Kevin, Latr´e Steven. Hierarchical reinforcement learning: A survey and open research challenges. Machine Learning and Knowledge Extraction. 2022;4(1):172–221. [Google Scholar]
17.Van der Maaten L, Hinton G. Visualizing data using t-sne. Journal of machine learning research. 2008;9:11. [Google Scholar]
18.Kobak Dmitry, Linderman George C. Initialization is critical for preserving global data structure in both t-sne and umap. Nature biotechnology. 2021;39(2):156–157. doi: 10.1038/s41587-020-00809-z. [DOI] [PubMed] [Google Scholar]
19.Pal Krishan, Sharma Mayank. 2020 Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC) IEEE; 2020. Performance evaluation of non-linear techniques umap and t-sne for data in higher dimensional topological space; pp. 1106–1110. [Google Scholar]
20.Tong Xin, Xue Xiaopeng, Duan Chuanzhi, Liu Aihua. Early administration of multiple vasopressors is associated with better survival in patients with sepsis: a propensity score-weighted study. European Journal of Medical Research. 2023;28(1):249. doi: 10.1186/s40001-023-01229-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Mikita, et al. Fuchita. Prophylactic administration of vasopressors prior to emergency intubation in critically ill patients: A secondary analysis of two multicenter clinical trials. Critical care explorations. 2023;5(7):e0946. doi: 10.1097/CCE.0000000000000946. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Xiaolu, et al. Li. Lactate metabolism in human health and disease. Signal transduction and targeted therapy. 2022;7(1):305. doi: 10.1038/s41392-022-01151-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Hyun Ik, et al. Park. Clinical significance of lactate clearance in patients with cardiogenic shock: results from the rescue registry. Journal of Intensive Care. 2021;9(1):1–10. doi: 10.1186/s40560-021-00571-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r1-5565] 1.Yu C, Liu J, Nemati S, Yin G. Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR) 2021;55(1):1–36. [Google Scholar]

[r2-5565] 2.Zhang K, Wang Y, Du J, Chu B, Celi LA, Kindle R, Doshi-Velez F. Identifying decision points for safe and interpretable reinforcement learning in hypotension treatment. arXiv preprint arXiv:2101.03309. 2021.

[r3-5565] 3.Johnson AE, Pollard TJ, Shen L, Lehman LW, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG. Mimic-iii, a freely accessible critical care database. Scientific data. 2016;3(1):1–9. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4-5565] 4.Bothe MK, Dickens L, Reichel K, Tellmann A, Ellger B, Westphal M, Faisal AA. The use of reinforcement learning algorithms to meet the challenges of an artificial pancreas. Expert review of medical devices. 2013;10(5):661–673. doi: 10.1586/17434440.2013.827515. [DOI] [PubMed] [Google Scholar]

[r5-5565] 5.Ernst Damien, et al. Proceedings of the 45th IEEE Conference on Decision and Control. IEEE; 2006. Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. [Google Scholar]

[r6-5565] 6.Parbhoo S, Bogojeska J, Zazzi M, Roth V, Doshi-Velez F. Combining kernel and model based learning for hiv therapy selection. AMIA Summits on Translational Science Proceedings. 2017. p. 239. [PMC free article] [PubMed]

[r7-5565] 7.Komorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature medicine. 2018;24(11):1716–1720. doi: 10.1038/s41591-018-0213-5. [DOI] [PubMed] [Google Scholar]

[r8-5565] 8.Chan B, Chen B, Sedghi A, Laird P, Maslove D, Mousavi P. Generalizable deep temporal models for predicting episodes of sudden hypotension in critically ill patients: a personalized approach. Scientific Reports. 2020;10(1) doi: 10.1038/s41598-020-67952-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9-5565] 9.Lee Yonggu, Park Chulwung, Kang Shinjin. Deep embedded clustering framework for mixed data. IEEE Access. 2022;11:33–40. [Google Scholar]

[r10-5565] 10.Perera Dilruk, Liu Siqi, Feng Mengling. 2023 International Joint Conference on Neural Networks (IJCNN) IEEE; 2023. Demystifying complex treatment recommendations: A hierarchical cooperative multi-agent rl approach; pp. 1–10. [Google Scholar]

[r11-5565] 11.Barto Andrew G, Mahadevan Sridhar. Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems. 2003;13(1-2):41–77. [Google Scholar]

[r12-5565] 12.Kochenderfer Mykel J. Adaptive abstraction for model-based reinforcement learning. 2006.

[r13-5565] 13.Jong Nicholas K, Hester Todd, Stone Peter. The utility of temporal abstraction in reinforcement learning. AAMAS (1) 2008. pp. 299–306.

[r14-5565] 14.Botvinick Matthew Michael. Hierarchical reinforcement learning and decision making. Current opinion in neurobiology. 2012;22(6):956–962. doi: 10.1016/j.conb.2012.05.008. [DOI] [PubMed] [Google Scholar]

[r15-5565] 15.Taherian Nahid, Shiri Mohammad Ebrahim. Q*-based state abstraction and knowledge discovery in reinforcement learning. Intelligent Data Analysis. 2014;18(6):1153–1175. [Google Scholar]

[r16-5565] 16.Hutsebaut-Buysse Matthias, Mets Kevin, Latr´e Steven. Hierarchical reinforcement learning: A survey and open research challenges. Machine Learning and Knowledge Extraction. 2022;4(1):172–221. [Google Scholar]

[r17-5565] 17.Van der Maaten L, Hinton G. Visualizing data using t-sne. Journal of machine learning research. 2008;9:11. [Google Scholar]

[r18-5565] 18.Kobak Dmitry, Linderman George C. Initialization is critical for preserving global data structure in both t-sne and umap. Nature biotechnology. 2021;39(2):156–157. doi: 10.1038/s41587-020-00809-z. [DOI] [PubMed] [Google Scholar]

[r19-5565] 19.Pal Krishan, Sharma Mayank. 2020 Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC) IEEE; 2020. Performance evaluation of non-linear techniques umap and t-sne for data in higher dimensional topological space; pp. 1106–1110. [Google Scholar]

[r20-5565] 20.Tong Xin, Xue Xiaopeng, Duan Chuanzhi, Liu Aihua. Early administration of multiple vasopressors is associated with better survival in patients with sepsis: a propensity score-weighted study. European Journal of Medical Research. 2023;28(1):249. doi: 10.1186/s40001-023-01229-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r21-5565] 21.Mikita, et al. Fuchita. Prophylactic administration of vasopressors prior to emergency intubation in critically ill patients: A secondary analysis of two multicenter clinical trials. Critical care explorations. 2023;5(7):e0946. doi: 10.1097/CCE.0000000000000946. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r22-5565] 22.Xiaolu, et al. Li. Lactate metabolism in human health and disease. Signal transduction and targeted therapy. 2022;7(1):305. doi: 10.1038/s41392-022-01151-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r23-5565] 23.Hyun Ik, et al. Park. Clinical significance of lactate clearance in patients with cardiogenic shock: results from the rescue registry. Journal of Intensive Care. 2021;9(1):1–10. doi: 10.1186/s40560-021-00571-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Where do doctors disagree? Characterizing Decision Points for Safe Reinforcement Learning in Choosing Vasopressor Treatment

Esther Brown

Shivam Raval

Alex Rojas

Jiayu Yao, PhD

Sonali Parbhoo, PhD

Leo A Celi, MD

Siddharth Swaroop, PhD

Weiwei Pan, PhD

Finale Doshi-Velez, PhD

Abstract

Introduction

Related Work

State Abstraction and Discretization.

Decision Regions.

Background

Kernels.

Dimensionality Reduction and Visualization.

Methods

Methods Overview.

Figure 1:

Step 1: Learning Representations.

Step 2: Evaluating Representations Using a Classifier.

Step 3: Identifying Decision Points.

Step 4: Identifying Global Structures of Decision Points Through Dimensionality Reduction and Topological Analysis.

Figure 2:

Step 5: Clinical Characterization of Decision Points and Domain Validation.

Figure 4:

Figure 3:

Cohort and Data Processing

Results: Decision Point Identification and Analysis for MIMIC-III

Experimental design.

The representations given by our pipeline are invariant and robust to different key design choices.

Our pipeline identifies Mean Arterial Pressure (MAP) as playing a role in clinical disagreement for vasopressor administration, while urine volume and lactate might also lead to disagreement.

Discussion and Conclusion

Exploration for Biases.

Comparison to Prior Work.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases