Abstract
Unpacking and comprehending how black-box machine learning algorithms (such as deep learning models) make decisions has been a persistent challenge for researchers and end-users. Explaining time-series predictive models is useful for clinical applications with high stakes to understand the behavior of prediction models, e.g., to determine how different variables and time points influence the clinical outcome. However, existing approaches to explain such models are frequently unique to architectures and data where the features do not have a time-varying component. In this paper, we introduce WindowSHAP, a model-agnostic framework for explaining time-series classifiers using Shapley values. We intend for WindowSHAP to mitigate the computational complexity of calculating Shapley values for long time-series data as well as improve the quality of explanations. WindowSHAP is based on partitioning a sequence into time windows. Under this framework, we present three distinct algorithms of Stationary, Sliding and Dynamic WindowSHAP, each evaluated against baseline approaches, KernelSHAP and TimeSHAP, using perturbation and sequence analyses metrics. We applied our framework to clinical time-series data from both a specialized clinical domain (Traumatic Brain Injury - TBI) as well as a broad clinical domain (critical care medicine). The experimental results demonstrate that, based on the two quantitative metrics, our framework is superior at explaining clinical time-series classifiers, while also reducing the complexity of computations. We show that for time-series data with 120 time steps (hours), merging 10 adjacent time points can reduce the CPU time of WindowSHAP by 80% compared to KernelSHAP. We also show that our Dynamic WindowSHAP algorithm focuses more on the most important time steps and provides more understandable explanations. As a result, WindowSHAP not only accelerates the calculation of Shapley values for time-series data, but also delivers more understandable explanations with higher quality.
Keywords: Explainable Artificial Intelligence, Shapley value, Time-series data, Model Interpretation
Graphical Abstract
1. Introduction
Explaining and understanding the decision-making process of black-box machine learning algorithms is one of the major challenges for the research community in computing and information sciences. Despite the strong performance of these algorithmic predictions, their non-linear structure makes them challenging to discern what information in the input data causes them to generate particular predictions that support clinical decisions [1]. Rationalizing model behavior can help uncover biases, promote fairness and transparency, and most importantly, increase trust among end-users [2], [3]. Furthermore, modern privacy laws such as the European Union General Data Protection Regulation (GDPR) emphasizes users’ right to explanations related to automated decision-making [4]. These trends and challenges, collectively, necessitate the development of tools for elucidating black-box clinical prediction models.
Model explainability is further complicated by the emerging shift in underlying data from using relatively simple, abstracted clinical data to more complex, routinely collected, longitudinal clinical data. During the course of a patient’s care process, electronic health records (EHRs) host vast amounts of time-series data through frequent charting of vital signs, laboratory tests, and prescriptions [5]. Such time-series data has the potential to support clinical decision-making and forecast a variety of patient outcomes such as clinical deterioration, functional improvement, discharge disposition, and effectiveness of interventions[6]–[9].
Existing explainability methods are often focused on extracting the importance or contribution of input features to the model prediction. For example, SHAP (SHapley Additive exPlanations) [10] is one such method that generates contribution scores (i.e., Shapley values) to explain the individual predictions based on a coalitional game theory that satisfies three desirable properties for explanations: consistency, local accuracy, and missingness. There exists an alternative, kernel-based approximation of Shapley values called KernelSHAP that can reduce the complexity of calculating Shapley values by sampling from a smaller number of feature subsets [10].
Despite its prominence in informatics and clinical applications [11]–[16], SHAP is not entirely appropriate for time-series predictive models. First, it was not originally intended to be used with time-series data. Second, while KernelSHAP provides a model-agnostic approximation of Shapley values that sets a ceiling for the number of sampled feature subsets, it is still computationally expensive for high-dimensional data [17], [18]. Last but not least, sequential data points in clinical time-series data are often highly dependent on each other, which can lead to misleading KernelSHAP results [19]. When there are several highly dependent features (e.g., variable-time point pairs in time-series data), the joint contribution of these features is distributed among them, resulting in a large number of small Shapley values [18]. This makes it more difficult to visualize data or extract useful explanations from the contribution scores.
The primary objective of this research is to design and evaluate an explanation method based on Shapley values that is (1) applicable to time-series data, (2) computationally feasible for high-resolution time-series data, and (3) able to tackle dependencies between sequential data points. To address the shortcomings of KernelSHAP, we present WindowSHAP, a framework designed to explain time-series prediction models more effectively and accurately. WindowSHAP reduces the total number of features for which Shapley values must be determined by combining neighboring time steps into a time window. Instead of calculating Shapley values for every possible time step and variable combinations, we simply calculate Shapley values for each time window (see Figure 1 for conceptual demonstration). We propose various types of time windows, each with their own advantages under the WindowSHAP framework. For evaluation purposes, we train three deep learning models on time-series data both from a specialized clinical domain (Traumatic Brain Injury) and a broad clinical domain (critical care medicine) to show the applicability of algorithms. We compare our proposed framework with competitive baselines using different quantitative metrics to demonstrate the efficiency of our algorithms and accuracy of their explanations.
Figure 1.
Conceptual Demonstration of KernelSHAP vs WindowSHAP for a classification model for an individual instance, predicting whether there is an anomaly in a synthetically generated sequence. The top picture shows the sequence and its Shapley values derived from the KernelSHAP while the bottom depicts the Shapley values from WindowSHAP. While KernelSHAP is spreading the Shapley values all over the sequence, our approach focuses more on the part of the sequence that is more important, avoiding calculating Shapley values for each single time step.
In summary, the main contributions of this study are as follows:
Developing the WindowSHAP framework, a variation of Shapley additive explanations for time-series data.
Introducing and evaluating variations of WindowSHAP based on different windowing techniques in both categories of fixed- and variable-length time windows.
Validating our method on real-world clinical time-series data by employing a variety of quantitative metrics.
The rest of the paper is organized as follows. Section 2 describes the related research. Section 3 presents the WindowSHAP framework as well as the datasets and prediction models that are used in experiments. Section 4 details our results, Section 5 discusses the findings and implications of our work, and Section 6 concludes our work.
2. Related Work
Explanation approaches can be broadly classified into model-specific and model-agnostic techniques. Model-specific techniques generate explanations for model behavior using architectural properties of the model. Attention mechanism is the most commonly utilized model-specific strategy for explaining deep learning time-series classifiers [20]–[24], where an importance score is assigned to each time step using attention layers. Gradients can also be used to describe deep learning time-series classifiers by allocating a weight to each input based on the outcome’s derivative with respect to the corresponding feature [25]. One study [26] produced important ratings for deep learning models using “difference from reference” rather than gradients. The reference value represents a default or neutral input, and the algorithm back-propagates the “difference from reference” values to avoid artifacts such as gradient discontinuity.
Model-agnostic explanation methods can be used on any black-box model regardless of the model structure. They only rely on the inputs and outputs rather than the model’s internal architecture. In recent years, one of the most common model-agnostic approaches is to attribute importance to features using Shapley values [10]. The Shapley value is a concept from coalitional game theory that fairly distributes the payout generated by a game to each of its players [27]. To calculate the Shapley value for a single feature, terms must be calculated. Hence, when the total number of input features increases, as it does in modern machine learning applications, the complexity of calculating Shapley values grows exponentially. Some approximations for calculating Shapley values have been developed, including KernelSHAP, which sums over a smaller sample of feature subsets rather than all subsets [10]. However, as the number of features increases, so does the number of sampled subsets (exponentially) in order to maintain an adequate accuracy for the approximated Shapley values.
TimeSHAP [28] is a temporal model extension of SHAP that includes a pruning mechanism that combines all initial time steps whose cumulative relevance is less than a certain threshold. It assumes that the earliest time steps in sequential data sets are the least important time points. However, this assumption can be criticized because it might not be true in all cases of time-series data. For example, the conceptual demonstration in Figure 1 shows a situation of anomaly detection where the most important time points are not necessarily at the end of a sequence.
Temporal Importance Model Explanation (TIME) [29] was developed to identify critical temporal steps and time intervals at the global explanation level. This approach, however, is confined to providing generalized, global significance of time steps and cannot be used for a single instance of data. Even though all of these methods are applicable to time-series classifiers, they either do not provide local explanations for a single instance of data, or they do not handle high dimensionality of time-series as well as high dependency of adjacent time steps.
3. Methods
3.1. Shapley values for tabular data
Shapley values assign an importance (contribution) score to the feature, indicating how much the model output for a single instance is influenced by its feature. Based on [10], we provide a formulation of Shapley values. Assuming that is an input of a prediction model , the Shapley value for feature for a given input is calculated by
(1) |
where is the set of all features, is a subset of feature indices, and is the characteristic function which shows the output of the prediction function if only features in set are present from input . The characteristic function is defined as follows:
(2) |
Here, is a sub-vector of representing the features in set . Due to the local accuracy property of Shapley values, the sum of all feature importance scores is equal to the prediction model output, i.e., where is the output of characteristic function when all the features are absent. See Table 1 for a description of the notations used in this work.
Table 1.
List of notations used in the paper
Notation | Description |
---|---|
The prediction model | |
Input of the prediction model | |
A subset of all features | |
The set of all combinations of variables and time steps | |
The set of all time steps for variable | |
kth time window in the variable ’s sequence | |
The number of time windows that variable is partitioned to | |
The set of all windows for variable | |
The set of all time windows for all variables | |
The contribution score of variable | |
The contribution score of variable at time point | |
The contribution score assigned to the kth window in variable | |
Window length parameter in Stationary and Sliding WindowSHAP algorithms | |
Stride parameter in Sliding WindowSHAP algorithm | |
The Shapley value threshold in Dynamic WindowSHAP algorithm | |
Maximum number of time windows in Dynamic WindowSHAP algorithm |
3.2. Shapley values for time-series data
The general Shapley values formulation provided in equation (1) is not directly applicable to time-series data. In order to calculate Shapley values for time-series data, each possible combination of variable and time step is considered an input feature, which results in Shapley values for each of these combinations. Suppose that is a time-series instance with variables and time steps. Defining as the set of all combinations of variables and time steps, we calculate the Shapley value of variable at time point as
(3) |
where is the characteristic function which denotes the prediction output when only the variable-time pairs in set are present in input . Extracting Shapley values for high resolution time-series will be very time consuming since for each pair of terms should be calculated.
3.3. WindowSHAP
We introduce our efficient framework called WindowSHAP to estimate Shapley values for time-series data in this section. WindowSHAP is designed on the idea of constructing windows from either nearby or non-adjacent temporal steps. In this method, we compute Shapley values for each individual time window rather than for all possible combinations of variable-time points. Assume that we partition into non-overlapping time windows. Note that a window need not necessarily have a contiguous set of time points. The resulting set of windows for variable is represented as where shows the time window in the variable ’s sequence. Considering each window for each variable as a feature, the Shapley value for the time window of variable is calculated as
(4) |
where is the set of all time windows for all variables i.e., . The Shapley value of any variable-time point combination can be estimated by distributing the importance of a time window equally among its time points, i.e.,
(5) |
The Shapley value of all windows in add up to the prediction model output based on the local accuracy property. The local accuracy property is maintained after dispersing the Shapley values of time windows among their time points, i.e.,
(6) |
Under the WindowSHAP framework, we describe three algorithms: (1) Stationary WindowSHAP, (2) Sliding WindowSHAP, and (3) Dynamic WindowSHAP. Stationary WindowSHAP and Sliding WindowSHAP are our fixed-length algorithms where all time windows are of the same length, while Dynamic WindowSHAP is a variable-length algorithm. We will describe each algorithm in the following sections.
3.3.1. Stationary WindowSHAP
In this approach, the time-axis is segmented into fixed-length windows. Even though all time windows have the same length, if the length of the sequence is not divisible by the length of the time window, the last time window may be smaller than the others. Figure 2 shows a partitioning the time-axis for the Stationary WindowSHAP algorithm.
Figure 2.
A visualization of time-axis partitioning in the Stationary WindowSHAP algorithm. The windows are non-overlapping, contiguous, and of the same length, except possibly the last window being smaller.
Algorithm 1.
Stationary WindowSHAP
Input: Input sequence , prediction model , window size | |
Output: Shapley values | |
❖ Calculating the total number of time windows | |
|
❖ Building the time windows using adjacent time steps |
❖ Building the set of all time windows | |
❖ Calculating the Shapley values for all time windows using WindowSHAP | |
3.3.2. Sliding WindowSHAP
Since the stationary windowing approach may not explain time points near the boundary of neighboring windows, we developed sliding time window approach (see Algorithm 2) where adjacent time windows overlap. Since in the WindowSHAP framework, should have non-overlapping time windows for variable , we shift the time window to the end of the sequence over iterations. This algorithm iteratively divides the temporal sequence into inside and outside of the specified time window, resulting in two Shapley values for each sequence. The algorithm’s window length and stride parameters, respectively, determine the length and the amount of shift for time windows in each iteration.
Algorithm 2.
Sliding WindowSHAP
Input: Input sequence , prediction model , window size , stride | |
Output: Shapley values | |
❖ Calculating the total number of sliding time windows | |
❖ Initializing Shapley values for all possible time windows and features | |
for do | ❖ Iterating over all time windows |
|
❖ Set of time steps inside the time window |
|
❖ Set of time steps outside the time window |
❖ Building the set of all time windows | |
❖ Calculating the Shapley values for all features inside and outside the time window | |
❖ Updating the matrix of Shapley values for the corresponding time window | |
Figure 3 demonstrates how Sliding WindowSHAP works (a) in each iteration and (b) after all iterations are completed. After all iterations are completed, the Shapley value of each time point is computed by averaging the Shapley values of the time windows that contain the time point. For example, the Shapley value of time step in Figure 3-b is .
Figure 3.
Demonstration of Sliding WindowSHAP: (a) depicts a single iteration in which the entire sequence is divided into two time intervals, inside and outside of the time window. (b) shows the final windowing result after all iterations have been completed and a Shapley value has been produced for each time window.
3.3.3. Dynamic WindowSHAP
In this approach, we divide the entire series into variable-length time windows. To accomplish this, we first define what the optimal split is using the following two objectives:
Keeping the number of time windows as few as possible to avoid increasing the algorithm’s complexity
Avoiding lengthy windows with large contribution scores to minimize information loss
The Shapley value of all time windows is calculated in each iteration of this algorithm, and time windows with Shapley values greater than a threshold, , are split into two subsequences. The method terminates when either it reaches the maximum number of time windows, , or the Shapely values of all time windows are less than or equal to . In the initial iteration of the algorithm, each time-series is considered as a single time window. For simplicity, Algorithm 3 shows the pseudo code of this method for a univariate sequence . However, this can be easily extended to include all features of a multivariate time-series data at once.
Algorithm 3.
Dynamic WindowSHAP
Input: Input sequence , prediction model , Shapley value threshold , maximum number of time windows | |
Output: Shapley values | |
❖ Initializing the set which stores splitting points of the sequence | |
While | ❖ Stop the iterations when stopping criteria are met |
|
❖ Building the time windows based on the split points |
❖ Building the set of all time windows | |
❖ Calculating Shapley values for all time windows using WindowSHAP | |
|
❖ Adding a new split point to if an interval’s Shapley value is larger than the threshold |
Figure 4 demonstrates how the algorithm works in four iterations of an example where, in the second iteration, only the second time window has a Shapley value greater than . Hence in the next iteration, this window gets split into two equal time windows. The algorithm terminates at iteration four because all Shapley values for time windows are less than .
Figure 4.
Demonstration of Dynamic WindowSHAP algorithm for a sequence. The algorithm stops in the fourth iteration because all the Shapley values for time windows are less than the threshold
3.4. Evaluation metrics
Given the volume and variety of clinical time-series data, evaluating and confirming these explanations through direct inspection by domain experts is not practicable. To implement a fair and quantitative evaluation of the explanation results, we adopt metrics discussed by Schlegel et al.[30]. They propose two metrics - perturbation and sequence analysis metrics - for evaluating explanations of single time points and temporal patterns, respectively. These metrics are defined based on the assumption that if a relevant/important feature (at a certain time point) changes, the performance of an accurate prediction model must decrease.
Perturbation analysis metric:
For a univariate time-series and the relevance vector , a time point changes to if the corresponding is larger than the percentile of . The new sequence is called for which the model quality metric is calculated, e.g., loss function. The difference between quality metric for the original sequence and the perturbed sequence is called the perturbation analysis metric. We can calculate this as a percentage of change in the quality metric, i.e., . Perturbation analysis metric aims to evaluate the impact of perturbing individual time points on model performance. However, this approach does not directly consider the impact of temporal patterns or trends, such as slopes or minima, on model performance.
Sequence analysis metric:
Unlike the perturbation analysis metric, this metric focuses on the inter-dependency of time points in a sequence and how model performance is affected when entire segments of the time-series are replaced or modified. Sequence analysis metric helps assess the ability of the model and the explanation method to capture and explain the significance of temporal patterns in the data.
For a univariate time-series and the relevance vector , a time point in the sequence is chosen if the corresponding is larger than the percentile of . Then, the time interval is replaced with the mean of the sequence and the resulting time-series is called . Similar to perturbation analysis, the difference between quality metric for the original sequence and the new sequence is called the sequence analysis metric. We can calculate this as a percentage of change in the quality metric, i.e., .
In summary, the perturbation analysis metric evaluates the influence of individual time points on the model’s performance, while the sequence analysis metric assesses the impact of temporal patterns and trends. Both metrics together provide a more comprehensive evaluation of explanaton methods, ensuring that they account for both individual time points and temporal patterns in time series data.
3.5. Data sources
To test the model-agnostic explanation methods (e.g., WindowSHAP), we used three distinct clinical time-series data sets to develop and train three different deep learning prediction models. Two sets of clinical time-series data were derived from the prospective, multicenter Transforming Research and Clinical Knowledge in Traumatic Brain Injury (TRACK-TBI) study [31], while the third dataset came from MIMIC III EHR data [32]. TRACK-TBI includes detailed clinical data on nearly 3,000 Traumatic Brain Injury (TBI) patients from 18 academic Level I trauma hospitals throughout the United States. MIMIC III is a de-identified EHR data for nearly 40,000 intensive care unit (ICU) patients at Beth Israel Deaconess Medical Center, Boston, MA. We utilized both specialized and generic clinical domain data to demonstrate the applicability of our method to a wide range of clinical areas.
We used two subsets of data from the TRACK-TBI study to develop two distinct prediction models: (1) time-series EHR data collected during hospital stay and (2) high-resolution physiologic data. EHR time-series data are comprised of clinical variables collected during the initial five days of hospital stay across patients admitted to the hospital with TBI. We included 900 out of 2996 participants, who had outcome data and recordings of blood pressure for at least 12 hours in the first 48 hours of ICU stay. We developed a prediction model to predict the long-term functional outcome of patients using the Glasgow Outcome Scale-Extended, a categorical outcome measure ranging from 1 (death) to 8 (upper good recovery) and dichotomous as good outcome (GOSE 5–8) vs poor outcome (GOSE 1–4). Detailed information on this dataset and the prediction model can be found in [33].
A subset of participants in the TRACK-TBI study (n = 25) also had high-resolution recordings of physiologic data using a bedside data aggregation system (Moberg Solutions, Inc; Ambler, PA). The waveform data for these individuals includes vital signs such as heart rate and arterial blood pressure, as well as intracranial monitoring data. A prediction model was developed and trained to predict an adverse event. Here, an adverse event is defined when intracranial pressure (ICP) is larger than 22 mmHg for at least 15 minutes.
The third clinical prediction model is based on the MIMIC data set and uses the initial 48 hours of clinical data to predict patient mortality in the subsequent 48 hours. The MIMIC time-series data includes eight vital signs and twenty lab measurements. The missing values in vital signs were imputed using the mean, whereas forward imputation is employed for missing laboratory measurements, i.e., the lab values were retained until a new measurement is obtained. A summary of data and model characteristics are included in Table 2.
Table 2.
Datasets characteristics
Dataset characteristic | TRACK-TBI EHR dataset | TRACK-TBI physiologic dataset | MIMIC-III dataset |
---|---|---|---|
Size of data (#samples, #time steps, #variables) | (900, 120, 62) | (5,816, 360, 8) | (22,988, 48, 26) |
Duration of each time step | 1 hour | 10 seconds | 1 hour |
Types of features | Vitals, lab measurements, GCS score components | Vital signs and intracranial data | Vital signs and lab measurements |
Outcome | Dichotomized GOSE score after 6 months | Adverse event of high ICP values (binary outcome) | Mortality after 48 hours |
Unfavorable Label (%) | 22% | 8.8% | 10% |
Method of handling missing values | Imputation during training using GRU-D units | Imputation using linear interpolation | Forward imputation |
To evaluate the quality of explanations generated by our algorithms, we compared the results to those of KernelSHAP and TimeSHAP, two baselines. We utilized all three RNN-based prediction models constructed and trained on distinct clinical time-series datasets, including TBI EHR data, TBI physiologic data, and MIMIC-III data. For each dataset, 50 random samples were selected from the test dataset and explanations were generated using several techniques. We computed perturbation analysis and sequence analysis scores for each combination of prediction model and explanation algorithm.
3.6. Implementation details
We developed three Recurrent Neural Network (RNN)-based prediction models and trained on distinct clinical time-series datasets, including TBI EHR data, TBI physiologic data, and MIMIC-III data. All algorithms and prediction models were implemented in Python 3 environment and is available online [34]. The prediction models were developed using Keras library. The detailed specifications of each prediction model are described in Table 3.
Table 3.
Prediction models specifications
RNN model characteristic | Data sets | ||
---|---|---|---|
TBI EHR data | TBI physiologic data | MIMIC-III data | |
Number of RNN layers | 100 (GRU-D units) | 200 (GRU units) | 70 (GRU units) |
Number of neurons in the (first, second) layer after RNN | (50, 0) | (70, 30) | (40, 10) |
Loss function optimization algorithm | Adam | Adam | Adam |
Learning rate | 0.0002 | 0.0002 | 0.0002 |
Regularization rate | 0.208 | 0.004 | 0.01 |
RNN dropout rate | 0.42 | 0.3 | 0.4 |
RNN recurrent dropout rate | 0.58 | 0.3 | 0.4 |
Hidden layer dropout rate | 0.29 | 0.3 | 0.4 |
Batch size | 32 | 32 | 64 |
To evaluate the quality of explanations generated by our algorithms, we compared the results to those of KernelSHAP and TimeSHAP, two baselines. To extract the Shapley values for each variable-time step combination, we modified the implementation of TimeSHAP such that the Shapley values of pruned variables and time steps are distributed uniformly among them. We utilized a grid search to determine the optimal parameter values for all explanation methods (see Table 4). For each dataset, 50 random samples were selected from the test dataset and explanations were generated using several techniques. We computed perturbation analysis and sequence analysis scores for each combination of prediction model and explanation algorithm.
Table 4.
Parameter values for explanation methods. KernelSHAP does not have any parameter to fix. The only parameter of TimeSHAP is the tolerance which is related to its pruning mechanism. Our algorithms parameters are described under section 3.3
Explanation Algorithm | Data sets | ||
---|---|---|---|
TBI EHR data | TBI physiologic data | MIMIC-III | |
TimeSHAP | Tolerance = 0.05 | Tolerance = 0.05 | Tolerance = 0.05 |
Stationary WindowSHAP | |||
Sliding WindowSHAP |
|
|
|
Dynamic WindowSHAP |
|
|
|
: Window length, : Stride, : Shapley value threshold, : maximum number of windows
4. Results
In this section, we present the outcomes of quantitative analysis, computational complexity analyses, and qualitative analysis. As part of the quantitative analysis, we demonstrate how evaluation measures of WindowSHAP explanations compare against baseline approaches. Next, the computational complexity analysis demonstrates how WindowSHAP affects the runtime and memory utilization of Shapley value extraction. Finally, a qualitative comparison between KernelSHAP and WindowSHAP is offered based on the explanations of an exemplar patient record from the MIMIC-III dataset.
Quantitative analysis:
As the output of all prediction models is binary, the binary cross-entropy loss function was utilized as the quality metric in calculating perturbation and sequence analysis scores. It is worth noting that as the loss function values rise, it indicates that the prediction model performs worse, and as a result, some downstream performance indicators such as accuracy fall. Therefore, the greater the percent change in loss function, the higher the explanation quality. Figure 5 depicts the outcomes of experimental tests for three different datasets. WindowSHAP outperformed or provided similar results as compared to competitors (i.e., KernelSHAP and TimeSHAP).
Figure 5.
Evaluation metrics for all explanation algorithms. Each row of figures shows the result for one of the prediction models. The x axis in all figures represents the percentile p that is used in the metrics definitions. The axis represents the change in the quality metric after perturbing the most crucial time points. Error bars are shown as the mean ± standard errors of the mean of binary loss function.
Computational complexity analysis:
The order of complexity for the WindowSHAP variants is less than KernelSHAP (Table 5).
Table 5.
Order of complexity of designed algorithms and the original implementation of Shapley values
KernelSHAP | Stationary WindowSHAP | Sliding WindowSHAP | Dynamic WindowSHAP |
---|---|---|---|
We evaluated the memory usage and CPU time of each suggested algorithm for various hyperparameter values. Figure 6 depicts the results of the complexity study performed on the TBI EHR data prediction model. The total number of variables is 62, and the length of time-series is 120 (each time step represents an hour). This data set was chosen to evaluate the computational complexity since it has the highest number of variable-time point combinations. It is noteworthy that the original implementation of Shapley values, KernelSHAP, has the same complexity as the Stationary WindowSHAP algorithm when the window length is one. In terms of computational complexity, WindowSHAP has significantly lower computational cost for generating Shapley values (Figure 6).
Figure 6.
Visualization of RAM usage and CPU time of different algorithms under WindowSHAP framework. Columns (a), (b), and (c) represent Stationary, Sliding, and Dynamic WindowSHAP algorithms respectively.
Qualitative analysis:
We use local explanations (i.e., the most important features) for a single patient record (see Figure 7) from the MIMIC-III dataset in order to illustrate how WindowSHAP differs from the original implementation of Shapley values, KernelSHAP. Based on the importance of each time step and variable, it is evident that the explanations of the two techniques are different. Dynamic WindowSHAP focuses more on the final time steps, whereas KernelSHAP assigns Shapley values to all time steps. Figure 8 displays the findings of the two techniques’ explanations for only heart rate variable.
Figure 7.
Heatmaps depicting the importance of all time steps for the important features for a certain patient record from the MIMIC-III dataset. The top 15 variables depicted on the y axis are ranked according to their importance. The darker the color is, the higher the absolute value of the assigned Shapley value is.
Figure 8.
The explanations of the heart rate variable for a patient in MIMIC mortality prediction model. The left and right plots represent visual explanations of WindowSHAP and KernelSHAP, respectively.
5. Discussion
Perturbation and sequence analysis metrics are central to evaluating explainability in time-series classifiers. These metrics work by first identifying the most influential time steps based on the explanation results. Then, the values at these time steps are replaced with non-informative values, creating a perturbed input. The loss function of the predictive model is then recalculated based on this new input. If the identified time steps were indeed critical to the model’s predictions, then this perturbation should lead to a significant increase in the loss function. This increase in the loss function serves as a measure of the importance of the identified time steps. Therefore, explanation methods that result in a larger increase in the loss function after perturbation are considered more effective. They are likely to have more accurately identified the key time steps and intervals, providing a more precise and reliable explanation based on the predictive model’s behavior.
WindowSHAP variants demonstrate superior explanatory performance (as illustrated in Figure 5) compared to other methods by inducing a larger change in the loss function value upon perturbation of the most critical time steps and intervals. However, for the TBI EHR dataset, the performance of WindowSHAP variants and TimeSHAP is not significantly different, but both outperform KernelSHAP. Only for the MIMIC dataset, KernelSHAP provides the highest quality explanation based on the sequence analysis metric, but the difference is negligible. Nonetheless, employing WindowSHAP remains justifiable, as it delivers explanations considerably faster than KernelSHAP for lengthy time-series data.
TimeSHAP is developed under the premise that the initial time steps in time-series data are of lesser importance. Consequently, it aggregates the initial time steps and assigns them a single Shapley value. This could be the primary reason behind TimeSHAP’s inferior performance for the TBI physiologic and MIMIC-III datasets. WindowSHAP, on the other hand, does not make any assumptions about the initial time steps. Instead, for example, Dynamic WindowSHAP aggregates adjacent non-important time steps regardless of their position within the sequence. This makes WindowSHAP more robust to the distribution of important timesteps in the data.
The strong performance of WindowSHAP in terms of explainability can also be attributed to two other aspects. First, by aggregating nearby time steps as a time window, WindowSHAP lowers the dependence of the elements (i.e., time windows) for which the Shapley values are calculated, hence improving the performance of the explanation. Second, as illustrated in Figure 1, by aggregating neighboring time steps as time windows, the Shapley values of adjacent time steps might cancel each other out because their absolute values are nearly identical but in opposite directions. This results in an extremely low Shapley value in the associated time window, demonstrating its true insignificance.
For the TBI EHR data prediction model with 62 variables and 120 time steps, KernelSHAP requires approximately 90GB of RAM and 104 seconds to calculate the Shapely values. By decreasing the duration of the time window in Stationary WindowSHAP, RAM and CPU time are reduced exponentially. For example, we show that merging 10 adjacent time points can reduce the CPU time by 80%. The complexity of Stationary WindowSHAP is dependent on the length of the time window, but the complexity of Sliding WindowSHAP is independent of length and only depends on the stride value. However, both and parameters affect the complexity of Dynamic WindowSHAP.
The explanations of KernelSHAP and WindowSHAP are different from each other. The results for explanation related to mortality prediction from the MIMIC-III dataset demonstrate that while KernelSHAP assigns large Shapley values to nearly all of the feature space and the complete time spectrum, Dynamic WindowSHAP assigns greater Shapley values to the final time steps, as this is more realistic and logical. Further, in Dynamic WindowSHAP, the length of time windows increases as the variables become less significant (lower on the 𝑦 axis in Figure 7), hence avoiding the calculation of Shapley values for less significant time points. The calculated Shapley values for heart rate shows that Dynamic WindowSHAP separates the sequence into two sections and only calculates two Shapley values, whereas KernelSHAP assigns single positive and negative values to some of the time points. Based on Dynamic WindowSHAP, the early part of the sequence contributes more to the unfavorable outcome (survival), whereas the second half contributes more to the positive outcome (death). Since the second half has a higher heart rate than the first, it is more rational to assign two opposite Shapley values to each segment. This is an illustration of how assigning Shapley values to windows, as opposed to scattered time points, makes explanations more comprehensible for end-users. The reason the algorithm only allocates two windows to this sequence is because the overall contribution of each window is not significant enough (less than ) so the algorithm does not further split them.
WindowSHAP has significant clinical implications, as it can assist users in better comprehending complex time-series data obtained from electronic health records (EHRs), physiological monitoring devices, and other sources. By identifying critical time points and temporal patterns using the WindowSHAP framework, clinicians can gain valuable insights into underlying clinical processes and relationships. For example, WindowSHAP can help uncover hidden temporal patterns in physiological data [35], [36], such as vital signs or lab results, which may be indicative of disease progression or response to treatment. Additionally, it can be used to identify crucial time intervals in EHR data, shedding light on the relationship between specific medical events and patient outcomes [22], [37]–[39]. This, in turn, can guide clinicians in making more informed decisions about treatment strategies or intervention timing. In addition, the enhanced interpretability provided by WindowSHAP can help bridge the gap between sophisticated machine learning models and clinical decision-making, fostering clinicians’ trust and confidence in machine learning based tools.
Limitations:
The need to tune the parameters of each algorithm is one of the limitations of this study. We utilized a basic grid search to determine the optimal explanation algorithm parameters. However, outcomes may vary from one data set to another or even within data instances. As an example, since Dynamic WindowSHAP adheres to the local accuracy property, the prediction outcome would equal the sum of all Shapley values. Consequently, the effect of the threshold value () on the quality of the explanation is dependent on the model outcome. In other words, even for the same prediction model, a single would not provide decent explanations for different data instances. One of the potential future developments for the Dynamic WindowSHAP method is to dynamically calculate the threshold based on the model output. The fact that WindowSHAP algorithms are designed to behave similarly for different variables is one of its constraints. For instance, in the Stationary WindowSHAP technique, the length of time windows is the same for all variables in the time-series data, despite the fact that it may seem necessary to have varied window lengths for variables based on their relevance or rate of change.
6. Conclusion
Clinical machine learning models have strong prediction accuracies, but they are opaque because of their non-linear hierarchical structure, making it difficult to determine what details in the input data are causing specific predictions. While considerable effort has gone into understanding deep learning models, time-series models have received comparatively little attention. Our WindowSHAP framework offers a promising way for understanding the behavior of all forms of time-series classifiers. Three distinct algorithms are created within the WindowSHAP framework and compared against baselines. The results demonstrate that the explanations provided by our algorithms are of greater quality, i.e., by perturbing the most important time points based on our explanations, the performance of prediction models decreases more. For instance, for the TBI physiologic data, our algorithms’ explanations identify the most significant time points that, if perturbed, would result in a rise in the loss function that is greater than twice that of baseline techniques. Our study also demonstrates that by utilizing the WindowSHAP framework, the computational complexity related to explainability can be dramatically reduced.
Problem:
Many machine learning models for clinical time-series data are not explainable
What Is Already Known:
Existing approaches to explain prediction models are frequently unique to non-temporal architectures and data. Although Shapley values is a famous approach for explaining black-box prediction models, it is not originally designed to explain clinical time-series prediction models and will be computationally expensive.
What This Paper Adds:
This study aims to design and evaluate an explanation method (WindowSHAP) based on Shapley values that is (1) applicable to time-series data, (2) computationally feasible for high-resolution time-series data, and (3) able to tackle dependencies between sequential data points.
Acknowledgments
This material is based upon work supported by the National Science Foundation under grants #1838730 and #1838745. Dr. Foreman was supported by the National Institute of Neurological Disorders and Stroke of the National Institutes of Health (K23NS101123). The content is solely the responsibility of the authors. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or of the National Institutes of Health. The authors acknowledge the TRACK-TBI Study Investigators for providing access to data used in this work.
Footnotes
Credit Author Statement
Conceptualization: Amin Nayebi, Chandan K Reddy, Vignesh Subbian
Methodology: Amin Nayebi, Chandan K Reddy, Sindhu Tipirneni, Vignesh Subbian
Software: Amin Nayebi
Data Curation: Brandon Foreman, Amin Nayebi, Vignesh Subbian
Validation: Brandon Foreman, Amin Nayebi, Vignesh Subbian
Writing- Original Draft: Amin Nayebi
Writing- Review and Editing: All
Project Admin & Funding Acquisition: Vignesh Subbian and Chandan K Reddy
Declaration of interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- [1].Samek W, Wiegand T, and Müller K-R, “Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models,” arXiv:1708.08296, Aug. 2017, Accessed: Mar. 02, 2022. [Online]. Available: http://arxiv.org/abs/1708.08296
- [2].Rudin C, “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead,” Nat Mach Intell, vol. 1, no. 5, p. 206, May 2019, doi: 10.1038/S42256-019-0048-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Walsh CG et al. , “Stigma, biomarkers, and algorithmic bias: recommendations for precision behavioral health with artificial intelligence,” JAMIA Open, vol. 3, no. 1, pp. 9–15, Apr. 2020, doi: 10.1093/JAMIAOPEN/OOZ054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Selbst A. and Powles J, “‘Meaningful Information’ and the Right to Explanation,” in Proceedings of the 1st Conference on Fairness, Accountability and Transparency, PMLR, PMLR, Jan. 2018, pp. 48–48. Accessed: Sep. 20, 2022. [Online]. Available: https://proceedings.mlr.press/v81/selbst18a.html [Google Scholar]
- [5].Lipton ZC, Kale DC, and Wetzel R, “Modeling Missing Data in Clinical Time Series with RNNs,” vol. 56, 2016, [Online]. Available: http://arxiv.org/abs/1606.04130
- [6].Shamout FE, Zhu T, Sharma P, Watkinson PJ, and Clifton DA, “Deep Interpretable Early Warning System for the Detection of Clinical Deterioration,” IEEE J Biomed Health Inform, vol. 24, no. 2, pp. 437–446, Feb. 2020, doi: 10.1109/JBHI.2019.2937803. [DOI] [PubMed] [Google Scholar]
- [7].Alberdi Aramendi A, Weakley A, Aztiria Goenaga A, Schmitter-Edgecombe M, and Cook DJ, “Automatic assessment of functional health decline in older adults based on smart home data,” J Biomed Inform, vol. 81, pp. 119–130, May 2018, doi: 10.1016/J.JBI.2018.03.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Teixeira Rodrigues A, Roque F, Piñeiro-Lamas M, Falcão A, Figueiras A, and Herdeiro MT, “Effectiveness of an intervention to improve antibiotic-prescribing behaviour in primary care: a controlled, interrupted time-series study,” Journal of Antimicrobial Chemotherapy, vol. 74, no. 9, pp. 2788–2796, Sep. 2019, doi: 10.1093/JAC/DKZ244. [DOI] [PubMed] [Google Scholar]
- [9].Eshghali M, Kannan D, Salmanzadeh-Meydani N, and Esmaieeli Sikaroudi AM, “Machine learning based integrated scheduling and rescheduling for elective and emergency patients in the operating theatre,” Ann Oper Res, pp. 1–24, Jan. 2023, doi: 10.1007/S10479-023-05168-X/FIGURES/11. [DOI] [PMC free article] [PubMed]
- [10].Lundberg SM and Lee SI, “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems, Neural information processing systems foundation, May 2017, pp. 4766–4775. Accessed: Mar. 24, 2021. [Online]. Available: http://arxiv.org/abs/1705.07874
- [11].Gong K, Lee HK, Yu K, Xie X, and Li J, “A prediction and interpretation framework of acute kidney injury in critical care,” J Biomed Inform, vol. 113, p. 103653, Jan. 2021, doi: 10.1016/J.JBI.2020.103653. [DOI] [PubMed] [Google Scholar]
- [12].Juraev F, El-Sappagh S, Abdukhamidov E, Ali F, and Abuhmed T, “Multilayer dynamic ensemble model for intensive care unit mortality prediction of neonate patients,” J Biomed Inform, vol. 135, p. 104216, Nov. 2022, doi: 10.1016/J.JBI.2022.104216. [DOI] [PubMed] [Google Scholar]
- [13].Yan C. et al. , “Predicting brain function status changes in critically ill patients via Machine learning,” Journal of the American Medical Informatics Association, vol. 28, no. 11, pp. 2412–2422, Oct. 2021, doi: 10.1093/JAMIA/OCAB166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Roder J, Maguire L, Georgantas R, and Roder H, “Explaining multivariate molecular diagnostic tests via Shapley values,” BMC Med Inform Decis Mak, vol. 21, no. 1, pp. 1–18, Dec. 2021, doi: 10.1186/S12911-021-01569-9/FIGURES/8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Peng J. et al. , “An Explainable Artificial Intelligence Framework for the Deterioration Risk Prediction of Hepatitis Patients,” J Med Syst, vol. 45, no. 5, pp. 1–9, May 2021, doi: 10.1007/S10916-021-01736-5/FIGURES/7. [DOI] [PubMed] [Google Scholar]
- [16].Nayebi A, Tipirneni S, Foreman B, Reddy CK, and Subbian V, “An Empirical Comparison of Explainable Artificial Intelligence Methods for Clinical Data: A Case Study on Traumatic Brain Injury,” Aug. 2022. doi: 10.48550/arxiv.2208.06717. [DOI] [PMC free article] [PubMed]
- [17].Utkin LV and Konstantinov AV, “Ensembles of Random SHAPs,” Mar. 2021, doi: 10.48550/arxiv.2103.03302. [DOI]
- [18].Jullum M, Redelmeier A, and Aas K, “groupShapley: Efficient prediction explanation with Shapley values for feature groups,” Jun. 2021, Accessed: Oct. 06, 2021. [Online]. Available: http://arxiv.org/abs/2106.12228
- [19].Aas K, Jullum M, and Løland A, “Explaining individual predictions when features are dependent: More accurate approximations to Shapley values,” Artif Intell, vol. 298, p. 103502, Sep. 2021, doi: 10.1016/J.ARTINT.2021.103502. [DOI] [Google Scholar]
- [20].Rojat T, Puget R, Filliat D, Del Ser J, Gelin R, and Díaz-Rodríguez N, “Explainable Artificial Intelligence (XAI) on TimeSeries Data: A Survey,” Apr. 2021, Accessed: Jul. 14, 2022. [Online]. Available: http://arxiv.org/abs/2104.00950
- [21].Abdelsalam Ismail A, Gunady M, Pessoa L, Corrada Bravo H, and Feizi S, “Input-Cell Attention Reduces Vanishing Saliency of Recurrent Neural Networks,” in Advances in Neural Information Processing Systems, 2019. Accessed: Aug. 07, 2022. [Online]. Available: https://github.com/ayaabdelsalam91/Input-Cell-Attention
- [22].Choi E, Bahadori MT, Kulas JA, Schuetz A, Stewart WF, and Sun J, “Retain: An interpretable predictive model for healthcare using reverse time attention mechanism,” arXiv preprint arXiv:1608.05745, 2016.
- [23].Song H, Rajan D, Thiagarajan JJ, and Spanias A, “Attend and Diagnose: Clinical Time Series Analysis Using Attention Models,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, pp. 4091–4098, Apr. 2018, doi: 10.1609/AAAI.V32I1.11635. [DOI] [Google Scholar]
- [24].Guo T, Lin T, and Antulov-Fantulin N, “Exploring interpretable LSTM neural networks over multi-variable data,” 36th International Conference on Machine Learning, ICML 2019, vol. 2019-June, pp. 4424–4440, 2019. [Google Scholar]
- [25].Simonyan K, Vedaldi A, and Zisserman A, “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,” 2nd International Conference on Learning Representations, ICLR 2014 - Workshop Track Proceedings, Dec. 2013, doi: 10.48550/arxiv.1312.6034. [DOI] [Google Scholar]
- [26].Shrikumar A, Greenside P, and Kundaje A, “Learning important features through propagating activation differences,” in International Conference on Machine Learning, PMLR, 2017, pp. 3145–3153. [Google Scholar]
- [27].Shapley LS, “A value for n-person games,” Contributions to the Theory of Games, pp. 307–317, 1953, Accessed: Jun. 26, 2022. [Online]. Available: https://cir.nii.ac.jp/crid/1572261550093228288
- [28].Bento J, Saleiro P, Cruz AF, Figueiredo MAT, and Bizarro P, “TimeSHAP: Explaining Recurrent Models through Sequence Perturbations,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, New York, NY, USA: ACM, Aug. 2021, pp. 2565–2573. doi: 10.1145/3447548.3467166. [DOI] [Google Scholar]
- [29].Sood A and Craven M, “Feature Importance Explanations for Temporal Black-Box Models,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 8, pp. 8351–8360, Jun. 2022, doi: 10.1609/aaai.v36i8.20810. [DOI] [Google Scholar]
- [30].Schlegel U, Arnout H, El-Assady M, Oelke D, and Keim DA, “Towards a rigorous evaluation of XAI methods on time series,” Proceedings - 2019 International Conference on Computer Vision Workshop, ICCVW 2019, pp. 4197–4201, Oct. 2019, doi: 10.1109/ICCVW.2019.00516. [DOI] [Google Scholar]
- [31].Yue JK et al. , “Transforming Research and Clinical Knowledge in Traumatic Brain Injury Pilot: Multicenter Implementation of the Common Data Elements for Traumatic Brain Injury,” https://home.liebertpub.com/neu, vol. 30, no. 22, pp. 1831–1844, Oct. 2013, doi: 10.1089/NEU.2013.2970. [DOI] [PMC free article] [PubMed]
- [32].Johnson AEW et al. , “MIMIC-III, a freely accessible critical care database,” Sci Data, vol. 3, 2016, doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Nayebi A, Tipirneni S, Foreman B, Ratcliff J, Reddy CK, and Subbian V, “Recurrent Neural Network based Time-Series Modeling for Long-term Prognosis Following Acute Traumatic Brain Injury,” Proceedings of the American Medical Informatics Association Annual Symposium, 2021, Accessed: Mar. 07, 2022. [Online]. Available: https://par.nsf.gov/servlets/purl/10300939 [PMC free article] [PubMed] [Google Scholar]
- [34].“WindowSHAP.” https://github.com/vsubbian/WindowSHAP (accessed Nov. 01, 2022).
- [35].Ghaderi H, Foreman B, Nayebi A, Tipirneni S, Reddy CK, and Subbian V, “Identifying TBI Physiological States by Clustering of Multivariate Clinical Time-Series,” Mar. 2023, Accessed: Jun. 20, 2023. [Online]. Available: https://arxiv.org/abs/2303.13024v2 [PMC free article] [PubMed]
- [36].Zhang D, Yao L, Chen K, Wang S, Chang X, and Liu Y, “Making Sense of Spatio-Temporal Preserving Representations for EEG-Based Human Intention Recognition,” IEEE Trans Cybern, vol. 50, no. 7, pp. 3033–3044, Jul. 2020, doi: 10.1109/TCYB.2019.2905157. [DOI] [PubMed] [Google Scholar]
- [37].Gandin I, Scagnetto A, Romani S, and Barbati G, “Interpretability of time-series deep learning models: A study in cardiovascular patients admitted to Intensive care unit,” J Biomed Inform, vol. 121, p. 103876, Sep. 2021, doi: 10.1016/j.jbi.2021.103876. [DOI] [PubMed] [Google Scholar]
- [38].Lauritsen SM et al., “Explainable artificial intelligence model to predict acute critical illness from electronic health records,” Nature Communications 2020 11:1, vol. 11, no. 1, pp. 1–11, Jul. 2020, doi: 10.1038/s41467-020-17431-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Zhang X. et al. , “INPREM: An Interpretable and Trustworthy Predictive Model for Healthcare,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 450–460. [Google Scholar]