Skip to main content
PLOS One logoLink to PLOS One
. 2025 Jul 8;20(7):e0327636. doi: 10.1371/journal.pone.0327636

Edges are all you need: Potential of medical time series analysis on complete blood count data with graph neural networks

Daniel Walke 1,2,*, Daniel Steinbach 3,4, Sebastian Gibb 3,5, Thorsten Kaiser 6, Gunter Saake 2, Paul C Ahrens 3, David Broneske 7,, Robert Heyer 8,9,
Editor: Qiang He10
PMCID: PMC12237013  PMID: 40627631

Abstract

Purpose

Machine learning is a powerful tool to develop algorithms for clinical diagnosis. However, standard machine learning algorithms are not perfectly suited for clinical data since the data are interconnected and may contain time series. As shown for recommender systems and molecular property predictions, Graph Neural Networks (GNNs) may represent a powerful alternative to exploit the inherently graph-based properties of clinical data. The main goal of this study is to evaluate when GNNs represent a valuable alternative for analyzing large clinical data from the clinical routine on the example of Complete Blood Count Data.

Methods

In this study, we evaluated the performance and time consumption of several GNNs (e.g., Graph Attention Networks) on similarity graphs compared to simpler, state-of-the-art machine learning algorithms (e.g., XGBoost) on the classification of sepsis from blood count data as well as the importance and slope of each feature for the final classification. Additionally, we connected complete blood count samples of the same patient based on their measured time (patient-centric graphs) to incorporate time series information in the GNNs. As our main evaluation metric, we used the Area Under Receiver Operating Curve (AUROC) to have a threshold independent metric that can handle class imbalance.

Results and Conclusion

Standard GNNs on evaluated similarity-graphs achieved an Area Under Receiver Operating Curve (AUROC) of up to 0.8747 comparable to the performance of ensemble-based machine learning algorithms and a neural network. However, our integration of time series information using patient-centric graphs with GNNs achieved a superior AUROC of up to 0.9565. Finally, we discovered that feature slope and importance highly differ between trained algorithms (e.g., XGBoost and GNN) on the same data basis.

1. Introduction

Recently, artificial intelligence (AI) showed its great potential in several biological and medical applications, such as diagnosing heart diseases [1] and chronic kidney disease from input matrices with clinical data [2]. For such classification and prediction tasks, researchers proposed several modern state-of-the-art machine learning algorithms (e.g., XGBoost [3]) within the last decades. However, real-world data such as medical data are often connected (e.g., time-dependent measurements of the same patient) [4]. These connections can carry valuable information which helps in increasing the predictive power of machine learning algorithms. However, this information is mostly neglected by state-of-the-art machine learning algorithms [3,5] since they consider data points as independent. Graphs are data structures that can store such connections. A graph G is a non-empty finite set of elements called nodes V(G) and finite set E(G) of distinct unordered pairs of distinct elements of V(G) called edges [6]. Each node and each edge can have features (attributes) attached for a more detailed data characterization. Furthermore, graphs can either have one node type and edge type (homogeneous graph) or multiple node and edge types (heterogeneous graph). An example for a homogeneous graph is a medical graph containing patients with attached features (e.g., age and lab measurements) as nodes connected by edges based on their similarity. In a heterogeneous graph, we store additional features (e.g., lab features) as separate node types and connect patient nodes with their respective feature nodes.

Graphs can be analyzed using graph learning with several algorithms, such as, Graph Neural Networks (GNNs) [712]. GNNs sample information (i.e., features) from neighboring nodes, transform this information (e.g., linear transformation with a subsequent activation function), and finally aggregate (e.g., averaging) the transformed information. Sampling, transformation, and aggregation are performed for each node and repeated for a predefined number of iterations (i.e., GNN layers). While all GNNs are based on these steps (i.e., sampling, transformation, and aggregation), they can differ in their architecture (e.g., different sampling and aggregation strategies or by the use of attention mechanisms [13]) [14]. GNNs have the advantage that they can utilize attached features and parallelize computations on modern hardware (e.g., GPUs) in contrast to using embedding techniques like DeepWalk [15] or Node2Vec [16]. They already showed promising results for predicting pediatric sepsis based on several groups of laboratory tests (e.g., medical history and serological tests) using similarity graphs [17]. However, there are currently two main challenges for the application of GNNs on medical data:

  • A

    Although GNNs showed great potential in diverse applications [1820] including medical applications [2124], the performance like Area under Receiver Operating Curve (AUROC) of GNNs in applications for clinical decision support solely based on complete blood count (i.e., hemoglobin, red blood cells, white blood cells, platelets and mean corpuscular volume) data in adults from the clinical routine is currently unclear. Furthermore, GNNs might facilitate and improve the analysis of time series information compared to current deep learning approaches like LSTMs, 1D-CNNs and Transformer-decoder models by natively (i.e., without additional padding and masking of input data) supporting time series of varying length.

  • B

    Although interpretability mechanisms exist to estimate the importance of individual features for machine learning algorithms and GNNs, to the best of our knowledge there is no framework revealing the influence of individual features’ directions to the predictions. However, such partial dependencies (e.g., increased sepsis risk for high white blood cell count) are even more important and valuable for clinical applications.

To overcome the unclear performance on complete blood count data from the clinical routine (challenge A), we evaluated the performance of GNNs (Graph SAGE [7], Graph Attention Networks [8], Graph Attention Networks version 2 [25], Graph Isomorphism Networks [10], Heterogeneous Graph Transformer [12], and Heterogeneous Attention Networks [11])on medical data [26,27] against shallow and ensemble-based machine learning algorithms and a neural network. The selected dataset [26] contains complete blood count data (five blood parameters and additional age and sex) classified as “sepsis” or “control” (“not sepsis”). Sepsis is a life-threatening organ dysfunction caused by a dysregulated immune response to an infection [28]. The inflammatory response is driven through the release of cytokines from neutrophil granulocytes and macrophages. Blood parameters like white blood cells, red blood cells, platelets, hemoglobin and mean corpuscular volume might serve as easily available indicators for sepsis [26] (S1 Note). It is still one of the leading causes of death in critically ill patients worldwide [29,30] and is well-studied [26,31]. An early prediction of sepsis allows fast initiation of an appropriate treatment (e.g., with antibiotics) [32]. GNNs might serve as a useful tool for classifying sepsis based on two different assumptions regarding similarity and incorporation of time-series information:

  1. Similarity: Instances with similar feature values usually have the same classification labels. Therefore, similarity graphs (a graph connecting instances with similar feature values) might increase the classification performance by potentially connecting sepsis measurements with each other. Applying GNNs on similarity graphs could potentially increase the classification performance compared to other state-of-the-art machine learning algorithms that do make use of similar features.

  2. Time-series information: Each patient can have multiple measurements at different times during the hospitalization. The time-series information of a single patient could help in identifying sepsis measurements during the stay. These time-series information could be represented as graphs based on measurement times (patient-centric graphs) and analyzed using GNNs.

Furthermore, we evaluated the importance and partial dependence to increase the interpretability of the used models. Therefore, we adopted the partial dependence [33] from scikit-learn [34] to also apply it on GNNs and PyTorch Neural Networks (challenge B).

2. Methods

In this section, we describe and explain the workflow of our study to evaluate the performance of graph learning algorithms (Fig 1). First, we pre-processed the complete blood count dataset from Steinbach et al. [26,27] (Fig 1A). Then, we constructed several graph structures from the dataset and applied GNNs on them (Fig 1B). Afterwards, we benchmarked the GNNs against several other machine learning algorithms and measured their required training time (Fig 1C). Additionally, we evaluated the importance of individual features for the classification of sepsis (Fig 1 D). Finally, we evaluated the performance of Graph Attention Networks (GAT) on several patient-centric graph structures.

Fig 1. Workflow of our study to evaluate the performance of GNNs compared to benchmark algorithms.

Fig 1

First the dataset [27] is pre-processed (A) according to the work of Steinbach et al. [26] resulting in a train and two test validation sets (internal and external test set). Then, we constructed graph structures based on these datasets (B I.) and applied GNNs on them (B II.). GNNs sample information (i.e., features) from neighboring nodes, transform these information (e.g., linear transformation with a subsequent activation function), and finally aggregate (e.g., averaging) the transformed information [14]. Here, we visualized the architecture of one GraphSAGE [7] layer which adds linear transformed features of the target node to the aggregated neighborhood features (B II.). Afterwards, we compare the results of our GNNs against tree-based (Decision Tree, Random Forest, RUSBoost and XGBoost) (C I.) and non-tree-based (Logistic Regression and a neural network) (C II.) benchmark models. Finally, we evaluated partial dependence of each feature (age, sex, hemoglobin, red blood cells, mean corpuscular volume, white blood cells and platelets) in the dataset to increase interpretability and transparency of the trained models (D). Therefore, we first calculated the average predictions of each feature for various grid values for each model. Then, we plotted resulting average predictions of trained models over all evaluated grid-values for each features. Finally, we calculated and normalized the variance of the features’ average predictions among all features for each model to evaluate the influence (i.e., importance) of each feature on the average prediction across all grid values. A high normalized variance (near one) indicates a high importance over different grid values and a low normalized variance (near zero) indicates a low importance.

2.1 Pre-processing and setup

The dataset from Steinbach et al. [26,27] contains patients hospitalized into non-intensive care units from German tertiary care center in Leipzig (internal dataset) and Greifswald (external dataset) between January 2014 and December 2021 [26]. Each patient can have multiple complete blood counts (i.e., rows in the dataset). Each complete blood count measurement contains a patient id, age, biological sex (i.e., only male or female were reported), five blood parameters (hemoglobin, red blood cells, white blood cells, mean corpuscular volume, and platelets), a binary label (“sepsis” or “not sepsis”), and information where and when the measurement was performed. The functions and the potential relevance of the individual blood parameters for sepsis is discussed in the supplement (S1 Note). We pre-processed the dataset and separated it into train, internal and external test sets according to Steinbach et al. [26] (Fig 1A). To visualize the distribution of each feature, we plotted violin plots for each continuous feature (S2 Fig). Afterward, we analyzed the data distribution of each set.

We used the following setup for all analyses:

  • Mainboard Supermicro X12SPA-TF

  • CPU: Intel® Xeon® Scalable Processor “Ice Lake” Gold 6338, 2.0 GHz, 32 Cores

  • GPU: NVIDIA® RTX A6000 (48 GB GDDR6)

  • RAM: 8x32 GB DDR4–3200

  • ROM: 2TB Samsung SSD 980 PRO, M.2

2.2 Graph construction and analysis

After pre-processing, we constructed two similarity graphs from the complete blood count data and analyzed them using GNNs (Fig 1B). The first similarity graph is a homogeneous k-nearest neighbors (k-nn) graph (Fig 4B). It contains a node for each complete blood count measurement and connects them directly based on similarity (normalized Euclidean distance of the features). The second graph is a heterogeneous similarity graph (Fig 4C). It contains a patient sample node for each complete blood measurement and nodes with discretized values (lower and upper limit of the discretization) for each blood parameter as similarity comparison. Discretization was performed based on ten percentiles to have less sensitivity against outliers. Each complete blood count node contains standard normalized patient features (age, sex, hemoglobin, red blood cells, white blood cells, mean corpuscular volume and platelets) similar to the homogeneous graph. With this heterogeneous graph structure, patients are indirectly connected via similar blood parameters.

Fig 4. Complete blood counts of a single patient.

Fig 4

(A), design of used similarity graphs (B, C), patient-centric graphs (D-F) and visualization of attention-weights/influence in patient-centric graphs after training (G-I). Each patient can have multiple blood count measurements ordered sequentially from 1 to 5 (A). Blood count measurement nodes labeled as “control” are highlighted in blue and measurements labeled as “sepsis” are highlighted in red. First, we constructed two similarity graphs, a homogeneous k-nn graph (B) and a heterogeneous graph (C). Note, that the similarity graphs can comprise measurements from different patients depending on their feature values. In the homogeneous similarity graph (B) the black edges (represented as lines) are constructed based on the k-nearest neighbors of each blood count measurement node. The k-nearest neighbors are constructed using the Euclidean distance of standard normalized sex, age, hemoglobin, red blood cells, white blood cells, mean corpuscular volume, and platelets. The heterogeneous similarity graph (C) contains five additional node types, one node type for each blood parameter (i.e., hemoglobin, red blood cells, white blood cells, mean corpuscular volume, and platelets). Note, that we have only visualized one additional node type for simplicity. Each blood parameter node type has m nodes, where m denotes the number of percentiles we want to consider (in this example m = 3). Furthermore, each blood parameter node contains three features (i.e., minimum and maximum absolute value of the respective percentile interval and the upper percentile value) (not visualized). In this example, we divided the blood parameter white blood cells into three non-overlapping percentiles (i.e., the lowest 33.33%, 33.33% − 66.67%, and finally 66.67% −100% of the lowest values). Then, each blood count node is connected to the respective interval for each blood parameter. Thereby, blood count nodes are indirectly connected based on similarity. Since the similarity graphs do not consider multiple measurements of the same patient, we created patient-centric graphs (D-F). Based on the measurements of a single patient, we can construct a directed graph (D), reversed directed graph (E), and undirected graph (F). In the directed graph (D) measurements are connected with all previous measurements. The reversed directed graph (E) connects all following measurements to the present measurement. Finally, the undirected graph connects all measurements with each other independent from their order. Edges are colored based on the label of the source node. Afterwards, we trained graph attention networks (GAT) on these graphs either with or without positional encodings (to represent their order in the sequence). GAT employs an attention mechanism which finally leads to a weighted sum (attention) of a neighbors’ measurements. We have schematically visualized the attention weights (i.e., their weight/influence on the target node) by the edge thickness for each graph (G-I). After training, nodes targeting nodes with the same label (e.g., control node to control node) have relatively high influences (thick edge), while nodes targeting nodes with a different label (e.g., sepsis node to control node) have relatively small influences (thin edges).

The basic assumption behind these graph structures is that similar complete blood counts might have the same label (homophily) and therefore, similarities might increase the classification performance. Afterward, we applied several GNNs (GraphSAGE [7], GAT [8], GATv2 [25], GIN [10], HGT [12], HAN [11]) with two layers (128 neurons) and a learning rate of 0.0003 using PyTorch Geometric [35]. Due to memory constraints while training on the heterogeneous graph, we reduced the size of the hidden dimension of GAT with two attention heads, GATv2 with two attention heads, HGT, and HAN to 64 dimensions. We trained the GNNs for 10,000 epochs with an early stopping after an increase of the validation loss for ten consecutive epochs similar to the work of Kipf and Welling [9]. We chose a high epoch number and relatively low learning rate to guarantee sufficient training (i.e., preventing under-fitting) while also preventing over-fitting by applying early stopping on a separate validation set. All GNNs were evaluated using AUROC, F1- Macro Score and Matthews Correlation Coefficient (MCC).

2.3 Benchmarking

As benchmarks, we evaluated the performance (AUROC, F1- Macro Score, and MCC) on tree-based and non-tree-based algorithms (Fig 1C) to get a comprehensive performance evaluation across several algorithms. As tree-based algorithms, we used a Decision Tree and three ensemble-based algorithms (i.e., Random Forest [5], RUSBoost [36], and XGBoost [3]). As non-tree-based algorithms, we used a Logistic Regression and a neural network. The neural network was implemented in PyTorch [37] and used standard normalized complete blood count features with two layers (128 neurons) and a learning rate of 0.0003 similar to the GNNs. The neural network was trained for 10,000 epochs with an early stopping after an increase of the validation loss for ten consecutive epochs. The high epoch number in combination with a low learning rate should prevent under-fitting. Early-stopping on a separate validation set was used to prevent over-fitting. Hyperparameters for the Logistic Regression and all tree-based algorithms were tuned using grid search with 10-fold cross validation using sklearn (version 1.2.2) [38]. To test the robustness of each benchmark against noise, we added 10 and 100 noisy features (random features between 0 and 1) to the dataset and evaluated the performance of each benchmark algorithm and GraphSAGE [7] as a representative of GNNs.

2.4 Partial dependence and feature importance

After performance evaluation, we evaluated the partial dependence [33] of each feature (Fig 1D) for each model to evaluate the influence of different feature values on the average prediction (i.e., the ratio of sepsis prediction). Therefore, we implemented a function similar to scikit-learn [38,39] but with the compatibility for GNNs and required data transformations (e.g., standard normalization). We used a grid resolution of 100 (i.e., 100 different values for each feature) between the 5%-percentile and 95% percentile of each feature. If the number of unique values is below the grid resolution (e.g., sex contains only two discrete values in the input data) only these unique values were used for the evaluation of average prediction. Afterwards, we plotted the average prediction over the respective grid values for each algorithm feature-wise using matplotlib as line charts. For the compatibility of GNNs, we implemented a “predict_proba”-function in each PyTorch-model (neural network and GNNs) which returns the prediction probabilities of each class similar to scikit-learn models. Finally, we calculated the variance of the average prediction over all features to evaluate which features had the highest influence (i.e., importance) among all features for each algorithm. The variance of each feature was normalized over the sum of all variances of the respective algorithm to obtain values between zero (i.e., indicating a low importance) and one (i.e., indicating a high importance). These normalized variance values were the plotted and hierarchically clustered (Euclidean distance) using seaborn’s clustermap. We evaluated the performance on all benchmark algorithms and on GraphSAGE (homogenous and heterogeneous) as a representative of GNNs. GraphSAGE was chosen as the final model since it achieved a reliable classification performance on the homogeneous and heterogeneous similarity graphs.

2.5 Patient-centric graphs

The constructed similarity graphs are measurement-centric, i.e., they do not consider multiple measurements of the same patient (Fig 4A-C) for the classification. For incorporating multiple measurements of the same patient, we construct several patient-centric graphs (Fig 4D-F). In these graphs, a node represents standard normalized complete blood counts and edges represent connections from previous to following measurements (directed graph, Fig 4D), following to previous measurements (reversed directed graph, Fig 4E) or connecting all measurements of the same patient with each other independent of their order (undirected graph, Fig 4F). Afterward, we added positional encodings on each node to represent the position of the measurement in the sequence of measurements. We applied Graph Attention Networks [8] with two layers (128 neurons), a learning rate of 0.0003, a batch size of 50,000 and trained the GNN for 10,000 epochs with an early stopping after an increase of the validation loss for five consecutive epochs. To prevent under-fitting, we used a high number of epochs and a low learning rate during training. We used early-stopping on a separate validation set again to prevent over-fitting. Then, we compared the classification performance (AUROC) with and without positional encodings for all patient-centric graphs. Finally, we evaluated the attention weights (influence between different nodes) on each graph to increase the interpretability of the trained Graph Attention Networks. Therefore, we returned all attention weights from each layer of the graphs and analyzed mean, standard deviation, and quantiles of nodes connecting to nodes with the same label (e.g., connection of a “control” node to a “control” node) and nodes connecting to nodes with a different label (e.g., connection of a “sepsis” node to a “control” node). Furthermore, we evaluated the classification performance (AUROC) on other GNNs (GraphSAGE [7] and GCN [9]) to evaluate whether the improvement was solely based on the attention mechanism. We trained both GNNs similarly with a learning rate of 0.0003, a batch size of 50,000 and trained the GNN for 10,000 epochs with an early stopping after an increase in the validation loss for five consecutive epochs. Finally, we also compared the performance of our patient-centric GNNs against other deep learning architectures, Long Short-Term Memory (LSTM) [40], bidirectional LSTM (Bi-LSTM) [41], one-dimensional Convolutional Neural Network (1D-CNN) [42] and transformer decoder-only architecture (Transformer) [13]. Since the complete blood count data contains time-series information of different lengths, we padded the time series of each patient to the maximum length and masked paddings from the loss function. We trained each algorithm for 100 epochs with an early stopping after a loss increase over the last two epochs. Hyperparameter tuning (number of layers, hidden dimension, learning rate, weight decay, kernel size on the 1D-CNNs, number of heads on the Transformer) and early stopping was performed using a separate validation dataset.

2.6 Ethics statement

The Ethics Committee at the Leipzig University Faculty of Medicine approved the initial study from Steinbach et al. [26] (reference number: 214/18-ek). The study was published in accordance with the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) statement. This study is only re-evaluating the dataset from Steinbach et al. [26] by evaluating GNNs and incorporating time-series information.

3. Results

After data pre-processing (Fig 1A), we evaluated the performance of GNNs on similarity graphs compared to other machine learning algorithms (Fig 1B and C) on two data sets representing the complete blood count for sepsis and non-sepsis patients’ data. Afterward, we evaluated the feature slope and importance of different algorithms to increase their transparency and interpretability (Fig 1D). Finally, we created patient-centric graphs and applied the attention mechanisms on these graphs to achieve a superior performance and highlight the importance of an appropriate graph structure for the desired use case.

3.1 Performance of graph learning on similarity graphs for classifying complete blood counts

First, we wanted to evaluate the performance of different GNNs on medical data compared to other machine learning algorithms. Therefore, we applied several GNNs (GraphSAGE [7], GAT [8], GATv2 [32], GIN [10], HGT [12], and HAN [11]) and other machine learning algorithms on sepsis blood count data, evaluated their performance (i.e., AUROC, Matthews Correlation Coefficient, F1-Macro) and their required training time (Table 1). In the following, we will mainly focus on AUROC as the primary evaluation metric to have a threshold-independent evaluation metric that can also incorporate the high class imbalance. By assessing model performance across different thresholds, AUROC enables clinicians to fine-tune the sensitivity and specificity of sepsis detection according to their needs, minimizing both the risk of missing septic patients and the potential harm from overdiagnosis, such as unnecessary antibiotic treatments. Nearly all GNNs revealed a similar performance on the homogeneous similarity graph (AUROC: ≤ 0.8741) and heterogeneous similarity graph (AUROC:

Table 1. Comparison of GNNs against benchmarks on complete blood count data for sepsis classification (higher values represent better performance) and their required training time. We evaluated the classification performance on two datasets (internal and external dataset). Bold values represent the best values in each column. MCC (Matthew’s Correlation Coefficient), AUROC (Area under receiver operating curve).

Models AUROC F1-Macro MCC Training time [s]
Internal External Internal External Internal External
Tree-based benchmarks Decision Tree [53] 0.8391 0.7870 0.4313 0.4018 0.0432 0.0291 2.00
Random Forest [5] 0.8700 0.8178 0.4770 0.4609 0.0605 0.0385 17.36
RUSBoost [36] 0.8680 0.8153 0.4701 0.4497 0.0576 0.0361 212.88
XGBoost [3] 0.8643 0.8121 0.4373 0.4184 0.0495 0.0327 0.54
Non-tree-based benchmarks Logistic regression [54] 0.8369 0.7558 0.4412 0.3736 0.0442 0.0222 5.97
Neural Network [55] 0.8806 0.8145 0.4479 0.4502 0.0521 0.0383 19.97
Homogeneous graph learning Graph SAGE [7] 0.8741 0.8052 0.4411 0.3964 0.0499 0.0308 394.69
GAT [8] (single attention head) 0.8726 0.8086 0.4440 0.4457 0.0501 0.0374 561.23
GAT [8] (two attention heads) 0.8707 0.8114 0.4476 0.4502 0.0513 0.0393 1,008.311
GATv2 [25] (single attention head) 0.8723 0.8057 0.4413 0.4442 0.04889 0.0371 553.55
GATv2 [25] (two attention heads) 0.8746 0.8130 0.4469 0.4500 0.0511 0.0384 1,189.47
GIN [10] 0.8649 0.8050 0.4499 0.4530 0.0502 0.0372 31.45
Heterogeneous graph learning Graph SAGE [7] 0.8747 0.8176 0.4422 0.4020 0.0506 0.0326 981.10
GAT [8] (single attention head) 0.8426 0.8055 0.4396 0.4094 0.0464 0.0328 1,086.07
GAT [8] (two attention heads) * 0.8396 0.8069 0.4404 0.4164 0.0459 0.0334 679.95
GATv2 [25] (single attention head) 0.8402 0.8061 0.4384 0.4136 0.0455 0.0329 323.36
GATv2 [25] (two attention heads) * 0.8422 0.8067 0.4401 0.4144 0.0468 0.0326 463.59
GIN [10] 0.8696 0.8051 0.4215 0.3626 0.0456 0.0273 54.95
HGT [12]* 0.8317 0.7778 0.4509 0.4197 0.0479 0.0289 2,696.46
HAN [11]* 0.8401 0.8036 0.4397 0.4106 0.0457 0.0328 13,039.06
Minimum 0.8317 0.7558 0.4215 0.3626 0.0432 0.0222 0.54
Mean 0.8584 0.8040 0.4444 0.4221 0.0491 0.0335 1,166.10
Maximum 0.8806 0.8178 0.4770 0.4609 0.0605 0.0393 13,039.06
*

Hidden dimension reduced to 64 due to memory constraints.

≤0.8747) on both datasets. Furthermore, the neural network (AUROC: ≤ 0.8806), ensemble-based machine learning algorithms (Random Forest (AUROC: ≤ 0.8700), RUSBoost (AUROC: ≤ 0.8680) and XGBoost (AUROC: ≤ 0.8643)) had a similar performance. However, some GNNs (GAT, GATv2, HGT and HAN) on heterogeneous similarity graphs and shallow algorithms, like Logistic Regression (AUROC: ≤ 0.8369) and Cecision Tree (AUROC: ≤ 0.8391), performed much worse on both datasets compared to other GNNs, the neural network and ensemble-based algorithms. Our results of the RUSBoost algorithm (AUROC: ≤ 0.8680) are consistent with the results of Steinbach et al. [26] (AUROC: ≤ 0.872).

XGBoost had the lowest required training time (0.54 s) followed by the Decision Tree (2.00 s) and Logistic Regression (5.97 s). The Random Forest (17.36 s) and one GNN (GIN: ≤ 54.95 s) are faster than the RUSBoost ensemble algorithm (212.88 s). However, nearly all homogeneous and heterogeneous GNNs had the highest required training times (up to 13,039.06 s) compared to all other considered algorithms.

Afterward, we compared the robustness of algorithms against 10 and 100 noisy features (Fig 2, S3 Table). The ensemble-based machine learning algorithms (Random Forest, RUSBoost, XGBoost) and Logistic Regression required more training time (2.74 s to 809.45 s) but had nearly the same performance (only up to 0.0184 worse AUROC) compared to the original datasets. However, the Decision Tree (up to 0.1487 worse AUROC), the neural network (up to 0.1154 worse AUROC) and GNNs (up to 0.0804 worse AUROC) lost performance which indicates a high sensitivity against noise.

Fig 2. Evaluating the robustness of GNNs compared to benchmarks (A) and their required training time (B).

Fig 2

We evaluated the classification performance (A) of GNNs (Graph-SAGE) on homogeneous and heterogeneous similarity graphs and the performance of other machine learning algorithms (neural network, Decision Tree, Logistic Regression, Random Forest, RUSBoost, XGBoost) by adding 10 or 100 noisy features to the complete blood count datasets (higher values represent better performance). Each model was trained on the noisy datasets and afterwards their classification performance was evaluated. Furthermore, we measured the required training time (B). We evaluated the performance on two datasets (internal and external dataset). Information about other evaluation metrics (F1-Score and MCC) are listed in S3 Table.

3.2 Partial dependence of graph and machine learning algorithms

After evaluating the performance, we evaluated partial dependence of each feature under the assumption of independent features. We plotted the average prediction (i.e., ratio of diseased sepsis-cases) against different feature values (lowest to highest value) (Fig 3AF) for the features age (Fig 3A), hemoglobin (Fig 3C), red blood cells (Fig 3E), white blood cells (Fig 3D), mean corpuscular volume (Fig 3F) and platelets (Fig 3G). If the resulting curve has a positive gradient, an increasing feature value (e.g., older people for the feature age) results in an increased probability of developing sepsis according to the model, and vice versa for a negative gradient. Overall, there was a positive gradient for the features age, white blood cells (Fig 3D), red blood cells (Fig 3E), and mean corpuscular volume (Fig 3F), showing increased sepsis risk with rising values of these parameters. Specifically, that means older people with higher white blood cell counts, red blood cell counts and increased corpuscular volume have higher sepsis probabilities according to the trained models. White blood cells had a minimum of around 4–8 Gpt/l (gigaparticles per liter) for most algorithms (Fig 3D) that indicates a physiological white blood cell count at this range. In contrast to the positive gradient, we observed a negative gradient for platelets for all algorithms (Fig 3G), indicating decreased sepsis risk for rising values according to the models. Specifically, this means that patients with lower platelets (thrombocytopenia) have a higher sepsis probability according to the trained models. Tree-based algorithms (Decision Tree, RUSBoost, Random Forest, XGBoost) do not depend (gradient near zero) on the features hemoglobin (Fig 3C) and red blood cells (Fig 3E) in contrast to non-tree-based algorithms (Logistic Regression, Neural Network, Homogeneous GNN, Heterogeneous GNN). The latter ones showed an increased sepsis probability for low hemoglobin levels (anemia) (Fig 3C). Finally, the feature “sex” is nearly irrelevant for all algorithms (i.e., low gradient), i.e., the sepsis probability does not significantly depend on sex (Fig 3B).

Fig 3. Partial dependence plots (A – G) and the resulting clustered feature importance (H) for each feature and trained model.

Fig 3

We plotted the partial dependence plots for the features age in years (A), binary sex categorically encoded with one for women and zero for men (B), hemoglobin in millimole per liter (C), white blood cells in gigaparticles per liter (D), red blood cells in teraparticles per liter (E), mean corpuscular volume in femtoliter (F), and platelets in gigaparticles per liter (G). For each feature, we plotted the average predictions (average ratio of sepsis classification) made by the trained models across different feature values (i.e., grid values). Tree-based algorithms (i.e., Decision Tree, Random Forest, XGBoost, and RUSBoost) are visualized as dashed lines and non-tree-based algorithms (i.e., Logistic Regression, the neural network, the homogeneous GNN, and heterogeneous GNN) as solid lines. If the curve in the feature variation graph has a positive gradient, an increasing feature value (e.g., older people for age) results in an increased probability of developing sepsis according to the model, and vice versa (negative gradient). Note, that at 0.5 (black dotted line) there is an equal number of sepsis and control cases according to the model. Therefore, we cannot make a statement around 0.5. Furthermore, we highlighted the reference range of specific blood parameters with a green rectangle [56,57] (see S4 Table). In H), we hierarchically clustered (Euclidean distance with average linking) the feature importance resulting from the normalized variance in the partial dependence plots for each trained model. Tree-based algorithms (i.e., Decision Tree, Random Forest, XGBoost, and RUSBoost) are grouped together indicating similar underlying mechanisms for the classification. However, their mechanisms differ from the non-tree-based algorithms (i.e., Logistic Regression, the neural network, the homogeneous GNN, and heterogeneous GNN) which are also grouped together.

The partial dependence plots (Fig 3AG) contain the ratios of diseased cases for different feature values. We estimated the importance of each feature by calculating the variance of all plotted ratios for each feature. This feature importance is normalized over the sum of all features’ importance in the model to obtain values between zero and one which sum up to one for all features (Fig 3H). For nearly all algorithms (besides Logistic Regression, and homogeneous GNN) white blood cell count was the most important feature. Furthermore, tree-based algorithms mainly rely on white blood cell count for their prediction. In contrast, the feature “sex” has the lowest feature importance across all models.

3.3 Graph neural networks on patient-centric graphs

Since the similarity graphs do not consider multiple measurements of the same patient (Fig 4A-C), we created patient-centric graphs and applied graph attention networks on them (Table 2). The graph attention networks sample, weight, and aggregate information from other measurements of the same patient either from previous (directed patient-graphs), subsequent (reversed directed patient-graphs) or all measurements (undirected patient graphs). The reversed directed patient-graphs (Fig 4E, Table 2) with and without positional encodings achieved the highest performance on both datasets, with an AUROC of up to 0.9565. The undirected graph (Fig 4F, Table 2) also achieved a superior performance with an AUROC of up to 0.9094. The directed graph (Fig 4D, Table 2) only achieved an AUROC of up to 0.8902. However, the performance without any positional encodings (Table 2) was lower with an AUROC of, 0.9502, 0.8996 and 0.8734 on the internal dataset for the reversed directed graph, the undirected graph and the directed graph, respectively.

Table 2. Comparing AUROC, exploited biases, use cases and issues of Graph Attention Networks on similarity and patient-centric graphs (directed, reversed directed and undirected) for the classification of sepsis on complete blood count data (higher values represent better performance). We evaluated the classification performance on two datasets (internal and external dataset). Bold values represent the best values in each column.

Graph AUROC Structure bias Feature bias Use case Issue
Internal External
Measurement-centric graphs Homogeneous similarity graph (Fig 4B) 0.8741 0.8052 Time-specific diagnostic (e.g., to predict sepsis) Does the patient have sepsis according to the available information?
Heterogeneous similarity graph (Fig 4C) 0.8747 0.8176
Patient-centric graphs Directed graph (Fig 4D) 0.8734 0.8114
Directed graph with positional encodings (Fig 4D) 0.8902 0.8203 +
Undirected graph (Fig 4F) 0.8996 0.8628 Retrospective diagnostics (e.g., for evaluating treatment strategies or disease causes) When did the patient diseased or recovered?
Undirected graph with positional encodings (Fig 4F) 0.9094 0.8652 +
Reversed directed graph (Fig 4E) 0.9502 0.9481 +
Reversed directed graph with positional encodings (Fig 4E) 0.9565 0.9498 + +

To evaluate the contribution of the attention mechanism, we evaluated the classification performance on two other GNNs, GraphSAGE and GCN (Table 3). GraphSAGE achieved a slightly higher classification performance on all graph structures but is still comparable to the results of the GAT on both datasets. However, GCN performed slightly worse on the directed and undirected graph structure, but still achieved a similar performance on the reversed directed graph on both datasets.

Table 3. Comparing AUROC across Graph Neural Networks patient-centric graphs (directed, reversed directed and undirected) for the classification of sepsis on complete blood count data (higher values represent better performance). We evaluated the classification performance on two datasets (internal and external dataset). Bold values represent the best values in each column.

Graph GAT GraphSAGE GCN
Internal External Internal External Internal External
Patient-centric graphs Directed graph (Fig 4D) 0.8734 0.8114 0.8890 0.8323 0.8582 0.7871
Directed graph with positional encodings (Fig 4D) 0.8902 0.8203 0.9001 0.8338 0.8613 0.7900
Undirected graph (Fig 4F) 0.8996 0.8628 0.9174 0.8831 0.8742 0.8083
Undirected graph with positional encodings (Fig 4F) 0.9094 0.8652 0.9336 0.9124 0.8955 0.8449
Reversed directed graph (Fig 4E) 0.9502 0.9481 0.9542 0.9499 0.9575 0.9520
Reversed directed graph with positional encodings (Fig 4E) 0.9565 0.9498 0.9545 0.9514 0.9517 0.9502

Other deep learning techniques (LSTM, Bi-LSTM, 1D-CNN and Transformer) highly differed in their classification performance with an AUROC between 0.8802 and 0.9535 on the internal dataset (Table 4). The performance of the LSTM (AUROC on the internal dataset: 0.8802) was comparable or slightly worse to the GNN on directed patient-centric graphs (AUROC up to 0.9001). However, the Bi-LSTM (AUROC on the internal dataset: 0.9535) achieved comparable results to the reversed patient-centric graphs (AUROC up to 0.9575). The 1D-CNN (AUROC on the internal dataset: 0.9337) and the Transformer (AUROC on the internal dataset: 0.9257) had a slightly lower AUROC.

Table 4. Comparing AUROC across several deep learning techniques for the classification of sepsis on complete blood count data (higher values represent better performance). We evaluated the classification performance on two datasets (internal and external dataset). Bold values represent the best values in each column.

Graph Patient-centric graphs Use case
Internal External
LSTM 0.8802 0.8206 Time-specific diagnostic (e.g., to predict sepsis)
Bi-LSTM 0.9535 0.9557 Retrospective diagnostics (e.g., for evaluating treatment strategies or disease causes)
1D-CNN 0.9337 0.9408
Transformer 0.9257 0.9208

4. Discussion and future directions

In this study, we evaluated the performance of GNNs on medical data using the use case sepsis prediction from blood count data. When GNNs are applied on similarity graphs (Table 1), they achieve similar performance as ensemble-based machine learning algorithms (XGBoost, RUSBoost, Random Forest) and neural networks. GNNs are based on the message-passing framework, i.e., information from connected nodes is iteratively transformed and aggregated to update node embeddings. There are only differences in the transformation and aggregation process [14]. GraphSAGE is only based on a linear transformation, followed by a mean aggregation of neighboring nodes [7]. GAT [8] and GATv2 [25] are also based on a linear transformation but integrate an attention mechanism to aggregate the information from neighboring nodes by a weighted (based on the neighbor’s importance) mean. However, GATv2 has a more expressive attention mechanism by applying the attention score after the non-linearity (leaky rectified linear unit activation) instead of before [25]. Graph Isomorphism Network aim to increase the expressive power of GNNs by first summing information from neighboring nodes and then passing this aggregated information to a multi-layer perceptron to update the node embeddings [10]. Heterogeneous Graph Attention Network is using a node-level attention mechanism specific for each edge type and then aggregating information from different edge types by a semantic-based attention [11]. Heterogeneous Graph Transformer performs an attention-mechanisms based on query, key, and value matrices followed by linear transformation and a Gaussian Error Linear Unit activation function and another linear transformation [12]. However, all GNNs aggregate information from connected nodes to improve the performance while sharing the weights across nodes in the same layer [14]. However, when they can only aggregate information from similar complete blood count data (measurement-centric graphs), there information gain is too small resulting in a similar performance compared to the neural network and ensemble-based machine learning algorithms. Furthermore, the performance of some GNNs can drop below (e.g., Heterogeneous Graph Attention Network and Heterogeneous Graph Transformer) due to the high number of introduced parameters leading to slight overfitting [11,12]. Nevertheless, GNNs on similarity graphs, the neural network and ensemble-based algorithms could outperform shallow algorithms (Decision Tree and Logistic Regression) due to more expressive representations of the underlying information (Table 1). However, this performance increase is associated with an increased computational complexity which requires more training time compared to shallow algorithms. The increased computational complexity can be compensated when exploiting modern hardware (e.g., usage of multiple threads or GPUs). Thereby, the training of XGBoost on the GPU (NVIDIA A6000) requires even less time than the training of shallow algorithms. GNNs were also trained on a GPU but still required much more training time due to the high computational complexity of the underlying sampling, transformation, and aggregation steps (see Introduction). In general, the computational complexity of a single convolution layer of a GNN is O(VFF’ + EF’), where V represents the number of nodes in the graph, E the number of edges in the graph, F the number of input features and F’ the hidden dimension or output features for the convolution [8,43]. However, the computational complexity differs across different architectures, depending on factors like the number of attention heads (e.g., in GAT and GATv2), and used multi-layer perceptrons (e.g., in GIN).

In addition to computational time, tree-based ensemble algorithms (XGBoost, RUSBoost, Random Forest) are more robust against noise compared to the Decision Tree, GNNs and the neural network (Fig 2). The increased robustness might be the result of the aggregation of multiple tree-based algorithms (ensembles). It is noteworthy, that the neural network and GNN required less training time with more noisy features which is due to the faster convergence to new (but worse) local minima while training.

Afterward, we evaluated the partial dependence and importance of different features for the final classification of each model (Fig 3A-G). Tree-based algorithms (Decision Tree, Random Forest, RUSBoost, XGBoost) showed similar partial dependence plots for classification which results from a similar prediction mechanism (usage of a single or multiple Decision Tree(s)). However, these mechanisms differ from non-tree-based algorithms (Logistic Regression, neural networks, GNNs) since they are based on linear transformation with (neural network and GNNs) or without (Logistic Regression) some kind of non-linearity (e.g., sigmoid, or rectified linear unit). Additionally, tree-based algorithms create harder decision boundaries than non-tree-based algorithms. Our approach for increasing the interpretability of machine learning models assumes that all features are independent from each other. However, in reality features are dependent on each other (e.g., red blood cells and hemoglobin). This simplification might skew the synthetic feature inputs for specific combinations. Future approaches could integrate existent feature dependencies to prevent distortions in the synthetic dataset.

Finally, we tested the performance of Graph Attention Networks on patient-centric graphs (i.e., graphs which integrate measurements of the same patient) (Table 2). The exploitation of time series information through the patient-centric graphs improved the classification performance of all previous models and achieved an AUROC of up to 0.9565 on Graph Attention Networks. The reason for this improvement is that the GNN on a patient-centric graph inherently reduces patient-specific fluctuations in the dataset. However, the performance improvement is also associated with the exploitation of a real-world bias in the underlying dataset. About 2/3 of the sepsis cases are not part of a sequence of examinations (i.e., they represent only a single measurement for a patient). However, the other 1/3 of the sepsis cases are part of examination sequences and sepsis is diagnosed in the last positions in most cases (92.14%). This highlights the benefit of regular monitoring of patient data as baseline information for machine learning algorithms.

The fact that most sepsis cases occur only at the last positions is exploited by biased features attributes (feature-induced bias) and/or a biased underlying graph structure (structure-induced bias). When incorporating positional encodings, we represent later positions (i.e., measurements) with higher feature attributes and earlier ones with lower feature attributes (feature-induced bias). With a specific graph structure (reversed directed, Fig 4E), the underrepresented sepsis cases do not integrate feature information from control information (structure-induced bias). However, the control cases can share information between each other which reduces potential fluctuations. Although the control cases can also integrate information from sepsis cases, the attention mechanism reduces their influence (Fig 4H, S5 Table). In the directed and undirected graph, control cases still share information with each other to reduce potential fluctuations. However, sepsis cases also integrate information from control cases which reduces the differences between the two groups. The integration of information from control cases is partially compensated by the attention mechanism which lowers the influence of control cases to sepsis cases (Fig 4G and I, S5 Table) but cannot be fully compensated due to the high number of control cases in contrast to sepsis cases. Thereby, the reversed directed patient-graphs achieve a much higher classification performance (AUROC of up to 0.9565) compared to the directed and (AUROC of up to 0.9094) and undirected graphs (AUROC of up to 0.8902).

Furthermore, we evaluated the performance on other GNNs (GraphSAGE [7] and GCN [9]) to evaluate whether the improvement was the consequence of the underlying attention mechanism (Table 3). The similar classification performance of GraphSAGE indicates that the attention mechanism is not required to achieve the improvement. A slightly higher AUROC could be the result of a less complex GNN structure (only a linear transformation of aggregated nodes without attention mechanism). Furthermore, GraphSAGE was developed to increase the inductive capabilities of GNNs [7]. GCN achieved a slightly lower AUROC on the directed and undirected graph structures which might be the result of the symmetrical normalization and the linear transformation of data before the aggregation [9]. An aggregation before linear transformation (e.g., in GraphSAGE) might smooth the node features facilitating the classification with the following linear transformation.

Compared to other deep learning techniques, GNNs achieved similar or even higher classification performance (AUROC) (Table 4). The LSTM [40] achieved similar performance to GNNs on directed patient-centric graphs. Similar to the GNNs on directed patient-centric graphs, the LSTM can only process information from past observations (i.e., complete blood count data) limiting the information for complete blood count at the beginning of time series. However, the Bi-LSTM [41], 1D-CNN [44], and Transformer [13] can aggregate information independently of their order (i.e., future and past observations) increasing the available information for all complete blood count samples. Thereby, the AUROC increased up to 0.9535 on the internal dataset and achieved similar results to the best performing GNN (AUROC up to 0.9575). The 1D-CNN and Transformer might performed slightly worse because they incorporate information from both directions into one hidden representation, while the Bi-LSTM creates two separate hidden representations (one for the forward direction and one for the backward direction) which are combined at the end. Furthermore, it is noteworthy that GNNs can handle time series of different lengths natively by the defined edge index while the other deep learning architectures required an additional padding and masking step.

We can use undirected and reversed directed patient-graphs for retrospective analysis (e.g., when a patient is diseased or recovered). This application might help to evaluate the success of a treatment (e.g., with specific antibiotics) or to evaluate potential causes of a disease (e.g., infection after a specific event). However, we cannot use the undirected and reversed directed patient-graphs when we want to diagnose sepsis at the current time point since they are incorporating information from subsequent measurements (i.e., information not available at the current time point). The same holds for the Bi-LSTM, 1D-CNN and Transformer. Therefore, we can only use the directed patient-centric graphs with and without positional encodings or the LSTM which achieved a lower classification performance compared to GAT on the undirected and reverse directed graphs. However, the performance of the directed patient-centric graph with positional encodings (AUROC of up to 0.8902) is still better than the standard ML algorithms (AUROC of up to 0.8806) which did not use time-series information.

To sum up, we compared the classification performance of different graph learning and other machine learning algorithms on sepsis blood count data and revealed different classification mechanisms in the trained models. Furthermore, we evaluated the performance of Graph Attention Networks on several patient-centric graphs and reached an outstanding AUROC of up to 0.9565 for retrospective use cases.

We would suggest the following directions for future research:

  • I

    Integrating additional features;

  • II

    Integrating more samples;

  • III

    Diagnoses of further diseases.

The integration of more features (I.) could include information of other laboratory measurements (e.g., specific biomarkers), vital signs of patients (e.g., body temperature and pulse rate), predisposing factors (e.g., genetic polymorphism [45] or chronic medical conditions like diabetes [46]), and previously administered drugs. These features might help to provide a more holistic view of a patient’s health status. Furthermore, sparse information like the existence of predisposing factors or previously administered drugs could be represented as a graph structure. However, data with more features must be collected for all patients, which could increase measurement times and costs. Furthermore, specific information like administered drugs could contain information clinicians might have only in retrospect. The integration of more samples (II.) in the dataset is time-consuming but could reduce the impact of outliers in the dataset. One promising direction might be the integration of samples from electronic health records like MIMIC-IV [47], Amsterdam University Medical Center Database [48], high time resolution ICU dataset (HiRID) [49] and eICU Collaborative Research Database [50]. Additionally, complete count data could enable diagnosing further diseases (III.) like thrombosis [51] or leukemia [52]. For the classification of further diseases, labels for the respective diseases are required. However, this labeling process might be time-consuming and requires domain experts like clinicians.

Supporting information

S1 Note. Potential relevance of complete blood count data to sepsis.

(DOCX)

pone.0327636.s001.docx (39.2KB, docx)
S2 Fig. Distribution of each continuous feature in the train, internal and external test datasets as violin plots.

(DOCX)

pone.0327636.s002.docx (114.8KB, docx)
S3 Table. Evaluating the robustness of GNNs compared to benchmarks by adding 10 or 100 noisy features to the complete blood count datasets for sepsis classification (higher values represent better performance) and the required training time.

(DOCX)

pone.0327636.s003.docx (20KB, docx)
S4 Table. Reference values for blood parameters in complete blood count analysis.

(DOCX)

pone.0327636.s004.docx (31.1KB, docx)
S5 Table. Attention weights of the last layer of the trained graph attention networks on each graph (directed graph, reverse directed graph and undirected graph).

(DOCX)

pone.0327636.s005.docx (17KB, docx)

Data Availability

All Jupyter Notebooks, python files and datasets of the methodology developed and used in this study are available at https://github.com/danielwalke/SBCDataAnalysis. The used dataset containing the complete blood counts is available in zenodo under https://zenodo.org/records/6922968 (DOI: 10.5281/zenodo.10122491).

Funding Statement

Grant-Numbers: HE 8077/2-1, SA 465/53-1 Funder: German Research Foundation/ Deutsche Forschungsgemeinschaft (DFG) (https://www.dfg.de/de). Funded Authors: D. W. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Budholiya K, Shrivastava SK, Sharma V. An optimized XGBoost based diagnostic system for effective prediction of heart disease. J King Saud Univ comput inf Sci. 2022;34(7):4514–23. doi: 10.1016/j.jksuci.2020.10.013 [DOI] [Google Scholar]
  • 2.Ogunleye A, Wang Q-G. XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Trans Comput Biol Bioinform. 2020;17(6):2131–40. doi: 10.1109/TCBB.2019.2911071 [DOI] [PubMed] [Google Scholar]
  • 3.Chen T, Guestrin C. XGBoost. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM. 2016. 785–94. doi: 10.1145/2939672.2939785 [DOI] [Google Scholar]
  • 4. Walke D, Micheel D, Schallert K, Muth T, Broneske D, Saake G, et al. The importance of graph databases and graph learning for clinical applications. Database (Oxford). 2023; 2023. doi: 10.1093/database/baad045 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Breiman L. Random forests. Machine Learning. 2001;45(1):5–32. doi: 10.1023/a:1010933404324 [DOI] [Google Scholar]
  • 6.Wilson RJ. Introduction to graph theory. 3rd ed. London: Longman. 1985. [Google Scholar]
  • 7.Hamilton WL, Ying R, Leskovec J. Inductive Representation Learning on Large Graphs.; 07.06.2017 NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017. doi: 10.5555/3294771.3294869 [DOI] [Google Scholar]
  • 8.Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph Attention Networks.; 30.10.2017 In: International Conference on Learning Representations (ICLR). 2018. [Google Scholar]
  • 9.Kipf TN, Welling M. Semi-Supervised Classification with Graph Convolutional Networks.; 09.09.2016. In: International Conference on Learning Representations (ICLR). 2017. [Google Scholar]
  • 10.Xu K, Hu W, Leskovec J, Jegelka S. How Powerful are Graph Neural Networks.; 01.10.2018. In: International Conference on Learning Representations (ICLR). 2019. [Google Scholar]
  • 11.Wang X, Ji H, Shi C, Wang B, Ye Y, Cui P, et al. Heterogeneous Graph Attention Network. In: Liu L, White R, editors. The World Wide Web Conference. New York, NY, USA: ACM; 2019. pp. 2022–32. [Google Scholar]
  • 12.Hu Z, Dong Y, Wang K, Sun Y. Heterogeneous Graph Transformer. In: Proceedings of The Web Conference 2020. New York, NY, USA: ACM; 2020. 2704–10. doi: 10.1145/3366423.3380027 [DOI] [Google Scholar]
  • 13.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need.; 12.06.2017. In: Advances in Neural Information Processing Systems 30 (NIPS’17). 2017. [Google Scholar]
  • 14.Veličković P. Message passing all the way up.; 22.02.2022. In: International Conference on Learning Representations (ICLR). 2022. [Google Scholar]
  • 15.Perozzi B, Al-Rfou R, Skiena S. DeepWalk.; 2014. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM; 2014. p. 701–10. doi: 10.1145/2623330.2623732 [DOI] [Google Scholar]
  • 16.Grover A, Leskovec J. node2vec: Scalable Feature Learning for Networks.; 03.07.2016. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM; 2016. p. 855–64. doi: 10 10.1145/2939672.2939754 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Chen X, Zhang R, Tang X-Y. Towards real-time diagnosis for pediatric sepsis using graph neural network and ensemble methods. Eur Rev Med Pharmacol Sci. 2021;25(14):4693–701. doi: 10.26355/eurrev_202107_26380 [DOI] [PubMed] [Google Scholar]
  • 18.Fan W, Ma Y, Li Q, He Y, Zhao E, Tang J, et al. Graph Neural Networks for Social Recommendation. In: The World Wide Web Conference. New York, NY, USA: ACM; 2019. 417–26. doi: 10.1145/3308558.3313488 [DOI] [Google Scholar]
  • 19.Wieder O, Kohlbacher S, Kuenemann M, Garon A, Ducrot P, Seidel T, et al. A compact review of molecular property prediction with graph neural networks. Drug Discov Today Technol. 2020;37:1–12. doi: 10.1016/j.ddtec.2020.11.009 [DOI] [PubMed] [Google Scholar]
  • 20.Wu S, Sun F, Zhang W, Xie X, Cui B. Graph neural networks in recommender systems: a survey. ACM Comput Surv. 2022;55(5):1–37. doi: 10.1145/3535101 [DOI] [Google Scholar]
  • 21.Wang Y, Wang YG, Hu C, Li M, Fan Y, Otter N, et al. Cell graph neural networks enable the precise prediction of patient survival in gastric cancer. NPJ Precis Oncol. 2022;6(1):45. doi: 10.1038/s41698-022-00285-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Sun Z, Yin H, Chen H, Chen T, Cui L, Yang F. Disease prediction via graph neural networks. IEEE J Biomed Health Inform. 2021;25(3):818–26. doi: 10.1109/JBHI.2020.3004143 [DOI] [PubMed] [Google Scholar]
  • 23.Gao J, Lyu T, Xiong F, Wang J, Ke W, Li Z. Predicting the survival of cancer patients with multimodal graph neural network. IEEE/ACM Trans Comput Biol Bioinform. 2022;19(2):699–709. doi: 10.1109/TCBB.2021.3083566 [DOI] [PubMed] [Google Scholar]
  • 24.Li Y, Qian B, Zhang X, Liu H. Graph neural network-based diagnosis prediction. Big Data. 2020;8(5):379–90. doi: 10.1089/big.2020.0070 [DOI] [PubMed] [Google Scholar]
  • 25.Brody S, Alon U, Yahav E. How Attentive are Graph Attention Networks.; 30.05.2021. In: International Conference on Learning Representations (ICLR). 2022. [Google Scholar]
  • 26.Steinbach D, Ahrens PC, Schmidt M, Federbusch M, Heuft L, Lübbert C, et al. Applying machine learning to blood count data predicts sepsis with ICU admission. Clin Chem. 2024;70(3):506–15. doi: 10.1093/clinchem/hvae001 [DOI] [PubMed] [Google Scholar]
  • 27.Gibb S, Ahrens P, Steinbach D, Schmidt M, Kaiser T. sbcdata: laboratory diagnostics from septic and non-septic patients used in the AMPEL project. R package version 1.0.0. 2023. doi: 10.5281/zenodo.10122491 [DOI] [Google Scholar]
  • 28.Delano MJ, Ward PA. The immune system’s role in sepsis progression, resolution, and long-term outcome. Immunol Rev. 2016;274(1):330–53. doi: 10.1111/imr.12499 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Póvoa P. C-reactive protein: a valuable marker of sepsis. Intensive Care Med. 2002;28(3):235–43. doi: 10.1007/s00134-002-1209-6 [DOI] [PubMed] [Google Scholar]
  • 30.Moor M, Rieck B, Horn M, Jutzeler CR, Borgwardt K. Early prediction of sepsis in the ICU using machine learning: a systematic review. Front Med (Lausanne). 2021;8:607952. doi: 10.3389/fmed.2021.607952 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Kaji DA, Zech JR, Kim JS, Cho SK, Dangayach NS, Costa AB, et al. An attention based deep learning model of clinical events in the intensive care unit. PLoS One. 2019;14(2):e0211057. doi: 10.1371/journal.pone.0211057 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Liu VX, Fielding-Singh V, Greene JD, Baker JM, Iwashyna TJ, Bhattacharya J, et al. The timing of early antibiotics and hospital mortality in sepsis. Am J Respir Crit Care Med. 2017;196(7):856–63. doi: 10.1164/rccm.201609-1848OC [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hastie T, Tibshirani R, Friedman JH. The elements of statistical learning. Data mining, inference, and prediction. 2nd ed. New York NY: Springer; doi: 10.1007/978-0-387-84858-7 2009. [DOI] [Google Scholar]
  • 34.scikit-learn. 4.1. Partial Dependence and Individual Conditional Expectation plots [updated 19 Jun 2024; cited 20 Jun 2024]. Available from: https://scikit-learn.org/stable/modules/partial_dependence.html#h2009 [Google Scholar]
  • 35. Fey M, Lenssen JE. Fast graph representation learning with PyTorch geometric. arXiv:1903.02428; 06.03.2019.
  • 36.Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A. RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst, Man, Cybern A. 2010;40(1):185–97. doi: 10.1109/tsmca.2009.2029559 [DOI] [Google Scholar]
  • 37.Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, et al., editors. Advances in Neural Information Processing Systems 32. Curran Associates, Inc; doi: 10.5555/3454287.3455008 2019. pp. 8024–35. [DOI] [Google Scholar]
  • 38.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O. Scikit-learn: machine learning in python. J Mach Learn Res. 2011. [Google Scholar]
  • 39.scikit-learn: machine learning in Python — scikit-learn 1.2.2 documentation [updated 19 Apr 2023; cited 24 Apr 2023]. Available from: https://scikit-learn.org/stable/ [Google Scholar]
  • 40.Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. doi: 10.1162/neco.1997.9.8.1735 [DOI] [PubMed] [Google Scholar]
  • 41.Graves A, Fernández S, Schmidhuber J. Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. In: Duch W, Kacprzyk J, Oja E, Zadrożny S, editors. Artificial Neural Networks: Formal Models and Their Applications ‐ ICANN 2005. Berlin, Heidelberg: Springer Berlin Heidelberg; doi: 10.1007/11550907_126 2005. pp. 799–804. [DOI] [Google Scholar]
  • 42.LeCun Y, Bengio Y. Convolutional networks for images, speech, and time series. The Handbook of Brain Theory and Neural Networks. Cambridge, MA, USA: MIT Press. 1998. p. 255–8. doi: 10.5555/303568.303704 [DOI] [Google Scholar]
  • 43.Blakely D, Lanchantin J, Qi Y. Time and Space Complexity of Graph Convolutional Networks.
  • 44.LeCun Y, Kavukcuoglu K, Farabet C. Convolutional networks and applications in vision. ISCAS 2010. 2010 IEEE International Symposium on Circuits and Systems, Nano-Bio Circuit Fabrics and Systems: May 30th-June 2nd 2010, Paris, France. [Piscataway, N.J.]: IEEE; doi: 10.1109/ISCAS.2010.5537907 2010. pp. 253–6. [DOI] [Google Scholar]
  • 45.Angus DC, Wax RS. Epidemiology of sepsis: an update. Crit Care Med. 2001; 29. Available from: https://journals.lww.com/ccmjournal/Fulltext/2001/07001/Epidemiology_of_sepsis__An_update.35.aspx [DOI] [PubMed] [Google Scholar]
  • 46.Koh GCKW, Peacock SJ, van der Poll T, Wiersinga WJ. The impact of diabetes on the pathogenesis of sepsis. Eur J Clin Microbiol Infect Dis. 2012;31(4):379–88. doi: 10.1007/s10096-011-1337-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Johnson AEW, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023;10(1):1. doi: 10.1038/s41597-022-01899-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Thoral PJ, Peppink JM, Driessen RH, Sijbrands EJG, Kompanje EJO, Kaplan L, et al. Sharing ICU patient data responsibly under the society of critical care medicine/European society of intensive care medicine joint data science collaboration: the Amsterdam university medical centers database (AmsterdamUMCdb) example. Crit Care Med. 2021;49(6):e563–77. doi: 10.1097/CCM.0000000000004916 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Hyland SL, Faltys M, Hüser M, Lyu X, Gumbsch T, Esteban C, et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nat Med. 2020;26(3):364–73. doi: 10.1038/s41591-020-0789-4 [DOI] [PubMed] [Google Scholar]
  • 50.Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU collaborative research database, a freely available multi-center database for critical care research. Sci Data. 2018;5:180178. doi: 10.1038/sdata.2018.178 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Jang HJ, Schellingerhout D, Kim J, Chung J, Kim D-E. Towards a biomarker for acute arterial thrombosis using complete blood count and white blood cell differential parameters in mice. Sci Rep. 2023;13(1):4043. doi: 10.1038/s41598-023-31122-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Davis AS, Viera AJ, Mead MD. Leukemia: an overview for primary care. Am Fam Physician. 2014;89(9):731–8. [PubMed] [Google Scholar]
  • 53.Belson WA. Matching and prediction on the principle of biological classification. Appl Stat. 1959;8(2):65. doi: 10.2307/2985543 [DOI] [Google Scholar]
  • 54.Wright RE. Logistic regression. In: Grimm LG, Yarnold PR, editors. Reading and understanding multivariate statistics. American Psychological Association. 1995. p. 217–44. [Google Scholar]
  • 55.Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323(6088):533–6. doi: 10.1038/323533a0 [DOI] [Google Scholar]
  • 56.Nebe T, Bentzien F, Bruegel M, Fiedler GM, Gutensohn K, Heimpel H, et al. Multizentrische ermittlung von referenzbereichen für parameter des maschinellen blutbildes/multicentric determination of reference ranges for automated blood counts. LaboratoriumsMedizin. 2011;35(1):3–28. doi: 10.1515/jlm.2011.004 [DOI] [Google Scholar]
  • 57.Gulati GL, Hyun BH. The automated CBC. A current perspective. Hematol Oncol Clin North Am. 1994;8(4):593–603. [PubMed] [Google Scholar]

Decision Letter 0

Hanna Landenmark

Dear Dr. Walke,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 04 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Hanna Landenmark

Staff Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“We thank all co-authors for contributing to the manuscript. Furthermore, we thank the German Research Foundation (DFG) for funding this project [grant numbers HE 8077/2-1, SA 465/53-1].”

We note that you have provided funding information that is currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“Grant-Numbers: HE 8077/2-1, SA 465/53-1

Funder: German Research Foundation/ Deutsche Forschungsgemeinschaft (DFG) (https://www.dfg.de/de)

Funded Authors: D. W.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

3. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: No

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #1: Yes

Reviewer #2: Yes

**********

Reviewer #1: The authors emphasize the important of making connections between samples.

However, the experimental results do not show that it is more effective to analyze samples from the perspective of graphs.

There are some concerns.

When analyzing similar-graphs, GNNs are comparable in performance to other algorithms, but less robust and longer training time;

When analyzing patient-centric graphs, it seems more like the attention mechanism works, not graph struture. I do not think other GNNs (i.e. GCN) can achieve the same high performance. More experiments may be needed here.

In addition, I wonder how the attributes of various types of nodes in heterogeneous similarity graphs are defined, since the attributes in homogeneous similarity graphs are used as nodes.

Despite these concerns, I think it makes sense to use GNNs to analyze medical data. The authors may need to further explore the way they consturct the graphs, any why GNNs are more efficient.

Reviewer #2: This manuscript titled “Edges are all you need: Potential of Medical Time Series 2 Analysis with Graph Neural Networks” introduced an approach for incorporating time-series clinical diagnosis data efficiently, and showed that graph neural networks (GNNs) provide a better alternative than traditional machine learning (ML) algorithms. While I think this work is very well-written and well explained, the novelty and impact of this work falls short for acceptance in PLOS One. Therefore, I reject this manuscript. I addressed the concerns below:

1. Novelty issue: The main objective of this paper and the graph neural network based approach is not new for clinical data. Several previous works have already applied GNN on clinical data. Some of the previous works are listed below [1 - 4]. Moreover, the feature importance calculation part is also not the invention of the authors, as their mentioned process falls under a special form of ablation study highly done in GNN papers. Furthermore, the GNNs they used to evaluate performance are also not developed by the authors. Under these circumstances, I believe this work has not been able to meet the novelty criteria for PLOS One.

[1] Wang, Yanan, Yu Guang Wang, Changyuan Hu, Ming Li, Yanan Fan, Nina Otter, Ikuan Sam et al. "Cell graph neural networks enable the precise prediction of patient survival in gastric cancer." NPJ precision oncology 6, no. 1 (2022): 45.

[2] Sun, Zhenchao, Hongzhi Yin, Hongxu Chen, Tong Chen, Lizhen Cui, and Fan Yang. "Disease prediction via graph neural networks." IEEE Journal of Biomedical and Health Informatics 25, no. 3 (2020): 818-826.

[3] Gao, Jianliang, Tengfei Lyu, Fan Xiong, Jianxin Wang, Weimao Ke, and Zhao Li. "Predicting the survival of cancer patients with multimodal graph neural network." IEEE/ACM Transactions on Computational Biology and Bioinformatics 19, no. 2 (2021): 699-709.

[4] Li, Yang, Buyue Qian, Xianli Zhang, and Hui Liu. "Graph neural network-based diagnosis prediction." Big data 8, no. 5 (2020): 379-390.

2. Work amount issue: The authors only evaluated performances for one dataset which is quite inadequate for evaluating performances across different types of data. Without showing that their approach generalizes for multiple types of data, this work cannot be accepted.

3. Results issue: According to the authors comments, if incorporating time series data improves performance for these types of data, than better alternative are recurrent neural networks (RNNs) and Transformers. But no comparison was shown of the GNNs with these types of models. GNNs are more suited for data that has natural graph-like structures (e.g., crystals, proteins, molecules, RNA, etc.). So, without comparing it with at least an LSTM model (RNN) [1], I cannot understand the impact of this work.

[1] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9, no. 8 (1997): 1735-1780.

4. “However, this information is mostly neglected by state-of-the-art machine learning algorithms" - you need to cite some works.

5. “Furthermore, GNNs and other complex machine learning algorithms (e.g., XGBoost) are often treated as black-boxes limiting their interpretability and transparency which is essential for medical applications.” - this work also does not address this issue. The feature importance calculation does not address this issue as this refers to the interpretability of the neural network itself. For example, what sort of information the output of each layer (latent space) bears.

6. What is the validity of synthetic data generated? No explanation was provided.

7. “The reason for this similar performance is that the nodes of complete blood counts only sample information from similar node blood count measurements (measurement-centric graphs).” - not a strong reason, need to describe with respect to the GNN architecture.

8. No details on the GNN algorithms used. The readers need to know the scientific reasons why GNN is performing better than traditional ML models. The authors need to explain why particular GNN architecture performed better, and why particular GNN architecture performed worse. Because this work can be iteratively improved, but if they don’t delve into the GNN architecture, this becomes very hard to improve logically.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Review-PlosOne.pdf

pone.0327636.s006.pdf (73.9KB, pdf)
PLoS One. 2025 Jul 8;20(7):e0327636. doi: 10.1371/journal.pone.0327636.r002

Author response to Decision Letter 1


24 Jun 2024

Reviewer 1

Thank you for your feedback. We hope that the extended results improved our manuscript.

The authors emphasize the important of making connections between samples.

However, the experimental results do not show that it is more effective to analyze samples from the perspective of graphs. There are some concerns.

#R1_1: When analyzing similar-graphs, GNNs are comparable in performance to other algorithms, but less robust and longer training time;

#R1_1 Answer: Yes, that is why we would like to emphasize not using GNNs for similarity-based graph structures. However, we can achieve improvements with GNNs on patient-centric graphs (see #R1_2).

#R1_2: When analyzing patient-centric graphs, it seems more like the attention mechanism works, not graph struture. I do not think other GNNs (i.e. GCN) can achieve the same high performance. More experiments may be needed here.

#R1_2 Answer: We performed further experiments of GNNs (GraphSAGE and GCN) on patient-centric graphs. The performance on GraphSAGE was comparable to the performance of GAT indicating that the attention mechanism is not necessary for the observed improvements. GCN has a bit lower performance on directed and undirected graph structures which might be the consequence of the symmetric normalization and the linear transformation before the message-propagation (e.g., GraphSAGE first propagate and then transforms the data). However, on the reversed-directed graphs the performance was similar to the GAT (see #R1_2).

#R1_3: In addition, I wonder how the attributes of various types of nodes in heterogeneous similarity graphs are defined, since the attributes in homogeneous similarity graphs are used as nodes.

#R1_3 Answer: We tried to clarify the explanation more (see #R1_3).

#R1_4: Despite these concerns, I think it makes sense to use GNNs to analyze medical data. The authors may need to further explore the way they consturct the graphs, any why GNNs are more efficient.

#R1_4 Answer: Our main objective was to improve clinical decision support systems with clinical routine data like complete blood count data with algorithms best suited for the underlying data. We highlighted this focus now better in our title. In this process, we figured out that incorporating the information from time series information (patient-centric graphs) was much more promising than considering the complete blood count measurements independent of the patient (e.g., similarity graphs).

Reviewer 2

Thank you for your comprehensive feedback on our work. We hope that the updated changes and extended results improved the manuscript.

This manuscript titled “Edges are all you need: Potential of Medical Time Series 2 Analysis with Graph Neural Networks” introduced an approach for incorporating time-series clinical diagnosis data efficiently, and showed that graph neural networks (GNNs) provide a better alternative than traditional machine learning (ML) algorithms. While I think this work is very well-written and well explained, the novelty and impact of this work falls short for acceptance in PLOS One. Therefore, I reject this manuscript. I addressed the concerns below:

#R2_1: 1. Novelty issue: The main objective of this paper and the graph neural network based approach is not new for clinical data. Several previous works have already applied GNN on clinical data. Some of the previous works are listed below [1 - 4]. Moreover, the feature importance calculation part is also not the invention of the authors, as their mentioned process falls under a special form of ablation study highly done in GNN papers. Furthermore, the GNNs they used to evaluate performance are also not developed by the authors. Under these circumstances, I believe this work has not been able to meet the novelty criteria for PLOS One.

[1] Wang, Yanan, Yu Guang Wang, Changyuan Hu, Ming Li, Yanan Fan, Nina Otter, Ikuan Sam et al. "Cell graph neural networks enable the precise prediction of patient survival in gastric cancer." NPJ precision oncology 6, no. 1 (2022): 45.

[2] Sun, Zhenchao, Hongzhi Yin, Hongxu Chen, Tong Chen, Lizhen Cui, and Fan Yang. "Disease prediction via graph neural networks." IEEE Journal of Biomedical and Health Informatics 25, no. 3 (2020): 818-826.

[3] Gao, Jianliang, Tengfei Lyu, Fan Xiong, Jianxin Wang, Weimao Ke, and Zhao Li. "Predicting the survival of cancer patients with multimodal graph neural network." IEEE/ACM Transactions on Computational Biology and Bioinformatics 19, no. 2 (2021): 699-709.

[4] Li, Yang, Buyue Qian, Xianli Zhang, and Hui Liu. "Graph neural network-based diagnosis prediction." Big data 8, no. 5 (2020): 379-390.

#R2_1 Answer: We included the references in our introduction section and have rewritten the sections and clarified our main objectives (see #R2_1). Our main objective is to improve clinical decision support systems with clinical routine data like complete blood count data with algorithms best suited for the underlying data (time series data of different length). Regardless of this objective, the novelty criteria does not apply to PLOS ONE as indicated on their homepage: „The world’s first multidisciplinary Open Access journal, PLOS ONE accepts scientifically rigorous research, regardless of novelty.“ (see https://everyone.plos.org/about-plos-one/)

#R2_2: 2. Work amount issue: The authors only evaluated performances for one dataset which is quite inadequate for evaluating performances across different types of data. Without showing that their approach generalizes for multiple types of data, this work cannot be accepted.

#R2_2 Answer: We never intended nor described to evaluate the performance across multiple data types. Our objective was to evaluate the classification performance on complete blood count data from clinical routine (see answer #R2_1). An external validation of our results was provided with an external validation dataset from a different tertiary care center.

#R2_3 3. Results issue: According to the authors comments, if incorporating time series data improves performance for these types of data, than better alternative are recurrent neural networks (RNNs) and Transformers. But no comparison was shown of the GNNs with these types of models. GNNs are more suited for data that has natural graph-like structures (e.g., crystals, proteins, molecules, RNA, etc.). So, without comparing it with at least an LSTM model (RNN) [1], I cannot understand the impact of this work.

[1] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9, no. 8 (1997): 1735-1780.

#R2_3 Answer: We introduced an explicit decoder-only transformer architecture, 1D-CNN, LSTM and Bi-LSTM for the sake of completeness (see #R2_3). In contrast to the GNN architecture, these architectures (at least the Transformer and 1D-CNN, and for efficient batching also LSTM architectures) required an additional padding and masking step which exacerbates their usage for time series of different lengths.

#R2_4: “However, this information is mostly neglected by state-of-the-art machine learning algorithms" - you need to cite some works.

#R2_4 Answer: We clarified the statement (see #R2_4).

#R2_5 5. “Furthermore, GNNs and other complex machine learning algorithms (e.g., XGBoost) are often treated as black-boxes limiting their interpretability and transparency which is essential for medical applications.” - this work also does not address this issue. The feature importance calculation does not address this issue as this refers to the interpretability of the neural network itself. For example, what sort of information the output of each layer (latent space) bears.

#R2_6 6. What is the validity of synthetic data generated? No explanation was provided.

#R2_5+6 Answer: While writing the manuscript, we unfortunately were not aware of the existence of partial dependence plots (https://hastie.su.domains/ElemStatLearn/) which were precisely our goal (i.e., investigating prediction changes based on different feature values). Therefore, we changed our methodology to the partial dependence plots and plotted the average prediction values across different grid values between the 5% and 95% percentile (see #R2_5+6). Thereby, we can directly see the influence of different values of individual features to the overall prediction of a specific complete blood count measurement.

#R2_7+8: 7. “The reason for this similar performance is that the nodes of complete blood counts only sample information from similar node blood count measurements (measurement-centric graphs).” - not a strong reason, need to describe with respect to the GNN architecture.

8. No details on the GNN algorithms used. The readers need to know the scientific reasons why GNN is performing better than traditional ML models. The authors need to explain why particular GNN architecture performed better, and why particular GNN architecture performed worse. Because this work can be iteratively improved, but if they don’t delve into the GNN architecture, this becomes very hard to improve logically.

#R2_7+8 Answer: We added further information as explanations including the individual architectures of the investigated GNNs (see #R2_7+8)

Attachment

Submitted filename: Response To Reviewers.docx

pone.0327636.s009.docx (21.5KB, docx)

Decision Letter 1

Giacomo Fiumara

Dear Dr. Walke,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Dec 05 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Giacomo Fiumara, PhD

Academic Editor

PLOS ONE

Additional Editor Comments:

The second round of reviews is now complete. My opinion is that the manuscript must undergo a major revision before considering for publication in PLOS ONE.

What emerges is that the manuscript lacks a unitary style. In addition, the abstract and the introduction should (at least) mention all the algorithms used in the experiments. In this respect, the abstract and the introduction fail in presenting the research.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #3: (No Response)

Reviewer #4: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions??>

Reviewer #3: Yes

Reviewer #4: No

**********

3. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #3: Yes

Reviewer #4: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #3: Yes

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #3: Yes

Reviewer #4: Yes

**********

Reviewer #3: The manuscript presents an original study that applies Graph Neural Networks (GNNs) to predict sepsis using Complete Blood Count (CBC) data. This is an interesting and promising approach, but the paper has several areas that require improvement. While the second half of the paper addresses many initial issues, there are gaps in clarity, the experimental setup, and interpretability that need to be tackled before the manuscript can be considered for publication.

Originality

The application of GNNs to patient-centric graphs for sepsis detection is novel and contributes to ongoing research into machine learning in healthcare. However, the work overlaps with previous studies like Chen et al. (Eur Rev Med Pharmacol Sci 2021; 25 (14): 4693-4701, DOI: 10.26355/eurrev_202107_26380), which used GNNs for pediatric sepsis. A more explicit comparison with this and similar works, along with a justification of the broader patient population, would strengthen the claim of originality.

Clinical Relevance

The selection of CBC parameters is reasonable, but the relevance of each feature to sepsis prediction should be better explained. For readers unfamiliar with clinical data, it would be helpful to briefly introduce why these specific markers are key indicators of sepsis and to support this with appropriate references.

Ethical Considerations

The manuscript states that ethical approval was not applicable due to the anonymization of the data. However, further elaboration is necessary regarding how the privacy of patient data was ensured. Data privacy is particularly sensitive in healthcare applications, and a brief explanation of how the dataset was anonymized would help meet ethical transparency standards.

Evaluation of Paper Sections

• Abstract

The abstract is concise but could be clearer. For instance, it mentions AUROC as the primary evaluation metric but doesn’t justify why this metric was chosen over others like the F1 score, which is more commonly use. A brief justification for focusing on AUROC would be helpful. Additionally, more details about the graph structure (why it was used over simpler models) and how GNNs handle time-series data would improve the clarity.

• Introduction

The introduction is somewhat lacking in depth, particularly concerning the clinical background of sepsis and the rationale for using GNNs. A more detailed explanation of why GNNs are suited for sepsis prediction, compared to simpler methods, would benefit the reader. The discussion of DeepWalk and Node2Vec compares them to GNNs, implying they are machine learning algorithms. These methods are more accurately described as graph embedding techniques used for feature representation, not prediction. This section would benefit from clarification and more technical precision. The introduction should also be expanded to give a clearer description of how the graph structure was defined. The definition of nodes and edges is somewhat imprecise and could confuse readers unfamiliar with graph theory terminology. For example, terms like "vertices" and "links" are interchangeable with "nodes" and "edges," but their use should be consistent.

• Methods

The methods section provides a detailed explanation of the experimental setup but lacks justification for some key decisions. For instance, why were tree-based algorithms chosen as baselines? Further, the reasoning behind certain hyperparameter choices, such as the number of epochs and learning rate, is unclear. More justification for selecting GraphSAGE as the representative GNN is necessary: were other GNN architectures considered, and if so, why was GraphSAGE chosen as the final model?

• Results

The results section is one of the paper’s strengths, providing a comprehensive evaluation of the models and a comparison between different machine learning approaches. However, the choice of AUROC as the primary metric remains problematic. The authors do mention F1 later in the paper but should provide more justification for emphasizing AUROC initially. The partial dependence plots in the results are well-executed, but the explanation of how the direction of individual features (e.g., an increase in white blood cell count) affects predictions could be clearer. While the paper mentions this limitation, providing more interpretability around feature impact is crucial, particularly in clinical settings where decisions depend on understanding how individual factors contribute to risk.

• Discussion

The discussion section does a good job of addressing the limitations of the study, including bias in the dataset and computational challenges. However, the paper would benefit from a more detailed analysis of the computational complexity of GNNs, which are known to be resource-intensive. Were any optimizations considered, such as using more efficient GNN architectures or limiting the number of layers to reduce computational overhead?

• Data and Code Availability

The transparency in making the code and data publicly available on GitHub and Zenodo is commendable and aligns with best practices in reproducibility. This adherence to open science is a strength of the paper, ensuring that the study can be replicated and verified by other researchers.

• Appendix

The appendix provides valuable supplementary materials, including tables, plots, and diagrams that enhance the main text. These materials clarify several points that are less detailed in the main sections.

Recommendation

The study presents a novel application of GNNs to sepsis prediction and shows strong experimental work. However, the manuscript requires revisions to improve its clarity and justification of choices in both methodology and evaluation metrics. I recommend a minor revision to address these issues before it can be considered for publication in PLOS ONE.

Reviewer #4: The article lacks coherence between the introduction and the conclusion and does not delve into the GNN algorithms used. It also fails to provide an explanation of why some GNN algorithms perform better than others, nor does it discuss the implemented architecture. This significantly reduces its scientific value. The absence of such crucial details makes it difficult for the reader to fully understand the work. Due to these reasons, as well as those outlined in the major comments (see attached file), I believe the article is not suitable for publication in the journal.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #3: No

Reviewer #4: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: JNCA_D_24_00921_R1_reviewer.pdf

pone.0327636.s008.pdf (51.8KB, pdf)
PLoS One. 2025 Jul 8;20(7):e0327636. doi: 10.1371/journal.pone.0327636.r004

Author response to Decision Letter 2


27 Nov 2024

We sincerely thank the editor and the reviewers for their thoughtful and constructive feedback, which has helped us improve the quality and clarity of our work. We greatly appreciate the time and effort invested in reviewing our manuscript and providing detailed comments and suggestions. Below, we address each comment point by point and describe the corresponding changes made to the manuscript starting with Additional Editor Comments. We are confident that these revisions have significantly strengthened the work and hope they meet the reviewer’s expectations.

Additional Editor Comments

The second round of reviews is now complete. My opinion is that the manuscript must undergo a major revision before considering for publication in PLOS ONE.

What emerges is that the manuscript lacks a unitary style. In addition, the abstract and the introduction should (at least) mention all the algorithms used in the experiments. In this respect, the abstract and the introduction fail in presenting the research.

We revised the style and writing according to the reviewers’ suggestions. We added the GNNs to the abstract and introduction (see Editor request).

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #3: (No Response)

Reviewer #4: All comments have been addressed

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #3: Yes

Reviewer #4: No

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #3: Yes

Reviewer #4: Yes

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #3: Yes

Reviewer #4: Yes

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #3: Yes

Reviewer #4: Yes

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #3

The manuscript presents an original study that applies Graph Neural Networks (GNNs) to predict sepsis using Complete Blood Count (CBC) data. This is an interesting and promising approach, but the paper has several areas that require improvement. While the second half of the paper addresses many initial issues, there are gaps in clarity, the experimental setup, and interpretability that need to be tackled before the manuscript can be considered for publication.

#R3_1: Originality

The application of GNNs to patient-centric graphs for sepsis detection is novel and contributes to ongoing research into machine learning in healthcare. However, the work overlaps with previous studies like Chen et al. (Eur Rev Med Pharmacol Sci 2021; 25 (14): 4693-4701, DOI: 10.26355/eurrev_202107_26380), which used GNNs for pediatric sepsis. A more explicit comparison with this and similar works, along with a justification of the broader patient population, would strengthen the claim of originality.

Answer #R3_1: We included the reference and highlighted that they classified pediatric sepsis instead of sepsis in adults, only used similarity graphs and incorporated multiple laboratory tests instead of using only data from the clinical routine like complete blood count data (“[GNNs] already showed promising results for predicting pediatric sepsis based on several groups of laboratory tests (e.g., medical history and serological tests) using similarity graphs [17].”see #R3_1).

#R3_2: Clinical Relevance

The selection of CBC parameters is reasonable, but the relevance of each feature to sepsis prediction should be better explained. For readers unfamiliar with clinical data, it would be helpful to briefly introduce why these specific markers are key indicators of sepsis and to support this with appropriate references.

Answer #R3_2: We added a supplementary note to explain the functions of individual blood parameters and their potential relevance for sepsis. A text reference for the supplementary note was added in the manuscript (“The functions and the potential relevance of the individual blood parameters for sepsis is discussed in the supplement (Supplementary Note 1)”, see #R3_2).

#R3_3: Ethical Considerations

The manuscript states that ethical approval was not applicable due to the anonymization of the data. However, further elaboration is necessary regarding how the privacy of patient data was ensured. Data privacy is particularly sensitive in healthcare applications, and a brief explanation of how the dataset was anonymized would help meet ethical transparency standards.

Answer #R3_3: We added ethical approvals from the original study of Steinbach et al. (“The Ethics Committee at the Leipzig University Faculty of Medicine approved the initial study from Steinbach et al. [26] (reference number: 214/18-ek). The study was published in accordance with the transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) statement. This study is only re-evaluating the dataset from Steinbach et al. [26] by evaluating GNNs and incorporating time-series information“, see #R3_3).

Evaluation of Paper Sections

#R3_4:• Abstract

The abstract is concise but could be clearer. For instance, it mentions AUROC as the primary evaluation metric but doesn’t justify why this metric was chosen over others like the F1 score, which is more commonly use. A brief justification for focusing on AUROC would be helpful. Additionally, more details about the graph structure (why it was used over simpler models) and how GNNs handle time-series data would improve the clarity.

Answer #R3_4: We added respective details in the abstract (“Methods: In this study, we evaluated the performance and time consumption of several GNNs (e.g., Graph Attention Networks) on similarity graphs compared to simpler, state-of-the-art machine learning algorithms (e.g., XGBoost) on the classification of sepsis from blood count data as well as the importance and slope of each feature for the final classification. Additionally, we connected complete blood count samples of the same patient based on their measured time (patient-centric graphs) to incorporate time series information in the GNNs. As our main evaluation metric we used the Area Under Receiver Operating Curve (AUROC) to have a threshold independent metric that can handle class imbalance.

Results and Conclusion: Standard GNNs on evaluated similarity-graphs achieved an Area Under Receiver Operating Curve (AUROC)[...]”, see #R3_4).

#R3_5:• Introduction

The introduction is somewhat lacking in depth, particularly concerning the clinical background of sepsis and the rationale for using GNNs. A more detailed explanation of why GNNs are suited for sepsis prediction, compared to simpler methods, would benefit the reader. The discussion of DeepWalk and Node2Vec compares them to GNNs, implying they are machine learning algorithms. These methods are more accurately described as graph embedding techniques used for feature representation, not prediction. This section would benefit from clarification and more technical precision. The introduction should also be expanded to give a clearer description of how the graph structure was defined. The definition of nodes and edges is somewhat imprecise and could confuse readers unfamiliar with graph theory terminology. For example, terms like "vertices" and "links" are interchangeable with "nodes" and "edges," but their use should be consistent.

Answer #R3_5: We corrected our definition of a graph and clarified that DeepWalk and node2vec are embedding techniques. Then, we added more details regarding the clinical background of sepsis. Finally, we added more details on how we defined the graph structures and the rationale behind using GNNs (“A graph G is a non-empty finite set of elements called nodes V(G) and finite set E(G) of distinct unordered pairs of distinct elements of V(G) called edges [6]. [...]GNNs have the advantage that they can utilize attached features and parallelize computations on modern hardware (e.g., GPUs) in contrast to using embedding techniques like DeepWalk [15] or Node2Vec [16]. […] Sepsis is a life-threatening organ dysfunction caused by a dysregulated immune response to an infection [28]. The inflammatory response is driven through the release of cytokines from neutrophil granulocytes and macrophages. Blood parameters like white blood cells, red blood cells, platelets, hemoglobin and mean corpuscular volume might serve as easily available indicators for sepsis [26] (Supplementary Note 1). ”, see #R3_5).

#R3_6: • Methods

The methods section provides a detailed explanation of the experimental setup but lacks justification for some key decisions. For instance, why were tree-based algorithms chosen as baselines? Further, the reasoning behind certain hyperparameter choices, such as the number of epochs and learning rate, is unclear. More justification for selecting GraphSAGE as the representative GNN is necessary: were other GNN architectures considered, and if so, why was GraphSAGE chosen as the final model?

Answer #R3_6: We used tree-based and non-tree based (e.g., logistic regression and a neural network) as benchmarks to get a comprehensive overview of the performance evaluation across several algorithms. We also added justifications for chosen hyperparameters and why we chose GraphSAGE as final GNN model for the interpretability (“As benchmarks, we evaluated the performance (AUROC, F1- Macro Score, and MCC) on tree-based and non-tree-based algorithms (Fig. 1 C) to get a comprehensive performance evaluation across several algorithms. […] The high epoch number in combination with a low learning rate should prevent under-fitting. Early-stopping on a separate validation set was used to prevent over-fitting. […] GraphSAGE was chosen as the final model since it achieved a reliable classification performance on the homogeneous and heterogeneous similarity graphs.”, see #R3_6).

#R3_7: Results

The results section is one of the paper’s strengths, providing a comprehensive evaluation of the models and a comparison between different machine learning approaches. However, the choice of AUROC as the primary metric remains problematic. The authors do mention F1 later in the paper but should provide more justification for emphasizing AUROC initially. The partial dependence plots in the results are well-executed, but the explanation of how the direction of individual features (e.g., an increase in white blood cell count) affects predictions could be clearer. While the paper mentions this limitation, providing more interpretability around feature impact is crucial, particularly in clinical settings where decisions depend on understanding how individual factors contribute to risk.

Answer #R3_7: We added justifications why chose AUROC as primary evaluation metric (“In the following, we will mainly focus on AUROC as the primary evaluation metric to have a threshold-independent evaluation metric that can also incorporate the high class imbalance. By assessing model performance across different thresholds, AUROC enables clinicians to fine-tune the sensitivity and specificity of sepsis detection according to their needs, minimizing both the risk of missing septic patients and the potential harm from overdiagnosis, such as unnecessary antibiotic treatments.”, see #R3_7). Additionally, we provided clearer explanations for the partial dependence plots (“Specifically, that means older people with higher white blood cell counts, red blood cell counts and increased corpuscular volume have higher sepsis probabilities according to the trained models. White blood cells had a minimum of around 4 – 8 Gpt/l (Giga- particles per liter) for most algorithms (Fig. 3 D) that indicates a physiological white blood cell count at this range. In contrast to the positive gradient, we observed a negative gradient for platelets for all algorithms (Fig. 3 G), indicating decreased sepsis risk for rising values according to the models. Specifically, this means that patients with lower platelets (thrombocytopenia) have a higher sepsis probability according to the trained models. Tree-based algorithms (Decision Tree, RUSBoost, Random Forest, XGBoost) do not depend (gradient near zero) on the features hemoglobin (Fig. 3 C) and red blood cells (Fig 3 E) in contrast to non-tree-based algorithms (Logistic Regression, Neural Network, Homogeneous GNN, Heterogeneous GNN). The latter ones showed an increased sepsis probability for low hemoglobin levels (anemia) (Fig. 3 C). Finally, the feature “sex” is nearly irrelevant for all algorithms (i.e., low gradient), i.e., the sepsis probability does not significantly depend on sex (Fig. 3 B). ”, see #R3_7). In terms of interpretability, we are actively developing a novel graph learning framework aimed at significantly improving interpretability, which remains a focus of our ongoing work.

#R_3_8: • Discussion

The discussion section does a good job of addressing the limitations of the study, including bias in the dataset and computational challenges. However, the paper would benefit from a more detailed analysis of the computational complexity of GNNs, which are known to be resource-intensive. Were any optimizations considered, such as using more efficient GNN architectures or limiting the number of layers to reduce computational overhead?

Answer #R3_8: We added a description for the general time complexity of GNNs (“In general, the computational complexity of a single convolution layer of a GNN is O(VFF’ + EF’), where V represents the number of nodes in the graph, E the number of edges in the graph, F the number of input features and F’ the hidden dimension or output features for the convolution [8,43]. However, the computational complexity differs across different architectures, depending on factors like the number of attention heads (e.g., in GAT and GATv2), and used multi-layer perceptrons (e.g., in GIN). ”see #R3_8). The number of layers was already quite low with only two layers. One layer was evaluated in initial experiments but showed a significantly lower classification performance. In terms of reducing computational overhead, as stated above we are currently working on a new graph learning framework that reduces computational complexity compared to Graph Neural Networks (GNNs) while significantly enhancing interpretability of results.

• Data and Code Availability

The transparency in making the code and data publicly available on GitHub a

Attachment

Submitted filename: ResponseToReviewers_RH.docx

pone.0327636.s010.docx (44.4KB, docx)

Decision Letter 2

Qiang He

Edges are all you need: Potential of Medical Time Series Analysis on Complete Blood Count Data with Graph Neural Networks.

PONE-D-24-06777R2

Dear Dr. Walke,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager®  and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Qiang He

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

we acknowledge that the paper now addresses the criticism raised by reviewers.

Gladly, we recommend acceptance

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #4: All comments have been addressed

Reviewer #5: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions??>

Reviewer #4: Yes

Reviewer #5: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #4: Yes

Reviewer #5: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #4: Yes

Reviewer #5: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #4: Yes

Reviewer #5: Yes

**********

Reviewer #4: The authors revised the paper according to reviewers' suggestions. Therefore I recommended the publication in correct form.

Reviewer #5: The authors have effectively addressed my initial concerns. The study’s originality is clarified by distinguishing its focus on routine CBC data in adults and providing a meaningful comparison to prior work. Clinical relevance is improved with supplementary material on the role of CBC parameters, and ethical considerations are addressed through clarification of anonymization and ethical approval. The abstract and introduction now provide stronger justification for using AUROC and GNNs for time-series data. Methodological justifications, including the choice of GraphSAGE and baseline algorithms, enhance the rigor of the experimental setup. Results are more interpretable with improved explanations of partial dependence plots, and the discussion addresses computational complexity and planned optimizations.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #4: No

Reviewer #5: No

**********

Acceptance letter

Qiang He

PONE-D-24-06777R2

PLOS ONE

Dear Dr. Walke,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Qiang He

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Note. Potential relevance of complete blood count data to sepsis.

    (DOCX)

    pone.0327636.s001.docx (39.2KB, docx)
    S2 Fig. Distribution of each continuous feature in the train, internal and external test datasets as violin plots.

    (DOCX)

    pone.0327636.s002.docx (114.8KB, docx)
    S3 Table. Evaluating the robustness of GNNs compared to benchmarks by adding 10 or 100 noisy features to the complete blood count datasets for sepsis classification (higher values represent better performance) and the required training time.

    (DOCX)

    pone.0327636.s003.docx (20KB, docx)
    S4 Table. Reference values for blood parameters in complete blood count analysis.

    (DOCX)

    pone.0327636.s004.docx (31.1KB, docx)
    S5 Table. Attention weights of the last layer of the trained graph attention networks on each graph (directed graph, reverse directed graph and undirected graph).

    (DOCX)

    pone.0327636.s005.docx (17KB, docx)
    Attachment

    Submitted filename: Review-PlosOne.pdf

    pone.0327636.s006.pdf (73.9KB, pdf)
    Attachment

    Submitted filename: Response To Reviewers.docx

    pone.0327636.s009.docx (21.5KB, docx)
    Attachment

    Submitted filename: JNCA_D_24_00921_R1_reviewer.pdf

    pone.0327636.s008.pdf (51.8KB, pdf)
    Attachment

    Submitted filename: ResponseToReviewers_RH.docx

    pone.0327636.s010.docx (44.4KB, docx)

    Data Availability Statement

    All Jupyter Notebooks, python files and datasets of the methodology developed and used in this study are available at https://github.com/danielwalke/SBCDataAnalysis. The used dataset containing the complete blood counts is available in zenodo under https://zenodo.org/records/6922968 (DOI: 10.5281/zenodo.10122491).


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES