Enhancing healthcare analytics with prompt based probabilistic graphical model

Yan Zhuang; Junyan Zhang; Shiyuan Liu; Bing Wei; Lei Zheng; Jianwei Gao; Zaijian Zeng; Juan Xu; Kunlun He

doi:10.1016/j.isci.2025.113417

. 2025 Aug 29;28(10):113417. doi: 10.1016/j.isci.2025.113417

Enhancing healthcare analytics with prompt based probabilistic graphical model

Yan Zhuang ^1,⁵, Junyan Zhang ^1,⁵, Shiyuan Liu ^2,⁵, Bing Wei ³, Lei Zheng ³, Jianwei Gao ⁴, Zaijian Zeng ⁴, Juan Xu ^4,^∗, Kunlun He ^1,^6,^∗∗

PMCID: PMC12547830 PMID: 41142988

Summary

Predicting clinical outcomes is essential for effective healthcare management. Electronic medical records (EMRs) contain rich temporal and relational structures, yet conventional models often struggle to capture these patterns with interpretability. This study proposes the prompt-based pre-trained graph model (PPGM), which combines graph neural networks with prompt learning in a two-stage framework: pre-training on patient graphs and fine-tuning with gated mechanisms for edges, nodes, and labels. By preserving the intrinsic relationships in EMRs, PPGM improves accuracy in predicting outcomes such as mortality and length of stay, while enabling transparent, interpretable reasoning. The approach enhances the integration of structured medical knowledge with machine learning, offering a scalable framework for data-driven clinical decision-making across diverse healthcare settings.

Subject areas: Health sciences, Artificial intelligence, Machine learning

Graphical abstract

Highlights

•
Prompt-based PPGM integrates graph structures for enhanced medical prediction
•
Sorted ICD codes and learnable positional embeddings capture diagnosis relationships
•
Edge, label, and node prompts significantly boost model interpretability
•
Outperforms state-of-the-art in mortality and hospitalization duration prediction

Health sciences; Artificial intelligence; Machine learning

Introduction

Integrating electronic medical record (EMR) data into predictive analytics is critical for enhancing clinical decision-making systems. The extensive proliferation of EMR datasets has enabled researchers to apply deep learning techniques for predicting diagnoses,¹ mortality rates,² and therapy outcomes² However, EMR data typically consist of records from each patient visit, encompassing various aspects of a patient’s medical history, such as diagnosis and treatment information. This information is usually represented in a flattened manner within EMRs, despite the inherent structured relationships in the data. For instance, treatments are generally prescribed based on diagnoses, and there is a sequential relationship between different visits. Presently, most deep learning models for diagnosis do not fully leverage these structural and sequential dependencies.

To leverage the sequential information across different visits in EMR data, deep learning models often employ recurrent neural networks (RNNs), such as long short-term memory (LSTM) networks³^,⁴ and gated recurrent units (GRUs),⁵ to handle the temporal dynamics.⁶^,⁷ Rajkomar et al.⁸ combined LSTMs with feedforward networks and time-based decision stumps for predicting critical medical metrics. Despite their effectiveness in capturing temporal features, these methods often fall short in fully analyzing the multifaceted nature of medical data. Medical records not only contain time-series information but also encompass complex relationships among patients, diseases, and treatments. Consequently, simple sequential models are inadequate for uncovering these intricate interactions. A more sophisticated architecture is therefore required to address the complex relational structures within medical data, beyond basic sequential approaches. This advancement is crucial for accurately reflecting real-world clinical interactions and enhancing the precision of diagnostic and treatment recommendations.

To fully exploit relationships within medical records, recent advancements highlight the transformer’s role⁹ in modeling sequential data with attention mechanisms. Models like HiTANet¹⁰ and RAPT¹¹ enhance predictive accuracy but, similar to other transformer-based approaches,¹²^,¹³ often lack the explainability required for clinical use.

To better learn the intrinsic relationships of different features in medical records for adapting to a wide range of downstream tasks, self-supervised frameworks, such as BEHRT¹⁴ and Hi-BEHRT,¹⁵ along with multi-task learning,¹⁶ enhance model performance without extensive labeling yet struggle to provide transparent decision-making processes. Large pre-trained models¹⁷^,¹⁸ also excel in performance but face similar challenges in explainability.

Graph neural networks (GNNs), in contrast, are well-suited for modeling the hierarchical and relational structures present in EMR data. Homogeneous GNNs such as GraphSAGE,¹⁹ graph convolutional networks (GCN),²⁰ and GAT²¹ have demonstrated strong performance on graph-structured data. However, these methods are limited by their reliance on single-type nodes, which restricts their applicability to complex, heterogeneous medical graphs. Later works like HSGNN²² and²³ extend the modeling capabilities by subgraph construction and multi-view learning, respectively, offering improved representation of complex medical relationships.

Building upon this, GCT²⁴ introduces a graph convolutional network-based approach that automatically learns latent patient visit structures from EMR data. The method constructs a heterogeneous graph comprising multiple types of medical entities, such as diagnoses and treatments. Notably, GCT incorporates conditional probability distributions between diagnoses and treatments—derived from historical data—as prior knowledge, which is explicitly integrated into the graph structure. This allows the model to better capture clinically meaningful relationships during message passing.

Unlike traditional graph structures, CACHE²⁵ models each hospitalization as a hyperedge that includes multiple nodes representing diagnoses, treatments, and medications. This design better captures high-order interactions among medical entities. The framework combines hypergraph neural networks with counterfactual reasoning to simulate patient state transitions under different intervention strategies, enhancing the understanding of potential causal pathways and offering more insightful clinical interpretations.

GraphCare,²⁶ on the other hand, integrates large language models (LLMs) with biomedical knowledge graphs. It first extracts medical knowledge from EMR data using a pre-trained language model to construct personalized, knowledge-enhanced patient graphs, which are then processed using GNNs and attention mechanisms for outcome prediction. While this approach improves predictive performance, it heavily relies on the accuracy of the generated knowledge.

Building upon these efforts, this work extends existing GNN-based methods by explicitly incorporating statistical prior knowledge—such as co-occurrence patterns between diagnoses and treatments—into the graph structure. Without introducing complex architectural components, the proposed method enhances both model expressiveness and interpretability.

We have developed an enhanced GCT model designed to integrate patient electronic health record (EHR) information and simultaneously perform multiple tasks, including predicting patient diagnoses, hospital stay durations, and mortality rates. This improved model is capable of effectively capturing and clearly presenting the structural relationships within patient EMR data. The initialization of the probability matrix incorporates a priori medical knowledge, integrating medical logic with data-driven models. For automatic ICD (international classification of diseases, Ninth Revision) coding, we employ two innovative approaches. In the pre-training phase, we sort ICD codes and use learnable positional embeddings in each encounter record to capture more relationships between diagnosis nodes. In the downstream task prediction phase, we utilize prompt learning to further improve predictive performance. This involves the pre-training of a BERT model on an extensive dataset comprising EMRs, thereby achieving standardization of data before modeling. Experiments were conducted on the public eICU dataset as well as a proprietary dataset to evaluate the model’s effectiveness in multi-task prediction. Comparisons with other state-of-the-art methods were made to demonstrate the model’s superiority.

The main contributions of this paper are summarized as follows.

(1)
We have effectively integrated the structure of patient EHR information using a knowledge graph based on clinical data, incorporating a priori medical knowledge for foundation models.
(2)
We have designed and constructed a unified foundation model based on graph embeddings, allowing for a comprehensive representation of clinical data, capturing the intricate relationships, and temporal dynamics inherent in EMR.
(3)
The model is capable of supporting a variety of downstream medical tasks simultaneously, providing a robust multi-task framework.
(4)
During pre-training, we use sorted ICD codes and learnable positional embeddings in the diagnostic mask prediction task. Sorting ICD codes numerically captures their inherent hierarchical structure, while learnable positional embeddings adaptively weigh interactions between diagnosis nodes. This enhances the extraction of internal hierarchical information and relationships among diagnoses.
(5)
In downstream tasks, we employ prompt fine-tuning using edge prompts, label prompts, and code prompts to enrich node relationships. These prompts significantly improve the model’s understanding of node interactions and enhance the model’s interpretability. By establishing virtual nodes based on semantic similarity, they effectively capture initial condition matrix relationships and assign weights to edges. Furthermore, capturing node labels enhances the distinction between different node types. This approach significantly improves the model’s predictive performance in downstream tasks.

Results

Pre-training and fine-tuning a PPGM

PPGM leverages a pre-training strategy that encapsulates patient visits as graphs, with medical entities as nodes and their relationships as edges, depicted in (Figure 1). In this framework, sorted ICD codes and learnable positional encoding (PE) are integrated to enhance the pre-training task performance by better capturing the relationships within the adjacency matrix. This structure facilitates the understanding of complex medical interactions through an adjacency matrix. The model employs a graph convolutional transformer (GCT) with a gating mechanism, enhancing information flow and attention-based learning. As illustrated in (Figure 1), the EMR system for hospitalized patients provides a chronological overview of care from admission to discharge. The upper panel outlines key documentation milestones, from Day 0 through to Day n + 1, capturing the evolution of medical assessments and notes.

The lower panel of (Figure 1) highlights the integration of essential medical elements—diagnostics, treatments, medications, and consultations—demonstrating their interconnectedness throughout the hospitalization period. This cohesive record supports continuous care coordination. This structured visualization underscores the EMR’s critical role in streamlining healthcare delivery and supporting detailed retrospective analysis for quality improvement.

Our approach refines the initial conditional probability matrix via dynamic adjustments, introducing regularization to maintain consistency between learned relationships and observed co-occurrences. To tailor the model to specific downstream tasks, we’ve innovated with prompt tuning, enriching the model with medical semantics through edge, label, and code prompts. Moreover, the incorporation of adaptive dynamic routing (ADR) from capsule networks provides a dynamic pathway for information, optimizing feature prioritization and pattern recognition within the model. We pre-trained the model on the eICU dataset’s EHRs using the masked diagnosis prediction (MDP) task, incorporating sorted ICD codes and learnable PE to better capture relationships within a foundational medical entity graph. The model then performs prompt tuning on this foundation graph for specific downstream tasks, allowing for fine-tuning of the graph structure to better capture node interactions and improve task-specific performance. This is accomplished through edge prompts, label prompts, and code prompts, which leverage pre-trained embeddings and semantic similarity to enrich the initial conditional probability matrix.

The overall framework is illustrated in (Figure 2). The overall framework of our approach is divided into three stages: preliminary graph construction and embedding initialization, prompt-enhanced model training, and predictive applications.

We integrate EMRs with expert knowledge to construct a preliminary graph structure that captures the nuanced relationships within medical data. The code initialization module subsequently generates embeddings that reflect the semantic and structural nuances of EMR data. Through graph structure initialization, we develop a rior guide probability matrix, which plays a crucial role in calculating attention scores and guides the model toward focusing on key features, as depicted in (Figure 2-I). This phase enriches our model with expert insights through label prompts, node prompts, and edge prompts. Label prompts offer detailed descriptions of clinical labels informed by expert knowledge, while node prompts introduce new nodes pertinent to various hierarchical levels. Edge prompts establish connections between these nodes across different levels. These enhancements expedite training convergence. A dynamic gating mechanism adjusts the influence of distinct attention mechanisms, ensuring efficient incorporation of expert knowledge into the model’s learning process, illustrated in (Figure 2-II). For specialized tasks such as mortality prediction, diagnosti[[parms resize(1),pos(50,50),size(200,200),bgcol(156)]]|PDF FILE NOT READY

PPGM output foundation graph while predicting the masked diagnosis

The MDP, as our pre-training task, did not utilize prompt fine-tuning to obtain results. We used the Philips eICU collaborative research dataset²⁷ to test PPGM on real-world EHR records. To more effectively capture the relationships between diagnoses within each medical record, we systematically organized the ICD codes and incorporated learning-based positional embeddings.

The ICD coding system focuses on classifying and encoding diseases and health conditions, providing comprehensive disease information. Diagnoses, being standardized and relatively static, are well-suited for prediction tasks, facilitating the identification of potential health issues and supporting clinical decisions. In contrast, treatment methods are highly variable, depending on specific diagnoses, patient characteristics (e.g., age, gender, and comorbidities), physician judgment, and institutional resources. Treatments, encompassing drug therapies, surgeries, and physical therapies, evolve over time and are difficult to encapsulate within a fixed coding system. Given these complexities, our prediction task is limited to MDP, leveraging the structured nature of diagnostic data.

The eICU dataset consists of intensive care unit (ICU) records collected from multiple sites in the United States between 2014 and 2015. From the encounter records, medication orders, and procedure orders, we extracted diagnosis codes and treatment codes (i.e., medication and procedure codes). Similar to,²⁴ we did not use lab results.

We performed 5-fold cross-validation on the eICU dataset for the MDP task, and the results are shown in (Table 1). Due to the inclusion of the gating mechanism and ADR mechanism, the task performance has significantly improved compared to the original GCT.

Table 1.

Performance comparison in masked diagnosis prediction task on the eICU Dataset

Model	Accuracy	Loss
GCT	17.13%	1126.7
CACHE	2.31%	10.82
Graphcare	20.15%	4.79
PPGM	32.42%	2.73

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

Noted that our model is capable of simultaneously accomplishing the pre-training tasks and generating a medical entity graph as depicted in (Figure 1). This entity graph illustrates the common medical knowledge that the model has learned, which is applicable for downstream task predictions. By examining the graph, we can discern the model’s acquisition of generalizable medical insights, which are crucial for enhancing the predictive capabilities of the model in subsequent clinical applications.

Sorted ICD codes and learnable positional embeddings enhanced the performance of masked diagnosis prediction

According to (Table 2), the impact of various components on the model’s performance in the MDP task using real-world EMR records is evaluated. Compared to the GCT baseline, PPGM demonstrates significant improvements in both loss and accuracy metrics. The loss decreases from 5.82 to 1.41%, and accuracy increases from 23.74% to 39.59%. This highlights the effectiveness of PPGM in this particular task.

Table 2.

Results of PPGM in masked diagnosis prediction task on real-world EMR records

metrics	GCT	PPGM	w/o Sorted ICD	w/o Learnable PE
Loss	5.82	1.41	2.06	2.13
Accuracy	23.74	39.59%	28.53%	37.54%

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

When Sorted ICD codes are not utilized, there is a notable increase in loss to 2.06 and a decrease in accuracy to 28.53%. This indicates that sorted ICD codes are important for reducing loss and improving accuracy by effectively capturing the relationships between different diagnoses within each encounter record.

Removing the learnable positional embeddings results in an even higher loss of 2.13 and a drop in accuracy to 37.54%. This suggests that learnable positional embeddings play a crucial role in maintaining lower loss and achieving higher accuracy, underscoring their importance for the model’s performance.

PPGM outperforms other models in mortality and hospitalization duration prediction

Based on the foundation graph obtained from pre-training, we utilized three prompt fine-tuning methods to conduct experiments on mortality prediction and hospitalization duration prediction tasks using the eICU and real-world datasets. To ensure a fair comparison across all experiments, the same cross-entropy loss function was used consistently throughout all model configurations. This allows the observed performance differences to be attributed to architectural and design improvements, rather than variations in the training objective. We compared the performance with other models, and the results demonstrate that the PPGM outperforms other models in both tasks. All evaluation metrics are computed using standard implementations from scikit-learn. For the mortality prediction task, AUPRC and AUROC are calculated based on the precision-recall and ROC curves, respectively. Precision, recall, and F1 score are reported using a weighted average to account for class imbalance in the dataset. For the hospitalization duration prediction task, accuracy refers to the proportion of correctly predicted length-of-stay categories, while loss corresponds to the mean squared error (MSE) between predicted and actual durations. All metrics are reported with two decimal places for consistency.

In the mortality prediction task on the eICU dataset, PPGM demonstrates superior performance compared to other models as shown in (Table 3). Specifically, PPGM achieves an AUPRC of 63.02%, notably higher than GCT at 53.24%, CACHE at 61.33%, and GraphCare at 16.72%. Its AUROC score of 62.98% also outperforms GCT at 51.43% and GraphCare at 60.32%. However, PPGM’s AUROC is lower than that of CACHE, which reports 83.91%. To better understand the characteristics of the dataset, we first examine the distribution of length of stay in (Figure 3A) and the top 10 diagnostic categories in (Figure 3B). These visualizations reveal the heterogeneity in patient trajectories and common clinical conditions. The performance discrepancy can be further attributed to the extreme class imbalance in the dataset, as illustrated in (Figure 3C), where only 9% of patients experienced in-hospital mortality.

Table 3.

Performance comparison in mortality prediction task on eICU dataset

Model	AUPRC	AUROC	Precision	Recall	F1 score	Accuracy
GCT	53.24%	51.43%	79.45%	75.38%	78.56%	78.56%
CACHE	61.33%	83.91%	71.92%	65.31%	83.21%	83.21%
GraphCare	16.72%	60.32%	54.32%	61.72%	85.63%	85.63%
PPGM	63.02%	62.98%	93.01%	94.22%	93.38%	95.30%

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

Distribution of eICU dataset

(A) Distribution of patient hospitalization duration in the eICU dataset across the following intervals: 0–3 days, 4–7 days, 8–14 days, 15–30 days, and over 30 days.

(B) Distribution of the top 10 most frequent diagnoses in the eICU dataset.

(C) Distribution of discharge status (survival vs. death) in the eICU dataset.

In such imbalanced settings, AUROC may give an over-optimistic estimate and fail to reflect the model’s real-world performance—particularly when identifying high-risk cases is of primary clinical interest. To improve transparency and better align with the clinical objectives, we emphasize metrics that are more informative under such skewed distributions, including AUPRC, precision, recall, and the F1-score.

Notably, these metrics (precision, recall, and F1-score) were computed using a weighted average approach that accounts for class imbalance, further ensuring reliable performance evaluation across both majority and minority classes. As shown in (Table 3), PPGM achieves high precision and recall of 93.01% and 94.22%, respectively, leading to an F1-score of 93.38%, significantly outperforming all other models. These results indicate that PPGM effectively identifies high-risk cases with minimal false negatives and false positives, making it well-suited for clinical applications where accurate detection of critical outcomes is essential.

For hospitalization duration prediction task on eICU dataset, based on (Table 4), PPGM demonstrates superior performance in the hospitalization duration prediction task on the eICU dataset when compared to other models. Specifically, PPGM achieves an accuracy of 64.68%, significantly higher than GCT at 56.41%, CACHE at 57.90%, and GraphCare at 59.86%. The Loss score for PPGM is 1.42, notably lower than GCT at 58.26, CACHE at 4.26, and GraphCare at 2.45. This suggests that the PPGM provides more precise estimates of hospital stay lengths, which is valuable for patient flow management and operational planning in healthcare institutions.

Table 4.

Performance comparison in hospitalization duration prediction task on eICU dataset

Model	Accuracy	Loss
GCT	56.41%	58.26
CACHE	57.90%	4.26
GraphCare	59.86%	2.45
PPGM	64.68%	1.42

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

The gating mechanism and prompt tuning helped improve the model’s performance

The gating mechanism employed in this model architecture serves to dynamically weigh the contributions of two different attention mechanisms, enhancing the model’s ability to focus on relevant information. By applying an average pooling operation over the hidden states and passing the result through a linear layer, the model generates gate scores that determine the relative importance of each attention mechanism. These scores are then normalized using a softmax function to ensure they represent a proper distribution of weights. During the computation, the attention probabilities are adjusted by these gate scores, allowing for a weighted combination of the standard attention output and a prior attention output. This adaptive weighting enables the model to better capture nuanced relationships within the data, improving its overall performance and adaptability.

The dataset comprises real-world hospitalization records, with key characteristics illustrated in (Figure 4) and detailed in the STAR Methods.

Distribution of real-world dataset

(A) Distribution of patient hospitalization duration in the real-world dataset across the following intervals: 0–3 days, 4–7 days, 8–14 days, 15–30 days, and over 30 days.

(B) Distribution of the top 10 most frequent diagnoses in the real-world dataset.

(C) Distribution of discharge status (survival vs. death) in the real-world dataset.

The (Table 5) presents a comprehensive performance evaluation of different models on the mortality prediction task using real-world EMRs. The metrics considered include loss, area under the precision-recall curve (AUPRC), area under the receiver operating characteristic curve (AUROC), precision, recall, and F1 score.

Table 5.

PPGM mortality prediction on real-world EMR data and ablation analysis of gates and prompts

metrics	GCT	PPGM	w/o gate	w/o pr
Loss	36.85	3.71	4.63	33.44
AUPRC	59.54%	64.12%	60.90%	62.95%
AUROC	80.43%	83.13%	80.81%	81.37%
Precision	93.81%	94.66%	93.90%	94.34%
Recall	92.33%	93.85%	92.69%	93.32%
F1 score	92.87%	94.18%	92.93%	93.86%
Accuracy	91.51%	96.17%	92.01%	95.54%

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

The performance of PPGM for mortality prediction on real-world EMR data, along with ablation analysis of the gating mechanism and prompts, is summarized as follows. Compared to GCT, the PPGM model demonstrates superior performance across all evaluation metrics, especially in terms of loss value, AUPRC, and AUROC. Specifically, the PPGM model achieves a loss value of 3.71, significantly lower than GCT’s 36.85. Moreover, the AUPRC reaches 64.12%, outperforming GCT’s 59.54%. The AUROC also improves to 83.13%, compared to GCT’s 80.43%. In terms of precision, recall, and F1 score, PPGM performs better with values of 94.66%, 93.85%, and 94.18%, respectively. Additionally, the accuracy of PPGM reaches 96.17%, significantly higher than GCT’s 91.51%.

The ablation studies, represented by “w/o gate” and “w/o pr”, show that removing these components leads to a decline in performance, although they still perform better than GCT except for the loss metric. This underscores the importance of the gate mechanism and pr component in enhancing the predictive power of PPGM.

As shown in (Figure 5), the PPGM configuration exhibits the lowest loss throughout the training process, highlighting its inherent advantages and demonstrating robustness and efficiency. In comparison, the w/o pr setup starts with a higher loss but shows a steady decrease, indicating initial challenges followed by consistent improvement. The GCT configuration begins with high performance, similar to w/o pr, but maintains a relatively stable loss after an initial drop, suggesting it reaches a point of minimal further improvement. Conversely, the w/o gate configuration starts with the highest loss among all configurations and decreases rapidly at first, but then the reduction slows down significantly, indicating rapid early gains but limited long-term improvement potential. Overall, the trends emphasize PPGM’s effectiveness in minimizing loss and maintaining optimal performance levels, while also illustrating the varying dynamics and adaptation capabilities of each model configuration.

Loss of hospitalization duration prediction task with different settings

Table 6 summarizes the performance comparison of different models in the task of predicting hospitalization duration using real-world EMRs (EMR). The metrics evaluated include loss and accuracy. Among the models compared are GCT, PPGM, a variant without gate mechanism (w/o gate), and another without the Pr component (w/o Pr).

Table 6.

PPGM hospitalization duration prediction and ablation of gates/prompts on real-world EMR data

metrics	GCT	PPGM	w/o gate	w/o Pr
Loss	34.73	1.93	4.16	34.71
Accuracy	49.78%	56.96%	55.26%	50.66%

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

PPGM demonstrates superior performance with the lowest loss value of 1.93 and the highest accuracy of 56.96%. In contrast, the GCT model achieves an accuracy of 49.78% and a slightly higher loss of 34.73. Models lacking specific components, such as the gate or Pr mechanism, show diminished performance, particularly in accuracy, suggesting that these elements contribute significantly to the effectiveness of the PPGM model. Specifically, the absence of the gate mechanism results in a noticeable drop in accuracy to 55.26%, while removing the Pr component leads to a more substantial decrease in accuracy to 50.66%, along with a significant increase in loss to 34.71. This indicates that both the gate mechanism and the Pr component are crucial for optimizing the predictive power of the model.

Different prompt tuning helped improve the model’s performance

In this section, we conduct ablation studies to evaluate the contribution of different components of our model to its overall performance.

In (Table 7), compared to the model without the edge prompt, PPGM exhibits a decrease in performance when the edge prompt is removed. Specifically, AUPRC drops from 64.12% to 60.35%, AUROC decreases from 83.13% to 81.71%, Precision falls from 94.66% to 93.93%, Recall declines from 93.85% to 92.45%, and F1 score decreases from 94.18% to 93.03%. This indicates that the edge prompt plays a critical role in enhancing the model’s predictive capability by incorporating pre-trained semantic information of medical concepts and refining the initial conditional probability matrix.

Table 7.

PPGM mortality prediction on real-world EMR data and ablation of different prompts

metrics	GCT	PPGM	w/o Edge Pr	w/o Label Pr	w/o Node Pr
Loss	36.85	3.71	0.41	35.18	1.25
AUPRC	59.54%	64.12%	60.35%	61.12%	60.02%
AUROC	80.43%	83.13%	81.71%	79.57%	81.20%
Precision	93.81%	94.66%	93.93%	90.82%	93.87%
Recall	92.33%	93.85%	92.45%	93.65%	92.48%
F1 score	92.87%	94.18%	93.03%	93.82%	93.03%
Accuracy	91.51%	96.25%	94.19%	95.48%	93.90%

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

When comparing PPGM to the model without the label prompt, we observe a drop in AUPRC from 64.12% to 61.12%, a more significant decline in AUROC from 83.13% to 79.57%, and a substantial decrease in Precision from 94.66% to 90.82%. Recall remains relatively stable at 93.65%, while the F1 score decreases to 93.82% from 94.18%. The notable degradation in AUROC and precision highlights the importance of the label prompt in modeling diagnostic labels and reinforcing connections across relevant nodes in the graph.

In comparison to the model without the node prompt, removing this component results in a reduction in AUPRC from 64.12% to 60.02%, AUROC from 83.13% to 81.20%, precision from 94.66% to 93.87%, Recall from 93.85% to 92.48%, and F1 score from 94.18% to 93.03%. These results suggest that node prompts, which introduce virtual nodes between existing concept pairs, significantly contribute to capturing complex interactions and improving the model’s structure learning ability.

Table 8 provides a concise comparison of performance metrics for hospitalization duration prediction using different models on real-world EMR data. The configurations include the general concept transformer (GCT), the prompt based probabilistic graphical model (PPGM), and PPGM variants without edge prompt (w/o Edge Pr), label prompt (w/o Label Pr), or node prompt (w/o Node Pr).

Table 8.

PPGM hospitalization duration prediction on real-world EMR data and ablation of different prompts

metrics	GCT	PPGM	w/o Edge Pr	w/o Label Pr	w/o Node Pr
Loss	5.47	1.93	3.67	5.23	2.4
Accuracy	48.22%	56.96%	55.78%	56.02%	56.14%

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

Results show that PPGM achieves the best performance with the lowest loss 1.93 and highest accuracy 56.96% among all models. When individual prompt components are removed, performance degrades to varying degrees: removing the edge prompt leads to an increase in loss to 3.67 and a drop in accuracy to 55.78%; removing the label prompt results in a loss of 5.23 and accuracy of 56.02%; and removing the node prompt increases the loss to 2.40 while maintaining a relatively high accuracy of 56.14%. This highlights the importance of edge, label, and node prompts in enhancing model accuracy and reducing loss. Despite performance drops when any prompt is removed, as indicated by higher losses and lower accuracies compared to the full PPGM, the model still outperforms GCT even in ablated settings, underlining its robustness and effectiveness in predicting hospitalization durations.

According to the results, each of the three prompt components edge, label, and node prompts contributes uniquely and significantly to the model’s overall performance. Edge prompts have the most substantial impact, as their removal leads to a sharp increase in loss and the largest drops in accuracy, highlighting their critical role in capturing relational information among medical entities. Label prompts also play an important role, with moderate performance degradation observed when they are removed, indicating their effectiveness in aligning model outputs with task-specific semantics. Node prompts contribute more subtly, supporting the refinement of individual concept representations. The combined integration of these components enables more accurate and reliable predictions, and removing any one of them results in a notable decline in performance.

Comprehensive ablation of key structural components

We conduct comprehensive ablation studies to evaluate the impact of key structural components in PPGM, including sorted ICD codes, the gating mechanism, prompt tuning, and PE strategies. Tables summarize the performance of PPGM and its ablated variants on two critical clinical tasks: mortality prediction and hospitalization duration prediction.

In both tasks, the full PPGM model achieves the best performance, demonstrating the effectiveness of the integrated design. Removing both sorted ICD codes and the gate leads to a significant drop in all metrics, highlighting their role in capturing sequential dependencies. Disabling prompt tuning results in severe performance degradation, particularly in AUPRC and AUROC, indicating the importance of learnable task-specific adaptation. Replacing learned positional embeddings with fixed sinusoidal encoding causes moderate declines, suggesting that adaptive positional information better supports modeling of clinical sequences.

Table 9 shows the results of combined ablations for mortality prediction on real-world EMR data. PPGM achieves the best overall performance with the lowest loss 3.71 and highest accuracy 96.17%. Removing both sorted ICD and the gate or prompt leads to performance degradation, especially in loss and AUPRC. Using a fixed PE slightly improves F1 score but has minimal impact on overall performance.

Table 9.

Mortality prediction on real-world EMR under combined ablations

metrics	PPGM	w/o Sorted ICD + gate	w/o Sorted ICD + pr	w/Fixed PE
Loss	3.71	5.98	36.23	1.24
AUPRC	64.12%	59.95%	62.03%	63.16%
AUROC	83.13%	80.52%	80.98%	82.60%
Precision	94.66%	94.01%	93.88%	94.29%
Recall	93.85%	92.52%	92.67%	93.45%
F1 score	94.18%	91.98%	93.10%	94.07%
Accuracy	96.17%	91.77%	95.41%	95.90%

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

Table 10 presents the combined ablation study for hospitalization duration prediction. PPGM again achieves the lowest loss 1.93 and highest accuracy 56.96%. Removing sorted ICD along with the gate or prompt significantly reduces performance, confirming their importance in modeling temporal and structural information.

Table 10.

Hospitalization duration prediction on real-world EMR under combined ablations

metrics	PPGM	w/o Sorted ICD + gate	w/o Sorted ICD + pr	w/Fixed PE
Loss	1.93	4.71	35.09	3.89
Accuracy	56.96%	52.21%	54.88%	55.62%

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

To evaluate the contribution of the ADR mechanism and capsule design in our model, we perform ablation studies on both mortality prediction and hospitalization duration prediction using real-world EMR data.

In mortality prediction, removing the ADR module leads to a noticeable drop in AUPRC and AUROC, indicating its critical role in capturing dynamic patient representations. Reducing the capsule dimension from 32 to 8 also results in performance degradation, suggesting that higher-dimensional capsules better preserve expressive clinical semantics. Similar trends are observed in hospitalization duration prediction, where disabling ADR or reducing capsule size consistently degrades accuracy, further confirming the importance of these architectural choices in clinical outcome modeling.

Table 11 presents the ablation study on ADR and capsule dimension in mortality prediction. PPGM achieves the best performance with ADR enabled and a capsule dimension of 32, yielding 64.12% AUPRC and 83.13% AUROC. Removing ADR or reducing capsule dimension leads to performance degradation, confirming their importance.

Table 11.

ADR and capsule ablation in real-world EMR mortality prediction

Setting	ADR Used	Capsule Dim	AUPRC	AUROC
PPGM	TRUE	32	64.12%	83.13%
w/o ADR	FALSE	32	62.01%	82.01%
Capsule Dim = 8	TRUE	8	63.23%	82.69%
Capsule Dim = 16	TRUE	16	63.98%	82.91%

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

Table 12 shows the ablation results for hospitalization duration prediction. PPGM achieves the highest accuracy with ADR and capsule dimension 32. Disabling ADR or using smaller capsule dimensions reduces accuracy, demonstrating their positive impact on predictive performance.

Table 12.

ADR and capsule ablation in real-world EMR hospitalization duration prediction

Setting	ADR Used	Capsule Dim	Accuracy
PPGM	TRUE	32	56.96%
w/o ADR	FALSE	32	53.12%
Capsule Dim = 8	TRUE	8	54.03%
Capsule Dim = 16	TRUE	16	55.16%

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

Attention scores across features

To exemplify the practical implications of attention scores within clinical settings, we examine a particular patient visit wherein an array of features was systematically analyzed to determine their relevance to the patient’s condition.

Figure 6 reveals feature contributions across clinical contexts, including a patient visit and two diagnoses: atrial fibrillation (427.31) and chronic systolic heart failure (428.43). For the diagnosis of atrial fibrillation, losartan achieves the highest attention score, indicating its dominant therapeutic significance. Similarly, in chronic systolic heart failure diagnosis, losartan exhibits maximal attention, showing the highest relevance in both cardiovascular conditions. Comparative analysis shows that nasal cannula, insulin, and other medications display context-dependent attention scores, reflecting secondary yet condition-specific relevance. Diagnostic codes and label prompts maintain consistent but lower attention values, suggesting auxiliary roles in clinical interpretation. These findings demonstrate that losartan serves as a key discriminative feature for managing atrial fibrillation and chronic systolic heart failure, with attention scores surpassing those of all other features. The hierarchical feature importance patterns highlight the necessity for structured integration of diagnostic codes and treatments in precision cardiovascular care.

Heatmap of feature-specific attention scores across clinical contexts

To further illustrate the clinical utility of attention scores in (Figure 6), consider the following case study as a concrete example of how attention mechanisms guide clinical decision-making:

In a clinical prediction case involving a 65-year-old male with chronic systolic heart failure (ICD-9 code 428.43) comorbid with hypertension and acute kidney injury, the patient’s EHRs were analyzed to assess in-hospital mortality risk and length of stay (LOS). The model constructed a heterogeneous graph structure, integrating diagnoses (e.g., heart failure, hypertension), treatments (losartan, furosemide), and laboratory data as nodes, with edges defined by historical conditional probability matrices. During pre-training, sorted ICD-9 coding and learnable positional embeddings captured hierarchical diagnostic relationships through numerical ordering, while dynamically adjusting node interaction weights. In prompt tuning, edge prompts leveraged bioClinicalBERT to compute semantic similarity between losartan and heart failure, refining the conditional probability matrix. Label prompts introduced virtual nodes linking patient states to risk labels, generating personalized risk scores. A dynamic routing mechanism iteratively calculated attention weights via capsule networks, predicting elevated mortality risk and moderate LOS (7–14 days). Guide physicians in adjusting medication regimens; LOS prediction facilitates proactive rehabilitation planning.

To validate the PPGM model’s attention mechanism, we conducted a structured evaluation using a Likert scale. Five senior cardiologists (each with over 10 years of ICU experience) independently assessed the clinical plausibility of the model’s top-ranked treatment associations—losartan with ICD-9 codes 427.31 (atrial fibrillation) and 428.43 (acute on chronic heart failure)—using a 1–5 scoring system (1 = completely unrelated, 2 = largely unrelated, 3 = possibly coincidental, 4 = mechanistically plausible indirect association, 5 = direct causal evidence). Without knowledge of the model’s predictions, all experts assigned the same scores: 4 for atrial fibrillation-losartan and 5 for heart failure-losartan.

The heart failure-losartan association received the highest score, reflecting the established mortality benefit of angiotensin receptor blockade. The atrial fibrillation-losartan link, while lacking direct causal evidence, was considered mechanistically plausible (e.g., blood pressure reduction mitigating arrhythmia triggers).

The consistency between expert judgment and model-derived attention scores underscores the interpretability and clinical face validity of the PPGM model in cardiovascular decision-making.

Discussion

The primary contribution of this study is the development and validation of a PPGM that outperforms other models in the field of medical informatics. By utilizing transformer architectures, and GCN, we have shown that complex temporal and relational data within EMRs can be effectively modeled. This speaks directly to clinicians and medical data scientists who are interested in improving predictive models for patient outcomes.

Our results indicate that label prompts are vital for maintaining high precision and AUROC, suggesting their critical role in structuring the model’s output space. Furthermore, the inclusion of edge prompts significantly enhances the model’s ability to capture and utilize semantic relationships between medical concepts, leading to superior performance. Node prompts help the model identify and understand interaction structures within the data, further improving prediction accuracy. These findings are particularly relevant for clinicians and medical data scientists who seek to leverage advanced computational models to improve patient outcomes.

The dataset comprises the MIMIC-III dataset, with key characteristics illustrated in (Figure 7) and detailed in the STAR Methods.

Distribution of neurological and oncological conditions in MIMIC-III

(A) Distribution of patient hospitalization duration in the MIMIC-III neurological and oncological cohort across the following intervals: 0–3 days, 4–7 days, 8–14 days, 15–30 days, and over 30 days.

(B) Distribution of the top 10 most frequent diagnoses in the MIMIC-III neurological and oncological cohort.

(C) Distribution of discharge status (survival vs. death) in the MIMIC-III neurological and oncological cohort.

Table 13 presents the experimental setup and hyper parameter configuration. The model uses the AdamW optimizer with a specified learning rate and hidden dropout probability. Training is conducted for a maximum number of steps. Dynamic routing is employed with a set number of iterations, output capsules, and capsule dimension. Early stopping is applied with a defined patience interval.

Table 13.

Performance comparison in mortality prediction task on subsets from the MIMIC-III dataset

Model	AUPRC	AUROC	Precision	Recall	F1 score	Accuracy
GCT	74.45%	77.32%	63.48%	66.23%	34.61%	76.22%
CACHE	59.92%	81.62%	67.89%	71.59%	38.10%	82.10%
GraphCare	20.45%	72.74%	57.65%	56.90%	55.13%	86.59%
PPGM	78.20%	82.48%	87.82%	82.51%	85.76%	89.67%

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

Several promising directions can be pursued based on our findings. One key area is the optimization of PPGM to reduce its computational demands, thereby enhancing its accessibility across diverse clinical settings. The dataset used in this study was sourced from the Cardiovascular Division of the Medical Big Data Research Center at the Chinese PLA General Hospital in Beijing. Given that the current real-world EHRs are limited to cardiology, future work should extend to multiple medical specialties. To this end, we evaluated the model’s performance on representative neurology and oncology subsets from the MIMIC-III dataset, focusing on downstream predictive tasks such as in-hospital mortality and LOS. The results on (Tables 13 and 14) demonstrated that PPGM exhibits a degree of generalizability across heterogeneous clinical data, confirming its adaptability to cross-domain medical scenarios. This finding provides valuable insights for future research. We plan to further explore the model’s generalization capacity by incorporating data from additional specialties—such as pulmonology, endocrinology, and intensive care—and systematically assess its performance across broader healthcare contexts, with the aim of uncovering its potential for interdisciplinary modeling.

Table 14.

Performance comparison in hospitalization duration prediction task on subsets from the MIMIC-III dataset

Model	Accuracy	Loss
GCT	52.12%	69.71
CACHE	51.84%	3.35
GraphCare	53.67%	1.23
PPGM	56.76%	1.12

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

To further investigate the capabilities of the PPGM model, (Tables 15 and 16) illustrates its superior performance in mortality prediction on the eICU dataset compared to few-shot LLMs such as deepseek-7B and HuatuoGPT-7B, underscoring its effectiveness and potential in enhancing clinical decision-making.

Table 15.

Mortality prediction performance comparison on the eICU dataset

Model	AUPRC	AUROC	Precision	Recall	F1 score	Accuracy
deepseek-7B	53.90%	51.29%	60.16%	56.21%	62.30%	75.28%
HuatuoGPT-7B	47.26%	49.28%	46.13%	49.90%	51.10%	78.10%
PPGM	63.02%	62.98%	93.01%	94.22%	93.38%	95.30%

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

Table 16.

Hospitalization duration prediction performance comparison on the eICU dataset

Model	Accuracy
deepseek-7B	52.18%
HuatuoGPT-7B	48.67%
PPGM	64.68%

Open in a new tab

Bold entries represent the best-performing configuration in each comparison.

We also encourage the exploration of additional prompts and their combinations to further enhance the model’s performance and interpretability. The integration of real-time data streams and the adaptation of the PPGM for other predictive tasks within healthcare, such as disease progression and treatment response, represent exciting opportunities for future work.

Table 15 compares mortality prediction performance on the eICU dataset. PPGM outperforms both deepseek-7B and HuatuoGPT-7B across all metrics with AUPRC of 63.02%, AUROC of 62.98%, precision of 93.01%, recall of 94.22%, F1 score of 93.38%, and accuracy of 95.30%.

Table 16 shows hospitalization duration prediction performance on the eICU dataset. PPGM again leads with an accuracy of 64.68%, significantly outperforming deepseek-7B at 52.18%, and HuatuoGPT-7B at 48.67%.

The hyperparameter settings used in our experiments are summarized in (Table 17), including the optimizer, learning rate, and architectural parameters of the capsule network. The basic statistics of the pre-processed eICU dataset, such as the number of patients, visits, and average counts of diagnoses and treatments per patient, are presented in (Table 18).

Table 17.

Experimental setup and hyperparameter configuration

Hyperparameter/Setting	Value
Optimizer	AdamW
Learning Rate	$5 \times 10^{- 4}$
Hidden Dropout Probability	0.3
Max Training Steps	$1 \times 10^{6}$
Number of Dynamic Routing Iterations	3
Number of Output Capsules	4
Dimension of Each Capsule	32
Early Stopping Patience	50 evaluation steps

Open in a new tab

Table 18.

Statistics of pre-processed eICU datasets. “#”: “the number of”, “/patient”: “per patient”

eICU	Train	Valid	Test
# Patients	133084	33267	16633
# Visits	160687	40171	20085
# Visits/Patient	1.2074	1.1958	1.2113
# diagnosis/Patient	22.6781	22.5322	22.7436
# treatment/Patient	18.3648	18.4393	18.2965

Open in a new tab

Our comprehensive evaluation framework reveals that PPGM demonstrates remarkable cross-domain generalization, maintaining consistent performance across diverse clinical specialties despite significant variations in data distribution and disease pathophysiology. Unlike conventional deep learning approaches that often suffer from domain shift when applied to unseen specialties, PPGM’s integration of prompt-based fine-tuning with probabilistic graphical structures enables effective knowledge transfer through three key mechanisms: (1) the preservation of hierarchical relationships via sorted ICD codes, (2) the adaptive weighting of medical concepts through learnable positional embeddings, and (3) the incorporation of semantic-rich edge prompts derived from clinical knowledge bases. When benchmarked against state-of-the-art LLMs in few-shot learning scenarios, PPGM exhibited substantially superior performance in critical outcome prediction while operating with significantly lower computational overhead, highlighting its efficiency-accuracy trade-off advantage in resource-constrained clinical environments. This performance differential stems from PPGM’s explicit modeling of medical ontologies rather than relying solely on statistical patterns—a distinction that proves particularly valuable in low-data regimes where spurious correlations might mislead less structured approaches. These findings underscore that for healthcare applications, architectural designs that incorporate domain-specific structural priors may offer more sustainable performance gains than simply scaling model parameters, especially when interpretability and generalization across heterogeneous clinical contexts are paramount.

In conclusion, our study underscores the potential of prompt-based graph models in transforming medical data analytics. As the field of healthcare continues to evolve, we anticipate that our approach will facilitate a deeper understanding of complex patient data, leading to more accurate predictions and better clinical outcomes. We encourage further studies to address the limitations identified and explore the exciting future opportunities inspired by our work. This advancement represents a significant step toward more intelligent and effective healthcare solutions.

Conclusion

In this work, we introduce PPGM that operates in two stages: pre-training and fine-tuning. Our model achieves remarkable inference performance across multiple downstream tasks. Addressing the underutilization of EHR structures, we incorporate hierarchical information through conditional probability matrices to guide medical predictions. Specifically, for medical features of the same type, we propose sorted ICD codes and learnable positional embeddings, leveraging the characteristics of ICD codes to convert them into positional information, which significantly enhances pre-training effectiveness.

The PPGM not only thoroughly learns the structural nuances of EHR data but also adapts effectively to downstream tasks, achieving notable predictive outcomes. We employ prompt-tuning to establish latent connections between existing feature nodes and between nodes and prediction labels, and use gating mechanisms to enhance their weights. Dynamic routing and capsule networks are adopted according to task-specific features, leading to superior predictive performance.

Extensive experiments on the eICU open-source dataset and real-world records demonstrate that our PPGM outperforms state-of-the-art methods across different tasks.

Our approach sets a new standard in the utilization of EHR data for predictive analytics, showcasing the potential of structured probabilistic models in healthcare applications.

Limitations of the study

Despite integrating a priori medical knowledge and capturing complex relationships within EMR data, our model’s generalize ability across diverse demographic and geographic populations remains to be thoroughly evaluated. Variations in healthcare practices and patient characteristics may necessitate population-specific adjustments or retraining. Additionally, while the model incorporates temporal dynamics inherent in EMR, accurately predicting long-term outcomes or modeling disease progression over time remains challenging. Further research is essential to enhance the model’s capability to effectively manage extended temporal dependencies, ensuring more reliable long-term predictions. Addressing these challenges will be crucial for improving the model’s robustness and broad applicability in various healthcare settings.

Resource availability

Lead contact

Requests for further information and resources should be directed to and will be fulfilled by the lead contact, Kunlun He (kunlunhe@plagh.org).

Materials availability

This study did not generate new unique reagents.

Data and code availability

•
Data: The Philips eICU Collaborative Research Dataset is available at GitHub (https://github.com/mit-lcp/eicu-code) and has been archived at reference.²⁷ The MIMIC-III database, used for cross-specialty validation on neurology and oncology subsets, is available at https://physionet.org/content/mimiciii/1.4/ (access requires completion of required training and signing a Data Use Agreement). The real-world EMR records used in this study were obtained from the Cardiovascular Department of Chinese PLA General Hospital’s Medical Big Data Research Center, Beijing, China. Due to the sensitivity of the hospital data, it cannot be made publicly available; data acquisition can be requested by contacting the email provided.
•
Code: Our source code is available at GitHub (https://github.com/PLA301dbgroup2/Probabilistic-Graphical-Model) with DOI (https://doi.org/10.5281/zenodo.16794545) to be provided upon acceptance.
•
Other: Part of downstream subtask data were under process of desensitization and approval; access can be requested through the lead contact.

Acknowledgments

The authors disclose support for the research of this work from research project (BHQ090003000X03).

Author contributions

All authors contributed to the conceptualization, analysis, and editing of the manuscript.

Declaration of interests

The authors declare that they have no competing interests.

STAR★Methods

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

Philips eICU Collaborative Research Dataset	MIT Laboratory for Computational Physiology	GitHub:https://github.com/mit-lcp/eicu-code
MIMIC-III dataset	MIT Laboratory for Computational Physiology	https://physionet.org/content/mimiciii/1.4/
Real-world EMR records	Cardiovascular Department of Chinese PLA General Hospital's Medical Big Data Research Center	Proprietary data, not publicly available

Software and algorithms

PPGM (Prompt Based Probabilistic Graphical Model)	This study	GitHub: https://github.com/PLA301dbgroup2/Probabilistic-Graphical-Model DOI: https://doi.org/10.5281/zenodo.16794545
bioClinicalBERT	Loshchilov and Hutter 2017	https://github.com/EmilyAlsentzer/clinicalBERT
HuatuoGPT-7B	Zhang et al.¹⁸	https://github.com/FreedomIntelligence/HuatuoGPT
Graph Convolutional Transformer (GCT)	Choi et al.²⁴	https://github.com/Google-Health/records-research
CACHE	Xu et al.²⁵	https://github.com/ritaranx/CACHE
GraphCare	Jiang et al.²⁶	https://github.com/pat-jj/GraphCare

Open in a new tab

Experimental model and study participant details

Datasets

Our work utilizes the Philips eICU Collaborative Research Dataset²⁷ and real-world EMR records.

The eICU dataset consists of Intensive Care Unit (ICU) records collected from multiple sites in the United States between 2014 and 2015. From the encounter records, medication orders, and procedure orders, we extracted diagnosis codes and treatment codes (i.e., medication and procedure codes). Similar to,²⁴ we did not use lab results.

In our methodology for processing medical encounter records, we handle diagnosis nodes using ICD codes (International Classification of Diseases, Ninth Revision) extracted from the eICU dataset. The numeric structure of each ICD code, which consists of at least three digits, is central to this classification system. The first digit broadly classifies diseases or health conditions (e.g., infectious diseases, neoplasms), the second digit specifies more precise etiology or clinical manifestations within that category, and the third digit provides further detail concerning the type, subtype, or stage of the disease.

Figure 3A shows the distribution of diagnostic durations in a dataset across five intervals. Most cases (88,483) are within ‘0-3 days' (label 0), indicating a high prevalence of short-term conditions. The numbers decrease sharply with duration: 28,002 cases for ‘4-7 days' (label 1), 7,249 for ‘7-14 days' (label 2), 2,744 for ‘14-30 days' (label 3), and just 426 for ‘30+ days' (label 4). This highlights a predominance of brief illnesses, with fewer instances requiring extended care, informing healthcare resource allocation and patient management strategies.

As shown in (Figure 3B), the provided data lists the top 10 diagnoses by their frequency in a dataset, using ICD codes. ‘518.81' (other specified pulmonary issues) is the most common with 28,279 occurrences. Respiratory conditions are notably prevalent, including pneumonia (‘J18.9', 17,680 cases) and acute respiratory failure (‘J96.00', 27,680 cases). Hypertension (‘I10', 23,191 cases) and heart arrhythmias (‘I48.0', 16,261 cases) highlight the significant burden of cardiovascular diseases. Acute kidney injury (‘N17.9', 18,421 cases) also stands out. This distribution emphasizes the high prevalence of both acute and chronic diseases requiring critical care, particularly in cardiology and pulmonology fields.

As shown in (Figure 3C), The hospital discharge status ratio shows that 91.0% of patients were discharged alive, while 9.0% of the cases resulted in expiration. This distribution is characterized by a predominant proportion of patients surviving to discharge, with a relatively smaller but notable proportion of mortality cases.

The real-world EMR records used in this study was obtained from the Cardiovascular Department of Chinese PLA General Hospital's Medical Big Data Research Center, Beijing, China, which consists of nine medical centers, and the medical data are aggregated into the medical big data platform. Moreover, the hospital is a crucial center for the treatment of cardiovascular diseases, boasting numerous professional physicians and detailed medical records, which makes its data highly practical and representative. The data platform consists of EHRs aggregated from eight affiliated medical centers. A total of 41,151 patients' EMRs with diagnosis, midical orders and laboratory results were extracted from the Cardiovascular Department.

The dataset under analysis provides valuable insights into hospitalization records, highlighting key aspects such as the duration of hospital stays, prevalent diagnoses, and discharge statuses. Specifically, regarding hospitalization duration (Figure 4A), a significant portion of patients experienced stays ranging from 7 to 14 days (n = 968), whereas shorter stays (0-3 days) were least common (n = 261). In terms of diagnostic categories (Figure 4B), ischemic heart diseases dominated, with I25.1 being the most frequently recorded diagnosis (n = 974). Acute myocardial infarction (I21.9, n = 424) followed as the second most common condition. Discharge statuses reveal that the majority of patients left the hospital alive (n = 2450), compared to those who expired during their stay (n = 672) (Figure 4C). These data delineate the primary characteristics of hospitalized patients within the current healthcare environment and also point the way towards further optimizing the quality and efficiency of healthcare services.

To evaluate the performance of the PPGM model across different clinical specialties, we selected neurological and oncological conditions from the MIMIC-III database. The data distribution is showed in (Figure 7). Figure 7A displays the hospitalization duration, Figure 7B shows the diagnostic categories, and Figure 7C presents the discharge status. These results illustrate the clinical heterogeneity across patient groups.

To ensure a rigorous evaluation and prevent data leakage between pre-training and downstream tasks, we adopted a patient-level data partitioning strategy, assigning each patient’s complete visit records exclusively to training (80%), validation (10%), or test (10%) sets with no overlap. This approach effectively eliminates information leakage in longitudinal EMR analysis. During pre-training, the model learned graph representations using only ICD codes and structural information via the masked diagnosis prediction (MDP) task, without exposure to any downstream task labels (e.g., mortality, length-of-stay). The initial graph structure and positional embeddings were derived solely from training data, with no involvement of validation or test samples. Prompt-based adaptation was used for fine-tuning to further reduce leakage risk. Additionally, we conducted external validation on an independent EMR dataset from the Cardiovascular Department of Chinese PLA General Hospital, which reflects a distinct clinical setting compared to the eICU database. These strategies ensure a robust, unbiased evaluation and support the model’s reproducibility and clinical applicability.

Method details

Pre-training process

Sorted ICD codes

Because each $d_{i}$ in diagnosis nodes is represented using ICD codes.To harness this inherent structure, the consistent structure of ICD codes encapsulates a wealth of diagnostic information, often resulting in numerically proximate codes representing diagnoses with shared characteristics. To harness this inherent structure, We have implemented a preprocessing step wherein all ICD codes associated with an encounter record are sorted in numerical order to create Sorted ICD Codes.This sorting ensures that diagnoses with common features are positioned closer together in the sequence, minimizing the distance between related codes and enhancing the representation of their similarities.

Learnable positional embeddings

Building upon this ordered representation of the sorted diagnosis codes, we introduce learnable positional embeddings into the embedding phase.This technique enables the model to better capture the relationships between adjacent or nearby diagnosis nodes within the sequence.By incorporating positional information, the model can learn to weight the interactions between closely positioned codes more heavily, enhancing its ability to recognize and utilize the intrinsic patterns present in the data.For each visit, the sorting function $σ$ orders the diagnosis nodes based on their corresponding ICD codes numerically.The embedding for the $i$ -th diagnosis code $d_{i}$ is denoted by $e_{d_{i}}$ , and the learnable positional embedding at position $i$ is represented by $p_{d_{i}}$ , whose parameters are trainable. The resulting embedding ${p e}_{d_{i}}$ is obtained by summing the embedding $e_{d_{i}}$ and the positional embedding $p_{d_{i}}$ .

The process proceeds iteratively as follows:

Sorted ICD codes:

s o r t (d_{1}^{(t)}, \dots, d_{| d^{(t)} |}^{(t)}) = d_{σ (1)}^{(t)}, \dots, d_{σ (| d^{(t)} |)}^{(t)}

Learnable positional embeddings:

\begin{array}{c} e_{d_{i}} = E m b e d (d_{i}), \\ {p o s}_{d_{i}} = P o s i t i o n a l E m b e d (d_{i}), \\ {p e}_{d_{i}} = e_{d_{i}} + {p o s}_{d_{i}} \end{array}

The $t$ -th visit $V^{(t)}$ begins with the visit node ${p o s}^{(t)}$ at the top. Beneath the visit node are diagnosis nodes $d_{1}^{(t)}, \dots, d_{| d^{(t)} |}^{(t)}$ , which lead to the ordering of a set of treatments $m_{1}^{(t)}, \dots, m_{| m^{(t)} |}^{(t)}$ .Here, $| d^{(t)} |$ and $| m^{(t)} |$ respectively denote the number of diagnosis and treatment codes in $V^{(t)}$ . Some treatments produce lab results $r_{1}^{(t)}, \dots, r_{| r^{(t)} |}^{(t)}$ .

Assuming all features $d_{i}, m_{i}, r_{i}$ can be represented in the same latent space, then we can view an encounter as a graph consisting of $| d | + | m | + | r |$ nodes with an adjacency matrix $A$ that describes the connections between the nodes. We use $V^{(t)}$ as the collective term to refer to any of $d_{i}, m_{i}$ and $r_{i}$ for the rest of the paper.

Our objective is to use pre-training tasks as a guiding direction to learn the structural relationships between these entities, specifically the adjacency matrix.

The original GCT paper²⁴ use $P \in [0.0, 1.0]$ to denote the matrix of conditional probabilities of all features based on entities co-occurrence, normalized such that each row sums to 1 . They replace the attention mechanism in the first GCT block with the conditional probabilities $P$ .

We further introduced a Gating Mechanism into the GCT architecture. The gating mechanism is an optimization strategy that dynamically adjusts the self-attention scores and the weights of the prior co-occurrence matrix based on the characteristics of the current data. This approach aims to improve the model's ability to process Electronic Health Record (EHR) data, particularly when structural information is incomplete, its flexibility allows the model to better adapt to varying data distributions. The iterative process is as follows:

Incorporate External Guide Information:

P = {\begin{array}{c} P ⊙ G, & i f G i s n o t N o n e \\ P, & o t h e r w i s e \end{array}

Apply Mask and Add Prior Information:

\begin{array}{c} M_{e x t} = M \cdot 1_{N}^{T} = {[M_{e x t}]}_{i, j} = m_{i} M_{e x t}^{T} = 1_{N} \cdot M^{T} = {[M_{e x t}^{T}]}_{i, j} = m_{j} \end{array}

Apply the mask and add the identity matrix:

I = I_{N}

P = P ⊙ (M_{e x t} ⊙ M_{e x t}^{T}) + s \cdot I

Compute Degree and Normalize:

D = \sum_{j = 1}^{N} P_{:, j}

D_{e x t} = D \cdot 1_{N}^{T} = {[D_{e x t}]}_{i, j} = d_{i}

Normalize $P$ :

P = \frac{P}{D_{e x t}}

Note that for elements where $D_{e x t}$ is zero (i.e., ( $d_{i} = 0$ )), special handling is required in actual implementation to avoid division by zero. For example, these positions can be set to uniform distribution or left unchanged.

Summary Formula:Combining all the steps, the final prior guide matrix $P$ can be obtained as follows:

\begin{array}{c} P = {\begin{array}{c} \frac{P ⊙ G ⊙ (M_{e x t} ⊙ M_{e x t}^{T}) + s \cdot I}{D_{e x t}}, & i f G i s n o t N o n e \\ \frac{P ⊙ (M_{e x t} ⊙ M_{e x t}^{T}) + s \cdot I}{D_{e x t}}, & o t h e r w i s e \end{array} \end{array}

Where: $⊙$ denotes element-wise multiplication. $1_{N}$ is a length $N$ vector of ones. $D_{e x t}$ is the extended degree vector. $I$ is the $N \times N$ identity matrix.

Based on the provided formula and description, we can succinctly and logically reorganize the information about $C^{(j)}$ and $W_{V}^{(j)}$ as follows:

In the $j$ -th convolution layer, $C^{(j)}$ represents the node embeddings, while $W_{V}^{(j)}$ denotes the trainable parameters (the value matrix in the attention mechanism). The mask matrix $M$ contains negative infinities where connections are prohibited and zeros where connections are allowed. The gating function $G ({\hat{A}}^{(j)}, P)$ outputs a weighted sum of self-attention scores and the prior co-occurrence matrix.

This formulation integrates the node embeddings with the attention mechanism's learnable parameters and applies a gating mechanism that considers both the adjacency structure (through ${\hat{A}}^{(j)}$ and the prior knowledge encoded in matrix $P$ , all while respecting the connectivity constraints specified by the mask matrix $M$ .

Attention Scores:

A = \frac{X W_{Q} {(X W_{K})}^{T}}{\sqrt{d}} + M

Gated Attention Probabilities:

P_{A}^{'} = s o f t m a x (A) ⊙ G_{0} + (P_{e x t}^{\otimes n}) ⊙ G_{1}

Note: The symbol $⊙$ represents element-wise multiplication.

Context Vector Calculation:

C^{'} = r e s h a p e {([s o f t m a x (\frac{X W_{Q} {(X W_{K})}^{T}}{\sqrt{d}} + M) ⊙ G_{0} + (P_{e x t}^{\otimes n}) ⊙ G_{1}] X W_{V})}_{b \times l \times n d}

The above equations encapsulate the core operations of the SelfAttention mechanism within a neural network layer. Initially, the attention scores $A$ are calculated by transforming the input hidden states $X$ into Query, Key, and Value matrices using linear transformations and computing their scaled dot-product, with an added attention mask $M$ to control which positions can be attended to. Subsequently, the raw attention scores are transformed into probabilities $P_{A}^{'}$ through a softmax function and further modulated by a gating mechanism that incorporates prior knowledge $P_{ext}$ . Finally, these probabilities are applied to the Value matrix to compute the context vectors $C^{'}$ , which are reshaped for compatibility with subsequent layers or further processing. This process enables the model to focus on relevant parts of the input sequence, thereby enhancing its ability to capture complex dependencies in the data.

These formulas encapsulate the core steps of the SelfAttention mechanism, including the computation of attention scores from Query-Key products, the application of a gating mechanism, and the calculation of the final context vector. Each formula implicitly includes necessary tensor shape changes and dimensional adjustments that are considered during actual implementation.

In the following, we present the pseudo-code for the Gating Mechanism for Attention Fusion.

Prompt tuning process

The original graph structure is more medically reasonable, but it overlooks the edge structure information between nodes of the same type and does not connect visit nodes to treatment nodes, thus failing to comprehensively capture all the information of an encounter. However, the original graph structure learns a more generalizable medical graph, making it a stronger foundation. Therefore, we use it as the foundation graph learned during pre-training. When performing downstream tasks, we employ prompt tuning to more fully explore and utilize the structural information between nodes for prediction tasks, thereby improving task performance.

We propose three prompt fine-tuning methods for downstream tasks.

Edge prompt

We introduce the pre-trained bioClinicalBERT¹³ to learn the conditional probability matrix. The elements of conditional probability matrix $P$ are calculated by:

P_{i j} = p (c_{i} | c_{j})

Instead of basing the calculation on the co-occurrence of $c_{i}$ and $c_{j}$ , we use the cosine similarity between concept embeddings based on bioClinicalBERT to calculate the conditional probability $p (c_{i} | c_{j})$ . This approach integrates pre-trained knowledge and ensures the symmetry of conditional probability matrix, resulting in $P_{bert}$ . $P_{bert}$ serves as the probability prompt for edge connections, replacing the original $P$ in the model, making the initial iterations of the conditional probability matrix more accurate and enriched with medical semantic information.

Label prompt

The second method is the label prompt, where we augment each encounter with a label node connected to every node $c_{i}$ but not connected to the visit node $v_{i}$ . The embedding of the label node aggregates all information pertaining to the patient's visit. Given an encounter record, we train models to learn the label embedding $I$ to perform the downstream tasks. Compared to the original GCT approach, which uses the visit node to represent the encounter for downstream predictions, the label node embedding improves information utilization. Consequently, it better performs downstream prediction tasks that are related to all nodes information within an encounter.

Code prompt

We introduce a virtual code node between each pair of nodes $c_{i}$ and $c_{j}$ , connected to both $c_{i}$ and $c_{j}$ . In the original GCT setup, nodes of the same type are not connected, which means that the interaction structure information of nodes of the same type (such as concurrent symptoms) is not utilized in downstream tasks. By using virtual node prompts, we link information from nodes of the same type without altering the overall framework of the model. Unlike the original adjacency edges, the edges connected to virtual nodes are virtual weighted edges. In the corresponding adjacency matrix, the elements are not 0 or 1, but a number in the interval (0,1). This means that as prompt information, the certainty of these edges is not as strong as that of the foundation graph. The weights of the virtual weighted edges are calculated based on the similarity of the bioClinicalBERT embeddings of the corresponding entity pairs.

Prompt tuning enables our model to better adapt to specific downstream tasks, leveraging the structural information of patients and prior medical knowledge to a greater extent for multitasking. We applied these three prompt methods simultaneously to the tasks of mortality prediction and hospitalization duration prediction.

Dynamic routing aggregator

In the context of advancing healthcare analytics, particularly in analyzing patient visits, the Dynamic Routing Aggregator within neural network architectures enables a sophisticated method for capturing hierarchical and structured information. Unlike traditional convolutional layers that rely on static weights for feature extraction, the aggregator facilitates dynamic information aggregation from lower-level features through an iterative routing process. This allows capsules to learn complex spatial relationships and generate robust, abstract representations that are especially valuable for maintaining detailed pose information and part-whole relationships. The subsequent mathematical formulation elucidates how the Dynamic Routing Aggregator transforms input data into capsules, encapsulating rich, structured representations that are critical for deep insights into patient care patterns and outcomes.

X \in R^{B \times T \times D_{i n}} : I n p u t t e n s o r .

M $\in$ {0,1}^B×T:Masktensor.

Parameters: $D_{i n}$ (input dimension), $N$ (number of output capsules), $D_{o u t}$ (dimension of each capsule), $I$ (iterations).

Shared Fully Connected Layer:

U = f (X W) \in R^{B \times T \times (N \cdot D_{o u t})}

Where $f$ is the activation function (tanh, ReLU or GELU).

Reshape to align with capsule dimensions:

U^{'} = r e s h a p e (U, [B, T, N, D_{o u t}])

Prepare Mask Tensor:

B = 0_{B \times T \times N} + (1 - M) \cdot (- \infty)

Iterative Dynamic Routing: For $i = 1 . . ., I :$ Softmax over capsules:

C = s o f t m a x (B, \dim = 2)

Weighted sum of capsule inputs:

S = \sum_{t = 1}^{T} C_{:, t, :, :} ⊙ U_{:, t, :, :}^{'}

Squash operation:

V = s q u a s h (S, \dim = 2)

Update logits:

B = B + \sum_{d = 1}^{D_{o u t}} U_{:, :, :, d}^{'} ⊙ V_{:, :, d} V_{r e t}

reshape the output based on the specified format:

V_{r e t} = r e s h a p e (V, [B, N \cdot D_{o u t}])

The Dynamic Routing Aggregator processes an input tensor $X$ in $X \in R^{B \times T \times D_{i n}}$ through a series of transformations to produce capsules that aggregate information dynamically. Initially, $X$ is fed into a shared fully connected layer with weights $W$ , and an activation function $f$ (such as tanh, ReLU or GELU) is applied to generate $U$ . This output is then reshaped into $U^{'}$ to align with the capsule dimensions.

A mask tensor $M$ ensures certain elements are ignored during processing by setting their corresponding logits in $B$ to $- \infty$ . The dynamic routing process begins with a softmax operation over the capsules along dimension 2 of $B$ , which produces coupling coefficients $C$ . These coefficients are used in a weighted sum of capsule inputs $U^{'}$ to result in the capsule outputs $S$ . A squash operation subsequently normalizes $S$ to obtain $V$ , ensuring robustness to varying input norms. Following this, the logits $B$ are updated based on the agreement between the input capsules and the output $V$ , and this iterative process continues for a specified number of iterations.

Finally, the capsule outputs $V$ are reshaped into $V_{r e t}$ to match the desired output format, thereby integrating the learned features into a compact representation suitable for further processing or decision-making tasks.The dynamic routing aggregation process is presented in the following algorithm.

Algorithm 1. Process Diagnosis Data and Sort ICD-9 Codes.

Require: $i n f i l e$ , $e n c o u n t e r_d i c t$

Ensure: Updated encounter_dict with sorted ICD-9 codes

for all lines $l i n e$ in $i n f i l e$ do

$e n c o u n t e r_i d \leftarrow l i n e [' p a t i e n t u n i t s t a y i d^{'}]$

$i c d 9 c o d e \leftarrow l i n e [' i c d 9 c o d e^{'}]$

if $i c d 9 c o d e = N o n e \lor i c d 9 c o d e = ϵ$ then

continue

end if

$c o d e s \leftarrow S p l i t (i c d 9 c o d e,^{'},^{'})$

$c o d e s \leftarrow T r i m W h i t e s p a c e (c o d e s)$

$f i n a l_i c d 9 s \leftarrow [c \in c o d e s : c [0] . i s d i g i t ()]$

if $e n c o u n t e r_i d \notin e n c o u n t e r_d i c t$ then

continue

end if

$e n c o u n t e r_d i c t [e n c o u n t e r_i d] . d x_i d s . e x t e n d (f i n a l_i c d 9 s)$

$e n c o u n t e r_d i c t [e n c o u n t e r_i d] . d x_i d s . s o r t ()$

end for

Algorithm 2. Gating Mechanism for Attention Fusion.

Require: $h i d d e n_s t a t e s, a t t e n t i o n_m a s k, a t t e n t i o n_p r o b s, a t t e n t i o n_p r o b s_p r i o r$

Ensure: fused $a t t e n t i o n_p r o b s$

if gate mechanism is enabled then

$a t t e n t i o n_m a s k_{1} \leftarrow S q u e e z e (a t t e n t i o n_{m} a s k) = = 0$

$h_{\leftarrow P o o l e r (h i d d e n_{s t a t e s}, a t t e n t i o n_{m a s k_{1}})}$

$g a t e_c o r e s \leftarrow G a t e (h_)$

$g a t e_c o r e s \leftarrow S o f t m a x (g a t e_c o r e s / 2)$

$g_{c u r r}, g_{p r i o r} \leftarrow S e l e c t A n d R e s h a p e (g a t e_s c o r e s, \dim = 1)$

$a t t e n t i o n_p r o b s \leftarrow a t t e n t i o n_p r o b s \cdot g_{c u r r} + a t t e n t i o n_p r o b s_p r i o r \cdot g_{p r i o r}$

end if

Algorithm 3. Dynamic Routing Aggregation.

Require: - Input tensor $X \in R^{B \times T \times D_{i n}}$

- Mask tensor M $\in$ {0,1}^B×T

- Hyperparameters: number of output capsules $K$ dimension per capsule $D_{o u t}$ , iterations $I$

Ensure: Output tensor $V \in R^{B \times (K \cdot D_{o u t})}$

Initialize shared weight matrix $W \in R^{D_{i n} \times K \cdot D_{o u t}}$

Transform inputs: $\hat{u} = A c t i v a t i o n (W \cdot X)$

Reshape $\hat{u}$ to shape $B \times T \times K \times D_{out}$

Apply mask to input: $M^{'} = M . unsqueeze (- 1) . repeat (1, 1, K)$

Initialize logits $B \leftarrow 0$ , masked with $M^{'}$ : $B [M^{'} = 0] \leftarrow - \infty$

for $r = 1$ to $R$ do

Compute coupling coefficients: $C = Softmax (B)$

Apply mask: $C = C ⊙ M . u n s q u e e z e (- 1)$

Weighted sum: $s = \sum_{t = 1}^{T} C_{t} \cdot {\hat{u}}_{t}$

Squash vectors: $v = S q u a s h (s)$

Update routing logits: $B + = (\hat{u} \cdot v) . d e t a c h () . s u m (- 1)$

end for

Final capsules: $v_{f i n a l} = v$

Flatten output if needed: $v_{f i n a l} \leftarrow F l a t t e n (v_{f i n a l})$

return $v_{final}$

Quantification and statistical analysis

Evaluation Metrics: AUPRC, AUROC, Precision, Recall, and F1 Score (weighted average) for mortality prediction; Accuracy and MSE for hospitalization duration prediction.

Experimental Validation: All experiments were conducted using five-fold cross-validation to ensure robust performance estimation. For each fold, model performance was evaluated on the held-out test set. Performance improvements were assessed based on consistent gains in absolute performance metrics across multiple folds and datasets.

Statistical Considerations: Due to the imbalanced nature of clinical outcomes (e.g., only 9% in-hospital mortality rate in the eICU dataset), we emphasized AUPRC over AUROC as the primary evaluation metric, as AUPRC provides a more informative assessment in highly imbalanced scenarios. Weighted averaging was employed for multi-class metrics to account for class imbalance.

Published: August 29, 2025

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.isci.2025.113417.

Contributor Information

Juan Xu, Email: xujuan@dchealth.com.

Kunlun He, Email: kunlunhe@plagh.org.

Supplemental information

Document S1. Figures S1–S7

mmc1.pdf^{(758.7KB, pdf)}

References

1.Kopitar L., Kocbek P., Cilar L., Sheikh A., Stiglic G. Early detection of type 2 diabetes mellitus using machine learning-based prediction models. Sci. Rep. 2020;10:11981. doi: 10.1038/s41598-020-68771-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Wu M., Hughes M., Parbhoo S., Zazzi M., Roth V., Doshi-Velez F. Beyond sparsity: Tree regularization of deep models for interpretability. Press; 2018. p. 32.AAAI. [Google Scholar]
3.Nie L., Wang M., Zhang L., Yan S., Zhang B., Chua T.-S. Disease inference from health-related questions via sparse deep learning. IEEE Trans. Knowl. Data Eng. 2015;27:2107–2119. doi: 10.1109/TKDE.2015.2399298. [DOI] [Google Scholar]
4.Baytas I.M., Xiao C., Zhang X., Wang F., Jain A.K., Zhou J. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM; 2017. Patient subtyping via time-aware LSTM networks; pp. 65–74. [Google Scholar]
5.Al-Dailami A., Kuang H., Wang J. 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) IEEE; 2022. Attention-based memory fusion network for clinical outcome prediction using electronic medical records; pp. 902–907. [Google Scholar]
6.Choi E., Bahadori M.T., Sun J., Kulas J., Schuetz A., Stewart W. RETAIN: An interpretable predictive model for healthcare using reverse time attention mechanism. Adv. Neural Inf. Process. Syst. 2016;29 doi: 10.48550/arXiv.1608.05745. [DOI] [Google Scholar]
7.Ma F., Chitta R., Zhou J., You Q., Sun T., Gao J. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM; 2017. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks; pp. 1903–1911. [Google Scholar]
8.Rajkomar A., Oren E., Chen K., Dai A.M., Hajaj N., Hardt M., Liu P.J., Liu X., Marcus J., Sun M., et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 2018;1:18. doi: 10.1038/s41746-018-0029-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017;30 doi: 10.48550/arXiv.1706.03762. [DOI] [Google Scholar]
10.Luo J., Ye M., Xiao C., Ma F. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20) Association for Computing Machinery; 2020. HiTANet: Hierarchical time-aware attention networks for risk prediction on electronic health records; pp. 647–656. [DOI] [Google Scholar]
11.Ren H., Wang J., Zhao W.X., Wu N. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. KDD '21. Association for Computing Machinery; 2021. RAPT: Pre-training of time-aware transformer for learning robust healthcare representation; pp. 3503–3511. [DOI] [Google Scholar]
12.Yang Z., Mitra A., Liu W., Berlowitz D., Yu H. TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records. Nat. Commun. 2023;14:7857. doi: 10.1038/s41467-023-43715-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Rasmy L., Xiang Y., Xie Z., Tao C., Zhi D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit. Med. 2021;4:86. doi: 10.1038/s41746-021-00455-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Li Y., Rao S., Solares J.R.A., Hassaine A., Ramakrishnan R., Canoy D., Zhu Y., Rahimi K., Salimi-Khorshidi G. BEHRT: transformer for electronic health records. Sci. Rep. 2020;10:7155. doi: 10.1038/s41598-020-62922-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Li Y., Mamouei M., Salimi-Khorshidi G., Rao S., Hassaine A., Canoy D., Lukasiewicz T., Rahimi K. Hi-BEHRT: hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records. IEEE J. Biomed. Health Inform. 2023;27:1106–1117. doi: 10.1109/JBHI.2022.3224727. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Ye M., Cui S., Wang Y., Luo J., Xiao C., Ma F. Proceedings of the Web Conference 2021. ACM; 2021. Medpath: Augmenting health risk prediction via medical knowledge paths; pp. 1397–1409. [Google Scholar]
17.Saab K., Tu T., Weng W.-H., Tanno R., Stutz D., Wulczyn E., Zhang F., Strother T., Park C., Vedadi E., et al. Capabilities of gemini models in medicine. arXiv. 2024 doi: 10.48550/arXiv.2404.18416. Preprint at: [DOI] [Google Scholar]
18.Zhang H., Chen J., Jiang F., Yu F., Chen Z., Li J., Chen G., Wu X., Zhang Z., Xiao Q., et al. Huatuogpt, towards taming language model to be a doctor. arXiv. 2023 doi: 10.48550/arXiv.2305.15075. Preprint at: [DOI] [Google Scholar]
19.Hamilton W.L., Ying R., Leskovec J. Curran Associates Inc; 2017. Inductive Representation Learning on Large Graphs. [Google Scholar]
20.Kipf T.N., Welling M. Semi-supervised classification with graph convolutional networks. arXiv. 2016 doi: 10.48550/arXiv.1609.02907. Preprint at. [DOI] [Google Scholar]
21.Veličković P., Cucurull G., Casanova A., Romero A., Liò P., Bengio Y. Proceedings of the International Conference on Learning Representations (ICLR 2018) 2018. Graph attention networks.https://openreview.net/forum?id=rJXMpikCZ [Google Scholar]
22.Liu Z., Li X., Peng H., He L., Yu P.S. Heterogeneous similarity graph neural network on electronic health records. IEEE; 2020. pp. 1196–1205. [Google Scholar]
23.Wu J., He K., Mao R., Li C., Cambria E. MEGACare: Knowledge-guided multi-view hypergraph predictive framework for healthcare. Inf. Fusion. 2023;100 [Google Scholar]
24.Choi E., Xu Z., Li Y., Dusenberry M., Flores G., Xue E., Dai A. Learning the graphical structure of electronic health records with graph convolutional transformer. Proc. AAAI Conf. Artif. Intell. 2020;34:606–613. [Google Scholar]
25.Xu R., Yu Y., Zhang C., Ali M.K., Ho J.C., Yang C. Machine Learning for Health. PMLR; 2022. Counterfactual and factual reasoning over hypergraphs for interpretable clinical predictions on ehr; pp. 259–278. [PMC free article] [PubMed] [Google Scholar]
26.Jiang P., Xiao C., Cross A.R., Sun J. GraphCare: Enhancing healthcare predictions with personalized knowledge graphs. 2023. https://openreview.net/forum?id=zPei0C6pxE
27.Pollard T.J., Johnson A.E.W., Raffa J.D., Celi L.A., Mark R.G., Badawi O. The eICU collaborative research database, a freely available multi-center database for critical care research. Sci. Data. 2018;5 doi: 10.1038/sdata.2018.178. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S7

mmc1.pdf^{(758.7KB, pdf)}

Data Availability Statement

•
Data: The Philips eICU Collaborative Research Dataset is available at GitHub (https://github.com/mit-lcp/eicu-code) and has been archived at reference.²⁷ The MIMIC-III database, used for cross-specialty validation on neurology and oncology subsets, is available at https://physionet.org/content/mimiciii/1.4/ (access requires completion of required training and signing a Data Use Agreement). The real-world EMR records used in this study were obtained from the Cardiovascular Department of Chinese PLA General Hospital’s Medical Big Data Research Center, Beijing, China. Due to the sensitivity of the hospital data, it cannot be made publicly available; data acquisition can be requested by contacting the email provided.
•
Code: Our source code is available at GitHub (https://github.com/PLA301dbgroup2/Probabilistic-Graphical-Model) with DOI (https://doi.org/10.5281/zenodo.16794545) to be provided upon acceptance.
•
Other: Part of downstream subtask data were under process of desensitization and approval; access can be requested through the lead contact.

[bib1] 1.Kopitar L., Kocbek P., Cilar L., Sheikh A., Stiglic G. Early detection of type 2 diabetes mellitus using machine learning-based prediction models. Sci. Rep. 2020;10:11981. doi: 10.1038/s41598-020-68771-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Wu M., Hughes M., Parbhoo S., Zazzi M., Roth V., Doshi-Velez F. Beyond sparsity: Tree regularization of deep models for interpretability. Press; 2018. p. 32.AAAI. [Google Scholar]

[bib3] 3.Nie L., Wang M., Zhang L., Yan S., Zhang B., Chua T.-S. Disease inference from health-related questions via sparse deep learning. IEEE Trans. Knowl. Data Eng. 2015;27:2107–2119. doi: 10.1109/TKDE.2015.2399298. [DOI] [Google Scholar]

[bib4] 4.Baytas I.M., Xiao C., Zhang X., Wang F., Jain A.K., Zhou J. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM; 2017. Patient subtyping via time-aware LSTM networks; pp. 65–74. [Google Scholar]

[bib5] 5.Al-Dailami A., Kuang H., Wang J. 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) IEEE; 2022. Attention-based memory fusion network for clinical outcome prediction using electronic medical records; pp. 902–907. [Google Scholar]

[bib6] 6.Choi E., Bahadori M.T., Sun J., Kulas J., Schuetz A., Stewart W. RETAIN: An interpretable predictive model for healthcare using reverse time attention mechanism. Adv. Neural Inf. Process. Syst. 2016;29 doi: 10.48550/arXiv.1608.05745. [DOI] [Google Scholar]

[bib7] 7.Ma F., Chitta R., Zhou J., You Q., Sun T., Gao J. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM; 2017. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks; pp. 1903–1911. [Google Scholar]

[bib8] 8.Rajkomar A., Oren E., Chen K., Dai A.M., Hajaj N., Hardt M., Liu P.J., Liu X., Marcus J., Sun M., et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 2018;1:18. doi: 10.1038/s41746-018-0029-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017;30 doi: 10.48550/arXiv.1706.03762. [DOI] [Google Scholar]

[bib10] 10.Luo J., Ye M., Xiao C., Ma F. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20) Association for Computing Machinery; 2020. HiTANet: Hierarchical time-aware attention networks for risk prediction on electronic health records; pp. 647–656. [DOI] [Google Scholar]

[bib11] 11.Ren H., Wang J., Zhao W.X., Wu N. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. KDD '21. Association for Computing Machinery; 2021. RAPT: Pre-training of time-aware transformer for learning robust healthcare representation; pp. 3503–3511. [DOI] [Google Scholar]

[bib12] 12.Yang Z., Mitra A., Liu W., Berlowitz D., Yu H. TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records. Nat. Commun. 2023;14:7857. doi: 10.1038/s41467-023-43715-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Rasmy L., Xiang Y., Xie Z., Tao C., Zhi D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit. Med. 2021;4:86. doi: 10.1038/s41746-021-00455-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Li Y., Rao S., Solares J.R.A., Hassaine A., Ramakrishnan R., Canoy D., Zhu Y., Rahimi K., Salimi-Khorshidi G. BEHRT: transformer for electronic health records. Sci. Rep. 2020;10:7155. doi: 10.1038/s41598-020-62922-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Li Y., Mamouei M., Salimi-Khorshidi G., Rao S., Hassaine A., Canoy D., Lukasiewicz T., Rahimi K. Hi-BEHRT: hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records. IEEE J. Biomed. Health Inform. 2023;27:1106–1117. doi: 10.1109/JBHI.2022.3224727. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Ye M., Cui S., Wang Y., Luo J., Xiao C., Ma F. Proceedings of the Web Conference 2021. ACM; 2021. Medpath: Augmenting health risk prediction via medical knowledge paths; pp. 1397–1409. [Google Scholar]

[bib17] 17.Saab K., Tu T., Weng W.-H., Tanno R., Stutz D., Wulczyn E., Zhang F., Strother T., Park C., Vedadi E., et al. Capabilities of gemini models in medicine. arXiv. 2024 doi: 10.48550/arXiv.2404.18416. Preprint at: [DOI] [Google Scholar]

[bib18] 18.Zhang H., Chen J., Jiang F., Yu F., Chen Z., Li J., Chen G., Wu X., Zhang Z., Xiao Q., et al. Huatuogpt, towards taming language model to be a doctor. arXiv. 2023 doi: 10.48550/arXiv.2305.15075. Preprint at: [DOI] [Google Scholar]

[bib19] 19.Hamilton W.L., Ying R., Leskovec J. Curran Associates Inc; 2017. Inductive Representation Learning on Large Graphs. [Google Scholar]

[bib20] 20.Kipf T.N., Welling M. Semi-supervised classification with graph convolutional networks. arXiv. 2016 doi: 10.48550/arXiv.1609.02907. Preprint at. [DOI] [Google Scholar]

[bib21] 21.Veličković P., Cucurull G., Casanova A., Romero A., Liò P., Bengio Y. Proceedings of the International Conference on Learning Representations (ICLR 2018) 2018. Graph attention networks.https://openreview.net/forum?id=rJXMpikCZ [Google Scholar]

[bib22] 22.Liu Z., Li X., Peng H., He L., Yu P.S. Heterogeneous similarity graph neural network on electronic health records. IEEE; 2020. pp. 1196–1205. [Google Scholar]

[bib23] 23.Wu J., He K., Mao R., Li C., Cambria E. MEGACare: Knowledge-guided multi-view hypergraph predictive framework for healthcare. Inf. Fusion. 2023;100 [Google Scholar]

[bib24] 24.Choi E., Xu Z., Li Y., Dusenberry M., Flores G., Xue E., Dai A. Learning the graphical structure of electronic health records with graph convolutional transformer. Proc. AAAI Conf. Artif. Intell. 2020;34:606–613. [Google Scholar]

[bib25] 25.Xu R., Yu Y., Zhang C., Ali M.K., Ho J.C., Yang C. Machine Learning for Health. PMLR; 2022. Counterfactual and factual reasoning over hypergraphs for interpretable clinical predictions on ehr; pp. 259–278. [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Jiang P., Xiao C., Cross A.R., Sun J. GraphCare: Enhancing healthcare predictions with personalized knowledge graphs. 2023. https://openreview.net/forum?id=zPei0C6pxE

[bib27] 27.Pollard T.J., Johnson A.E.W., Raffa J.D., Celi L.A., Mark R.G., Badawi O. The eICU collaborative research database, a freely available multi-center database for critical care research. Sci. Data. 2018;5 doi: 10.1038/sdata.2018.178. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Enhancing healthcare analytics with prompt based probabilistic graphical model

Yan Zhuang

Junyan Zhang

Shiyuan Liu

Bing Wei

Lei Zheng

Jianwei Gao

Zaijian Zeng

Juan Xu

Kunlun He

Summary

Graphical abstract

Highlights

Introduction

Results

Pre-training and fine-tuning a PPGM

Figure 1.

Figure 2.

PPGM output foundation graph while predicting the masked diagnosis

Table 1.

Sorted ICD codes and learnable positional embeddings enhanced the performance of masked diagnosis prediction

Table 2.

PPGM outperforms other models in mortality and hospitalization duration prediction

Table 3.

Figure 3.

Table 4.

The gating mechanism and prompt tuning helped improve the model’s performance

Figure 4.

Table 5.

Figure 5.

Table 6.

Different prompt tuning helped improve the model’s performance

Table 7.

Table 8.

Comprehensive ablation of key structural components

Table 9.

Table 10.

Table 11.

Table 12.

Attention scores across features

Figure 6.

Discussion

Figure 7.

Table 13.

Table 14.

Table 15.

Table 16.

Table 17.

Table 18.

Conclusion

Limitations of the study

Resource availability

Lead contact

Materials availability

Data and code availability

Acknowledgments

Author contributions

Declaration of interests

STAR★Methods

Key resources table

Experimental model and study participant details

Datasets

Method details

Pre-training process

Sorted ICD codes

Learnable positional embeddings

Prompt tuning process

Edge prompt

Label prompt

Code prompt

Dynamic routing aggregator

Algorithm 1. Process Diagnosis Data and Sort ICD-9 Codes.

Algorithm 2. Gating Mechanism for Attention Fusion.

Algorithm 3. Dynamic Routing Aggregation.

Quantification and statistical analysis

Footnotes

Contributor Information

Supplemental information

References