Abstract
Background
Traditional risk evaluation models have been applied to guide public health and clinical practice in various studies. However, the application of existing methods to data sets with missing and censored data, as is often the case in electronic health records, requires additional considerations. We aimed to develop and validate a predictive model that exhibits high performance with data sets that contain missing and censored data.
Methods and Results
This is a retrospective cohort study of coronary heart disease at Weihai Municipal Hospital on unique patients aged 18 to 96 years between 2013 and 2021. A total of 169 692 participants formed our study population, of which 10 895 participants were diagnosed with coronary heart disease. Models were built for the risk of coronary heart disease based on demographic, laboratory, and medical history variables. All complete samples were assigned to the training set (n=110 325), whereas the remaining samples were assigned to the validation set (n=59 367). The area under the receiver operating characteristic curve value was 0.800 (95% CI, 0.794–0.805), and the C statistic was 0.796 (95% CI, 0.791–0.801) in the derivation cohort, and the corresponding values were 0.837 (95% CI, 0.821–0.853) and 0.838 (95% CI, 0.822–0.854) in the validation cohort. The calibration curve demonstrated its good calibration ability, and decision curve analysis showed its clinical usefulness.
Conclusions
Our proposed risk prediction model has demonstrated significant effectiveness in handling the complexities of electronic health record data, which often involve extensive missing data and censoring. This approach may offer potential assistance in the use of electronic health records to enhance patient outcomes.
Keywords: Bayesian network, coronary heart disease, electronic health records, risk prediction, survival analysis
Subject Categories: Cardiovascular Disease, Epidemiology
Nonstandard Abbreviations and Acronyms
- CPH
Cox proportional hazards
- EHD
electronic health data
- IPCW
inverse probability of censoring weighted
- WSBN
weighted survival Bayesian network
Clinical Perspective.
What Is New?
We developed and validated a Bayesian network‐based prediction model, using electronic health records, to accurately forecast the probability of experiencing a coronary heart disease event.
This study demonstrated that our proposed model had good prediction ability in electronic health records with extensive missing and censored data.
What Are the Clinical Implications?
By using electronic health records, the clinical prediction model can provide patients and physicians with a quantified risk probability of experiencing coronary heart disease in the future based on the current health status of patients.
Moreover, the clinical prediction model can serve as a more intuitive and robust scientific tool for physicians by presenting them with the interdependencies among predictors, thus enabling them to offer guidance to patients about their health education and behavioral interventions.
Coronary heart disease (CHD) poses a significant threat to public health both in China and globally, with both genetic predisposition and environmental risk factors playing a crucial role in its development. In 2019, 197 million people worldwide experienced CHD, of whom 9.14 million died of CHD. China accounted for ≈38.2% of the global increase in CHD deaths over the past 30 years. 1 Furthermore, the prevalence of CHD‐related deaths in China is still increasing. 2
Clinical risk prediction scores or models play a crucial role in enhancing patient care management and elevating overall population outcomes. 3 The first coronary heart disease risk equation was developed in the FHS (Framingham Heart Study) in 1976. 4 Since then, researchers worldwide successively proposed several risk evaluation tools for cardiovascular disease, including the Framingham general cardiovascular disease equations in the United States, the systematic coronary risk evaluation model in Europe, the QRISK in the United Kingdom, and the most recent pooled cohort equations for atherosclerotic cardiovascular disease reported in the American College of Cardiology/American Heart Association guideline. 3 , 5 , 6 , 7 However, these models are all based on Western populations, which limits their generalizability to other populations. Hence, Barzi et al developed 8‐year cardiovascular disease risk equations for Asian populations without information on diabetes and high‐density lipoprotein cholesterol. 8 Liu et al developed the Framingham CHD risk assessment tool for the Chinese population. 9 Wang et al estimated a 10‐year risk of fatal and nonfatal ischemic cardiovascular diseases in Chinese adults. 10 Yang et al developed prediction equations for atherosclerotic cardiovascular disease risk in China (China‐PAR), which were recommended by the American College of Cardiology/American Heart Association guidelines to facilitate the primary prevention and management of cardiovascular disease in clinical practice for populations of Chinese ethnicity. 11 In recent years, machine learning methods have been increasingly used for disease prediction. Weng et al found that their machine learning models outperformed traditional models in terms of predictive accuracy, indicating that machine learning could be a valuable tool in improving the accuracy of cardiovascular risk prediction using routine clinical data. 12
Electronic health records (EHRs) serve as official digital health documents for individuals, accessible across various health care facilities and agencies. The significance of EHRs is steadily increasing, driven by the digitization of patient information. 13 However, the use of EHRs is accompanied by a set of challenges, including instances of missing and censored data, nonlinear connections between risk factors and outcomes, and varying impacts observed among different patient subgroups. These challenges necessitate the application of innovative strategies to develop refined risk models. 14 Although traditional risk evaluation models have played a crucial role in guiding public health and clinical practice across various studies, their application to data sets with missing and censored data, as is often the case in EHRs, requires additional considerations.
Bayesian networks are a powerful machine learning method for handling such scenarios. 15 By leveraging Bayes theorem and updating probabilities with observed evidence, Bayesian networks enable robust prediction and inference even with incomplete data. In addition, they represent dependencies among variables through a directed acyclic graph, where nodes represent variables and edges reflect conditional dependencies between variables. 16 , 17 This visual format aids clinical researchers in understanding complex statistical relationships involving numerous variables, such as patient demographics, medical history, and laboratory results. 18 In addition, the inverse probability of censoring weighted (IPCW) method is a highly effective technique for addressing bias caused by censored data. 19 It involves assigning weights to individuals based on the inverse probability of experiencing censoring events, such as loss to follow‐up. These weights are then used to adjust the contribution of each individual in the statistical analysis, correcting for biases introduced by censoring. 20
In this study, we built a predictive Bayesian network‐based model that demonstrates good performance with missing and censored data sets. We developed the prediction model using EHRs to determine the probability of a patient experiencing a CHD event within a 3‐year period. To assess the effectiveness of our approach, we compared the performance of our model with the Cox proportional hazards (CPH) model using both complete and incomplete data. The performance of the models was evaluated by using the area under the receiver operating characteristic curve (AUC), the C statistic, the calibration curve, and the decision curve.
Methods
The data and codes that support the findings of this study are available from the corresponding author upon reasonable request.
Study Population and Cohort Design
Weihai Municipal Hospital is a 3A‐level hospital located in Weihai City, Shandong Province, China. A total of 199 901 participants who underwent routine health examinations at Weihai Municipal Hospital from January 1, 2013, to December 31, 2021, were retrospectively enrolled. All participants had signed informed consent for this study. Among these participants, 30 209 were excluded according to the exclusion criteria. Exclusion criteria were as follows: (1) participants diagnosed with CHD at baseline, (2) participants diagnosed with CHD after December 31, 2021, and (3) participants aged <18 years. The remaining 169 692 participants formed our study population, of whom 10 895 participants were diagnosed with CHD.
Data extracted from the electronic medical records system of Weihai Municipal Hospital included demographic characteristics, laboratory findings, and medical history for diseases. Demographic characteristics included birth, age, sex, body mass index, identity card number, patient identifier, and physical examination date of patients. Laboratory findings included blood indicators and urine and vaginal secretion indicators. Medical histories for diseases of patients were extracted from the case registration system of Shandong Province and the case registration system of Weihai Municipal Hospital (Figure S1). All data were aggregated to Shandong University, and we used the identity card number as a unique identifier to match all the data (Figure S2).
This research was approved by the institutional review board and the ethics committee of the Weihai Municipal Hospital.
Predictors Considered
The risk factors to be studied were selected according to a screening process (see below). Risk factors incorporated in the prediction model included age, sex, body mass index, diabetes, obesity, hypertension, hyperuricemia, total cholesterol, triglycerides, low‐density lipoprotein cholesterol, and high‐density lipoprotein cholesterol. The blood test variables indicate the results of the corresponding tests, whereas the disease variables indicate whether the participants have been diagnosed with such diseases. In total, 11 variables were ultimately used to construct the prediction model and evaluate the potential risk of CHD. The summary statistics for the risk factors are given in Table 1.
Table 1.
Summary Statistics for the Risk Factors
| Data source | Variable type | Value* | Missing rate, % |
|---|---|---|---|
| Demographic characteristics | |||
| Age, y | Continuous variable | 40 | 0 |
| Sex† | Categorical variable | 0, 1 | 0 |
| Body mass index, kg/m2 | Continuous variable | 24.96 | 16.57 |
| Laboratory findings | |||
| Total cholesterol, mmol/L | Continuous variable | 4.70 | 24.17 |
| Triglycerides, mmol/L | Continuous variable | 1.31 | 25.72 |
| Low‐density lipoprotein cholesterol, mmol/L | Continuous variable | 2.85 | 26.14 |
| High‐density lipoprotein cholesterol, mmol/L | Continuous variable | 1.43 | 25.62 |
| Medical history‡ | |||
| Hyperuricemia | Categorical variable | 0, 1 | 0.24 |
| Obesity | Categorical variable | 0, 1 | 15.23 |
| Diabetes | Categorical variable | 0, 1 | 15.23 |
| Hypertension | Categorical variable | 0, 1 | 15.23 |
Continuous variables provide mean value, whereas categorical variables provide values of different categories.
Sex=0 means male, whereas sex=1 means female.
Disease=0 means not experiencing the disease, whereas disease=1 means experiencing the disease.
Outcome Definitions
To identify patients with CHD, we used the International Classification of Diseases, Tenth Revision (ICD‐10), code I25. Patients who had been assigned this code were considered to have CHD. To conduct the statistical analysis, we used the survival time and follow‐up time data of the patients. Specifically, the follow‐up and survival time for both the derivation cohort and the validation cohort were determined on the basis of the dates of physical examination, disease diagnosis, and death.
Statistical Analysis
Grouping and Feature Selection
Figure 1 shows the flowchart of the whole study. The population was divided into a training set (n=110 325) and a validation set (n=59 367), according to whether the sample was complete. All complete samples were allocated to the training set, whereas the remaining samples were assigned to the validation set. This partitioning allows for the evaluation of the newly developed model's performance on incomplete data and its ability to generalize effectively, while ensuring that overfitting to the training data is avoided.
Figure 1. Flowchart of the study.

AUC indicates area under the receiver operating characteristic curve; CHD, coronary heart disease; CPH, Cox proportional hazards; and WSBN, weighted survival Bayesian network.
For feature selection, we considered various risk factors, including patient characteristics, blood and urine test indicators, and disease diagnosis indicators. Initially, variables with a missing data rate exceeding 50% were excluded to ensure an adequate sample size in the training set, considering the division between the training set and validation set. Subsequently, we screened the remaining candidate variables based on relevant medical literature, specifically retaining those documented to be associated with CHD. Finally, we applied lasso regression, which effectively identifies and eliminates irrelevant variables by penalizing the size of regression coefficients. The regression estimates are shrunk toward 0, with the degree of shrinkage determined by the lambda parameter (λ). To determine the optimal value for λ, we used a cross‐validation approach with 10‐fold validation and selected λ based on the minimum criterion. The results of lasso regression can be found in Table S1.
Model Development
IPCW Method
Survival analysis techniques have evolved and expanded to encompass diverse types of data that are amenable to analysis. Although various methods have been developed for analyzing data when dealing with informative censoring, the only estimator developed for dependent censoring is the IPCW estimator. This technique checks the dependent censoring in censored data, especially in right‐censored data, by assigning additional weight to subjects for whom data are not censored. 19
We designated the duration of observation as γ, the interval between the initiation of the follow‐up period and the occurrence of an event as T, and we defined the span between the commencement of the follow‐up period and either disenrollment or the conclusion of the study period as S. In general, the IPCW method follows these steps: (1) Use the Kaplan‐Meier estimator for the survival distribution of censoring times to assess the function G(t)=P(S i >t), signifying the likelihood that the censoring time surpasses t, using the training data. 21 (2) Compute an inverse probability of censoring weight for each individual i within the data set. Patients whose event status is unknown at γ (ie, are censored before γ and therefore have S i ≤min [T i ,γ]) are assigned weight w i = 0. The remaining patients are assigned weights inversely proportional to the estimated probability of being censored after their observed follow‐up time. (3) Apply a prediction method to a weighted version of the training set where each member i of the training set is weighted by a factor of w i . For example, if w i = 3, it is as if the observation appeared 3 times in the data set. 22 , 23 In this article, we applied the Bayesian network model to a weighted version of the training set where each member of the training set was weighted by an inverse probability of censoring weight.
Bayesian Network
A Bayesian network is a probabilistic graphical model for representing knowledge about an uncertain domain, where each node corresponds to a random variable and each edge represents the conditional probability for the corresponding random variables. Bayesian networks satisfy the local Markov property, which states that a node is conditionally independent of its nondescendants given its parents. By leveraging Bayes theorem and updating probabilities with observed evidence, this model enables robust prediction and inference even with incomplete data. 15 Bayesian networks have gained significant popularity in biomedical applications because of their interpretability and effectiveness in assisting with reasoning under uncertainty. 18 They are particularly well suited for managing the complexities of risk prediction when using electronic health data (EHD).
Learning Bayesian networks from data involves 2 tasks: learning network structure (determining dependencies between variables; qualitative component) and learning the parameters (determining the strength of these dependencies between variables; quantitative component). 24 Commonly used structure learning algorithms for Bayesian networks include score‐ and constraint‐based algorithms. Score‐based methods aim to find the model structure that best fits the data by introducing a scoring function to evaluate each candidate model. On the other hand, constraint‐based methods use conditional independence constraints derived from statistical tests on the data. Once the structure of the Bayesian network is determined, the next step is parameter learning. This process involves estimating conditional probability tables that quantify the dependencies between variables in the network. 15 , 25 Various techniques are commonly used for parameter learning in Bayesian networks, including maximum likelihood estimation, the expectation‐maximization algorithm, Bayesian estimation, and others.
After constructing the Bayesian network model, inference algorithms can be used to perform probabilistic inference and prediction. They leverage the network's structure and conditional probability distributions for computing posterior probabilities of unobserved variables using observed evidence.
CPH Model
The Cox model, also referred to as the CPH model, is a statistical technique used in survival analysis to investigate the relationship between predictor variables and the time‐to‐event outcome. 26 This model offers estimations of regression coefficients, which indicate the significance and magnitude of the effects of predictors on the hazard rate. 27 , 28 In this study, we calculated the baseline risk function, representing the survival probability in the absence of any influencing factors, by setting all covariates to 0. Using the hazard ratio (HR), we quantified the relative risk associated with a specific covariate compared with the reference group. By multiplying the baseline risk by the HR, we obtained the absolute risk, capturing the combined effect of both factors. In addition, we evaluated the probability of a CHD event for individuals. Initially, we derived estimated regression coefficients from the fitted Cox model and computed the baseline hazard function, which signifies the risk when all predictor variable values are 0. Subsequently, we computed a linear combination of predictor variable values by multiplying them with their corresponding regression coefficients and then summing the resultant products. Using this linear combination, we calculated the hazard function for a specific time point t by multiplying it with the baseline hazard function. Following that, we integrated the hazard function from time 0 up to the desired time point t, thereby representing the cumulative risk of experiencing the event within that time interval. Ultimately, the survival probability was determined by exponentiating the negative value obtained from the integration. The risk or probability of an individual encountering a CHD event was obtained by subtracting the survival probability from 1.
Weighted Survival Bayesian Network Model
In this study, a novel predictive model called the weighted survival Bayesian network (WSBN) has been developed. This model combines the IPCW method and the CPH model with Bayesian networks to enable predictions based on incomplete data. Constructing the WSBN model involves 5 key steps: (1) using structure learning algorithms to determine the network structure for all predictors; (2) integrating the IPCW method and the Bayesian network to learn the parameters between predictors; (3) constructing the multivariate CPH model using the predictors and the disease outcome of interest; (4) integrating the CPH model with the Bayesian network to create the final WSBN model.
We followed the above 4 steps to construct the WSBN model using the training set and make prediction. Before constructing the model, it is necessary to convert continuous variables into categorical variables and subsequently transform all numerical variables into factor variables. Table S2 displays categorized subgroups derived from predictors. In the first step, any structure learning algorithm of Bayesian network models can be used. We specifically used the Tabu search algorithm, known for its speed advantage, to determine the structure of the Bayesian network in our study. In the second step, we addressed censoring by integrating the IPCW method with the parameter learning process of the Bayesian network. The IPCW method assigned weights to each sample based on the inverse probability of being censored, ensuring proper adjustment for bias during analysis. Unlike the traditional parameter learning approach of directly computing the frequency distribution to estimate the conditional probability table, our method allowed us to calculate the conditional probability tables using the weighted data set processed by the IPCW method. To implement this process, we developed a new R function, as detailed in Data S1. Subsequently, we constructed the multivariate CPH model using the predictors and the outcome variable, CHD. Finally, we fused the CPH model with the Bayesian network. The predictor variables in the CPH model correspond to nodes in a Bayesian network, enabling us to integrate the outcome variable from the CPH model into the Bayesian network. In terms of structure, significant predictor nodes in the CPH model directed toward the CHD node. In terms of parameters, the predicted outcomes of the CPH model were incorporated into the Bayesian network as conditional probabilities between the CHD node and predictor nodes. Figure 2 shows the network structure of the WSBN model.
Figure 2. The network structure of the weighted survival Bayesian network model.

BMI indicates body mass index; CHD, coronary heart disease; Diab, diabetes; HBP, hypertension; HDL‐C, high‐density lipoprotein cholesterol; HUA, hyperuricemia; LDL‐C, low‐density lipoprotein cholesterol; Obes, obesity; TC, total cholesterol; and TG, triglycerides.
After constructing the model, we used the WSBN model to make predictions using the likelihood weighting algorithm, a technique commonly used for probabilistic inference in Bayesian networks. The algorithm assigns weights to the samples based on their likelihood given the observed evidence, ensuring that the generated samples represent the posterior distribution of the unobserved variables. On top of the constructed WSBN model, we applied the likelihood weighting method to make predictions for individuals in the derivation and validation cohorts. The predicted results for individuals in the validation cohort can be found in Table S3.
Model Comparison
The multivariate Cox model was used as a benchmark in our study. Because the Cox model requires complete data for all variables, appropriate imputation techniques should be used to impute the missing values before fitting the Cox model. Any imputation method could be used for this purpose. Because of its high accuracy, effective resistance against overfitting, and efficiency in handling large‐scale data sets, we chose to use the random forest imputation method in our study. We used the randomForest package in R to implement this method. On the other hand, the WSBN model can be directly constructed using incomplete data, eliminating the need to fill in missing values before model construction. Table 2 shows the results of the multivariate Cox model.
Table 2.
Results of the Multivariate Cox Regression Model
| Variables | Subgroup | Coefficient | SE | HR (95% CI) | Wald value | P value |
|---|---|---|---|---|---|---|
| Total cholesterol, mmol/L | 5.72 | Reference | ||||
| 5.23–5.69 | 0.015 | 0.038 | 1.016 (0.942–1.095) | 0.403 | 0.687 | |
| >5.72 | 0.077 | 0.050 | 1.080 (0.980–1.190) | 1.552 | 0.121 | |
| Triglycerides, mmol/L | <1.7 | Reference | ||||
| 1.7–2.3 | 0.047 | 0.031 | 1.048 (0.987–1.114) | 1.527 | 0.127 | |
| ≥2.3 | 0.084 | 0.032 | 1.088 (1.022–1.158) | 2.627 | 0.009 | |
| High‐density lipoprotein cholesterol, mmol/L | ≥1.04 | Reference | ||||
| 0.91–1.04 | −0.106 | 0.027 | 0.899 (0.852–0.949) | −3.864 | <0.001 | |
| <0.91 | −0.081 | 0.032 | 0.922 (0.867–0.981) | −2.582 | 0.010 | |
| Low‐density lipoprotein cholesterol, mmol/L | 2.6–3.3 | Reference | ||||
| 3.3–4.1 | 0.062 | 0.038 | 1.064 (0.987–1.148) | 1.617 | 0.106 | |
| ≥4.1 | 0.187 | 0.048 | 1.206 (1.098–1.325) | 3.899 | <0.001 | |
| Body mass index, kg/m2 | ≤18.5 | Reference | ||||
| 18.5 to ≤25 | 0.131 | 0.033 | 1.140 (1.069–1.215) | 4.018 | <0.001 | |
| >25 | 0.282 | 0.033 | 1.326 (1.243–1.415) | 8.505 | <0.001 | |
| Sex | Male | Reference | ||||
| Female | 0.394 | 0.023 | 1.483 (1.417–1.551) | 17.008 | <0.001 | |
| Age, y | 18–45 | Reference | ||||
| 45–60 | 1.392 | 0.077 | 4.022 (3.457–4.679) | 18.019 | <0.001 | |
| >60 | 2.980 | 0.074 | 19.689 (17.030–22.764) | 40.255 | <0.001 | |
| Hypertension | No | Reference | ||||
| Yes | 0.522 | 0.031 | 1.685 (1.587–1.789) | 17.034 | <0.001 | |
| Diabetes | No | Reference | ||||
| Yes | 0.476 | 0.045 | 1.610 (1.474–1.759) | 10.542 | <0.001 | |
| Obesity | No | Reference | ||||
| Yes | −0.270 | 0.064 | 0.764 (0.674–0.866) | −4.211 | <0.001 | |
| Hyperuricemia | No | Reference | ||||
| Yes | 0.345 | 0.048 | 1.412 (1.286–1.550) | 7.241 | <0.001 |
HR indicates hazard ratio.
Assessment of Model Performance
To assess the performance of the model, several metrics were used, including the AUC and its 95% CI, the C statistic and its 95% CI, calibration curves, and decision curves.
We evaluated the model discrimination using the AUC, which characterizes model discrimination and ranges between 0 and 1, with a higher value corresponding to better discrimination. The AUC represents the proportion of times that patients with an event are correctly classified as having a higher probability of that event compared with all possible pairs of patients with and without the event. We used the roc function in the pROC package to obtain the AUC and its 95% CI. By default, the 95% CI of the AUC was computed with 2000 stratified bootstrap replicates. The C statistic is also a measure of discrimination ability, and higher values indicate better discrimination performance. 29 The C statistic is similar to the AUC but considers the censored nature of the survival data. We used the concordance.index function in the survcomp package to obtain the C statistic and its 95% CI. The 95% CI for the C statistic was calculated by the C statistic ±SE, multiplied by 1.96. Calibration curves were used to assess the consistency between the predicted probability and the observed probability in the training and validation sets. 30 We calculated the calibration in the large (A) and calibration slope (B). The better the model calibration performance, the closer A is to 0 and the closer B is to 1. Decision curve analysis was performed to assess the clinical utility, where a higher net benefit indicated better clinical usefulness. 31
Results
Baseline Characteristics
According to the available data, 169 692 patients were included in the cohort and 10 895 patients were diagnosed with CHD (6.42%). Among 169 692 patients, the mean follow‐up time was 5.35 years, and the median age at baseline was 38 years (range, 18–96 years). Women accounted for 55.7% and men accounted for 44.3% of the cohort. Table S4 displays the baseline characteristics of this cohort, grouped by individual CHD status.
Incidence Rate and Density
A total of 10 895 new cases of CHD were diagnosed during the 1 201 783 600 person‐years of follow‐up in the cohort. The incidence rate of CHD was 6420.46 per 100 000 people, and the incidence density of CHD was 1201.78 per 100 000 person‐years. The incidence rate of CHD was 5152.33 per 100 000 people for women and 7428.00 for men. The incidence density of CHD was 992.72 per 100 000 person‐years for women and 1359.28 for men. Table 3 shows the incidence rate and incidence density (1/100 000) of the Weihai cohort group by age and sex. Table S5 presents the incidence rate and incidence density (1/100 000) of the Weihai cohort categorized by sex.
Table 3.
Incidence Rate and Incidence Density (1/100 000) of the Weihai CHD Cohort Group by Age and Sex
| Group | Sex=0 | Sex=1 | All | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Case* | Total† | Rate‡ | Density§ | Case | Total | Rate | Density | Case | Total | Rate | Density | |
| Age, y | ||||||||||||
| 0–39 | 450 | 38 635 | 1165 | 205 813 | 662 | 51 700 | 1280 | 285 372 | 1112 | 90 335 | 1231 | 491 186 |
| 40–44 | 238 | 8308 | 2865 | 48 257 | 677 | 10 926 | 6196 | 67 337 | 915 | 19 234 | 4757 | 115 595 |
| 45–49 | 372 | 7459 | 4987 | 39 203 | 937 | 9406 | 9962 | 53 602 | 1309 | 16 865 | 7762 | 92 806 |
| 50–54 | 499 | 6817 | 7320 | 33 958 | 1001 | 7341 | 13 636 | 38 871 | 1500 | 14 158 | 10 595 | 72 829 |
| 55–59 | 624 | 5505 | 11 335 | 26 871 | 1058 | 5686 | 18 607 | 29 285 | 1682 | 11 191 | 15 030 | 56 156 |
| 60–64 | 667 | 4305 | 15 494 | 19 218 | 1021 | 4464 | 22 872 | 20 690 | 1688 | 8769 | 19 250 | 39 908 |
| 65–69 | 489 | 2220 | 22 027 | 9330 | 714 | 2498 | 28 583 | 10 821 | 1203 | 4718 | 25 498 | 20 151 |
| ≥70 | 532 | 1882 | 28 268 | 7284 | 954 | 2540 | 37 559 | 10 649 | 1486 | 4422 | 33 605 | 17 934 |
| All | 3871 | 75 131 | 5152 | 993 | 7024 | 94 561 | 7428 | 1360 | 10 895 | 169 692 | 6420 | 1202 |
CHD indicates coronary heart disease.
Case refers to the count of individuals with CHD within this age range and sex category.
Total refers to the overall population count within this age range and sex category.
Rate refers to the incidence rate within this age range and sex category.
Density refers to the incidence density within this age range and sex category.
Validation of the Prediction Model
Discrimination Performance
Figure 3 shows the discrimination performance of the WSBN model and the Cox model for the derivation cohort and the validation cohort. The AUC value and its 95% CI were 0.800 (0.794–0.805), and the C statistic and its 95% CI were 0.796 (0.791–0.801) for the WSBN model in the derivation cohort, and the corresponding values for the WSBN model in the validation cohort were 0.837 (0.821–0.853) and 0.838 (0.822–0.854). The AUC value and its 95% CI were 0.824 (0.818–0.830) and the C statistic and its 95% CI were 0.821 (0.815–0.827) for the Cox model in the derivation cohort, and the corresponding values for the Cox model in the validation cohort were 0.826 (0.818–0.834) and 0.824 (0.816–0.831). The WSBN model, which is based on incomplete data, performed almost as well as the Cox model that relies on complete data when it comes to prediction. The favorable performance of the WSBN model on the validation cohort suggests its robust generalization to novel data, thus indicating effective mitigation of overfitting to the training data.
Figure 3. ROC curves of the WSBN model and the Cox model in the derivation cohort (blue line) and the validation cohort (red line).

A, The ROC curve of the WSBN model in the derivation cohort and the validation cohort. B, The ROC curve of the Cox model in the derivation cohort and the validation cohort. AUC indicates area under the ROC curve; ROC, receiver operating characteristic; and WSBN, weighted survival Bayesian network.
Calibration Performance
Figure 4 depicts the relationship between the predicted risk of CHD derived from the WSBN model and the actual proportion of CHD observed in both the derivation and validation cohorts. The horizontal axis represents the predicted probability of CHD, whereas the vertical axis represents the observed proportion of CHD. In the derivation cohort, a strong agreement is evident between the predicted and observed risks, as the data points in Figure 4 align closely along the line with a slope of 1 and an intercept of 0. The corresponding values are A=−0.001 and B=0.951. Similarly, in the validation cohort, this agreement remains consistent. The corresponding values for this cohort are A=0.004 and B=1.112. Figure S3 shows the relationship between the predicted risk of CHD derived from the Cox model and the actual proportion of CHD observed in both the derivation and validation cohorts. A comprehensive summary of all results for both the derivation and validation cohorts can be found in Table 4.
Figure 4. Calibration plots of the observed vs the predicted risk of coronary heart disease from the WSBN model in the derivation cohort and the validation cohort.

A, The calibration plot of the WSBN model in the derivation cohort. B, The calibration plot of the WSBN model in the validation cohort. The A to C in the figure indicate: the intercept (A), the slope (B), and the area under the receiver operating characteristic curve value (C). WSBN indicates weighted survival Bayesian network.
Table 4.
Evaluation Results of the WSBN Model and the Cox Model in the Derivation Cohort and the Validation Cohort
| Indicators | WSBN model | Cox model | ||
|---|---|---|---|---|
| Derivation cohort | Validation cohort | Derivation cohort | Validation cohort | |
| Intercept of calibration curve | −0.001 | 0.004 | −0.003 | −0.002 |
| Slope of calibration curve | 0.951 | 1.112 | 1.014 | 1.009 |
| AUC (95% CI) | 0.800 (0.794–0.805) | 0.837 (0.821–0.853) | 0.824 (0.818–0.830) | 0.826 (0.818–0.834) |
| C index | 0.796 (0.791–0.801) | 0.838 (0.822–0.854) | 0.821 (0.815–0.827) | 0.824 (0.816–0.831) |
AUC indicates area under the receiver operating characteristic curve; and WSBN, weighted survival Bayesian network.
Decision Curve
Decision curve analysis calculates a clinical “net benefit” for prediction models in comparison to default strategies of treating all or no patients. A model can be recommended for clinical use if it has the highest level of benefit across a range of clinically reasonable preferences. Figure S4 shows the decision curves of the WSBN model in the derivation cohort and the validation cohort. The horizontal axis of this figure is the threshold probability. When one's risk of CHD reaches a certain threshold, it is defined as positive, and some intervention measures are taken. The ordinate axis is the net benefit after the advantages are subtracted from the disadvantages. The net benefit is calculated across a range of threshold probabilities, defined as the minimum probability of disease at which further intervention is warranted, as net benefit=sensitivity×prevalence–(1–specificity)×(1–prevalence)×w, where w is the odds at the threshold probability. 32 The blue line (representing intervention with the WSBN model) is always above the gray line (representing intervention for all) and the black line (representing intervention for none). Intervention with the WSBN prediction model had a higher net benefit than intervention for all and intervention for none. This result indicates that the model developed in this study is clinically useful.
Discussion
EHRs are collected as a routine part of health care delivery, and they have great potential to be used to improve patient health outcomes. They contain multiple years of health information to be leveraged for risk prediction, disease detection, and treatment evaluation. 13 Although EHRs offer great possibilities for improving CHD risk prediction in modern patient groups, they also come with significant challenges. The challenges encompass, but are not limited to, missing data, variations among subgroups, as well as limited follow‐up and censored data. The initial 2 challenges promote the use of machine learning techniques, such as Bayesian networks, to analyze EHRs. However, conventional machine learning methods may not directly address the third challenge. We use the IPCW method to address this problem. Vock et al demonstrated that a Bayesian network model combined with IPCW was comparable to that of other machine learning methods but with greater explanatory power, thus enabling the concerns that often swirl around the proverbial “black box” to be avoided. 20 In doing so, this approach also innovatively uses Bayesian network models to show how such a quantitative tool can offer clinicians another tool to predict patient outcomes.
In this article, our proposed risk prediction model has demonstrated significant effectiveness in handling the complexities of EHD, which often include extensive missing data and censoring data. The WSBN model combines the IPCW method, Bayesian network, and CPH model to create a novel and comprehensive approach. This innovative strategy enhances the Bayesian network technique by integrating the IPCW method, allowing for the analysis of censored event data. Simultaneously, the WSBN model incorporates the Cox model to identify variables that are linked to the outcome of interest. Through this integration, the WSBN model achieves an improved predictive ability. Although the WSBN model was used in our study to predict the probability of CHD, the WSBN model can also be extended to other diseases, such as hypertension and lung cancer.
The biggest difference from other risk prediction models is that the WSBN model can directly generate accurate predictions in the case of missing data, whereas other models cannot. Most of previous methods usually remove samples with missing data or use data imputation methods (such as median, mean, K‐nearest neighbor, or random forest methods) to impute missing data. However, removing samples with missing or censored data leads to a waste of data, and the imputed data may be different from the real data. In addition, these imputation methods rely on data that have well‐defined features and do not take dependencies between variables into account. Therefore, it is helpful to establish a new prediction model with good prediction performance and the capability of dealing with missing data and censored data.
The study has certain limitations that need to be taken into consideration. (1) The data were collected from 1 province of China, which cannot represent the whole population of China. (2) We did not obtain data on the following risk factors: drinking and smoking. It is suggested that these risk factors be included in the construction of a predictive model if data on these factors are available. (3) There are too many combinations of data imputation methods and predictive models to compare our model with them all. The multivariate Cox model with imputed data was used as a benchmark in our study. The reason for this is that the Cox model is a classic survival analysis method, known for its ability to handle censored data with great efficacy. However, it cannot manage missing data. The missing data need to be processed (deleted or imputed) before using the Cox model. We used the random forest algorithm to impute the missing data and then used the Cox model to make predictions in our study for comparison with our model performance.
The proposed model also has certain limitations that need to be acknowledged. (1) The WSBN model is more suitable for short‐term forecasting because the Bayesian network method performs well in short‐term forecasting. We recommend to refrain from setting a large value for the predictive years when establishing forecasts. In our study, we developed the WSBN model to predict the probability of having a CHD event within 3 years. (2) Converting continuous variables into categorical variables by simplifying them into finite categories may introduce information loss. Our future work involves the implementation of the WSBN model to enable predictions incorporating continuous variables. (3) The effectiveness of predictive models is constrained by the sample size used to train them. A larger sample size is generally more conducive to achieving better model performance, whereas a smaller sample size can have a detrimental effect on the utility of the model. Having a sufficient sample size with complete data, especially for predictor variables, is crucial for effective model training and optimal performance, despite the ability of the WSBN model to make predictions using incomplete data.
Overall, we developed and validated a new predictive model based on Bayesian networks in large cohorts. This study demonstrated that our model had good prediction ability in EHD with a large amount of missing and censored data. By using EHD, the clinical prediction model can provide patients and physicians with a quantified risk probability of experiencing CHD in the future based on the current health status of patients. Moreover, this model can serve as a more intuitive and robust scientific tool for physicians by presenting them with the interdependencies (network structure and conditional probability tables) among predictors, thus enabling them to offer guidance to patients about their health education and behavioral interventions. All in all, the WSBN model may contribute to the mining and use of EHD and help improve clinical decision‐making and patient outcomes.
Conclusion
In this study, we developed a new prediction model called the WSBN model based on the Bayesian network method. We trained the model on EHRs with missing and censored data to predict the probability of having a CHD event within 3 years. The predictive model exhibited good discriminatory and calibration capabilities in both the derivation and validation cohorts. In addition, the analysis of decision curves confirmed its clinical utility. Overall, we provided a useful prediction tool to use EHD in common diseases, which may help clinical decision‐making and improve outcomes.
Sources of Funding
This study is supported by the National Key Research and Development Program of China (grant 2020YFC2003500) from the Ministry of Science and Technology of the People's Republic of China and National Natural Science Foundation of China (grant 82173625).
Disclosures
None.
Supporting information
Data S1
Tables S1–S5
Figures S1–S4
This article was sent to Yen‐Hung Lin, MD, PhD, Associate Editor, for review by expert referees, editorial decision, and final disposition.
Supplemental Material is available at https://www.ahajournals.org/doi/suppl/10.1161/JAHA.123.029400
For Sources of Funding and Disclosures, see page 11.
Contributor Information
Lijie Ding, Email: dljie369@163.com.
Fuzhong Xue, Email: xuefzh@sdu.edu.cn.
References
- 1. Moran A, Gu D, Zhao D, Coxson P, Wang YC, Chen C‐S, Liu J, Cheng J, Bibbins‐Domingo K, Shen Y‐M, et al. Future cardiovascular disease in China. Circ Cardiovasc Qual Outcomes. 2010;3:243–252. doi: 10.1161/circoutcomes.109.910711 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Roth GA, Mensah GA, Johnson CO, Addolorato G, Ammirati E, Baddour LM, Barengo NC, Beaton AZ, Benjamin EJ, Benziger CP. Global burden of cardiovascular diseases and risk factors, 1990–2019: update from the GBD 2019 study. J Am Coll Cardiol. 2020;76:2982–3021. doi: 10.1016/j.jacc.2020.11.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Goff DC Jr, Lloyd‐Jones DM, Bennett G, Coady S, D'agostino RB, Gibbons R, Greenland P, Lackland DT, Levy D, O'donnell CJ. ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association task force on practice guidelines. Circulation. 2013;2014(129):S53–S55. doi: 10.1161/01.cir.0000437741.48606.98 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Kannel WB, McGee D, Gordon T. A general cardiovascular risk profile: the Framingham study. Am J Cardiol. 1976;38:46–51. doi: 10.1016/0002-9149(76)90061-8 [DOI] [PubMed] [Google Scholar]
- 5. Conroy R. Estimation of ten‐year risk of fatal cardiovascular disease in Europe: the SCORE project. Eur Heart J. 2003;24:987–1003. doi: 10.1016/s0195-668x(03)00114-3 [DOI] [PubMed] [Google Scholar]
- 6. Hippisley‐Cox J, Coupland C, Robson J, Brindle P. Derivation, validation, and evaluation of a new QRISK model to estimate lifetime risk of cardiovascular disease: cohort study using QResearch database. BMJ. 2010;341:c6624. doi: 10.1136/bmj.c6624 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Wilson PWF, D'Agostino RB, Levy D, Belanger AM, Silbershatz H, Kannel WB. Prediction of coronary heart disease using risk factor categories. Circulation. 1998;97:1837–1847. doi: 10.1161/01.cir.97.18.1837 [DOI] [PubMed] [Google Scholar]
- 8. Barzi F, Patel A, Gu D, Sritara P, Lam T, Rodgers A, Woodward M. Cardiovascular risk prediction tools for populations in Asia. J Epidemiol Community Health. 2007;61:115–121. doi: 10.1136/jech.2005.044842 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Liu J. Predictive value for the Chinese population of the Framingham CHD risk assessment tool compared with the Chinese multi‐provincial cohort study. JAMA. 2004;291:2591–2599. doi: 10.1001/jama.291.21.2591 [DOI] [PubMed] [Google Scholar]
- 10. Wang Y, Liu J, Wang W, Wang M, Qi Y, Xie W, Li Y, Sun J, Liu J, Zhao D. Lifetime risk for cardiovascular disease in a Chinese population: the Chinese Multi–Provincial Cohort Study. Eur J Prev Cardiol. 2013;22:380–388. doi: 10.1177/2047487313516563 [DOI] [PubMed] [Google Scholar]
- 11. Yang X, Li J, Hu D, Chen J, Li Y, Huang J, Liu X, Liu F, Cao J, Shen C, et al. Predicting the 10‐year risks of atherosclerotic cardiovascular disease in Chinese population. Circulation. 2016;134:1430–1440. doi: 10.1161/circulationaha.116.022367 [DOI] [PubMed] [Google Scholar]
- 12. Weng SF, Reps J, Kai J, Garibaldi JM, Qureshi N. Can machine‐learning improve cardiovascular risk prediction using routine clinical data? PLoS One. 2017;12:e0174944. doi: 10.1371/journal.pone.0174944 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Yadav P, Steinbach M, Kumar V, Simon G. Mining electronic health records (EHRs) a survey. ACM Computing Surveys (CSUR). 2018;50:1–40. doi: 10.1145/3127881 [DOI] [Google Scholar]
- 14. Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. 2012;13:395–405. doi: 10.1038/nrg3208 [DOI] [PubMed] [Google Scholar]
- 15. Heckerman D. A tutorial on learning with Bayesian networks. Innovat Bayesian Netw: Theory Appl. 2008;156:33–82. [Google Scholar]
- 16. Arora P, Boyne D, Slater JJ, Gupta A, Brenner DR, Druzdzel MJ. Bayesian networks for risk prediction using real‐world data: a tool for precision medicine. Value Health. 2019;22:439–445. doi: 10.1016/j.jval.2019.01.006 [DOI] [PubMed] [Google Scholar]
- 17. Heckerman D. Bayesian networks for data mining. Data Min Knowl Disc. 1997;1:79–119. doi: 10.1023/A:1009730122752 [DOI] [Google Scholar]
- 18. Lucas PJF, van der Gaag LC, Abu‐Hanna A. Bayesian networks in biomedicine and health‐care. Artif Intell Med. 2004;30:201–214. doi: 10.1016/j.artmed.2003.11.001 [DOI] [PubMed] [Google Scholar]
- 19. Robins JM, Finkelstein DM. Correcting for noncompliance and dependent censoring in an AIDS clinical trial with inverse probability of censoring weighted (IPCW) log‐rank tests. Biometrics. 2000;56:779–788. doi: 10.1111/j.0006-341X.2000.00779.x [DOI] [PubMed] [Google Scholar]
- 20. Vock DM, Wolfson J, Bandyopadhyay S, Adomavicius G, Johnson PE, Vazquez‐Benitez G, O'Connor PJ. Adapting machine learning techniques to censored time‐to‐event health record data: a general‐purpose approach using inverse probability of censoring weighting. J Biomed Inform. 2016;61:119–131. doi: 10.1016/j.jbi.2016.03.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53:457–481. doi: 10.1080/01621459.1958.10501452 [DOI] [Google Scholar]
- 22. Štajduhar I, Dalbelo‐Bašić B, Bogunović N. Impact of censoring on learning Bayesian networks in survival modelling. Artif Intell Med. 2009;47:199–217. doi: 10.1016/j.artmed.2009.08.001 [DOI] [PubMed] [Google Scholar]
- 23. Štajduhar I, Dalbelo‐Bašić B. Learning Bayesian networks from survival data using weighting censored instances. J Biomed Inform. 2010;43:613–622. doi: 10.1016/j.jbi.2010.03.005 [DOI] [PubMed] [Google Scholar]
- 24. Cheng J, Bell DA, Liu W. Learning belief networks from data: an information theory based approach. Paper/Poster Presented at: Proceedings of the Sixth International Conference on Information and Knowledge Management. 1997.
- 25. Cooper GF, Herskovits E. A Bayesian method for the induction of probabilistic networks from data. Mach Learn. 1992;9:309–347. doi: 10.1007/bf00994110 [DOI] [Google Scholar]
- 26. Clark TG, Bradburn MJ, Love SB, Altman DG. Survival analysis part I: basic concepts and first analyses. Br J Cancer. 2003;89:232–238. doi: 10.1038/sj.bjc.6601118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Selvin S. Survival Analysis for Epidemiologic and Medical Research. Cambridge University Press; 2008. doi: 10.1017/CBO9780511619809 [DOI] [Google Scholar]
- 28. Frank EH. Regression Modeling Strategies With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Spinger; 2015. [Google Scholar]
- 29. Pencina MJ, D'agostino RB. Overall C as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation. Stat Med. 2004;23:2109–2123. doi: 10.1002/sim.1802 [DOI] [PubMed] [Google Scholar]
- 30. Xie S‐H, Lagergren J. A model for predicting individuals' absolute risk of esophageal adenocarcinoma: moving toward tailored screening and prevention. Int J Cancer. 2016;138:2813–2819. doi: 10.1002/ijc.29988 [DOI] [PubMed] [Google Scholar]
- 31. Steyerberg EW, Vergouwe Y. Towards better clinical prediction models: seven steps for development and an ABCD for validation. Eur Heart J. 2014;35:1925–1931. doi: 10.1093/eurheartj/ehu207 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Vickers AJ, van Calster B, Steyerberg EW. A simple, step‐by‐step guide to interpreting decision curve analysis. Diagn Progn Res. 2019;3:1–8. doi: 10.1186/s41512-019-0064-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data S1
Tables S1–S5
Figures S1–S4
