Abstract
Background and Objective. Current cardiovascular disease (CVD) risk models are typically based on traditional laboratory-based predictors. The objective of this research was to identify key risk factors that affect the CVD risk prediction and to develop a 10-year CVD risk prediction model using the identified risk factors. Methods. A Cox proportional hazard regression method was applied to generate the proposed risk model. We used the dataset from Framingham Original Cohort of 5079 men and women aged 30-62 years, who had no overt symptoms of CVD at the baseline; among the selected cohort 3189 had a CVD event. Results. A 10-year CVD risk model based on multiple risk factors (such as age, sex, body mass index (BMI), hypertension, systolic blood pressure (SBP), cigarettes per day, pulse rate, and diabetes) was developed in which heart rate was identified as one of the novel risk factors. The proposed model achieved a good discrimination and calibration ability with C-index (receiver operating characteristic (ROC)) being 0.71 in the validation dataset. We validated the model via statistical and empirical validation. Conclusion. The proposed CVD risk prediction model is based on standard risk factors, which could help reduce the cost and time required for conducting the clinical/laboratory tests. Healthcare providers, clinicians, and patients can use this tool to see the 10-year risk of CVD for an individual. Heart rate was incorporated as a novel predictor, which extends the predictive ability of the past existing risk equations.
1. Introduction
Cardiovascular disease (CVD) describes various conditions that affect the functioning of heart/cardiovascular [1]. Due to the high rate of disease morbidity, CVD has become the leading cause of mortality around the world [2–4]. In New Zealand, statistics on CVD mortality in 2017 suggests that the percentage of deaths caused by CVD is 33% [4].
Majority of cardiovascular-related deaths are premature and preventable and can be improved by effective health management by employing effective diet plans, lifestyle interventions, and drug intervention [5]. To prevent CVD, a useful approach is to assess CVD risk regularly and then introduce new lifestyle adjustments or clinical treatments accordingly.
In the past decades, a great deal of research has been done on the CVD risk estimation such as the Framingham risk scores from the Framingham Heart Study (FHS) [6, 7], the QRISK equations [8], the Europe SCORE risk equations [9], the ASSIGN scores from the Scottish Heart Health Extended Cohort (SHHEC) [10], the Prospective Cardiovascular Master (PROCAM) equations [11], and the CUORE Cohort Study formulas [12]. These CVD risk prediction models have proved their effectiveness in the health and disease management for clinicians and individuals [13–15]. The new PREDICT CVD risk assessment equation developed for primary health care among the population in New Zealand has been integrated to the electronic health records (EHRs) and a web-based software called PREDICT has been developed to support general practices manage the CVD risk in primary care [13]. The PREDICT has got 400,728 patients assessed with the CVD risk and is becoming a useful tool for decision support and health management for general practitioners.
However, challenges and issues regarding the development of CVD risk estimation models still exist. CVD risk models [16–18] are based on single risk factor which cannot realize the influence of multiple factors simultaneously. Risk models [6, 8, 19] using statistical regression methods [20–22] prefer to use classic risk factors such as age, smoking, diabetes, sex, high blood pressure, and total cholesterol to estimate the risk score. Studies [18, 19, 23–27] applying data mining or machine learning techniques for the CVD risk estimations cannot provide an absolute risk estimation, although some of these models [18, 26] tried to incorporate novel predictors in the risk models. This research aims to identify the novel risk factors for CVD detection by conventional predictors and then enhance the risk estimation by developing a multiple-variable-based risk prediction model that targets the 5-year and 10-year CVD events.
2. Methods
2.1. Study Population
The study population selected from the Framingham Original Cohort study dataset [28, 29]. We obtained the ethics approval from NHLBI [30] and the Auckland University of Technology Ethics Committee (AUTEC) (Ref: 17/385 Early Detection and Self-Management of Cardiovascular Disease Using Artificial Intelligence-Based Model). The data from this cohort study includes a total of 5079 men and women aged 30-74 years free of CVD at the baseline, of them 3189 had CVD events eventually. Details of the CVD events distribution in male and female among the study population are summarized in Table 1.
Table 1.
CVD event distribution in male and female.
| Count. | CVD Events | Age Range | |
|---|---|---|---|
| Male | 2294 | 1560 | 30 - 74 |
| Female | 2785 | 1629 | 30 - 74 |
| Total | 5079 | 3189 | 30 - 74 |
2.2. Data Extraction
There are 32 exams in the Framingham Original Cohort study dataset, as shown in Appendix A. Data frame collected in the first exam “Exam1” was chosen to develop the CVD prediction model because it has the maximum number of samples 5209 subjects. Data from 130 subjects were removed because of the ethics protection. The other five exams are ranging from 8 to 12, marked with italic font (as shown in Table 7 of Appendix A) and will be used for the validation for the fitted model. Data of candidate risk factors (listed in Table 2) for creating the risk model was extracted.
Table 7.
Exams in the Framingham Original Cohort study data set.
| Exams | Exam Date Range | Age Range | Mean Age | Attendees |
|---|---|---|---|---|
| Exam 1 | 1948 - 1953 | 28 - 74 | 44 | 5209 |
| Exam 2 | 1950 - 1955 | 31 - 65 | 46 | 4792 |
| Exam 3 | 1952 - 1956 | 32 - 67 | 48 | 4416 |
| Exam 4 | 1954 - 1958 | 34 - 69 | 50 | 4541 |
| Exam 5 | 1956 - 1960 | 37 - 70 | 52 | 4421 |
| Exam 6 | 1958 - 1963 | 38 - 72 | 54 | 4259 |
| Exam 7 | 1960 - 1964 | 40 - 74 | 55 | 4191 |
| Exam 8 | 1962 - 1966 | 42 - 76 | 57 | 4030 |
| Exam 9 | 1964 - 1968 | 44 - 78 | 59 | 3833 |
| Exam 10 | 1966 - 1970 | 46 - 80 | 61 | 3595 |
| Exam 11 | 1968 - 1971 | 49 - 81 | 62 | 2955 |
| Exam 12 | 1971 - 1974 | 50 - 83 | 64 | 3261 |
| Exam 13 | 1972 - 1976 | 53 - 85 | 66 | 3133 |
| Exam 14 | 1975 - 1978 | 55 - 88 | 68 | 2871 |
| Exam 15 | 1977 - 1979 | 57 - 89 | 69 | 2632 |
| Exam 16 | 1979 - 1982 | 59 - 91 | 70 | 2351 |
| Exam 17 | 1981 - 1984 | 61 - 93 | 72 | 2179 |
| Exam 18 | 1983 - 1985 | 63 - 94 | 74 | 1825 |
| Exam 19 | 1985 - 1988 | 65 - 96 | 75 | 1541 |
| Exam 20 | 1986 - 1990 | 67 - 97 | 77 | 1401 |
| Exam 21 | 1988 - 1992 | 69 - 99 | 79 | 1319 |
| Exam 22 | 1990 - 1994 | 72 - 101 | 80 | 1166 |
| Exam 23 | 1992 - 1996 | 73 - 101 | 81 | 1026 |
| Exam 24 | 1995 - 1998 | 76 - 103 | 83 | 831 |
| Exam 25 | 1997 - 1999 | 78 - 104 | 84 | 703 |
| Exam 26 | 1999 - 2001 | 79 - 103 | 86 | 558 |
| Exam 27 | 2002 - 2003 | 82 - 104 | 87 | 414 |
| Exam 28 | 2004 - 2005 | 84 - 104 | 89 | 303 |
| Exam 29 | 2006 - 2007 | 85 - 102 | 91 | 218 |
| Exam 30 | 2008 - 2010 | 88 - 102 | 92 | 141 |
| Exam 31 | 2010 - 2011 | 90 - 99 | 92 | 91 |
| Exam 32 | 2012 - 2014 | 93 - 106 | 96 | 40 |
Table 2.
Description of candidate predictors.
| ORDERS | PREDICTORS | UNITS | TYPES |
|---|---|---|---|
| 1 | AGE | YEARS | CONTINUOUS |
|
| |||
| 2 | SEX | 0001 MALE 0002 FEMALE |
CATEGORICAL |
|
| |||
| 3 | BMI | KG/M2 | CONTINUOUS |
|
| |||
| 4 | HYPERTENSION | 0000 NEGATIVE 0001 TRANSIENT 0002 PERMANENT 0003 TYPE UNKNOWN 0008 DOUBTFUL |
CATEGORICAL |
|
| |||
| 5 | HISTORY OF NERVOUS HEART | 0000 NO 0001 YES, DEFINITE |
CATEGORICAL |
|
| |||
| 6 | HISTORY OF PERICARDITIS | 0000 NO 0001 YES, DEFINITE |
CATEGORICAL |
|
| |||
| 7 | HISTORY OF OTHER CVD | 0000 NO 0001 YES, DEFINITE |
CATEGORICAL |
|
| |||
| 8 | PREMATURE BEATS | 0000 NO 0001 YES, DEFINITE 0002 YES, DOUBTFUL |
CATEGORICAL |
|
| |||
| 9 | HISTORY OF ATRIOVENTRICULAR BLOCK | 0000 NO 0001 YES, DEFINITE 0002 YES, DOUBTFUL |
CATEGORICAL |
|
| |||
| 10 | HISTORY OF RHEUMATIC FEVER | 0000 NONE 0001 YES 0008 DOUBTFUL |
CATEGORICAL |
|
| |||
| 11 | HISTORY OF ALLERGY OR ASTHMA | 0000 NEGATIVE 0001 ALLERGY, ALONE 0002 BRONCHIAL ASTHMA, ALONE, 0003 ALLERGY AND ASTHMA, TOGETHER |
CATEGORICAL |
|
| |||
| 12 | HISTORY OF THYROID DISEASE | 0000 NEGATIVE 0001 HYPERTHYROID ONLY 0002 HYPOTHYROID ONLY |
CATEGORICAL |
|
| |||
| 13 | HISTORY OF SUBACUTE ENDOCARDITIS | 0000 NO 0001 YES |
CATEGORICAL |
|
| |||
| 14 | BLOOD PRESSURE SYSTOLIC | MM HG | CONTINUOUS |
|
| |||
| 15 | BLOOD PRESSURE DIASTOLIC | MM HG | CONTINUOUS |
|
| |||
| 16 | CIGARETTES PER DAY | LAPSE, FORM 8/50 | CONTINUOUS |
|
| |||
| 17 | CIGARS PER DAY | LAPSE, FORM 8/50 | CONTINUOUS |
|
| |||
| 18 | PIPERS PER DAY | LAPSE, FORM 8/50 | CONTINUOUS |
|
| |||
| 19 | PULSE RATE | PER MINUTE | CONTINUOUS |
|
| |||
| 20 | DIABETES | 0000 NO 0001 YES, DEFINITE |
CATEGORICAL |
2.3. Statistical Analysis
Cox proportional hazard regression analysis [22] was selected for developing the proposed risk model (one of the most accurate method belonging to the semiparametric statistical method). This research aims to develop a prediction model using multiple parameters to estimate the probability of developing CVD for an individual. There are mainly three statistical approaches in survival analysis, i.e., nonparametric, semiparametric, and parametric [31]. The nonparametric approaches can only perform univariate analysis with single predictor and therefore are not suitable for the study of continuous variables [22, 32]. Both parametric and semiparametric approaches can perform multiple parameter analysis. They assume that the predictors and the log hazard rate have a linear relationship between [33]. However, the Cox proportional hazard model has an advantage that only the rank orderings of the failure and censoring times are used to estimate and test the regression coefficients [22]. The Cox model is more efficient even though the assumption of the parametric models is met. When the assumptions are not met, the Cox regression analysis can still be used efficiently with an extended Cox regression from [34], but a parametric model such as Weibull survival distribution would be a null model.
Statistical analyses were performed in R Studio platform [35]. Missing values for candidate risk factors listed in Table 2 were imputed using Multiple Imputation [36]. Continuous and categorical variables were transformed and imputed using algorithms modified from Maximum Generalized Variance (MGV) in the SAS PRINQUAL procedure [37]. R function transcan inside the “Hmisc” package was used [35].
For candidate predictors listed in Table 2, two steps of variables selection from the list were performed. The first step was conducted in a “Forward Selection” manner [38]; i.e., the univariate Cox analysis was applied to all candidate variables. Insignificant predictors were filtered out based on a significance level p value >0.05. In the second step, all selected variables from the univariate analysis were entered into the multivariate Cox regression analysis to see how the risk factors jointly impact the incidence rate for CVD. Risk factors with a p value less than 0.05 will be finally decided.
In the validation stage, two approaches were undertaken to assess the predictive ability of our fitted model, statistical validation, and empirical validation. The statistical validation was performed with respect to both discrimination and calibration. The empirical validation was defined as an empirical comparison with a general CVD risk prediction model (the Framingham office-based risk equation [6]) in a horizontal and longitudinal perspective. The horizontal comparison was conducted by comparing with the Framingham prognostic model using data collected from multiple samples at the same time point. The longitudinal comparison was conducted by comparing with the Framingham prognostic model using data collected from specific examples at different time-points (fixed time intervals follow-up) and seeing the risk trend for an individual over time.
3. Results
3.1. Derivation of a 10-Year Risk Score for CVD
Risk factors included in the risk model are age, sex, body mass index (BMI), hypertension, systolic blood pressure (SBP), cigarettes per day, pulse rate, the status of diabetes. Characteristics of risk factors were listed in Table 3. Statistics of “Min.”, “1st Qu.”, “Median”, “Mean”, “3rd Qu.”, and “Max.” of these risk factors are summarized.
Table 3.
Summary statistics for risk factors used in risk model.
| Predictors | Variables | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|---|---|
| AGE | Age | 28 | 37 | 44 | 44.15 | 51 | 74 |
| SEX | Sex | 1 | 1 | 2 | 1.548 | 2 | 2 |
| BMI | Bmi | 14.12 | 22.66 | 25.17 | 25.61 | 27.92 | 56.68 |
| HYPERTENSION | Hyp | 0 | 0 | 0 | 0.147 | 0 | 1 |
| BLOOD PRESSURE SYSTOLIC | Bps | 84 | 122 | 136 | 138.6 | 150 | 270 |
| CIGARETTES PER DAY | Cgrpd | 0 | 5 | 20 | 16.26 | 20 | 60 |
| PULSE RATE | Pr | 37 | 67 | 75 | 75.61 | 83 | 170 |
| DIABETES | Dia | 0 | 0 | 0 | 0.0197 | 0 | 1 |
The regression coefficients, hazard ratios, and their corresponding upper and lower 95% confidence intervals (CI) were estimated, as presented in Table 4. Values of the baseline hazard rate where the time point is ten years were estimated as well, shown in Table 5. The 10-year baseline hazard rate is 0.1023354 at mean values of all covariates, 0.001863652 at all covariates equal to zero. Corresponding, the survival probability (exp(basehaz)) is 0.9027267 at mean values and 0.9981381 at all covariates equal to zero.
Table 4.
Regression coefficients and hazard ratios in risk model.
| Predictors | Variables | coef∗ | Hazard Ratio | lower .95 | upper .95 |
|---|---|---|---|---|---|
| AGE | log of age | 2.083643 | 8.033686 | 6.4082 | 10.0716 |
| SEX | sex | -0.469719 | 0.625178 | 0.5787 | 0.6754 |
| BMI | log of bmi | 0.608864 | 1.838342 | 1.4368 | 2.3521 |
| HYPERTENSION | hyp | 0.241461 | 1.273108 | 1.1342 | 1.429 |
| BLOOD PRESSURE SYSTOLIC | log of bps | 1.682571 | 5.37937 | 3.7938 | 7.6277 |
| CIGARETTES PER DAY | cgrpd | 0.009669 | 1.009716 | 1.0065 | 1.013 |
| PULSE RATE | log of pr | -0.30209 | 0.739271 | 0.5879 | 0.9297 |
| DIABETES | dia | 1.087501 | 2.96685 | 2.3244 | 3.7869 |
∗ Estimated regression coefficient.
Table 5.
Baseline hazard and survival at 10 years.
| Covariates at mean value | Covariates equal to zero | |
|---|---|---|
| Baseline hazard estimate | 0.1023354 | 0.001863652 |
| Baseline survival estimate | 0.9027267 | 0.9981381 |
The Cox model has an exponential form (see Equation (1)), where t represents the time that the event occurs; λ(t) is the hazard function for a subject at time t, determined by a set of m covariates (X1, X2,…, Xk); β1, β2,…βk are the regression coefficients that measure the effect size of covariates; exp is the exponential function (exp(X) = ex); λ0(t) is the baseline hazard rate, an arbitrary (unknown) function, corresponding to the value of the hazard when all Xi equal zero.
| (1) |
So, the Cox model can be written as a survival function:
| (2) |
A general formula for computing risk estimates has the following form:
| (3) |
where H(t) is the CVD risk estimated for an individual; S0(t) is baseline survival rate at follow-up time t, where t = 10 years (see Table 5), βi is the regression coefficient (see Table 4), Xi is the value of the ith risk factor (if is continuous it is the log-transformed value), is the corresponding mean, and k denotes the number of risk factors. The CVD risk function could be derived from (3), using regression coefficients from Table 4 and the baseline hazard rates from Table 5; hence, we computed the probability of developing any type of CVD for an individual. A case of computing the absolute risk score in 10 years was demonstrated in Appendix C.
3.2. Nomograms
A nomogram is a two-dimensional diagram to represent a mathematical function involving several predictors [39]. It is a simple graphical illustration to approximately predict a particular event based on conventional statistical regression methods such as Cox proportional hazards model for survival analysis [40]. A nomogram is accomplishing the estimation of individual survivals in 10 years and the median survival time by years was depicted in Figure 1.
Figure 1.
Nomogram for predicting overall survival in 10 years.
In Figure 1, each predictor has a set of n scales, and there is a mapping between each scale and the “Points” scale. The bottoms are the corresponding 10-year survival estimates, and the median survival time (years). By accumulating the total points corresponding to the specific configuration of covariates for a patient, a clinician can then manually obtain the predicted value of the event for that patient.
3.3. Validation
The validation of the proposed predictive risk model was performed using traditional statistics. C-index (also called receiver operating characteristic (ROC) area) [41] was used to assess the goodness of the risk model based on a bootstrap internal resampling validation. From the statistical validation analysis, we got a C-index (area under the receiver operator curve [AUROC]) of 0.71 indicating moderately good discrimination.
Then, we performed an empirical validation by comparing our risk model with the Framingham Heart Study model in an external dataset horizontally and longitudinally over time. In the horizontal validation process, there were 2786 samples in the external dataset, and 1693 samples have got a CVD event. Risk scores using the FHS model and the proposed risk model were computed separately. Statistics of min (lower whisker), 1st quartile (the lower hinge), median, 3rd quartile (the upper hinge), and max (the extreme of the upper whisker) of estimated risks for all samples are depicted in Figure 2. This box-whisker graph in Figure 2 shows that the risks assessed by our Cox model are higher than the risk calculated by the Framingham model, but the error for five statistics (min, 1st Qu, median, mean, 3rd Qu., max) is within 0.02. For example, the median values of the FHS model and the Cox model are 0.1429475 and 0.1661985, respectively. For subjects with CVD event, the Cox model is much more accurate than the FHS model whereas for subjects without CVD, the Cox risk model overestimates the risk rate. Overall, the risk scale of the Cox model is consistent with the Framingham model, which highlights that the proposed Cox model is par with the FHS model.
Figure 2.

Horizontal comparison between Cox model and FHS model.
In the longitudinal validation process, we selected four sex-specific subjects with or without CVD at the end of the Framingham Study. A summary of these four subjects is listed in Table 6 to confirm the longitudinal validation of the predicted CVD event.
Table 6.
Data summary for samples in the longitudinal validation.
| Samples | Gender | CVD | Diabetes |
|---|---|---|---|
| Sample 1 | Male | ✘ | ✘ |
| Sample 2 | Male | ✓ | ✓ |
| Sample 3 | Female | ✘ | ✘ |
| Sample 4 | Female | ✓ | ✓ |
For each sample, data with fixed time intervals (approximately two years) from longitudinal time follow-up are extracted. The data from five exams (Exam 8, Exam 9, Exam 10, Exam 11, and Exam 12) are extracted for comparison. Data summary for sample 1, sample 2, sample 3, and sample 4 are listed in Appendix B. For each sample, the risks of developing CVD in 10 years related to the selected five exams data are separately computed using the Cox model and the Framingham model. Then the trend of risk over the years with 5% error is depicted, as shown in Figure 3. This figure shows that the trend of risks of these two models are consistent and risks for a specific sample increase over time, the dotted trend lines in each graph represent the increase in the CVD risk over time. Also, samples (both male and female) with diabetes that developed CVD will have a higher risk than the ones with no developed CVD.
Figure 3.
Longitudinal validation.
4. Discussion
It is widely accepted that CVD has become one of the significant public health issue globally [42, 43] and contributes significantly to the annual deaths globally. Previous studies have noted the importance of identifying associated risk factors and the early detection and intervention of CVDs [44–48] and investigated reducing the risk of developing CVD in early stages. Consequently, CVD risk prediction tools based on a single variable or multiple variables have been devised to yield estimates of the CVD risk [6, 8, 9, 14, 49–51].
Motivated by the objective of early detection and risk estimation of CVD, the present study was designed to identify novel CVD risk factors, determine the effect of these factors, and then develop a risk prediction model based on the identified factors. Although risk factors could vary from one specific CVD component to another, there is sufficient evidence that different types of CVD have commonalities of risk factors. We developed and validated a 10-year risk equation for CVD risk using follow-up data rigorously measured by the Framingham Heart Study.
This investigation extends the number of risk factors by the previous general CVD risk formulations, incorporating heart rate to estimate absolute CVD risk. The approach used in this research is based on advanced statistical techniques that allow reducing the bias in the assessment of true CVD risk. The whole process of data analysis strictly follows the guideline of regression modelling strategies and survival analysis [34, 52].
We use continuous variables (age, BMI, SBP, and pulse rate) to generate the model that performs better than other similar models developed using categorical variables. Compared with simpler approaches that try to make inferences of 5-year and 10-year risk models such as the model based on logistic regression analysis [53] and the CVD risk model using Kaplan-Meier and log-rank test [46], the proposed Cox risk model is more adequate and will avoid severe errors of underestimation or overestimation [22, 34]. Moreover, this model was developed based on a more substantial number of samples and events, suggesting a valid estimation of the real risk.
4.1. Comparison with Other CVD Risk Prediction Tools
The old version Framingham general CVD risk function [53] is useful for identifying persons at high risk of CVD, but it was based on a limited number of risk factors (serum cholesterol, SBP, smoking history, electrocardiogram, and glucose intolerance). The new Framingham laboratory-test-based formula [6] included HDL cholesterol in the risk function. The QRISK study investigators incorporated family history as a novel risk factor by the Framingham general formulas [8]. Although researchers have published risk scores [6, 8, 53] for predicting general CVDs, these functions did not include heart rate in the risk model.
Risk models formulated by using machine learning or data mining techniques have incorporated heart rate as a risk factor but tools that can predict CVD absolute risk are fewer. For example, a prediction tool [54] focuses on the classification of CVD event by employing the ANN and the Bayesian classifier based on heart rate variability. The diagnosis CVD model [27] categorizes the CVD risk as different levels but an absolute risk score cannot be obtained. Even though a supportive tool [19] will generate the estimate of a risk score, but the user can not know how many years the score is targeting.
Some equations only focused on specific CVD outcomes. The Europe SCORE project equations were developed for the fatal cardiovascular event [9]. These risk estimation tools [7, 14, 30] are just for coronary heart disease. Also, there are some risk models aiming stroke [16, 55]. Compared with these disease-specific models to estimate the risk of developing specific CVD outcomes, the present study generated a general CVD risk tool that could predict a global CVD risk as well as the risk of developing individual components.
Moreover, compared with the laboratory-based algorithms, the present research proposed a more straightforward way to estimate 10-year CVD risk based on risk factors. An individual can assess his or her CVD risk during an office visit or his monitoring of the combination of risk factors in the risk model, either manually or use some devices like wearable sensors.
4.2. Implication
The CVD risk prediction model could be implemented at the primary care for population analysis and identifying the high-risk individual. This would be a transformation in healthcare management of CVD at an individual as well as at a population level. However, with a small event size of diabetes, caution must be applied to the practice of this risk model. Even though we have used multiple imputation methods to impute the missing values for diabetes, the original feature of data in-balance, which decides that the imputed data frame for the “diabetes” might still have a data in-balance there. Advanced imputation methods need to be considered in the future for avoiding unexpected outcome caused by the diabetes data in-balance.
Our research aims to provide a CVD prediction model based on key risk factors, so that it can be used at the point-of-care for better and informed decision making. Thus, risk factors based on a clinical test such as total cholesterol, HDL cholesterol were not included, but some of these risk factors have a substantial effect on the development of CVD. We have provided a valid framework for creating a risk model using the Cox regression model; future work should consider risk factors not included in our model at this moment. Thus, expanding more predictors into the risk model is an important issue for future research.
5. Conclusion
The proposed study devised a risk prediction model based on multivariable predictors. A novel risk factor “heart rate” was incorporated into this risk equation by conventional risk factors. A satisfying predictive ability with C-index (AUROC) of 0.71 was obtained, which ensures the accuracy of estimating risk scores. Compared with studies focusing on specific diseases, the proposed algorithm can be applied to measure the 10-year risk of CVD. Health care professionals, public health physicians, practice managers, and individuals can run the proposed model to quantify risk at a population level, during patient consultation and identify high-risk individuals for further preventive health care for the entire practice.
Appendix
A. Exams in the Framingham Original Cohort Study Dataset
See Table 7.
B. Data Summary for Samples
Table 8.
Exam data for Sample 1: male without CVD.
| Exams | age | bmi | bps | pr | cgrpd | trt | hyp | dia | smk |
|---|---|---|---|---|---|---|---|---|---|
| Exam 8 | 44 | 26.386894 | 120 | 82 | 40 | 0 | 0 | 0 | 1 |
| Exam 9 | 45 | 26.826676 | 120 | 80 | 0 | 0 | 0 | 0 | 0 |
| Exam 10 | 47 | 27.467643 | 118 | 70 | 20 | 0 | 0 | 0 | 1 |
| Exam 11 | 49 | 28.222249 | 110 | 76 | 44 | 0 | 0 | 0 | 1 |
| Exam 12 | 52 | 28.675012 | 110 | 80 | 50 | 0 | 0 | 0 | 1 |
Table 9.
Exam data for Sample 2: male with CVD and diabetes.
| Exams | age | bmi | bps | pr | cgrpd | trt | hyp | dia | smk |
|---|---|---|---|---|---|---|---|---|---|
| Exam 8 | 45 | 27.74258 | 132 | 83 | 20 | 0 | 0 | 0 | 1 |
| Exam 9 | 47 | 26.26118 | 124 | 80 | 20 | 0 | 0 | 0 | 1 |
| Exam 10 | 49 | 27.664352 | 130 | 78 | 20 | 0 | 1 | 0 | 1 |
| Exam 11 | 51 | 27.121914 | 130 | 90 | 20 | 0 | 1 | 0 | 1 |
| Exam 12 | 53 | 24.816551 | 122 | 82 | 20 | 0 | 0 | 1 | 1 |
Table 10.
Exam data for Sample 3: female without CVD.
| Exams | age | bmi | bps | pr | cgrpd | trt | hyp | dia | smk |
|---|---|---|---|---|---|---|---|---|---|
| Exam 8 | 44 | 20.776333 | 110 | 70 | 20 | 0 | 0 | 0 | 1 |
| Exam 9 | 46 | 20.265439 | 120 | 70 | 20 | 0 | 0 | 0 | 1 |
| Exam 10 | 48 | 22.312012 | 118 | 73 | 20 | 0 | 0 | 0 | 1 |
| Exam 11 | 50 | 21.797119 | 114 | 82 | 20 | 0 | 0 | 0 | 1 |
| Exam 12 | 52 | 21.797119 | 130 | 76 | 20 | 0 | 0 | 0 | 1 |
Table 11.
Exam data for Sample 4: female with CVD and diabetes.
| Exams | age | bmi | bps | pr | cgrpd | trt | hyp | dia | smk |
|---|---|---|---|---|---|---|---|---|---|
| Exam 8 | 46 | 21.793044 | 130 | 65 | 3 | 0 | 1 | 0 | 1 |
| Exam 9 | 48 | 21.967388 | 170 | 75 | 16 | 0 | 1 | 0 | 1 |
| Exam 10 | 50 | 22.494583 | 140 | 60 | 8 | 0 | 1 | 0 | 1 |
| Exam 11 | 53 | 22.31746 | 140 | 63 | 8 | 0 | 1 | 0 | 1 |
| Exam 12 | 54 | 23.380197 | 160 | 58 | 2 | 1 | 1 | 1 | 1 |
C. Computation of Absolute Risk
Here, we take a specific subject to illustrate the process of risk score calculation. This sample is a 44-year-old man not having diabetes and hypertension. He has a systolic blood pressure of 120 mm Hg, pulse rate of 82 per minute, BMI of 26.38689413 kg/m2 and is a current smoker smoking 40 lapses per day, as shown in Table 12.
Table 12.
Data summary for the subject 15018644.
| PREDICTORS | VALUES | UNITS |
|---|---|---|
| AGE | 44 | YEARS |
| SEX | 1 | MALE |
| BMI | 26.38689413 | KG/M2 |
| HYPERTENSION | 0 | NO |
| TREATMENT OF HYPERTENSION | 0 | NO |
| BLOOD PRESSURE SYSTOLIC | 120 | MM HG |
| CIGARETTES PER DAY | 40 | LAPSE |
| SMOKING | 1 | YES |
| PULSE RATE | 82 | PER MINUTE |
| DIABETES | 0 | NO |
| COX MODEL RISK | 12.57% | |
| FHS MODEL RISK | 11.86% | |
The risk estimate based on the Cox model is calculated as follows:
| (C.1) |
| (C.2) |
| (C.3) |
Data Availability
The cardiovascular disease (CVD) data used to support the findings of this study were supplied by Framingham Heart Study-Cohort (FHS-Cohort) under license and so cannot be made freely available. Requests for access to these data should be made with Open BioLINCC Studies Group through this website https://biolincc.nhlbi.nih.gov/studies/framcohort/.
Additional Points
The main contribution of the present study is developing a risk prediction model for early detection of CVD. More specifically, the contribution can be summarized in four major respects: firstly, a novel risk factor “heart rate” was identified as significant for the development of CVD; secondly, an CVD risk prediction model aiming for early detection of CVD was developed based on various risk factors; thirdly, an absolute risk score in 10 years of CVD can be calculated using this risk model; lastly, multiple forms of the risk estimation of CVD, namely risk equation and nomogram, were also developed.
Conflicts of Interest
Authors declare no conflicts of interest.
Authors' Contributions
All authors contributed equally.
References
- 1.Mendis S., Puska P., Norrving B., et al. Global Atlas on Cardiovascular Disease Prevention and Control. World Health Organization; 2011. [Google Scholar]
- 2.Mozaffarian D., Benjamin E. J., Go A. S., et al. Heart disease and stroke statistics update: a report from the American Heart Association. Circulation. 2015;131(4):e29–e322. doi: 10.1161/CIR.0000000000000152. [DOI] [PubMed] [Google Scholar]
- 3.Chan W. C., Wright C., Riddell T., et al. Ethnic and socioeconomic disparities in the prevalence of cardiovascular disease in New Zealand. The New Zealand Medical Journal. 2008;121(1285) [PubMed] [Google Scholar]
- 4.Heart Foundation. General heart statistics in New Zealand. Heart Foundation; 2017. https://www.heartfoundation.org.nz/statistics. [Google Scholar]
- 5.McGill H. C., McMahan C. A., Gidding S. S. Preventing heart disease in the 21st century implications of the pathobiological determinants of atherosclerosis in youth (PDAY) study. Circulation. 2008;117(9):1216–1227. doi: 10.1161/circulationaha.107.717033. [DOI] [PubMed] [Google Scholar]
- 6.D'Agostino R. B., Sr., Vasan R. S., Pencina M. J., et al. General cardiovascular risk profile for use in primary care: the Framingham heart study. Circulation. 2008;117(6):743–753. doi: 10.1161/CIRCULATIONAHA.107.699579. [DOI] [PubMed] [Google Scholar]
- 7.Lloyd-Jones D. M., Wilson P. W. F., Larson M. G., et al. Framingham risk score and prediction of lifetime risk for coronary heart disease. American Journal of Cardiology. 2004;94(1):20–24. doi: 10.1016/j.amjcard.2004.03.023. [DOI] [PubMed] [Google Scholar]
- 8.Hippisley-Cox J., Coupland C., Vinogradova Y., Robson J., May M., Brindle P. Derivation and validation of QRISK, a new cardiovascular disease risk score for the United Kingdom: prospective open cohort study. British Medical Journal. 2007;335(7611):136–141. doi: 10.1136/bmj.39261.471806.55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Conroy R. M., Pyörälä K., Fitzgerald A. P., et al. Estimation of ten-year risk of fatal cardiovascular disease in Europe: the SCORE project. European Heart Journal. 2003;24(11):987–1003. doi: 10.1016/S0195-668X(03)00114-3. [DOI] [PubMed] [Google Scholar]
- 10.Woodward M., Brindle P., Tunsfall-Pedoe H. Adding social deprivation and family history to cardiovascular risk assessment: the ASSIGN score from the Scottish Heart Health Extended Cohort (SHHEC) Heart. 2007;93(2):172–176. doi: 10.1136/hrt.2006.108167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Assmann G., Cullen P., Schulte H. Simple scoring scheme for calculating the risk of acute coronary events based on the 10-year follow-up of the Prospective Cardiovascular Münster (PROCAM) study. Circulation. 2002;105(3):310–315. doi: 10.1161/hc0302.102575. [DOI] [PubMed] [Google Scholar]
- 12.Ferrario M., Chiodini P., Chambless L. E., et al. Prediction of coronary events in a low incidence population. Assessing accuracy of the CUORE Cohort Study prediction equation. International Journal of Epidemiology. 2005;34(2):413–421. doi: 10.1093/ije/dyh405. [DOI] [PubMed] [Google Scholar]
- 13.Wells S., Riddell T., Kerr A., et al. Cohort profile: the PREDICT cardiovascular disease cohort in New Zealand primary care (PREDICT-CVD 19) International Journal of Epidemiology. 2017;46(1):22–22. doi: 10.1093/ije/dyv312. [DOI] [PubMed] [Google Scholar]
- 14.Wilson P. W. F., D'Agostino R. B., Levy D., Belanger A. M., Silbershatz H., Kannel W. B. Prediction of coronary heart disease using risk factor categories. Circulation. 1998;97(18):1837–1847. doi: 10.1161/01.CIR.97.18.1837. [DOI] [PubMed] [Google Scholar]
- 15. Cardiovascular Disease Risk Assessment Steering Group and others, New Zealand primary care hand- book 2012. Wellington: Ministry of health; 2013 (2017)
- 16.Yu J., Dai L., Zhao Q., et al. Association of cumulative exposure to resting heart rate with risk of stroke in general population: the Kailuan cohort study. Journal of Stroke and Cerebrovascular Diseases. 2017;26(11):2501–2509. doi: 10.1016/j.jstrokecerebrovasdis.2017.05.037. [DOI] [PubMed] [Google Scholar]
- 17.Han K. H., Park K. C., Kim M. J., Kim Y. S., Chun H. Association between heart rate variability and 10-year atherosclerotic cardiovascular disease risk score. Atherosclerosis. 2017;263:e190–e191. doi: 10.1016/j.atherosclerosis.2017.06.611. [DOI] [Google Scholar]
- 18.Murukesan L., Murugappan M., Iqbal M., Saravanan K. Machine learning approach for sudden cardiac arrest prediction based on optimal heart rate variability features. Journal of Medical Imaging and Health Informatics. 2014;4(4):521–532. doi: 10.1166/jmihi.2014.1287. [DOI] [Google Scholar]
- 19.Unnikrishnan P., Kumar D. K., Poosapadi Arjunan S., Kumar H., Mitchell P., Kawasaki R. Development of health parameter model for risk prediction of CVD using SVM. Computational and Mathematical Methods in Medicine. 2016;2016:7. doi: 10.1155/2016/3016245.3016245 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Cannon A. Reliability Data Banks. Springer Science & Business Media; 2012. [Google Scholar]
- 21.Kaplan E. L., Meier P. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association. 1958;53(282):457–481. doi: 10.1080/01621459.1958.10501452. [DOI] [Google Scholar]
- 22.Cox D. R. Breakthroughs in Statistics. New York, NY, USA: Springer; 1992. Regression models and life-tables; pp. 527–541. (Springer Series in Statistics). [DOI] [Google Scholar]
- 23.Hachesu P. R., Ahmadi M., Alizadeh S., Sadoughi F. Use of data mining techniques to determine and predict length of stay of cardiac patients. Health Informatics Journal. 2013;19(2):121–129. doi: 10.4258/hir.2013.19.2.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Kim J., Lee J., Lee Y. Data-mining-based coronary heart disease risk prediction model using fuzzy logic and decision tree. Health Informatics Journal. 2015;21(3):167–174. doi: 10.4258/hir.2015.21.3.167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kumari M., Godara S. Comparative study of data mining classification methods in cardiovascular disease prediction. Semantic Scholar. 2011 [Google Scholar]
- 26.Melillo P., Izzo R., Orrico A., et al. Automatic prediction of cardiovascular and cerebrovascular events using heart rate variability analysis. PLoS ONE. 2015;10(3) doi: 10.1371/journal.pone.0118504.e0118504 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Vaanathi S. Cardiovascular disease prediction using fuzzy logic expert system. IUP Journal of Computer Sciences. 2017;11(3) [Google Scholar]
- 28.Dawber T. R., Kannel W. B., Lyell L. P. An approach to longitudinal studies in a community: the Framingham Study. Annals of the New York Academy of Sciences. 1963;107(1):539–556. doi: 10.1111/j.1749-6632.1963.tb13299.x. [DOI] [PubMed] [Google Scholar]
- 29.Kannel W. B., Feinleib M., Mcnamara P. M., Garrison R. J., Castelli W. P. An investigation of coronary heart disease in families: The framingham offspring study. American Journal of Epidemiology. 1979;110(3):281–290. doi: 10.1093/oxfordjournals.aje.a112813. [DOI] [PubMed] [Google Scholar]
- 30.Eckel R. H., Barouch W. W., Ershow A. G. Report of the national heart, lung, and blood institute-national institute of diabetes and digestive and kidney diseases working group on the pathophysiology of obesity-associated cardiovascular disease. Circulation. 2002;105(24):2923–2928. doi: 10.1161/01.cir.0000017823.53114.4c. [DOI] [PubMed] [Google Scholar]
- 31.Lee E. T., Wang J. Statistical Methods for Survival Data Analysis. Vol. 476. JohnWiley & Sons; 2003. [Google Scholar]
- 32.Mantel N. Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemotherapy Reports. 1966;50(3):163–170. [PubMed] [Google Scholar]
- 33.Efron B. The efficiency of Cox's likelihood function for censored data. Journal of the American Statistical Association. 1977;72(359):557–565. doi: 10.1080/01621459.1977.10480613. [DOI] [Google Scholar]
- 34.Harrell F. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer; 2015. [Google Scholar]
- 35.Ihaka R., Gentleman R. R. A language for data analysis and graphics. Journal of Computational and Graphical Statistics. 1996;5(3):299–314. [Google Scholar]
- 36.Van Buuren S. Flexible Imputation of Missing Data. CRC Press; 2012. [Google Scholar]
- 37.Kuhfeld W. F. The prinqual procedure, SAS/STAT Users Guide 2. pp. 1265–1323. 1990.
- 38.Chong I.-G., Jun C.-H. Performance of some variable selection methods when multicollinearity is present. Chemometrics and Intelligent Laboratory Systems. 2005;78(1-2):103–112. doi: 10.1016/j.chemolab.2004.12.011. [DOI] [Google Scholar]
- 39.Kattan M. W. Nomograms are superior to staging and risk grouping systems for identifying high-risk patients: preoperative application in prostate cancer. Current Opinion in Urology. 2003;13(2):111–116. doi: 10.1097/00042307-200303000-00005. [DOI] [PubMed] [Google Scholar]
- 40.Kattan M. W., Kantoff P. W., Kattan M., et al. Comparison of Cox regression with other methods for determining prediction models and nomograms. The Journal of Urology. 2003;170(6):S6–S10. doi: 10.1097/01.ju.0000094764.56269.2d. [DOI] [PubMed] [Google Scholar]
- 41.Hanley J. A., McNeil B. J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
- 42.Lopez A. D., Mathers C. D., Ezzati M., Jamison D. T., Murray C. J. Global and regional burden of disease and risk factors, 2001: systematic analysis of population health data. The Lancet. 2006;367(9524):1747–1757. doi: 10.1016/S0140-6736(06)68770-9. [DOI] [PubMed] [Google Scholar]
- 43.Hay D. S. Cardiovascular Disease in New Zealand, 2004: A Summary of Recent Statistical Information. National Heart Foundation of New Zealand; 2004. [Google Scholar]
- 44.Hubert H. B., Feinleib M., McNamara P. M., Castelli W. P. Obesity as an independent risk factor for cardiovascular disease: a 26-year follow-up of participants in the Framingham Heart Study. Circulation. 1983;67(5):968–977. doi: 10.1161/01.cir.67.5.968. [DOI] [PubMed] [Google Scholar]
- 45.Cupples L. Some risk factors related to the annual incidence of cardiovascular disease and death using pooled repeated biennial measurements. Framingham Heart Study. 1987 [Google Scholar]
- 46.Weiner D. E., Tighiouart H., Amin M. G., et al. Chronic kidney disease as a risk factor for cardiovascular disease and all-cause mortality: a pooled analysis of community-based studies. Journal of the American Society of Nephrology. 2004;15(5):1307–1315. doi: 10.1097/01.asn.0000123691.46138.e2. [DOI] [PubMed] [Google Scholar]
- 47.Böhm M., Swedberg K., Komajda M., et al. Heart rate as a risk factor in chronic heart failure (SHIFT): The association between heart rate and outcomes in a randomised placebo-controlled trial. The Lancet. 2010;376(9744):886–894. doi: 10.1016/S0140-6736(10)61259-7. [DOI] [PubMed] [Google Scholar]
- 48.Odden M. C., Shlipak M. G., Whitson H. E., et al. Risk factors for cardiovascular disease across the spectrum of older age: the Cardiovascular Health Study. Atherosclerosis. 2014;237(1):336–342. doi: 10.1016/j.atherosclerosis.2014.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.De Ruijter W., Westendorp R. G. J., Assendelft W. J. J., et al. Use of Framingham risk score and new biomarkers to predict cardiovascular mortality in older people: population based observational cohort study. BMJ. 2009;338(7688):219–222. doi: 10.1136/bmj.a3083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Pencina M. J., D'Agostino R. B., Larson M. G., Massaro J. M., Vasan R. S. Predicting the 30-year risk of cardiovascular disease: the framingham heart study. Circulation. 2009;119(24):3078–3084. doi: 10.1161/CIRCULATIONAHA.108.816694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Bannink L., Wells S., Broad J., Riddell T., Jackson R. Web-based assessment of cardiovascular disease risk in routine primary care practice in New Zealand: the first 18,000 patients (PREDICT CVD-1) The New Zealand Medical Journal. 2006;119(1245) [PubMed] [Google Scholar]
- 52.Kleinbaum D. G., Klein M. Survival Analysis. Vol. 3. Springer; 2010. [Google Scholar]
- 53.Kannel W. B., McGee D., Gordon T. A general cardiovascular risk profile: the Framingham study. American Journal of Cardiology. 1976;38(1):46–51. doi: 10.1016/0002-9149(76)90061-8. [DOI] [PubMed] [Google Scholar]
- 54.Kim H., Ishag M. I., Piao M., Kwon T., Ryu K. H. A data mining approach for cardiovascular disease diagnosis using heart rate variability and images of carotid arteries. Symmetry. 2016;8(6, article 47) doi: 10.3390/sym8060047. [DOI] [Google Scholar]
- 55.Parmar P., Krishnamurthi R., Ikram M. A., et al. The stroke riskometerTM app: validation of a data collection tool and stroke risk predictor. International Journal of Stroke. 2015;10(2):231–244. doi: 10.1111/ijs.12411. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The cardiovascular disease (CVD) data used to support the findings of this study were supplied by Framingham Heart Study-Cohort (FHS-Cohort) under license and so cannot be made freely available. Requests for access to these data should be made with Open BioLINCC Studies Group through this website https://biolincc.nhlbi.nih.gov/studies/framcohort/.


