Skip to main content
PLOS One logoLink to PLOS One
. 2024 Mar 14;19(3):e0300201. doi: 10.1371/journal.pone.0300201

Machine learning-based models to predict the conversion of normal blood pressure to hypertension within 5-year follow-up

Aref Andishgar 1, Sina Bazmi 2, Reza Tabrizi 3,*, Maziyar Rismani 2, Omid Keshavarzian 4, Babak Pezeshki 5, Fariba Ahmadizar 6
Editor: Amir Hossein Behnoush7
PMCID: PMC10939282  PMID: 38483860

Abstract

Background

Factors contributing to the development of hypertension exhibit significant variations across countries and regions. Our objective was to predict individuals at risk of developing hypertension within a 5-year period in a rural Middle Eastern area.

Methods

This longitudinal study utilized data from the Fasa Adults Cohort Study (FACS). The study initially included 10,118 participants aged 35–70 years in rural districts of Fasa, Iran, with a follow-up of 3,000 participants after 5 years using random sampling. A total of 160 variables were included in the machine learning (ML) models, and feature scaling and one-hot encoding were employed for data processing. Ten supervised ML algorithms were utilized, namely logistic regression (LR), support vector machine (SVM), random forest (RF), Gaussian naive Bayes (GNB), linear discriminant analysis (LDA), k-nearest neighbors (KNN), gradient boosting machine (GBM), extreme gradient boosting (XGB), cat boost (CAT), and light gradient boosting machine (LGBM). Hyperparameter tuning was performed using various combinations of hyperparameters to identify the optimal model. Synthetic Minority Over-sampling Technology (SMOTE) was used to balance the training data, and feature selection was conducted using SHapley Additive exPlanations (SHAP).

Results

Out of 2,288 participants who met the criteria, 251 individuals (10.9%) were diagnosed with new hypertension. The LGBM model (determined to be the optimal model) with the top 30 features achieved an AUC of 0.67, an f1-score of 0.23, and an AUC-PR of 0.26. The top three predictors of hypertension were baseline systolic blood pressure (SBP), gender, and waist-to-hip ratio (WHR), with AUCs of 0.66, 0.58, and 0.63, respectively. Hematuria in urine tests and family history of hypertension ranked fourth and fifth.

Conclusion

ML models have the potential to be valuable decision-making tools in evaluating the need for early lifestyle modification or medical intervention in individuals at risk of developing hypertension.

Introduction

Hypertension, a prevalent chronic multifactorial disease, remains a significant challenge in the modern world [1]. In 2021, the World Health Organization estimated that approximately one-third of the global population have hypertension, two-thirds of those found in low- and middle-income countries [2]. Despite advancements in diagnosis and treatment, the prevalence of hypertension in these countries continues to rise [3]. Iran, for instance, reports a 25% prevalence of hypertension [4]. According to World Health Organization reports, hypertension contributes to an annual toll of 9.4 million deaths. In low- and middle-income countries, hypertension was responsible for approximately 8.5 million deaths in 2015, accounting for 88% of global hypertension-related mortality [5]. Specifically, hypertension stands as a primary cause of mortality in the Middle East [6]. Referred to as a silent killer, hypertension becomes apparent only at hazardous pointes, leading to events such as heart attacks or strokes [7]. Despite being controllable through cost-effective medications and timely interventions [8], many hypertensive patients remain undiagnosed due to insufficient awareness of screening and risk factors [9]. Moreover, care episodes for hypertension in low- to middle-income countries incur costs ranging from $500 to $1500, with monthly treatment expenses averaging around $22 [10], and hypertension commonly develops among middle-aged individuals, impacting productivity and imposing additional burdens on economic systems [11]. Given the high level of costs and complications associated with the chronic disease, studies have aimed to estimate hypertension risks for more effective prevention and management of complications [1]. Among the most renowned risk assessment tools is the Framingham Risk Score for predicting cardiovascular diseases [12]. However, these models lack sufficient diversity in encompassing different ethnicities, necessitating the constant development of tailored risk prediction models for specific populations [13].

Machine learning (ML), an integral component of artificial intelligence (AI), has gained significant traction in recent years due to its superior performance in risk classification tools compared to conventional statistical techniques [14]. This technology enables computers to learn without direct programming and adeptly analyze intricate data interactions [15]. Typically, ML surpasses traditional statistical methods by reducing bias, autonomously handling missing variables with minimal intervention in original data, managing distorted variables, and ensuring balanced data, thereby yielding superior outcomes [15]. Furthermore, ML models have the capacity to represent nonlinear relationships and enhance overall predictive accuracy [16]. Consequently, ML methods serve as a valuable tool for automating disease prediction [15].

While the precise origins of hypertension remain elusive, factors such as genetics, excessive salt intake, reduced physical activity, and being obese are known contributors to its progression [17]. These and variables such as educational levels and income, among others, exhibit significant variations across countries and regions [8], underscoring the need for further research to develop location-specific risk assessment tools. Numerous studies have sought to predict hypertension using AI-based ML models. However, the data from these studies have been primarily cross-sectional, and there is no evidence indicating the successful implementation of these algorithms in clinical settings in the rural Middle East areas. Additionally, to date, no longitudinal hypertension prediction model has been established on the total population of these regions.

In this investigation, we aim to assess and contrast the efficacy of various ML methods utilizing a longitudinal rural middle eastern dataset to forecast individuals susceptible to developing hypertension within a 5-year span, hence identifying individuals with a higher probability of benefiting from treatments. We scrutinize and compare ten ML techniques to derive the optimal model for predicting hypertension risk. The assessment of models involves multiple metrics, employing a range of validation techniques and evaluation criteria.

Methods

1. Data source

This is a retrospective longitudinal study based on the Fasa Adults Cohort Study (FACS) data. FACS study has 10 118 participants aged 35–70 years in Sheshdeh and Qarabolagh districts of Fasa city. FACS was created to assess the risk factors that predispose Fasa’s rural residents to Non-Communicable Diseases (NCDs), including cardiovascular diseases. FACS enrollment began in October 2014 and ended in September 2016 in an area with 84% rural residents. Since September 2021, when the fourth follow-up was completed, the cohort study has entered the re-evaluation phase of the same variables as the registration phase, with 3000 of the first phase participants scheduled to participate. Random sampling was used to select participants for this phase of the study. The re-evaluation phase includes all of the steps taken, clinical examinations performed, biological samples taken, and questionnaires administered during the registration phase [18].

2. Study population

Our study sample is selected as a census from the FACS. Five inclusion criteria were considered to include people in the study: 1. Participants with 5 years of follow-up 2. Participants with 5 years data available 3. Participants without hypertension diseases at the first phase (with the same diagnostic criteria mentioned in the final outcome section) 4. Being alive at the end of the follow-up. Finally, 2288 participants were included with census method. The study steps are summarized in a flowchart displayed in Fig 1.

Fig 1. Flowchart of this study.

Fig 1

3. Data preparation and preprocessing

Most of the variables had no missing data and the other had < 10% missing data. For continuous variables, mean imputation was employed, while for categorical variables, median imputation was used to replace missing data. Finally, a total of 160 variables were included in the ML models. The list of all variables is included in S1 Table. Moreover, data must be processed before using ML models. Two methods were employed in this process: feature scaling and one-hot encoding. They were used to process the continuous variables and variables with more than 2 categories, respectively. In this study, standard scaling procedure was used which transforms continuous variables with a range from -1 to +1. One-hot encoding was applied to produce dummy variables which takes only the value of 0 or 1.

4. Final outcome

In the FACS study, a person was diagnosed with hypertension if they had systolic blood pressure (SBP) ≥140 mmHg or diastolic blood pressure (DBP) ≥90 mmHg on at least two episodes (15 minutes apart), or consuming anti-hypertensive drugs due to previous diagnosis [17]. In the current study, having hypertension after 5 years of follow-up analyzed as a classified outcome (hypertensive participants / non-hypertensive participants).

5. Splitting data

To avoid overfitting, the dataset was divided into two parts: training (80%) and test (20%) data. Training was used for training the models, hyper-parameter tuning and 5-fold cross validation. Test data was blind to the training data and was used for final evaluation and internal validation of the ML models. Training and test data was followed 5 years until final outcome was achieved (Fig 2).

Fig 2. Procedure of splitting dataset into training (80%) and test (20%) parts.

Fig 2

6. Machine learning algorithms

In this study, ten supervised ML algorithms were used: logistic regression (LR), support vector machine (SVM), random forest (RF), gaussian naive bayes (GNB), linear discriminant analysis (LDA), k-nearest neighbors (KNN), gradient boosting machine (GBM), extreme gradient boosting (XGB), cat boost (CAT) and light gradient boosting machine (LGBM). We used a variety of ML methods to make sure the dataset was thoroughly explored. Every algorithm possesses distinct advantages and disadvantages, and our objective was to evaluate each one’s performance independently in several research domains. LR was implemented for its simplicity. SVM was selected for its capability in handling high-dimensional data and finding complex relationships. RF was leveraged as an ensemble learning model. GNB provided a computationally efficient approach. LDA offered interpretability. KNN was implemented to detect local patterns. GBM sequentially refined model performance. XGB can perform with high accuracy in large datasets. CAT optimized categorical feature handling, and LGBM efficiently managed larger datasets with swift training. This multifaceted strategy sought to capitalize on the distinct advantages of every model, guaranteeing a thorough examination of the dataset. By not depending only on a single model, we were able to prevent any bias and obtain a comprehensive comprehension of the data.

Anaconda (version 4.12.0) on the Visual Studio Code Platform (version 1.76.2) and python (version 3.9.12) was used to implement all ML algorithms. Furthermore, the machine algorithms were run using the Scikit-Learn Module (version 1.1.3) [19].

7. Model development

At first, 5-fold cross validation and hyper-parameter tuning were applied on training data to find the optimal hyper-parameters. In this stage all features were used. The 5-fold approach separated all of the training data into 5 equal parts, and each time one of the parts was considered validation data, it trained itself and reported the accuracy, and eventually, the average of all 5 accuracies was obtained. Each ML model’s accuracy may now be changed by adjusting its hyper-parameters. Various combinations of hyper-parameters were utilized in the hyper-parameter tuning process to find the best combination of hyper-parameters. For the hyper-parameter tuning step, the grid search approach was employed [20] (S2 Table).

Second, over-sampling was employed to balance the outcome values. The Synthetic Minority Over-sampling Technology (SMOTE) was used to balance the training data. This technique oversamples the minority group by creating "fake" instances. SMOTE selects samples from the minority class and creates "fake" samples along the same line segment, linking some or all of the k nearest neighbors of the minority class [21]. Hypertensive participants were the minority class and SMOTE generated 1428 instances to equalize hypertensive and non-hypertensive individuals.

Finally, ML models were trained using balanced training data and the best hyper-parameters.

8. Model evaluation

All trained ML models were applied to test data. For the final evaluation and comparison of the ML models, three metrics were used: Area under receiver operating characteristic curve (AUC), f1-score and area under the precision-recall curve (AUC-PR) (Fig 3A, S3 Table). In addition, for more detail, S3 Table includes measures such as accuracy, sensitivity, and specificity. The following equations are used to determine the evaluation metrics:

Fig 3. Comparative analysis of model performance indicators and diagnostic visualizations for full-featured and simplified models across ten algorithms.

Fig 3

(A), Comparison the AUC, f1-score and AUC-PR among the models with ten algorithms using all features. (B), Comparison of the AUC, f1-score and AUC-PR among simplified models and the model with all features. (C-D), ROC curve and confusion matrix of the LGBM model with top-30 features. LR; Logistic Regression, SVM; Support Vector Machine, RF; Random Forest, GNB; Gaussian Naive Bayes, LDA; Linear Discriminant Analysis, KNN; K-Nearest Neighbors, GBM; Gradient Boosting Machine, XGB; Extreme Gradient Boosting, CAT; Cat boost, LGBM; Light Gradient Boosting Machine, AUC; Area Under the ROC Curve, ROC; Receiver operating characteristic, AUC-PR; Area Under the Precision-Recall curve.

Accuracy = (TP+TN)/(TP+FP+TN+FN)

Sensitivity = TP/(TP+FN)

Specificity = TN/(TN+FP)

F1-score = 2*TP/(2*TP+FP+FN)

TP stands for true positive rate, TN is for true negative rate, FP stands for false positive rate, and FN stands for false negative rate. Finally, LGBM model was chosen as the best ML model based on AUC, f1-score and AUC-PR.

9. Feature selection

To accomplish efficient data reduction, feature selection approaches can be utilized. This is helpful in identifying more accurate ML models and reduce computational costs. There are three types of feature selection: wrapper, filter, and embedded methods [22]. SHapley Additive exPlanations (SHAP) was used as a wrapper method. SHAP is a uniform way to explaining any ML model’s output. It combines game theory to local explanations, bringing together various earlier approaches and represents the only consistent and locally correct additive feature attribution method based on expectations. It has become a feature selection method in the recent years [23, 24].

Then, SHAP and LGBM models were combined to determine the optimal amount of features between 10, 15, 20, 25, 30, and 35. The best performance was achieved by a subset of 30 characteristics (Fig 3B, S4 Table). Fig 4B shows the top 30 features and their importance in predicting hypertension.

Fig 4. Interpret LGBM model with top-30 features and its performance.

Fig 4

(A), SHAP beeswarm plot for top features. The plot below sorts features by the sum of SHAP value magnitudes over all samples, and uses SHAP values to show the distribution of the impacts each feature has on the model output. The color represents the feature value (red high, blue low). This reveals for example that a high systolic blood pressure highers the predicted home price. (B), The SHAP values of top-30 variables. The input features on the y-axis are arranged in descending importance, and the values on the x-axis represent the mean influence of each feature on the size of the model output based on SHAP analysis. (C), Receiver operating characteristic (ROC) curves of top-3 features. LGBM; Light Gradient Boosting Machine, AUC; Area Under the ROC Curve, ROC; Receiver operating characteristic, SHAP; Shapley Additive exPlanations, SBP: Systolic Blood Pressure, WHR; Waist-to-Hip Ratio.

Table 1 displays a descriptive analysis of chosen characteristics. Statistical tests such as Independent Samples Test, Chi-Square Test, and Mann-Whitney Test were utilized. Statistical significance was defined as P-values less than 0.05. The data was analyzed using SPSS version 18 (IBM Corp., Armonk, N.Y., USA).

Table 1. General and top-30 important characteristics of the participants according to hypertension after 5 years of follow-up (Total number of participants = 2288).

Variables Hypertension P-value
No (N = 2037) Yes (N = 251)
Age, years* 48.60±8.21 49.88±8.13 0.020a
Sex Male 1023 (50.2) 87 (34.6) ≤0.001b
Female 1014 (49.8) 164 (65.4)
Physical activity, MET 42.99±11.93 40.66±8.82 ≤0.001a
Systolic blood pressure, mmHg 105[93,117] 113[100,126] ≤0.001c
Diastolic blood pressure, mmHg 71[62,80] 74[65,83] ≤0.001c
Waist circumference, cm 91.97±11.06 96.87±11.61 ≤0.001a
Waist-to-hip ratio 0.92±0.06 0.95±0.05 ≤0.001a
Waist-to-height ratio 0.56±0.07 0.60±0.07 ≤0.001a
Alkaline phosphatase level (ALP), U/L 204.40±79.69 210.45±65.47 0.248a
Ascorbic acid in urine test, yes 288(87.8) 40(12.2) 0.443b
Blood in urine test, yes 620(86.4) 98(13.6) 0.006b
Iron intake, mg/day 24.03±10.98 22.27±9.88 0.016a
Sodium intake, mg/day 4782.3±2061.5 4447.1±1702.8 0.004a
Cholesterol intake, mg/day 278.14±79.69 256.20±169.25 0.043a
Grain products consumption, gr/day 704.23±331.86 641.76±284.17 0.004a
Vegetable consumption, gr/day 587.97±321.82 610.04±321.96 0.305a
Dairy products consumption, gr/day 212.48±179.60 231.17±188.67 0.122a
Meat consumption, gr/day 99.35±63.85 91.39±64.25 0.063a
History of joint pain, yes 828(89.0) 102(11.0) 0.997b
History of oral aphthous, yes 403(85.9) 66(14.1) 0.016b
History of heartburn, yes 721(87.7) 101(12.3) 0.131b
History of back stiffness, yes 497(86.9) 75(13.1) 0.0.58b
History of chronic headaches, yes 259(84.6) 47(15.4) 0.008b
History of numbness, yes 139(85.3) 24(14.7) 0.112b
History of urinary problems, yes 880(88.4) 116(11.6) 0.364b
Past medical history of hospitalization, yes 662(93.0) 50(7.0) ≤0.001b
Family history of hypertension, yes 986(87.2) 145(12.8) ≤0.001b
Family history of epilepsy, yes 120(83.9) 23(16.1) 0.043b
Family history of pelvic or femoral fracture, yes 122(83.0) 25(17.0) 0.015b
Family history of stroke, yes 257(85.7) 43(14.3) 0.046b
Having job, yes 1107(91.8) 99(8.2) ≤0.001b

Data was presented as Mean ± SD, Median [IQR], and Number (%). Statistical analyses such as a: Independent Samples Test, b: Chi-Square, and c: Mann-Whitney Test were used. MET; Metabolic Equivalent

*Age is not one of top-30 variables and it is only presented as one of the demographic variables of the study.

10. Model interpretation

ROC curve and confusion matrix of the LGBM model with top-30 features is displayed in Fig 3C & 3D. The SHAP analysis was utilized to comprehend the LGBM model. SHAP values for the top features were determined in detail (Fig 4B). The beeswarm plot is designed to display an information-dense summary of how the top features in a dataset impact the model’s output (Fig 4A).

11. Ethics approval and consent to participate

Our study protocol was approved by the Fasa University of Medical Sciences Research Council and Ethics Committee (approval code: IR.FUMS.REC.1402.133) and adhered to Helsinki guidelines. Furthermore, all subjects provided written informed consent before participating. The authors have no access to information that could identify individual participants during or after data collection.

Results

1. Compare performance of machine learning algorithms

Fig 3A and S3 Table display the performance of all ML models based on various metrics. To make the final decision and discover the best ML model, the AUC, f1-score and AUC-PR metrics were considered. The highest AUC was achieved by RF (0.65). GBM and LGBM were ranked second and third, respectively, with AUC = 0.63. Among the ML models, CAT obtained the highest f1-score. GNB and SVM came in second and third, with f1-scores of 0.21 and 0.20, respectively. LGBM had the highest AUC-PR (0.20), while RF was second (AUC-PR = 0.19). Ultimately, the optimal model was determined to be LGBM. At last, 30 of the greatest features were chosen as the best numbers for predicting hypertension with LBGM model. The LBGM with top-30 features had AUC = 0.67, f1-score = 0.23 and AUC-PR = 0.26.

2. Descriptive analytics of top-30 variables of participants

In this research, 251 people (10.9%) were diagnosed with hypertension (Table 1). Women were more likely than males to have hypertension (p-value<0.05). In hypertension individuals, SBP, DBP, waist-to-hip ratio, and waist-to-height ratio were higher (p-value<0.05). Physical activity, blood in urine test, iron intake, sodium intake, cholesterol intake, grain products consumption, meat consumption, history of oral aphthous, history of chronic headaches, past medical history of hospitalization, family history of hypertension, family history of epilepsy, family history of pelvic or femoral fracture, family history of stroke, and having a job were all higher in hypertensive patients (p-value<0.05). There were no statistically significant differences in alkaline phosphatase level, ascorbic acid in urine test, vegetable consumption, dairy products consumption, history of joint pain, history of heartburn, history of back stiffness and h7istory of urinary problems (p-value>0.05).

3. Feature importance

Fig 4B shows the top-30 features in order of importance. The top three predictors of hypertension were SBP, gender, and waist-to-hip ratio, with AUCs of 0.66, 0.58, and 0.63, respectively. Blood in urine test and family history of hypertension were in fourth and fifth rank, respectively. In Fig 4A features were ordered by their SHAP values and it shows a beeswarm plot. For better understanding of this plot, binary variables have values of zero and one, which one indicates a positive value. To interpret this plot, more SBP and waist-to-hip ratio increase the risk of having hypertension in the future. Female gender and the presence of blood in urine test increase the risk of having hypertension in the future.

Discussion

This longitudinal study, based on the FACS with 10,118 participants aged 35–70, aimed to predict 5-year hypertension risk using ML. After selecting 2288 participants meeting specific criteria, 160 variables were processed and prepared for analysis. Various ML algorithms were employed, and Light Gradient Boosting Machine (LGBM) emerged as the optimal model. The study eventually introduced the top 30 features, highlighting the top 5 factors of SBP, gender, waist to hip ratio, hematuria, and family history of hypertension significantly associated with hypertension development. The model achieved an AUC of 0.67.

So far, three relatively robust studies have been conducted to predict hypertension using ML in the Middle East. AlKaabi et al. [1] implemented three supervised ML algorithms in a cross-sectional study involving 987 individuals aged over 18 in Qatar, where the random forest model demonstrated the best performance with an AUC of 0.869. Our study had a much larger sample size, a stronger methodology, utilized more models and variables, and unlike this study, incorporated feature selection. Additionally, the study population wasn’t entirely representative of the entire region, focusing solely on individuals residing in Qatar within a specific timeframe. Sakr et al. [6] conducted a longitudinal study on 23,095 suspected cardiovascular patients referred for exercise testing and followed up for 10 years, implementing six ML models. The RTF model achieved the best performance with an AUC of 0.93. This study was longitudinal, had a larger sample size, and attained a higher AUC. However, it did not encompass the general population, evaluating only patients referred for exercise testing, and focused mainly on factors related to cardiovascular diseases and exercise test results. Furthermore, Namatollahi et al. [25] designed a predictive model for hypertension based on factors associated with body composition in a cross-sectional study utilizing data from the same adult cohort in Fasa. This study also followed a cross-sectional design, focusing exclusively on factors related to body structure. In contrast, considering the follow-up phase of this cohort, we conducted the current longitudinal study, incorporating a broader range of factors. Most studies conducted in the field of ML models for predicting hypertension [8, 2628], including studies by AlKaabi and Namatollahi [1, 25], were based on cross-sectional data. Firstly, cross-sectional studies cannot precisely determine the exact timing of future hypertension development in patients. Secondly, cross-sectional data often include numerous hypertension-related complications in patients’ records, which essentially provide the ML models with an unfair advantage, artificially inflating their accuracy scores. This issue, known as data leakage, undermines the predictive reliability of the results, making them fundamentally non-generalizable to real-world scenarios. In contrast, longitudinal data, like ours, begins with patients who are initially healthy, showing no signs of hypertension or its extensive complications. Consequently, results derived from longitudinal data hold greater validity, and even lower scores are more valuable than the misleadingly elevated scores from cross-sectional models.

In our model, an interesting predictive factor that had less discussion in texts regarding its relation to hypertension was positive hematuria. Before this, only three studies [2931] directly examined this connection, all exclusively on hemophiliac patients, a population with higher occurrences of hematuria and hypertension than the general population, and they were conducted with small sample sizes. Holme et al. [30] in their cross-sectional study did not find a significant correlation between the presence of hematuria and hypertension in these patients. Sun et al. [31], in their prospective study focusing solely on men, concluded that despite the high prevalence of hematuria and hypertension in hemophiliac patients, these two factors are not related, and hematuria is unlikely to lead to hypertension in the long term. Also, renal insufficiency in these patients in the follow-up was rare, questioning the renal damage as an intermediary for this relationship. However, this study was solely conducted on hemophiliac male patients and had a small sample size. Nonetheless, Qvistad et al. [29] in a recent study found that the connection between hematuria and hypertension becomes significant in patients with a family history of hypertension. Our study results were adjusted for a family history of hypertension, yet hematuria was selected as one of the top 5 predictive factors for hypertension in a 5-year model. Hematuria could be a sign of underlying kidney damage or dysfunction, which, although mild and overlooked, could, in the long term, alter blood pressure regulation by affecting sodium balance, increasing fluid retention, and disrupting hormonal equilibrium, such as the renin-angiotensin-aldosterone system, ultimately leading to hypertension [3234]. Additionally, factors causing hematuria might trigger an inflammatory response and endothelial dysfunction. If chronic, this inflammation and dysfunction could potentially increase vascular resistance, subsequently raising blood pressure and leading to hypertension [35]. Of course, both hypertension and hematuria share common risk factors such as obesity and smoking, but these factors are adjusted for in models. Longitudinal studies based on this hypothesis are needed to examine and confirm the relationship between hematuria and the likelihood of developing hypertension over time.

Repeatedly, anthropometric indices have been introduced as risk factors for cardiovascular diseases [36], and various studies have reported a strong correlation between WHR (waist-to-hip ratio) and hypertension. However, WHR as a specific predictor for the occurrence of hypertension has been less discussed. Initially, a cross-sectional study by Feldstein et al. [37] demonstrated that WHR might better and logically predict the risk of hypertension compared to other anthropometric indices. A meta-analysis of cross-sectional studies indicated that WHR is a better biomarker for cardiovascular diseases and hypertension risk [38]. Choi et al. in a large longitudinal study with a good sample size concluded that WHR has a significant and strong relationship with the occurrence of hypertension over time [39]. The use of WHR, compared to popular anthropometric indices like BMI and WC, could be more useful as it’s easier to measure, doesn’t have a linear relationship with other indices, and has shown consistency across different age and ethnic groups [40]. In our study, WHR was chosen as the third top predictive factor for hypertension in the next 5 years, aligning with the mentioned texts and similar studies.

Family history of hypertension, like other diseases, is associated with a higher chance of developing hypertension in an individual. Wang and colleagues’ extensive 54-year longitudinal study on a cohort demonstrated that family history of hypertension, both from the father and mother, has an independent and strong correlation with the occurrence of hypertension over time [41]. Similarly, a recent longitudinal study by Kunnas et al. [42] with a 15-year follow-up and a more precise design showed similar results. In our study, a family history of hypertension was selected as the fifth top predictive factor for hypertension in the next 5 years, in line with the mentioned texts and similar studies.

The occurrence and prevalence of hypertension differ between men and women [43]. Generally, hypertension prevalence is usually higher in men than in women, but our model identified female gender as the second top predictive factor for hypertension in the next 5 years. Our study cohort included individuals aged 35 to 70 years. As age increases, especially beyond the sixth decade of life, the steepness of hypertension occurrence in women significantly rises [44]. Moreover, at older ages, specific hypertension risk factors for women, such as pregnancy-induced hypertension and menopause, become evident and prevalent, increasing the chances of developing hypertension at these ages [45]. Additionally, socioeconomically disadvantaged status is more associated with hypertension in women [46], which seems entirely logical given our study population in rural areas, predominantly with lower socioeconomic status. Considering that hypertension in women is a stronger risk factor for cardiovascular diseases [45], this result seems crucial.

Two factors, WHR and SBP, among the top predictive factors in our study, were in line with the top predictive factors in a similar and robust study conducted in Canada [47]. This conformity could indicate a percentage of similarity among different populations in predicting future hypertension.

Based on our ML model, individuals at high risk for developing hypertension can be recommended to modify their lifestyles and behaviors (such as physical activity, dietary changes, smoking cessation, and alcohol consumption) to avoid hypertension and prevent all associated dangerous complications and costs [47]. It is further recommended to employ new ML models in various geographical regions where there is a wide diversity in hypertension risk factors, as each model may reveal new predictive factors for hypertension [28].

Our study had several strengths that set it apart from previous research. Firstly, it was a longitudinal study with a 5-year follow-up period, providing valuable insights into the long-term development of hypertension. Additionally, we employed feature selection approaches and utilized ten supervised ML algorithms, enhancing the robustness of our analysis. Furthermore, we conducted hyperparameter tuning to optimize the performance of our models. Moreover, our study has a significantly larger number and scope of variables compared to most studies conducted in this field, including the Canadian study [47].

Our study had a strong methodology; however, unfortunately, we faced severe data limitations. Out of the 3,000 followed individuals, only 251 developed hypertension. Due to this data limitation, our model’s final F1 score was low, and the AUC did not reach a significantly high value. Despite the severe data limitations, we were able to achieve an AUC of approximately 0.67, which demonstrates the strength of our methodology. Upon the completion of the data collection for the Fasa cohort follow-up phase in the upcoming years, we will be able to enhance and fortify our models. Furthermore, we were unable to perform external validation with our models due to limitations in accessing complete datasets from different cohorts.

Conclusion

ML models demonstrated effective performance in predicting hypertension and its related factors in our rural population. LGBM emerged as the optimal model. It eventually introduced the top 30 features, highlighting the top 5 factors of higher baseline SBP, female gender, higher WHR, positive hematuria, and family history of hypertension significantly associated with hypertension development in the future. The model achieved an AUC of 0.67, f1-score = 0.23 and AUC-PR = 0.26. Individuals identified as high risk can be recommended to modify their lifestyles and behaviors to prevent hypertension and associated complications and costs.

Supporting information

S1 Table. All characteristics and clinical features of participants.

(DOCX)

pone.0300201.s001.docx (14.1KB, docx)
S2 Table. Finding the appropriate hyper-parameter values for each algorithm after hyper-parameter tuning.

#Abbreviations, LR; Logistic Regression, SVM; Support Vector Machine, RF; Random Forest, GNB; Gaussian Naive Bayes, LDA; Linear Discriminant Analysis KNN; K-Nearest Neighbors, GBM; Gradient Boosting Machine, XGB; Extreme Gradient Boosting, CAT; Cat Boost, LGBM; Light Gradient Boosting Machine.

(DOCX)

pone.0300201.s002.docx (13.4KB, docx)
S3 Table. Performance of the ten machine learning algorithms using all features.

#Abbreviations, LR; Logistic Regression, SVM; Support Vector Machine, RF; Random Forest, GNB; Gaussian Naive Bayes, LDA; Linear Discriminant Analysis, KNN; K-Nearest Neighbors, GBM; Gradient Boosting Machine, XGB; Extreme Gradient Boosting, CAT; Cat boost, LGBM; Light Gradient Boosting Machine, AUC; Area Under the ROC Curve, ROC; Receiver operating characteristic, AUC-PR; Area Under the Precision-Recall curve.

(DOCX)

pone.0300201.s003.docx (14.5KB, docx)
S4 Table. Performance of the LGBM model with different number of features.

#Abbreviations, LGBM; Light Gradient Boosting Machine, AUC; Area Under the ROC Curve, ROC; Receiver operating characteristic, AUC-PR; Area Under the Precision-Recall curve.

(DOCX)

pone.0300201.s004.docx (13.9KB, docx)

Acknowledgments

This project has been approved by the National Institutes for Medical Research Development (NIMAD), Tehran, Iran under code "4021292".

Data Availability

In our institutional policy, it is not stated that the data should be made public, and a data and material transfer agreement should not allow further transfer of data without the provider's prior written consent. However, the data can be made available upon request from the corresponding author, who is a member of this team. Additionally, the dataset generated for this study is available upon request to the Fasa Non-Communicable Diseases Research Center management team. They can be contacted via telephone at +987153314068 or via email at ncdrc.fums.ac.ir@gmail.com.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.AlKaabi LA, Ahmed LS, Al Attiyah MF, Abdel-Rahman ME. Predicting hypertension using machine learning: Findings from Qatar Biobank Study. PLoS One. 2020;15(10):e0240370. Epub 20201016. doi: 10.1371/journal.pone.0240370 ; PubMed Central PMCID: PMC7567367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Mamdouh H, Alnakhi WK, Hussain HY, Ibrahim GM, Hussein A, Mahmoud I, et al. Prevalence and associated risk factors of hypertension and pre-hypertension among the adult population: findings from the Dubai Household Survey, 2019. BMC Cardiovascular Disorders. 2022;22(1):18. doi: 10.1186/s12872-022-02457-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Tang C, Jiang H, Zhao B, Lin Y, Lin S, Chen T, et al. The association between bilirubin and hypertension among a Chinese ageing cohort: a prospective follow-up study. Journal of Translational Medicine. 2022;20(1):108. doi: 10.1186/s12967-022-03309-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Oori MJ, Mohammadi F, Norozi K, Fallahi-Khoshknab M, Ebadi A, Gheshlagh RG. Prevalence of HTN in Iran: meta-analysis of published studies in 2004–2018. Current hypertension reviews. 2019;15(2):113–22. doi: 10.2174/1573402115666190118142818 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Berek PA, Irawati D, Hamid AYS. Hypertension: A global health crisis. Ann Clin Hypertens. 2021;5:8–11. [Google Scholar]
  • 6.Sakr S, Elshawi R, Ahmed A, Qureshi WT, Brawner C, Keteyian S, et al. Using machine learning on cardiorespiratory fitness data for predicting hypertension: The Henry Ford ExercIse Testing (FIT) Project. PLoS One. 2018;13(4):e0195344. Epub 20180418. doi: 10.1371/journal.pone.0195344 ; PubMed Central PMCID: PMC5905952. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Rapport R. Hypertension Silent killer. New Jersey medicine: the journal of the Medical Society of New Jersey. 1999;96(3):41–3. [PubMed] [Google Scholar]
  • 8.Islam SMS, Talukder A, Awal MA, Siddiqui MMU, Ahamad MM, Ahammed B, et al. Machine Learning Approaches for Predicting Hypertension and Its Associated Factors Using Population-Level Data From Three South Asian Countries. Front Cardiovasc Med. 2022;9:839379. Epub 20220331. doi: 10.3389/fcvm.2022.839379 ; PubMed Central PMCID: PMC9008259. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Mills KT, Stefanescu A, He J. The global epidemiology of hypertension. Nature Reviews Nephrology. 2020;16(4):223–37. doi: 10.1038/s41581-019-0244-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Gheorghe A, Griffiths U, Murphy A, Legido-Quigley H, Lamptey P, Perel P. The economic burden of cardiovascular disease and hypertension in low-and middle-income countries: a systematic review. BMC public health. 2018;18(1):1–11. doi: 10.1186/s12889-018-5806-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wang G, Grosse SD, Schooley MW. Conducting research on the economics of hypertension to improve cardiovascular health. American journal of preventive medicine. 2017;53(6):S115–S7. doi: 10.1016/j.amepre.2017.08.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.D’Agostino RB Sr, Vasan RS, Pencina MJ, Wolf PA, Cobain M, Massaro JM, Kannel WB. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation. 2008;117(6):743–53. doi: 10.1161/CIRCULATIONAHA.107.699579 [DOI] [PubMed] [Google Scholar]
  • 13.Goff DC Jr, Lloyd-Jones DM, Bennett G, Coady S, D’agostino RB, Gibbons R, et al. 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. Circulation. 2014;129(25_suppl_2):S49–S73. doi: 10.1161/01.cir.0000437741.48606.98 [DOI] [PubMed] [Google Scholar]
  • 14.Beunza J-J, Puertas E, García-Ovejero E, Villalba G, Condes E, Koleva G, et al. Comparison of machine learning algorithms for clinical event prediction (risk of coronary heart disease). Journal of biomedical informatics. 2019;97:103257. doi: 10.1016/j.jbi.2019.103257 [DOI] [PubMed] [Google Scholar]
  • 15.Bi Q, Goodman KE, Kaminsky J, Lessler J. What is machine learning? A primer for the epidemiologist. American journal of epidemiology. 2019;188(12):2222–39. doi: 10.1093/aje/kwz189 [DOI] [PubMed] [Google Scholar]
  • 16.Wang P, Li Y, Reddy CK. Machine learning for survival analysis: A survey. ACM Computing Surveys (CSUR). 2019;51(6):1–36. [Google Scholar]
  • 17.Bolívar JJ. Essential hypertension: an approach to its etiology and neurogenic pathophysiology. International journal of hypertension. 2013;2013:547809. Epub 2014/01/05. doi: 10.1155/2013/547809 ; PubMed Central PMCID: PMC3872229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Homayounfar R, Farjam M, Bahramali E, Sharafi M, Poustchi H, Malekzadeh R, et al. Cohort Profile: The Fasa Adults Cohort Study (FACS): a prospective study of non-communicable diseases risks. International journal of epidemiology. 2023. Epub 2023/01/03. doi: 10.1093/ije/dyac241 . [DOI] [PubMed] [Google Scholar]
  • 19.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research. 2011;12:2825–30. [Google Scholar]
  • 20.Schratz P, Muenchow J, Iturritxa E, Richter J, Brenning A. Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecological Modelling. 2019;406:109–20. [Google Scholar]
  • 21.Nakamura M, Kajiwara Y, Otsuka A, Kimura H. LVQ-SMOTE—Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data. BioData mining. 2013;6(1):16. Epub 2013/10/04. doi: 10.1186/1756-0381-6-16 ; PubMed Central PMCID: PMC4016036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Jović A, Brkić K, Bogunović N, editors. A review of feature selection methods with applications. 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO); 2015. 25–29 May 2015. [Google Scholar]
  • 23.Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Advances in neural information processing systems. 2017;30. [Google Scholar]
  • 24.Effrosynidis D, Arampatzis A. An evaluation of feature selection methods for environmental data. Ecological Informatics. 2021;61:101224. doi: 10.1016/j.ecoinf.2021.101224 [DOI] [Google Scholar]
  • 25.Nematollahi MA, Jahangiri S, Asadollahi A, Salimi M, Dehghan A, Mashayekh M, et al. Body composition predicts hypertension using machine learning methods: a cohort study. Sci Rep. 2023;13(1):6885. Epub 20230427. doi: 10.1038/s41598-023-34127-6 ; PubMed Central PMCID: PMC10140285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Islam MM, Rahman MJ, Chandra Roy D, Tawabunnahar M, Jahan R, Ahmed N, Maniruzzaman M. Machine learning algorithm for characterizing risks of hypertension, at an early stage in Bangladesh. Diabetes Metab Syndr. 2021;15(3):877–84. Epub 20210420. doi: 10.1016/j.dsx.2021.03.035 . [DOI] [PubMed] [Google Scholar]
  • 27.Islam MM, Alam MJ, Maniruzzaman M, Ahmed N, Ali MS, Rahman MJ, Roy DC. Predicting the risk of hypertension using machine learning algorithms: A cross sectional study in Ethiopia. PLoS One. 2023;18(8):e0289613. Epub 20230824. doi: 10.1371/journal.pone.0289613 ; PubMed Central PMCID: PMC10449142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Guo S, Ge JX, Liu SN, Zhou JY, Li C, Chen HJ, et al. Development of a convenient and effective hypertension risk prediction model and exploration of the relationship between Serum Ferritin and Hypertension Risk: a study based on NHANES 2017-March 2020. Front Cardiovasc Med. 2023;10:1224795. Epub 20230906. doi: 10.3389/fcvm.2023.1224795 ; PubMed Central PMCID: PMC10510409. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Qvigstad C, Sørensen LQ, Tait RC, de Moerloose P, Holme PA. Macroscopic hematuria as a risk factor for hypertension in ageing people with hemophilia and a family history of hypertension. Medicine (Baltimore). 2020;99(9):e19339. doi: 10.1097/MD.0000000000019339 ; PubMed Central PMCID: PMC7478422. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Holme PA, Combescure C, Tait RC, Berntorp E, Rauchensteiner S, de Moerloose P. Hypertension, haematuria and renal functioning in haemophilia—a cross-sectional study in Europe. Haemophilia. 2016;22(2):248–55. Epub 20151216. doi: 10.1111/hae.12847 . [DOI] [PubMed] [Google Scholar]
  • 31.Sun HL, Yang M, Sait AS, von Drygalski A, Jackson S. Haematuria is not a risk factor of hypertension or renal impairment in patients with haemophilia. Haemophilia. 2016;22(4):549–55. Epub 20160331. doi: 10.1111/hae.12921 . [DOI] [PubMed] [Google Scholar]
  • 32.Orlandi PF, Fujii N, Roy J, Chen H-Y, Lee Hamm L, Sondheimer JH, et al. Hematuria as a risk factor for progression of chronic kidney disease and death: findings from the Chronic Renal Insufficiency Cohort (CRIC) Study. BMC nephrology. 2018;19:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Remuzzi G, Perico N, Macia M, Ruggenenti P. The role of renin-angiotensin-aldosterone system in the progression of chronic kidney disease. Kidney Int Suppl. 2005;(99):S57–65. doi: 10.1111/j.1523-1755.2005.09911.x . [DOI] [PubMed] [Google Scholar]
  • 34.Te Riet L, van Esch JH, Roks AJ, van den Meiracker AH, Danser AH. Hypertension: renin-angiotensin-aldosterone system alterations. Circ Res. 2015;116(6):960–75. doi: 10.1161/CIRCRESAHA.116.303587 . [DOI] [PubMed] [Google Scholar]
  • 35.Patrick DM, Van Beusecum JP, Kirabo A. The role of inflammation in hypertension: novel concepts. Curr Opin Physiol. 2021;19:92–8. Epub 20201013. doi: 10.1016/j.cophys.2020.09.016 ; PubMed Central PMCID: PMC7552986. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Fuchs FD, Gus M, Moreira LB, Moraes RS, Wiehe M, Pereira GM, Fuchs SC. Anthropometric indices and the incidence of hypertension: a comparative analysis. Obes Res. 2005;13(9):1515–7. doi: 10.1038/oby.2005.184 . [DOI] [PubMed] [Google Scholar]
  • 37.Feldstein CA, Akopian M, Olivieri AO, Kramer AP, Nasi M, Garrido D. A comparison of body mass index and waist-to-hip ratio as indicators of hypertension risk in an urban Argentine population: a hospital-based study. Nutr Metab Cardiovasc Dis. 2005;15(4):310–5. doi: 10.1016/j.numecd.2005.03.001 . [DOI] [PubMed] [Google Scholar]
  • 38.Browning LM, Hsieh SD, Ashwell M. A systematic review of waist-to-height ratio as a screening tool for the prediction of cardiovascular disease and diabetes: 0·5 could be a suitable global boundary value. Nutr Res Rev. 2010;23(2):247–69. Epub 20100907. doi: 10.1017/s0954422410000144 . [DOI] [PubMed] [Google Scholar]
  • 39.Choi JR, Koh SB, Choi E. Waist-to-height ratio index for predicting incidences of hypertension: the ARIRANG study. BMC Public Health. 2018;18(1):767. Epub 20180619. doi: 10.1186/s12889-018-5662-8 ; PubMed Central PMCID: PMC6008942. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Ashwell M, Gunn P, Gibson S. Waist-to-height ratio is a better screening tool than waist circumference and BMI for adult cardiometabolic risk factors: systematic review and meta-analysis. Obes Rev. 2012;13(3):275–86. Epub 20111123. doi: 10.1111/j.1467-789X.2011.00952.x . [DOI] [PubMed] [Google Scholar]
  • 41.Wang NY, Young JH, Meoni LA, Ford DE, Erlinger TP, Klag MJ. Blood pressure change and risk of hypertension associated with parental hypertension: the Johns Hopkins Precursors Study. Arch Intern Med. 2008;168(6):643–8. doi: 10.1001/archinte.168.6.643 . [DOI] [PubMed] [Google Scholar]
  • 42.Kunnas T, Nikkari ST. Family history of hypertension enhances age-dependent rise in blood pressure, a 15-year follow-up, the Tampere adult population cardiovascular risk study. Medicine (Baltimore). 2023;102(39):e35366. doi: 10.1097/MD.0000000000035366 ; PubMed Central PMCID: PMC10545328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Connelly PJ, Casey H, Montezano AC, Touyz RM, Delles C. Sex steroids receptors, hypertension, and vascular ageing. Journal of human hypertension. 2022;36(2):120–5. doi: 10.1038/s41371-021-00576-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Connelly PJ, Azizi Z, Alipour P, Delles C, Pilote L, Raparelli V. The importance of gender to understand sex differences in cardiovascular disease. Canadian Journal of Cardiology. 2021;37(5):699–710. doi: 10.1016/j.cjca.2021.02.005 [DOI] [PubMed] [Google Scholar]
  • 45.Connelly PJ, Currie G, Delles C. Sex Differences in the Prevalence, Outcomes and Management of Hypertension. Curr Hypertens Rep. 2022;24(6):185–92. Epub 20220307. doi: 10.1007/s11906-022-01183-8 ; PubMed Central PMCID: PMC9239955. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Neufcourt L, Deguen S, Bayat S, Zins M, Grimaud O. Gender differences in the association between socioeconomic status and hypertension in France: A cross-sectional analysis of the CONSTANCES cohort. PLoS One. 2020;15(4):e0231878. doi: 10.1371/journal.pone.0231878 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Chowdhury MZI, Leung AA, Walker RL, Sikdar KC, O’Beirne M, Quan H, Turin TC. A comparison of machine learning algorithms and traditional regression-based statistical modeling for predicting hypertension incidence in a Canadian population. Sci Rep. 2023;13(1):13. Epub 20230102. doi: 10.1038/s41598-022-27264-x ; PubMed Central PMCID: PMC9807553. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Amir Hossein Behnoush

29 Jan 2024

PONE-D-23-44016Machine learning-based models to predict the conversion of normal blood pressure to hypertension within 5-year follow-upPLOS ONE

Dear Dr. Tabrizi,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The reviewers have raised several critical concerns regarding the manuscript. The authors are encouraged to address these issues which might need extensive changes. The English language of the manuscript should also be enhanced. 

Please submit your revised manuscript by Mar 14 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Amir Hossein Behnoush

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. In this instance it seems there may be acceptable restrictions in place that prevent the public sharing of your minimal data. However, in line with our goal of ensuring long-term data availability to all interested researchers, PLOS’ Data Policy states that authors cannot be the sole named individuals responsible for ensuring data access (http://journals.plos.org/plosone/s/data-availability#loc-acceptable-data-sharing-methods).

Data requests to a non-author institutional point of contact, such as a data access or ethics committee, helps guarantee long term stability and availability of data. Providing interested researchers with a durable point of contact ensures data will be accessible even if an author changes email addresses, institutions, or becomes unavailable to answer requests.

Before we proceed with your manuscript, please also provide non-author contact information (phone/email/hyperlink) for a data access committee, ethics committee, or other institutional body to which data requests may be sent. If no institutional body is available to respond to requests for your minimal data, please consider if there any institutional representatives who did not collaborate in the study, and are not listed as authors on the manuscript, who would be able to hold the data and respond to external requests for data access? If so, please provide their contact information (i.e., email address). Please also provide details on how you will ensure persistent or long-term data storage and availability.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The study entitled "Machine learning-based models to predict the conversion of 1 normal blood pressure

2 to hypertension within 5-year follow-up" conducted by Andishgar et al. aimed to assess and contrast the efficacy of various machine learning methods evaluating individuals susceptibility to develop new onset hypertension within 5-years. the study is well-conducted, the aim is clear and authors developed the study in parallel of the main aim. however I have some major concerns regarding the method and results.

1-ML as a predictive tool should be in parallel with previous clinical findings. Top 30 important features seems not to be in agreement with some hypertension risk factors. in this study ALP level has much more predictive power than absence of "physical activity" which considers as a major risk factor of HTN.

2- authors reported the prevalence of hematuria in normal individuals 630 out of 2300. It is a huge number for the prevalence of this key feature.

3- I ask authors to add the logistic model, since this simple model showed a better performance than other models in many literatures.

Minor Comments:

1- in table 1, the percentage reported based on "row" as total, for example the proportion of male sex reported 92.2% in individuals W/O HTN, which is incorrect. this should be changed to "column" as total.

2- The conclusion should be according to the aim of the study, please add a sentence or two explaining the findings for best model.

Reviewer #2: The study titled "Machine learning-based models to predict the conversion of normal blood pressure to hypertension within 5-year follow-up" conducted by Andishgar and colleagues used ML models for prediction of hypertension. The study is well-designed. I have some major comments for improvement:

1- Abbreviations should be defined in their first use. Please ONLY use abbreviated forms after the definition (e.g., you defined ML several times).

2- The introduction is too long. Make it more concise.

3- Line 116: change "5" to "five"

4- Methods section 2: Have you excluded patients receiving anti-hypertensive drugs?

5- If possible, add external validation; else, mention it clearly in the discussion and limitations sections.

6- I found several typos and grammatical errors.

Reviewer #3: This manuscript comprehensively explores using machine learning (ML) techniques to predict hypertension risk in a rural Middle Eastern area. The study adopts a longitudinal design with an impressive initial sample size (10,118 participants) and follows up with 3,000 participants after five years. The insights into ML applications for hypertension risk prediction in a specific population are valuable, emphasizing the potential for early intervention. However, the moderate AUC suggests room for improvement, and the practical implementation of these models in healthcare would require further validation and consideration of real-world factors.

I have identified some points that could enhance the manuscript:

1. Lines 85-87: Clarify the term "few risk factors" by providing a specific range. Additionally, elaborate on "common machine learning approaches" by offering examples from previous studies.

2. Line 92: Introduce the acronym FACS before using it in this section for better reader understanding.

3. Lines 93-94: Distinguish between "common" and "established" machine learning techniques. Clarify how these terms differ and reconsider the use of "common" to avoid potential underestimation of previous approaches.

4. Line 108: Define the acronym NCDs before its use.

5. Line 108: Specify examples of "the most common ones" regarding NCDs for better context.

6. Line 117: Reevaluate the necessity of explicitly stating "Age between 35-70 years" as an inclusion criterion, given that it aligns with the larger study's age range.

7. Lines 123: Clarify the apparent conflict between the statement about missing data and the inclusion criterion.

8. Lines 123-124: Provide details on how the multiple imputation method was implemented. Specify if different ML models were trained on various imputed datasets and how this influenced later stages.

9. Lines 144-148: Explain the rationale behind choosing specific ML algorithms. Provide insights into why these algorithms were deemed suitable for the study.

10. Lines 156-158: Elaborate on your approach to combining hyperparameters. Explain whether a grid search or any specific method was used.

Addressing these points can improve the manuscript's clarity and give readers a more detailed understanding of the study's methodology and findings.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Mar 14;19(3):e0300201. doi: 10.1371/journal.pone.0300201.r002

Author response to Decision Letter 0


3 Feb 2024

Reviewer #1

1. ML as a predictive tool should be in parallel with previous clinical findings. Top 30 important features seems not to be in agreement with some hypertension risk factors. in this study ALP level has much more predictive power than absence of "physical activity" which considers as a major risk factor of HTN.

Response:

We appreciate the reviewer's attention to detail. We also found this finding to be intriguing initially, and we believe it could be one of the significant contributions of our work to the existing literature. By identifying new potential risk factors, our study opens avenues for future research. Although our models did not exhibit the highest accuracies, we also observed that certain anthropometric features related to physical activity, such as waist-hip ratio and waist-to-height ratio, ranked higher than factors like ALP. Recent studies have emphasized the role of diet in the development of cerebrovascular diseases and diabetes, suggesting that it may be more influential than physical activity in these conditions. Furthermore, the association between ALP and hypertension has been discussed in the literature [1]. A prospective cohort study using data from the Kharameh cohort study, which is part of the Prospective Epidemiological Studies in Iran (PERSIAN) database, similar to ours, demonstrated that higher levels of ALP were associated with an increased risk of developing hypertension [2]. It is possible that this association is more pronounced genetically in Iran. Exploring its generalizability in future studies could be an interesting avenue to pursue. ALP has been suggested to be positively associated with hypertension due to its potential link to atherosclerosis and endothelial dysfunction [3]. Additionally, it has been proposed to be inversely related to endothelium-dependent vasodilation [4].

2. authors reported the prevalence of hematuria in normal individuals 630 out of 2300. It is a huge number for the prevalence of this key feature.

Response:

Thanks for bringing up this point. In our study, we used a validated dataset that was introduced and published in "International Journal of Epidemiology". We cited this dataset in our method section [5]. Additionally, the prevalence of hematuria we observed was 630 out of 2300 cases, which corresponds to 27 percent. This prevalence could be reasonable, considering that a review study has reported hematuria prevalence rates of up to 31 percent in various populations [6] (“The reported prevalence of asymptomatic microhematuria (aMH) ranges between 1.7% and 31.1%”)

3. I ask authors to add the logistic model, since this simple model showed a better performance than other models in many literatures.

Response:

As you said, we also added logistic regression model, as can be seen in the figures and the manuscript. It didn’t perform better than the LGBM model and our final analysis and results did not change.

4. in table 1, the percentage reported based on "row" as total, for example the proportion of male sex reported 92.2% in individuals W/O HTN, which is incorrect. this should be changed to "column" as total.

Response:

Thanks for mentioning. It is corrected in the manuscript.

5. The conclusion should be according to the aim of the study, please add a sentence or two explaining the findings for best model.

Response:

Thank you for bringing up this point. We have included this statement in the conclusion section:

“LGBM emerged as the optimal model. It eventually introduced the top 30 features, highlighting the top 5 factors of higher baseline SBP, female gender, higher WHR, positive hematuria, and family history of hypertension significantly associated with hypertension development in the future. The model achieved an AUC of 0.67, f1-score=0.23 and AUC-PR=0.26.”

Reviewer #2

1. Abbreviations should be defined in their first use. Please ONLY use abbreviated forms after the definition (e.g., you defined ML several times).

Response:

Thanks for mentioning this point. The issue with the abbreviations has been fully corrected in the manuscript.

2. The introduction is too long. Make it more concise.

Response:

Thank you for your comment. We have revised the introduction section to make it a little shorter.

3. Line 116: change "5" to "five".

Response:

Thanks. It is corrected in the manuscript.

4. Methods section 2: Have you excluded patients receiving anti-hypertensive drugs?

Response:

We admire your accuracy. Yes, we excluded patients with hypertension who met the same diagnostic criteria mentioned in the final outcome section. It is now emphasized in the manuscript. Thanks.

“3. Participants without hypertension diseases at the first phase (with the same diagnostic criteria mentioned in the final outcome section)”

5. If possible, add external validation; else, mention it clearly in the discussion and limitations sections.

Response:

Due to limitations in accessing "full" datasets from different cohorts, we were unable to perform external validation. We acknowledge the significance of this comment and have included a statement addressing this limitation in our discussion section:

“Furthermore, we were unable to perform external validation with our models due to limitations in accessing complete datasets from different cohorts.”

6. I found several typos and grammatical errors.

Response:

Thank you for mentioning that. We have thoroughly reviewed the text and made the necessary corrections to fix any typos and grammatical errors.

Reviewer #3

1. Lines 85-87: Clarify the term "few risk factors" by providing a specific range. Additionally, elaborate on "common machine learning approaches" by offering examples from previous studies.

Response:

That's an important point. In contrast to many other studies, we included a large number of variables (up to 160) in our machine learning models. Although we did a comprehensive literature search and we are certain of this statement, providing the exact range of risk factors included in machine learning models from the existing literature is challenging, as we might inadvertently miss a study, resulting in an inaccurate range; Thus, we have fully omitted this sentence.

“However, the data from these studies have been primarily cross-sectional, and there is no evidence indicating the successful implementation of these algorithms in clinical settings in the rural Middle East areas.”

2. Line 92: Introduce the acronym FACS before using it in this section for better reader understanding.

Response:

Thanks for the mentioned point. We have fully described FACS in the method section, so we have omitted it from the introduction section to avoid redundancy.

“In this investigation, we aim to assess and contrast the efficacy of various ML methods utilizing a longitudinal rural middle eastern dataset to forecast”

3. Lines 93-94: Distinguish between "common" and "established" machine learning techniques. Clarify how these terms differ and reconsider the use of "common" to avoid potential underestimation of previous approaches.

Response:

Thanks for mentioning this point. This is solely due to our limited English language proficiency. Based on your valuable comment, we have decided to remove the words "common" and "established" to avoid confusion and the underestimation of previous studies.

4. Line 108: Define the acronym NCDs before its use.

Response:

Thanks. We defined the abbreviation NCD

5. Line 108: Specify examples of "the most common ones" regarding NCDs for better context.

Response:

Thank you for bringing this up. We defined the abbreviation NCD and provided the most important example of NCDs present in our dataset and relate to our work. The inappropriate phrase of "the most common ones" is omitted.

“FACS was created to assess the risk factors that predispose Fasa's rural residents to Non-Communicable Diseases (NCDs), including cardiovascular diseases.”

6. Line 117: Reevaluate the necessity of explicitly stating "Age between 35-70 years" as an inclusion criterion, given that it aligns with the larger study's age range.

Response:

Thank you for pointing out this important mistake. Based on your valuable comment, we have removed the age range of 35-70 from our inclusion criteria.

7. Lines 123: Clarify the apparent conflict between the statement about missing data and the inclusion criterion.

Response:

Thanks for your accurate comment. We changed inclusion criteria number 3 (participants with complete data) to participants with 5 years data available.

“2. Participants with 5 years data available”

8. Lines 123-124: Provide details on how the multiple imputation method was implemented. Specify if different ML models were trained on various imputed datasets and how this influenced later stages.

Response:

Sorry for misunderstanding. We used 2 types of imputed methods and called this ‘multiple imputation’. We used mean and median imputation methods for continuous and categorical variables, respectively. We hadn’t multiple datasets in this approach and these was only one imputed data. We corrected the term ‘multiple imputation’ in the manuscript.

“For continuous variables, mean imputation was employed, while for categorical variables, median imputation was used to replace missing data.”

9. Lines 144-148: Explain the rationale behind choosing specific ML algorithms. Provide insights into why these algorithms were deemed suitable for the study.

Response:

We used a variety of ML methods to make sure the dataset was thoroughly explored. Every algorithm possesses distinct advantages and disadvantages, and our objective was to evaluate each one's performance independently in several research domains. SVM was selected for its capability in handling high-dimensional data and finding complex relationships. RF was leveraged as an ensemble learning model. GNB provided a computationally efficient approach. LDA offered interpretability. KNN was implemented to detect local patterns. GBM sequentially refined model performance. XGB can perform with high accuracy in large datasets. CAT optimized categorical feature handling, and LGBM efficiently managed larger datasets with swift training. This multifaceted strategy sought to capitalize on the distinct advantages of every model, guaranteeing a thorough examination of the dataset. By not depending only on a single model, we were able to prevent any bias and obtain a comprehensive comprehension of the data.

The whole paragraph mentioned above was added to the method (section 6).

10. Lines 156-158: Elaborate on your approach to combining hyperparameters. Explain whether a grid search or any specific method was used.

Response:

For the hyper-parameter tuning step, the grid search approach was employed. This sentence was added to the manuscript.

Attachment

Submitted filename: Response to the Reviewers.docx

pone.0300201.s005.docx (30.2KB, docx)

Decision Letter 1

Amir Hossein Behnoush

23 Feb 2024

Machine learning-based models to predict the conversion of normal blood pressure to hypertension within 5-year follow-up

PONE-D-23-44016R1

Dear Dr. Tabrizi,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Amir Hossein Behnoush

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: (No Response)

Reviewer #3: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Many thanks for the precise responses. All comments are addressed properly and the manuscript meets acceptance criteria now.

Reviewer #2: (No Response)

Reviewer #3: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

**********

Acceptance letter

Amir Hossein Behnoush

4 Mar 2024

PONE-D-23-44016R1

PLOS ONE

Dear Dr. Tabrizi,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Amir Hossein Behnoush

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. All characteristics and clinical features of participants.

    (DOCX)

    pone.0300201.s001.docx (14.1KB, docx)
    S2 Table. Finding the appropriate hyper-parameter values for each algorithm after hyper-parameter tuning.

    #Abbreviations, LR; Logistic Regression, SVM; Support Vector Machine, RF; Random Forest, GNB; Gaussian Naive Bayes, LDA; Linear Discriminant Analysis KNN; K-Nearest Neighbors, GBM; Gradient Boosting Machine, XGB; Extreme Gradient Boosting, CAT; Cat Boost, LGBM; Light Gradient Boosting Machine.

    (DOCX)

    pone.0300201.s002.docx (13.4KB, docx)
    S3 Table. Performance of the ten machine learning algorithms using all features.

    #Abbreviations, LR; Logistic Regression, SVM; Support Vector Machine, RF; Random Forest, GNB; Gaussian Naive Bayes, LDA; Linear Discriminant Analysis, KNN; K-Nearest Neighbors, GBM; Gradient Boosting Machine, XGB; Extreme Gradient Boosting, CAT; Cat boost, LGBM; Light Gradient Boosting Machine, AUC; Area Under the ROC Curve, ROC; Receiver operating characteristic, AUC-PR; Area Under the Precision-Recall curve.

    (DOCX)

    pone.0300201.s003.docx (14.5KB, docx)
    S4 Table. Performance of the LGBM model with different number of features.

    #Abbreviations, LGBM; Light Gradient Boosting Machine, AUC; Area Under the ROC Curve, ROC; Receiver operating characteristic, AUC-PR; Area Under the Precision-Recall curve.

    (DOCX)

    pone.0300201.s004.docx (13.9KB, docx)
    Attachment

    Submitted filename: Response to the Reviewers.docx

    pone.0300201.s005.docx (30.2KB, docx)

    Data Availability Statement

    In our institutional policy, it is not stated that the data should be made public, and a data and material transfer agreement should not allow further transfer of data without the provider's prior written consent. However, the data can be made available upon request from the corresponding author, who is a member of this team. Additionally, the dataset generated for this study is available upon request to the Fasa Non-Communicable Diseases Research Center management team. They can be contacted via telephone at +987153314068 or via email at ncdrc.fums.ac.ir@gmail.com.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES