Skip to main content
Digital Health logoLink to Digital Health
. 2025 Mar 14;11:20552076251326014. doi: 10.1177/20552076251326014

Advancing personalized medicine in digital health: The role of artificial intelligence in enhancing clinical interpretation of 24-h ambulatory blood pressure monitoring

Sreyoshi F Alam 1,, Charat Thongprayoon 1,, Jing Miao 1, Justin H Pham 1, Mohammad S Sheikh 1, Oscar A Garcia Valencia 1, Gary L Schwartz 1, Iasmina M Craici 1, Maria L Gonzalez Suarez *, Wisit Cheungpasitporn *,
PMCID: PMC11909660  PMID: 40093710

Abstract

Background: The use of artificial intelligence (AI) for interpreting ambulatory blood pressure monitoring (ABPM) data is gaining traction in clinical practice. Evaluating the accuracy of AI models, like ChatGPT 4.0, in clinical settings can inform their integration into healthcare processes. However, limited research has been conducted to validate the performance of such models against expert interpretations in real-world clinical scenarios. Methods: A total of 53 ABPM records from Mayo Clinic, Minnesota, were analyzed. ChatGPT 4.0's interpretations were compared with consensus results from two experienced nephrologists, based on the American College of Cardiology/American Heart Association (ACC/AHA) guidelines. The study assessed ChatGPT's accuracy and reliability over two rounds of testing, with a three-month interval between rounds. Results: ChatGPT achieved an accuracy of 87% for identifying hypertension, 89% for nocturnal hypertension, 81% for nocturnal dipping, and 94% for abnormal heart rate. ChatGPT correctly identified all conditions in 60% of ABPM records. The percentage agreement between the first and second round of ChatGPT's analysis was 81% in identifying hypertension, 85% in nocturnal hypertension, 89% in nocturnal dipping, and 94% in abnormal heart rate. There was no significant difference in accuracy between the first and second round (all p > 0.05). The Kappa statistic was 0.63 for identifying hypertension, 0.66 for nocturnal hypertension, 0.76 for nocturnal dipping, and 0.70 for abnormal heart rate. Conclusions: ChatGPT 4.0 demonstrates potential as a reliable tool for interpreting 24-h ABPM data, achieving substantial agreement with expert nephrologists. These findings underscore the potential for AI integration into hypertension management workflows, while highlighting the need for further validation in larger, diverse cohorts.

Keywords: Digital health, artificial intelligence, 24-h ambulatory blood pressure monitoring, hypertension, personalized medicine, chatGPT, clinical decision support, nocturnal hypertension, nocturnal dipping, heart rate analysis

Background

Hypertension is a global health concern, affecting approximately 1.13 billion people worldwide, according to the World Health Organization. 1 It is a leading risk factor for cardiovascular diseases, including stroke, myocardial infarction, and heart failure. The prevalence of hypertension is increasing, driven by factors such as aging populations, sedentary lifestyles, and unhealthy dietary habits. 1 Accurate and timely diagnosis of hypertension and related conditions is essential for effective management and prevention of adverse outcomes.

Ambulatory Blood Pressure Monitoring (ABPM) is a crucial tool in modern medicine, enabling the continuous measurement of blood pressure over a 24-h period. 2 This technique offers a comprehensive view of an individual's blood pressure fluctuations throughout the day and night, providing valuable insights that single-point measurements cannot. ABPM is particularly useful in diagnosing and managing conditions such as hypertension, hypotension, nocturnal hypertension, and other related cardiovascular anomalies. 2 However, the interpretation of ABPM data is often complex and time-consuming, requiring expert analysis to ensure accurate diagnosis and appropriate management. ABPM plays a vital role in the epidemiological landscape of hypertension. Studies have shown that ABPM provides more accurate and predictive information about cardiovascular risks compared to traditional office blood pressure measurements. It can help in identifying cases of white-coat hypertension, masked hypertension, and nocturnal hypertension, which are often missed during clinic visits. 3 Despite its benefits, the widespread adoption of ABPM is limited by the need for specialized interpretation, which is both resource-intensive and dependent on expert clinicians. 2

Artificial Intelligence (AI) has emerged as a transformative force in various fields, including healthcare.4,5 AI models, particularly those based on advanced neural networks, have shown remarkable potential in interpreting complex medical data. 6 It has increasingly been implemented in the field of hypertension. Machine learning (ML) models have shown superior performance over traditional methods in predicting hypertension, detecting early cardiac dysfunction, and accurately classifying heart failure (HF) stages, which can lead to more targeted and effective treatments. 7 The potential of AI involves prediction, diagnosis, and management of hypertension and HF, highlighting its ability to enhance patient care at every stage. ChatGPT and Natural language processing (NLP) AI have been tested in multiple fields ranging from answering board questions 8 to summarizing clinical guidelines, assisting in clinical decision support, and acting as an educational tool for both patients and healthcare providers. 9

The integration of AI in clinical practice promises to enhance diagnostic accuracy, streamline workflows, and ultimately improve patient outcomes. 10 Among these AI models, ChatGPT 4.0, developed by OpenAI, represents a significant advancement in natural language processing and understanding, with potential applications in the interpretation of medical data. 11 The utility of AI in interpreting ABPM data is increasingly recognized. AI models can analyze large volumes of data rapidly and consistently, potentially outperforming human experts in certain tasks. 11 Evaluating the accuracy of AI models, such as ChatGPT 4.0, in clinical settings can provide insights that facilitate their integration into healthcare processes, making ABPM analysis more accessible and reliable. Current guidelines recommend the use of ABPM for the diagnosis and classification of hypertension. 12 However, AI has not yet been widely applied to interpreting ambulatory blood pressure monitor readings. Image processing, one of AI's strong suits, can assist medical professionals, including hypertension specialists and nephrologists, in augmenting their interpretation of ABPM data. While AI cannot replace physicians, it can expedite the process, allowing more time for critical clinical tasks and enabling faster and more efficient extraction of data from 24-h blood pressure readings.

This study aims to assess the performance of ChatGPT 4.0 in interpreting 24-h ABPM records and compare its accuracy with expert interpretations, which could have significant implications for the future of hypertension management and the broader application of AI in healthcare. However, limited research has been conducted to validate the performance of these models against expert interpretations in real-world clinical scenarios. Most existing studies focus on the theoretical potential of AI, with few addressing practical implementation and validation. This gap in research hinders the adoption of AI in clinical practice, as clinicians and healthcare providers require robust evidence of the models’ accuracy and reliability before they can be integrated into patient care. This study addresses this critical gap by evaluating ChatGPT 4.0's performance in interpreting ABPM data from patients at Mayo Clinic, Minnesota.

2. Materials and methods

2.1. Case selection

The study was conducted at Mayo Clinic, Minnesota. The Mayo Clinic Institutional Review Board approved this study (IRB number 24-002829) and exempted the need for informed consent because this was a minimal risk study solely involving chart review.

To ensure internal validity, strict inclusion and exclusion criteria were applied to minimize confounding variables. Patients aged 18 years or older with complete 24-h ABPM recordings that met calibration standards were included. Pediatric patients, incomplete ABPM data, and recordings shorter than 24 h were excluded. A total of 53 ABPM records were randomly selected for this study. ABPM data were calibrated using a standardized protocol. Manual blood pressure measurements were taken in three positions (sitting, standing, and lying down) and compared to simultaneous automated measurements. Calibration was considered satisfactory if the difference between manual and automated systolic and diastolic averages was 8 mm Hg or lower. These calibrated readings ensured data accuracy and minimized variability. ABPM is routinely conducted in our department to diagnose hypertension, monitor management strategies, and assess nocturnal hypertension.

The data were deidentified to remove any patient identification information. Inclusion criteria required the availability of ABPM data at our facility for patients aged 18 years and older, with data collected over a one-month period and satisfactory ABPM results (as defined by the criteria stated above). Pediatric patients were excluded to maintain the homogeneity of the data. Additionally, 6- and 12-h ABPM readings were excluded from the study. Table 1 provides a detailed summary of the demographic characteristics of the study cohort. The mean age of the 53 patients was 66.74 years, with a gender distribution of 54.72% females and 45.28% males. The selection process ensured the inclusion of patients with diverse blood pressure profiles, including hypertension, nocturnal hypertension, normal nocturnal dipping, tachycardia, and bradycardia.

Table 1.

Patient sample demographics.

Age Gender Ethnicity Comorbidities
75 Male White Hypertension, Hyperlipidemia, Prediabetes,Psoriasis
70 Male White Hypertension,Obstructive sleep apnea,Type 2 DM,Hyperlipidemia, CKD3A, Atrial Fibrillation,Cardiomyopathy
84 Female White Hypertension, Hyperlipidemia, GERD,Hyponatremia
68 Male White Hypertension,Hyperlipidemia,Type 2 DM,Obesity,Osteoarthritis
67 Male White Hypertension,Cardiomyopathy,Type 2 DM, Osteoarthritis,Obesity
89 Female White Hypertension,Hyperlipidemia,ILD,Osteoarthritis
72 Male White Aortic Stenosis, Hypothyroidism,Prediabetes
83 Male White CAD,Hyperlipidemia,Hypertension,Hypothyroidism
73 Female White Hypertension,Hyperlipidemia,Connective Tissue Disease
68 Female White Hypertension,Hyperlipidemia, Cardiomyopathy,Type 2 DM
65 Female White Type 2 DM
71 Female White Hypertension,GERD,Osteoporsis,IBS,Migraine
89 Male White Hypertension,SNHL,GERD
86 Male White Hypertension,Mitral Valve Prolapse,Mitral Regurgitation, BPPV
56 Female White Hypertension, SVT, Obesity,Hyperlipidemia
84 Female White Hypertension,Obstructive sleep apnea, Osteoarthritis,CAD
62 Female Asian Hypothyr,POTS,Hyperlipidemia
82 Male White CAD,LBBB,Obstructive sleep apnea,GERD,Aortic Regurgitation,HFPEF
55 Female White Hyperlipidemia,Hypothyroidism,PTSD
74 Male White BPH, Pharyngeal Carcinoma,CAD
69 Female White Hypertension,Osteoarthritis,GERD,Hyperlipidemia,MDD
78 Female Asian Hypertension,CKD4, Osteoporosis,Cataract
77 Female White Lumbar stenosis,Compression fracture
82 Female White Hypertension,MGUS,CAD
77 Male White CKD3b, Hyperlipidemia, Obesity, Hypomagnesemia, Type 2 DM,HFREF
81 Male White Hypertension, Orthostatic Hypotension, Obstructive sleep apnea,Osteoarthritis,BPH,HIV,MDD,MCAD
45 Female White Orthostatic Hypotension, Obesity
79 Female Asian Hypertension,Hyperlipidemia,Aortic Stenosis
73 Female White Hypertension, Orthostatic Hypotension, Obesity,Hyperlipidemia,Atrial Fibrillation,AL Amyloidosis,Peripheral Neuropathy
74 Male Hispanic Orthostatic Hypotension, BPH, Failed renal allograft, Recurrent IgA nephropathy
76 Male White Hypertension,Atrial Fibrillation,AAA, Type 2 DM,Osteoporosis, Osteoarthritis
64 Male White Hypertension, Atrial Tachycardia,Hypothyroidism,Hyperlipidemia,Hypercalcemia,Hyperglycemia
81 Female White Hypertension, RAS, Prediabetes,Hypothyroidism, Obesity
65 Male White Atrial Fibrillation, GERD,Obstructive sleep apnea,Osteoarthritis
78 Male White Hypertension, Atrial Fibrillation,Hyperlipidemia,CAD,Peripheral Neuropathy
62 Male White Hypertension,PFO,CVA
74 Male White CAD,Atrial Fibrillation,Hypertension,IBS,Osteoarthritis,Hyperlipidemia,
64 Female Hispanic Hyperlipidemia, Type 2 DM,Carotid Stenosis
44 Male Asian Obstructive Sleep Apnea, Anxiety,Tobacco use,
42 Male Armenian Type 2 DM,CVA, Cardiac Transplant,Pulmonary Embolism
31 Female White Postural Orthostatic Tachycardia Syndrome, scoliosis,HSP,Preeclampsia
62 Female Black CAD,Orthostatic Hypotension,Hyperlipidemia
67 Female White Hyperlipidemia, Orthostatic Hypotension, Parkinsonism,
70 Male Asian Pituitary Macroadenoma
22 Female White Fibromyalgia,Anxiety Type 2 DM, Hypertryglyceridemia
25 Female White POTS, Addisons Disease
46 Female White Fibromyalgia, Asthma
76 Female White Hypertension, GERD,Obesity, Obstructive sleep apnea,Hyperlipidemia,Type 2 DM
48 Male White Autonomic dysfunction,Orthostatic Hypotension
44 Female White GERD, Migraine,MDD, Anxiety,PTSD
48 Female White GERD,Sjogren's Syndrome,Hyperlipidemia
72 Female White Hypertension,Hyperlipidemia,Parkinsons Disease
68 Male White Hypertension,Hyperlipidemia,CAD,RAS,BPH

Acronyms: DM: Diabetes Mellitus, CKD: Chronic Kidney Disease, GERD: Gastroesophageal Reflux Disease, CAD: Coronary Artery Disease, IBS: Irritable Bowel Syndrome, BPPV: Benign Paroxysmal Positional Vertigo, POTS: Postural Orthostatic Tachycardia Syndrome, MDD: Major Depressive Disorder, MGUS: Monoclonal Gammopathy Of Unknown Significance, BPH: Benign Prostatic Hyperplasia, HIV: Human Immunodeficiency Virus, MCAD: Mast Cell Activation Disease, AAA: Abdominal Aortic Aneurysm, RAS: Renal Artery Stenosis, PTSD: Posttraumatic Stress Disorder

2.2. Device and data acquisition

The ABPM data were collected using a validated 24-h ABPM device, specifically the Spacelabs ambulatory blood pressure monitors, which record blood pressure at regular intervals throughout the day and night. These devices provide a comprehensive dataset for each patient, including systolic and diastolic blood pressure readings, heart rate, and time-stamped annotations. The Spacelabs monitors are annually validated for calibration accuracy. Each year, all monitors are sent to our laboratory facility at the Mayo Clinic, where they undergo 24-h testing on specialized calibration devices to ensure proper functionality and accuracy before use.

For each patient, prior to initiating 24-h blood pressure monitoring, trained staff manually measure blood pressure in both arms across three positions (lying down, sitting, and standing) using a standard sphygmomanometer. These six manual measurements are compared with simultaneous readings from the ABPM device, and the device is considered calibrated if the difference in systolic and diastolic averages between manual and device measurements is ≤8 mmHg. Once calibrated, the cuff is connected to the patient, and 24-h readings are initiated. All ABPM data collected prior to this study were meticulously checked for completeness and consistency before being processed by ChatGPT.

2.2.1. Prompt development and structure

The prompt used with ChatGPT was carefully crafted to ensure clear and accurate alignment with the study's objectives. Our team undertook an iterative process to develop a structured and detailed prompt. It clearly defined parameters such as thresholds for hypertension, criteria for nocturnal dipping, and classifications for heart rate, following the American College of Cardiology/American Heart Association (ACC/AHA) guidelines. The goal was to create a straightforward, unambiguous command that would optimize ChatGPT's interpretative accuracy while minimizing irrelevant outputs. The exact prompt used in this study is provided in Table 2 for reference.

Table 2.

Prompt and parameter and definitions provided to chat GPT prior to interpretation of ABPM.

Parameter Data and Definitions
Prompt “We will provide ambulatory blood pressure readings accompanied by images, please provide evaluations based on the following criteria:
Hypertension Assessment Determine if the patient meets the criteria for hypertension. The criteria for hypertension are met if the average blood pressure is above or equal to 130/80 mmHg. Answer with “Yes” for hypertension, “No” if the criteria are not met.
Hypotension Assessment Evaluate whether the patient meets the criteria for hypotension, defined as an average blood pressure less than 90/60 mmHg. Answer with “Yes” for hypotension, “No” if the criteria are not met.
Nocturnal Hypertension Assessment If nocturnal blood pressure data is available, determine if the patient has nocturnal hypertension, defined as a nocturnal blood pressure greater than 120/70 mmHg. Answer “Yes” for nocturnal hypertension, “No” if the criteria are not met, and “Not Available” if nocturnal blood pressure readings are not provided.
Normal dipper pattern Assessment when nocturnal blood pressure data is available; normal dipper pattern is defined by nocturnal BP falls by >10% of the daytime average BP value. Answer with “Yes” for normal dipper pattern, “No” if the criteria are not met.
Heart Rate Concerns Identify any concerns regarding tachycardia or bradycardia based on averaged heart rate readings. Tachycardia is defined as an averaged heart rate greater than 100 beats per minute (BPM), and bradycardia is defined as an averaged heart rate less than 60 BPM. Provide your answer as “tachycardia” if the criteria for tachycardia are met, “bradycardia” if the criteria for bradycardia are met, or “normal range” if the heart rate does not meet the criteria for either condition.
End of prompt Please ensure the evaluation is thorough and based on the provided criteria, ensuring an accurate and detailed response.

2.2.2. Data extraction from ABPM recordings

The ABPM recordings provided both graphical and numerical data outputs. The graphical representation (Figure 1(a)) depicted 24-h blood pressure trends, while the numerical dataset (Figure 1(b)) included key parameters such as average systolic and diastolic blood pressure readings, heart rate, and timestamps. The graphical data were processed as images, while the numerical values were extracted into a tabular format for standardization. This dual-input approach ensured that ChatGPT received comprehensive data for accurate interpretation. All graphical and numerical data were meticulously verified for completeness and consistency prior to being processed by ChatGPT.

Figure 1.

Figure 1.

(a) ABPM graph, (b) average numerical data.

2.3. Ai model description

ChatGPT 4.0, developed by OpenAI, is an advanced NLP model designed to interpret and analyze data. For this study, this feature was utilized. The NLP model was provided with visual data of ABPM readings that included both the graphical representation of 24 h blood pressure readings as well as average numerical value (Figure 1). To standardize reporting,a set of predefined criteria (prompt) was provided.

2.4. Evaluation procedure

The definitions (Table 2) of hypertension, nocturnal hypertension, normal nocturnal dipping, normal heart rate, tachycardia, and bradycardia according to the ACC/AHA guidelines, were provided to ChatGPT 4 before the query. The data were collected, and the image/graph and the average readings were extracted from each of the ABPM data. Since ChatGPT is an NLP model capable of interpreting image data, the extracted image was provided (Figure 1). The interpretation parameters (Table 2) were set and provided to Chat GPT prior to interpretation. Each ABPM record was individually entered into ChatGPT for evaluation. The evaluation of ChatGPT's performance in the same 53 ABPM records was independently conducted in two separate rounds in April and August 2024 to observe the consistency and reliability of ChatGPT's response over time. The same prompt was used for both rounds.

The prompt used with ChatGPT was carefully developed to ensure that the model accurately understood and adhered to our clinical objectives. This process included extensive testing and refinement to establish a standardized input format. The prompt clearly defined parameters such as thresholds for hypertension, criteria for nocturnal dipping, and classifications for heart rate, following the guidelines of the ACC/AHA. To enhance reproducibility and minimize variability, we provided ChatGPT with both graphical representations of 24-h ABPM data and the corresponding average numerical values for analysis. The detailed prompt is provided in Table 2.

2.5. Reference standard for ABPM readings

To establish a reliable reference standard for assessing ChatGPT's performance, two experienced nephrologists independently reviewed the ABPM data using predefined criteria and guidelines. In cases of disagreement, a structured consensus process was implemented, involving joint discussions to resolve discrepancies and ensure accurate interpretations based on the established parameter definitions. The level of agreement between the nephrologists prior to reaching a consensus was quantified using the kappa statistic. They reached a consensus on the presence or absence of specific conditions following the ACC/AHA guidelines and institutional criteria. These consensus interpretations served as the reference standard for evaluating ChatGPT's accuracy.

2.6. Statistical analysis

The accuracy of ChatGPT's responses was evaluated by comparing its outputs to the consensus opinion of a panel of experienced nephrologists, which served as the reference standard. To assess potential improvement in performance over time, the accuracy, sensitivity, and specificity of ChatGPT in interpreting ABPM data were analyzed across two assessment rounds using McNemar's test. McNemar's test is specifically designed to compare paired proportions, making it well-suited for this study, as the same set of ABPM records was evaluated in both rounds under comparable conditions. This test evaluates whether there is a statistically significant difference in paired binary outcomes (e.g., correct vs. incorrect interpretations) between two time points. The assumptions of McNemar's test, including the requirement for paired data and binary outcomes, were met. P-values from the test were reported to identify any significant differences between the two rounds.

The level of agreement between ChatGPT's interpretations in the first and second rounds was quantified using both percentage agreement and the Kappa statistic. The Kappa statistic adjusts for chance agreement and provides a robust measure of reliability, with values ranging from 0 (no agreement) to 1 (perfect agreement). For interpretive clarity, Kappa values were categorized into five levels: 0.01–0.20 indicating slight agreement, 0.21–0.40 indicating fair agreement, 0.41–0.60 indicating moderate agreement, 0.61–0.80 indicating substantial agreement, and 0.81–1.00 indicating near-perfect agreement. Additionally, the prevalence-adjusted bias-adjusted Kappa (PABAK) was calculated to account for potential biases introduced by the prevalence of certain conditions in the dataset. All statistical analyses were performed using JMP statistical software, version 17.0 (SAS Institute Inc., Cary, NC), which facilitated comprehensive data management and analysis. The software enabled the application of McNemar's test and calculation of both Kappa and PABAK statistics, ensuring a thorough evaluation of the agreement and reliability of ChatGPT's performance.

3. Results

The 53 ABPM records were selected after implementing the inclusion and exclusion criteria.

3.1. Accuracy of ChatGPT response in interpreting ABPM compared with nephrologist review

In the first round, ChatGPT correctly interpreted the presence or absence of hypertension in 46 (87%) records, nocturnal hypertension in 47 (89%) records, nocturnal dipping in 43 (81%) records, and an abnormal heart rate in 50 (94%) records (Figure 2). It correctly interpreted all conditions in 32 (60%) records. The sensitivity was 79% in identifying hypertension, 82% in nocturnal hypertension, 68% in nocturnal dipping, 71% in abnormal heart rate. The specificity was 96% in identifying hypertension, 92% in nocturnal hypertension, 96% in nocturnal dipping, and 98% in abnormal heart rate (Table 3).

Figure 2.

Figure 2.

Percentage of agreement between ChatGPT and nephrologist ABPM interpretation by category for each round.

Table 3.

Accuracy, Sensitivity, and Specificity of ChatGPT in ABPM Interpretation:Comparison of ChatGPT’s accuracy, sensitivity, and specificity in interpreting ambulatory blood pressure monitoring (ABPM) data, as assessed in two independent rounds. The AI model’s performance is evaluated against consensus interpretations by expert nephrologists, with p-values indicating statistical differences between the two rounds.

Accuracy Sensitivity Specificity
1st round 2nd round p-value 1st round 2nd round p-value 1st round 2nd round p-value
Hypertension 46/53 (87%) 48/53 (91%) 0.75 23/29 (79%) 27/29 (93%) 0.22 23/24 (96%) 21/24 (88%) 0.63
Nocturnal hypertension 47/53 (89%) 49/53 (92%) 0.73 14/17 (82%) 16/17 (94%) 0.63 33/36 (92%) 33/36 (92%) 1.00
Nocturnal dip 43/53 (81%) 45/53 (85%) 0.69 19/28 (68%) 21/28 (75%) 0.63 24/25 (96%) 24/25 (96%) 1.00
Abnormal heart rate 50/53 (94%) 51/53 (96%) 0.98 5/7 (71%) 5/7 (71%) 1.00 45/46 (98%) 46/46 (100%) 1.00
All conditions 32/53 (60%) 36/53 (68%) 0.48 - - - - - -

In the second round, ChatGPT correctly interpreted the presence or absence of hypertension in 48 (91%) records, nocturnal hypertension 49 (92%) records, nocturnal dipping in 45 (85%) records, and an abnormal heart rate in 51 (96%) records. It correctly answered all conditions in 36 (68%) records in this round. The sensitivity was 93% in identifying hypertension, 94% in nocturnal hypertension, 75% in nocturnal dipping, and 71% in abnormal heart rate. The specificity was 88% in identifying hypertension, 92% in nocturnal hypertension, 96% in nocturnal dipping, and 100% in abnormal heart rate (Table 3).

There was no significant difference in accuracy, sensitivity, and specificity between the first and second round (all p > 0.05).

3.2. Agreement between the first and second rounds of ChatGPT response

The percentage agreement between the first and second round of ChatGPT was 81% in identifying hypertension, 85% in nocturnal hypertension, 89% in nocturnal dipping, and 94% in abnormal heart rate (Figure 3). The Kappa agreement statistic was 0.63 in identifying hypertension, 0.66 in nocturnal hypertension, 0.76 in nocturnal dipping, and 0.70 in abnormal heart rate. The prevalence-adjusted bias-adjusted Kappa statistic was 0.61 in identifying hypertension, 0.70 nocturnal hypertension, 0.78 in nocturnal dipping, and 0.88 in abnormal heart rate (Table 4).

Figure 3.

Figure 3.

Intra-rater percentage of agreement between ChatGPT round 1 and ChatGPT round 2.

Table 4.

Agreement Between ChatGPT’s First and Second Rounds of ABPM Interpretation: Assessment of intra-rater reliability in ChatGPT’s ABPM interpretations across two rounds. Agreement percentages, kappa statistics, and prevalence-adjusted bias-adjusted kappa values are reported for key diagnostic categories, including hypertension, nocturnal hypertension, nocturnal dipping, and abnormal heart rate.

Agreement percentage Kappa Prevalence adjusted bias-adjusted Kappa
Hypertension 43/53 (81%) 0.63 0.61
Nocturnal hypertension 45/53 (85%) 0.66 0.70
Nocturnal dip 47/53 (89%) 0.76 0.78
Abnormal heart rate 50/53 (94%) 0.70 0.88

4. Discussion

ChatGPT 4.0 demonstrated accuracy in detecting hypertension (84.91%) and hypotension (94.34%). These conditions have well-defined diagnostic thresholds, which likely contributed to the AI's strong performance. Hypertension is a significant risk factor for cardiovascular diseases, and accurate diagnosis is critical for effective management. The AI's ability to reliably identify hypertension could aid in early diagnosis and intervention, potentially reducing the burden of cardiovascular diseases.

Hypotension, although less common than hypertension, can also have serious health implications if left untreated. The AI's high accuracy in detecting hypotension highlights its potential utility in identifying this condition, which is often overlooked due to its less symptomatic nature. This capability is particularly valuable in clinical settings where quick and accurate diagnosis is essential.

The AI achieved an accuracy of 81.13% in detecting nocturnal hypertension and 75.47% in identifying a normal nocturnal dip. Nocturnal hypertension, characterized by elevated blood pressure during sleep, is a critical condition often associated with increased cardiovascular risks. Accurate identification of nocturnal hypertension is crucial as it often goes undetected with standard daytime measurements. The AI's performance in this area indicates a reliable level of accuracy, although further improvements are necessary. The normal nocturnal dip, the natural decrease in blood pressure during sleep, is an important indicator of cardiovascular health. 13 Deviations from the normal pattern can signal potential health issues. Enhancing the AI's accuracy in identifying a normal nocturnal dip could further improve its utility in clinical settings, aiding in the diagnosis of various cardiovascular conditions.

ChatGPT 4.0 showed good performance in identifying normal heart rates, with an accuracy of 90.57%. This result underscores the AI's capability in recognizing standard physiological parameters, which is fundamental for distinguishing between normal and abnormal conditions. High accuracy in this area is essential for ensuring that the AI does not flag normal readings as pathological, thereby reducing false positives and unnecessary interventions. The AI's performance in detecting tachycardia (73.58%) and bradycardia (71.70%) was comparatively lower. Tachycardia, defined as an abnormally high heart rate, can be indicative of various underlying conditions, including cardiac arrhythmias and other systemic issues. Bradycardia, characterized by an abnormally low heart rate, can sometimes be benign but may also indicate serious health problems, particularly if symptomatic. While the AI's accuracy in detecting tachycardia and bradycardia is promising, targeted improvements are necessary to enhance its reliability in these areas. Enhancing accuracy in detecting tachycardia and bradycardia would be beneficial, especially in preventing misdiagnosis and ensuring timely treatment.

The findings of this study align with previous research that highlights the potential of AI in medical diagnostics. 14 Prior studies have demonstrated the ability of AI models to accurately interpret various types of medical data, including imaging and laboratory results.6,14,15 However, most existing studies focus on the theoretical potential of AI, with few addressing practical implementation and validation in clinical settings.16,17 This study contributes to bridging this gap by providing evidence of ChatGPT 4.0's performance in a real-world clinical scenario. The accuracy rates for hypertension and hypotension are consistent with earlier findings that highlight AI's potential to enhance diagnostic accuracy. 18 However, the lower performance in tachycardia and bradycardia detection indicates specific areas where AI models still need improvement. This divergence from previous studies underscores the importance of validating AI models in diverse clinical settings to identify and address potential limitations.16,19

Several factors could explain the varied accuracy rates observed in this study. For conditions like hypertension and hypotension, clear diagnostic thresholds allow the AI to make accurate determinations. The well-defined nature of these conditions means that the AI can reliably identify patterns associated with elevated or reduced blood pressure. In contrast, conditions like tachycardia and bradycardia involve more nuanced patterns and variability in heart rate, which may challenge the AI's pattern recognition capabilities. The data used for the AI model may have been insufficient or imbalanced, limiting its ability to generalize to these more complex conditions. Additionally, the complexity of interpreting heart rate data, which can be influenced by a wide range of factors including physical activity, stress, and underlying health conditions, may contribute to the AI's lower performance in these areas.

The findings of this study have several implications for clinical practice. The high accuracy of ChatGPT 4.0 in detecting hypertension and hypotension suggests that AI can be a valuable tool in diagnosing these conditions. Integrating AI into clinical workflows could enhance diagnostic accuracy and efficiency, allowing healthcare providers to identify and manage cardiovascular conditions more effectively. 14 However, the lower performance in detecting tachycardia and bradycardia indicates that AI models need further refinement before they can be fully trusted in these areas. Continued development and training on more diverse datasets are necessary to improve the AI's accuracy and reliability. Additionally, incorporating feedback from clinical experts during the AI's training phase could help to fine-tune its diagnostic capabilities. Although we need to take into consideration the human errors in data interpretation.

This study has several strengths, but the limitations should be acknowledged. This study utilized 53 ABPM recordings to assess the performance of ChatGPT 4.0 in interpreting 24-h ABPM data compared to expert nephrologists. The sample size was limited due to the additional resources required to obtain 24-h ABPM data, which is less commonly performed compared to 6-h or 12-h recordings. While this smaller sample size reflects the constraints of this exploratory, proof-of-concept study, it serves as a foundation for evaluating the feasibility of leveraging AI in clinical settings. The patient cohort included individuals from diverse demographic and geographic backgrounds, spanning national and international origins, which enhances the representativeness of the dataset despite the sample size limitations. A random selection process was employed to minimize bias, and all data were de-identified before analysis. To ensure the broader applicability of these findings, future studies with larger sample sizes and more diverse populations are planned to validate and expand upon these preliminary results.

The relatively small sample size of 53 cases may limit the generalizability of the findings. A larger sample size would provide more robust data and potentially yield different results. This was an initial exploratory study, and in the future, larger studies with more patient data over longer periods of time should be evaluated. Additionally, the study was conducted at a single institution (Mayo Clinic), which may not represent the diversity of patient populations seen in other clinical settings, although at Mayo Clinic we have a wide array of patients from both national and international backgrounds; however, 83% of the patient sample was White. This study has several limitations that impact its internal and external validity. The small sample size and single-center design limit the generalizability of the findings. Additionally, while confounding variables were minimized through strict inclusion criteria and standardized ABPM calibration, unmeasured factors such as patient comorbidities and medication use were not explicitly controlled. Future studies will address these limitations by incorporating larger, multi-center datasets and more comprehensive patient-level data to enhance both internal and external validity. Multi-center studies involving varied patient demographics are necessary to validate the findings. Another limitation is the potential bias in the selection of cases. Although the cases were randomly selected, there may still be inherent biases in the dataset that could influence the results. Ensuring a truly representative sample of ABPM records, with balanced representation of different conditions, is crucial for accurately assessing the AI's performance. The potential general limitations of using AI in healthcare, such as data quality, generalizability, model transparency, complexity, resource constraints, and regulatory challenges, are also applicable in this setting. AI models can also have inherent bias based on training datasets, and continuous validation is resource-intensive. Automation bias and overreliance on AI are additional concerns that must be addressed.16,19

Future research should focus on expanding the dataset to include a more diverse patient population and a greater variety of clinical scenarios. This would help in training the AI model more comprehensively, potentially improving its accuracy in detecting conditions like tachycardia and bradycardia. Multi-center studies involving larger sample sizes and varied patient demographics are necessary to validate the findings and ensure the generalizability of the results. Additionally, studies should explore the practical implementation of AI in clinical workflows, assessing its impact on diagnostic accuracy, efficiency, and patient outcomes. Understanding how AI can be seamlessly integrated into existing healthcare processes is crucial for maximizing its benefits. Future research should also investigate the potential of AI to assist in other aspects of patient care, such as treatment planning and monitoring.20,21 AI-based technologies could leverage clinical BP records and apply methods like deep learning to enhance the accuracy of BP measurements and provide individualized cardiovascular risk assessments. 22

5. Conclusion

In summary, this study provides a comprehensive evaluation of ChatGPT 4.0's performance in interpreting 24-h ABPM data, comparing its accuracy with expert clinical interpretations. The findings suggest that AI has potential to assist in the interpretation of complex clinical data, with high accuracy in several key areas. The high accuracy rates for hypertension and hypotension indicate that AI can aid in diagnosing these common conditions, potentially improving patient outcomes through early diagnosis and intervention. However, the lower accuracy in detecting tachycardia and bradycardia highlights areas where further development is needed. The integration of AI in interpreting ABPM data could revolutionize hypertension management and improve patient outcomes. By comparing the performance of AI models against expert interpretations, this study contributes to the growing body of evidence supporting the use of AI in clinical settings. As AI technology continues to evolve, its role in healthcare is expected to expand, offering new opportunities for enhancing diagnostic accuracy and efficiency. However, continued research and development are necessary to address the limitations identified in this study and ensure the reliable performance of AI models across a wide range of clinical scenarios.

Acknowledgements

None

Contributorship: Author Contributions: Conceptualization, S.F.A., W.C.; Data curation, J.H.P.; Formal analysis, J.H.P.; Funding acquisition, J.H.P., W.C.; Investigation, J.H.P.; Methodology, J.H.P., J.M., I.M.C.; Project administration, C.T., S.F.A.; Resources, J.H.P.; Supervision, C.T., J.M., I.M.C.; Validation, W.C.; Visualization, J.H.P., W.C.; Writing – original draft, J.H.P., W.C.; Writing – review & editing, C.T., S.F.A., J.M., I.M.C., M.S.S., O.A.G.V., G.L.S., M.L.G.S. All authors have read and agreed to the published version of the manuscript.

Disclosure: The authors have nothing to disclose

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement: The data underlying this article will be shared on reasonable request to the corresponding author.

Disclosure: All the authors declared no competing interest.

Ethics approval: The study was conducted at Mayo Clinic, Minnesota. The Mayo Clinic Institutional Review Board approved this study (IRB number 24-002829) and exempted the need for informed consent because this was a minimal risk study solely involving chart review.

Funding: The authors received no financial support for the research, authorship, and/or publication of this article.

Funding: None

Guarantor: C.T.

Language model use: The use of ChatGPT in this study was strictly limited to the response-generating protocol described in the methods section. ChatGPT was not used for data analysis, writing, or any other aspects of the production of this manuscript.

References

  • 1.WHO. WHOWWHO. Hypertension: https://www.who.int/news-room/fact-sheets/detail/hypertension March 16 2023
  • 2.Harianto H, Valente M, Hoetomo Set al. et al. The clinical utility of ambulatory blood pressure monitoring (ABPM): a review. Curr Hypertens Rev 2014; 10: 189–204. [DOI] [PubMed] [Google Scholar]
  • 3.Mizuno H, Choi E, Kario K, et al. Diagnostic accuracy of office blood pressure measurement and home blood pressure monitoring for hypertension screening among adults: results from the IDH study. J Am Heart Assoc 2023; 12: e030150. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Pham JH, Thongprayoon C, Suppadungsuk S, et al. Digital health tools in nephrology: a comparative analysis of AI and professional opinions via online polls. Digital Health 2024; 10: 20552076241277458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Sheikh MS, Barreto EF, Miao J, et al. Evaluating ChatGPT's efficacy in assessing the safety of non-prescription medications and supplements in patients with kidney disease. Digital Health 2024; 10: 20552076241248082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Alsharqi M, Lapidaire W, Iturria-Medina Y, et al. A machine learning-based score for precise echocardiographic assessment of cardiac remodelling in hypertensive young adults. Eur Heart J Imaging Methods Pract 2023; 1: qyad029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Cai A, Zhu Y, Clarkson SAet al. et al. The use of machine learning for the care of hypertension and heart failure. JACC: Asia 2021; 1: 162–172. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Miao J, Thongprayoon C, Garcia Valencia OA, et al. Performance of ChatGPT on nephrology test questions. Clin J Am Soc Nephrol 2024; 19: 35–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Miao J, Thongprayoon C, Fülöp Tet al. et al. Enhancing clinical decision-making: optimizing ChatGPT's performance in hypertension care. J Clin Hypertens 2024; 26: 588–593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Ho YS, Fülöp T, Krisanapan P, et al. Artificial intelligence and machine learning trends in kidney care. Am J Med Sci 2024; 367: 281–295. [DOI] [PubMed] [Google Scholar]
  • 11.Arshad HB, Butt SA, Khan SU, et al. ChatGPT and artificial intelligence in hospital level research: potential, precautions, and prospects. Methodist Debakey Cardiovasc J 2023; 19: 77–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Padmanabhan S, Tran TQB, Dominiczak AF. Artificial intelligence in hypertension. Circ Res 2021; 128: 1100–1118. [DOI] [PubMed] [Google Scholar]
  • 13.Xu Y, Barnes VA, Harris RA, et al. Sleep variability, sleep irregularity, and nighttime blood pressure dipping. Hypertension 2023; 80: 2621–2626. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Alam SF, Gonzalez Suarez ML. Transforming healthcare: the AI revolution in the comprehensive care of hypertension. Clin Pract 2024; 14: 1357–1374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Wu L, Huang L, Li M, et al. Differential diagnosis of secondary hypertension based on deep learning. Artif Intell Med 2023; 141: 102554. [DOI] [PubMed] [Google Scholar]
  • 16.Chomutare T, Tejedor M, Svenning TO, et al. Artificial intelligence implementation in healthcare: a theory-based scoping review of barriers and facilitators. Int J Environ Res Public Health 2022; 19: 16359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Myllyaho L, Raatikainen M, Männistö T, et al. Systematic literature review of validation methods for AI systems. J Syst Softw 2021; 181: 111050. [Google Scholar]
  • 18.Visco V, Izzo C, Mancusi C, et al. Artificial intelligence in hypertension management: an ace up your sleeve. J Cardiovasc Dev Dis 2023; 10: 74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ahmed MI, Spooner B, Isherwood J, et al. A systematic review of the barriers to the implementation of artificial intelligence in healthcare. Cureus 2023; 15: e46454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wang T, Yan Y, Xiang S, et al. A comparative study of antihypertensive drugs prediction models for the elderly based on machine learning algorithms. Front Cardiovasc Med 2022; 9: 1056263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Koren G, Nordon G, Radinsky Ket al. et al. Machine learning of big data in gaining insight into successful treatment of hypertension. Pharmacol Res Perspect 2018; 6: e00396. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Mousavi SS, Reyna MA, Clifford GDet al. et al. A survey on blood Pressure measurement technologies: addressing potential sources of bias. Sensors (Basel) 2024; 24: 1730. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Digital Health are provided here courtesy of SAGE Publications

RESOURCES