Towards improving risk prediction through clustering analysis

Ricardo Henao; Michael Pencina

doi:10.1093/eurjpc/zwaf164

. Author manuscript; available in PMC: 2025 Sep 5.

Published before final editing as: Eur J Prev Cardiol. 2025 Mar 20:zwaf164. doi: 10.1093/eurjpc/zwaf164

Towards improving risk prediction through clustering analysis

Ricardo Henao ^1,^2,^3,^4,^*, Michael Pencina ^1,^2,³

PMCID: PMC12409752 NIHMSID: NIHMS2108298 PMID: 40108943

The ability to accurately predict an individual’s risk of developing a disease or experiencing a specific health outcome is paramount in modern medicine. Effective risk prediction models empower health professionals to personalize treatment strategies, optimize preventative resource allocation, and ultimately improve patient outcomes. However, the pursuit of reliable risk prediction is fraught with challenges, which require a dedicated investment in the development of new methodologies and a critical examination of existing paradigms in view of the changing landscape in healthcare instigated by the staggering growth in the collection, aggregation, and availability of granular patient data.

Several factors contribute to the difficulty in constructing accurate risk prediction models. Firstly, limited access to relevant covariates poses a significant hurdle. Comprehensive patient data, including longitudinal, environmental, and lifestyle factors, are often unavailable or incomplete. This necessitates reliance on readily accessible but potentially less informative variables (predictors), thus compromising the model’s predictive power. Secondly, traditional models are often static (or cross-sectional), providing a snapshot of risk at a single point in time. Consequently, they fail to account for the dynamic nature of health, where risk profiles evolve over time due to various internal and external factors. Incorporating longitudinal data as a means for capturing patient’s health history is crucial for capturing these temporal changes and enhancing predictive accuracy. Thirdly, many widely used models aim to predict outcomes over long horizons, such as 10 years. This requires extrapolating trends and making assumptions about future exposures and behaviours, introducing considerable uncertainty, which is why accurately predicting events far into the future remains a formidable challenge.

Nevertheless, the need for accurate risk prediction remains undeniable. The ability to identify individuals at high risk for specific conditions allows for targeted interventions and preventative measures. This is particularly crucial in the context of treatment effectiveness, where identifying those most likely to benefit can significantly improve outcomes. There are several effective treatment options in the cardiovascular disease space,¹ which can be administered by targeting known modifiable risk factors.² However, the identification of such, or more generally, those individuals most likely to benefit from treatments (or interventions) remains an open problem, which can be addressed either directly or by proxy. Directly estimating individual treatment effects is highly desirable, but it often requires complex statistical modelling and large data sets, posing significant logistical and computational challenges, especially when working with observational data.³ Proxy approaches, such as using risk scores to identify high-risk individuals, are more common and can be highly effective when coupled with accurate models and clear explanations; however, they require the healthcare professional to connect the individual’s presentation, their risk, and the explanation provided by the model, with the expected treatment response to make treatment decisions.

Despite the advancements in machine learning (and artificial intelligence), simple regression-based models such as the Cox proportional hazard model continue to play a significant role in risk prediction,⁴ with main algorithms recommended by major European (SCORE2⁵) and US (PCE⁶ and PREVENT⁷) expert societies relying on this methodology. Their appeal lies in their interpretability, ease of deployment, and generalizability. Clinicians can readily understand the relationships between risk factors and outcomes, facilitating clinical decision-making. Moreover, these models often require fewer computational resources and can be readily implemented in diverse settings. However, simple models suffer from several limitations. They often ignore complex interactions between variables and fail to capture non-linear relationships. They also tend to over-simplify the heterogeneity of patient populations, by assuming that risk factors have uniform effects across all individuals.

To address the limitations of simple models, researchers have explored more sophisticated approaches, such as machine learning algorithms. These models, including random forests, boosting algorithms, and neural networks, can capture complex interactions and non-linear relationships, potentially leading to improved predictive accuracy. However, the increased complexity of these models comes at a cost. They often require larger data sets to prevent over-fitting and may be less interpretable than simple models. This lack of transparency can hinder clinical adoption, as clinicians may be reluctant to rely on models they do not fully understand. Moreover, when these more complex models are applied to simpler, structured types of data, they are not able to demonstrate improvements in performance on validation samples. For instance, Hong et al.⁸ showed that machine learning models developed de novo on data from the Framingham Offspring, Atherosclerosis Risk in Communities, and Multi-Ethnic Study for Atherosclerosis studies did not improve model performance over that of the existing Framingham Stroke, REGARDS self-report, and pooled cohort equations (PCE).

In this context, cluster analysis presents an intriguing alternative to risk prediction models. Originally designed in the cross-sectional context, it was meant to determine which variables ‘cluster together’ among the population under study to potentially identify new phenotypes of existing disease. This intent was then connected to risk prediction by estimating the risk of onset of future adverse outcomes or disease manifestations. If successful, this would be attractive—in cardiovascular prevention, it could point to cardio-metabolic phenotypes with different risk characteristics for future CVD events. In recent years, the increasing availability of data needed for better risk prediction has been precipitated by the maturity and widespread deployment of electronic health record systems, better and safer approaches for sharing data for the purpose of building these models (protected computing environments and federated learning), and ambitious initiatives to make data at scale more openly available (UK Biobank, All of Us, etc.). As a result, sample size issues are becoming less of an issue. This is one of the key reasons the focus has started to shift towards model generalization to subpopulations of interest. From this perspective, cluster analysis presents itself as an interesting proposition because it allows one to identify more homogeneous subpopulations within a larger more diverse population over which tailored risk prediction models can be developed and applied. This approach, which in principle could address some of the most important limitations of risk prediction models highlighted above, has been extended from the original two-step approach (clustering then risk prediction) to integrated approaches, where a single model estimates both the subpopulation (cluster) to which the individual belongs as well as the risk of the outcome of interest given the characteristics of such subpopulation. Using the latter clustering approach, Chapfuwa et al.⁹ achieved strong performance based on model discrimination and calibration in cardiovascular risk estimates on the Framingham Offspring study relative to standard and machine learning risk prediction models.

In this issue of the journal, Yacamán Méndez et al,¹⁰ contrast the model performance of a clustering algorithm with that of guideline-recommended risk prediction models (SCORE2, PCE, and PREVENT) using data from the Stockholm Diabetes Prevention Programme. They conclude that cluster analysis performs ‘comparably’ with existing models with better sensitivity (more events detected) but lower specificity (more false positives). These main observations are based on comparing the high-risk cluster from cluster analysis with a high-risk group based on each guideline-recommended prediction model exceeding a pre-defined, model-specific threshold. This resulted in unequal high-risk groups: larger for the cluster analysis and smaller for the prediction models. Moreover, cluster analysis was performed on the target data while the risk prediction models were applied to what can be considered an external sample. These methodological differences might explain some of the apparent differences in performance.

Overall, the observed similar performance of cluster analysis and risk prediction models is consistent with what we would expect. Risk prediction models are optimized for best risk ranking while cluster analysis adds to it the desire to create interpretable subpopulations of individuals. It is this latter feature, i.e. creation of interpretable clusters, rather than improvements in risk predictions, that should motivate researchers to employ cluster analysis. If successful, this could be a viable option to create new risk prediction frameworks that bring us closer to fulfilling the goal of precision medicine, namely that in which interventions are tailored to the individual by accounting for the unique combination of characteristics that contribute to their risk and importantly, whether interventions are likely to modify such risk.

Funding

This work was supported by grant 1R61-NS120246-02 from the National Institute of Neurological Disorders and Diseases (NINDS).

Footnotes

The opinions expressed in this article are not necessarily those of the Editors of the European Journal of Preventive Cardiology or of the European Society of Cardiology.

Conflict of interest: M. P. Personal consulting for Eli Lilly; personal research relationship with McGill University Health Centre; past consulting for Cleerly Inc.; and board member for Coalition for Health AI. R. H. has nothing to disclose.

References

1.Arnett DK, Blumenthal RS, Albert MA, Buroker AB, Goldberger ZD, Hahn EJ, et al. 2019 ACC/AHA guideline on the primary prevention of cardiovascular disease: a report of the American College of Cardiology/American Heart Association task force on clinical practice guidelines. J Am Coll Cardiol 2019; 74:e177–e232. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Khera AV, Emdin CA, Drake I, Natarajan P, Bick AG, Cook NR, et al. Genetic risk, adherence to a healthy lifestyle, and coronary disease. N Engl J Med 2016;375:2349–2358. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.D’Agostino RB. Estimating treatment effects using observational data. JAMA 2007;297: 314–316. [DOI] [PubMed] [Google Scholar]
4.Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol 2019;110:12–22. [DOI] [PubMed] [Google Scholar]
5.SCORE2 Working Group and ESC Cardiovascular Risk Collaboration. SCORE2 risk prediction algorithms: new models to estimate 10-year risk of cardiovascular disease in Europe. Eur Heart J 2021;42:2439–2454. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Goff DC Jr, Lloyd-Jones DM, Bennett G, Coady S, D’agostino RB, Gibbons R, et al. 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association task force on practice guidelines. Circulation 2014;129:S49–S73. [DOI] [PubMed] [Google Scholar]
7.Khan SS, Matsushita K, Sang Y, Ballew SH, Grams ME, Surapaneni A, et al. Development and validation of the American Heart Association’s PREVENT equations. Circulation 2024;149:430–449. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Hong C, Pencina MJ, Wojdyla DM, Hall JL, Judd SE, Cary M, et al. Predictive accuracy of stroke risk prediction models across black and white race, sex, and age groups. JAMA 2023;329:306–317. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Chapfuwa P, Li C, Mehta N, Carin L, Henao R. Survival cluster analysis. In Proceedings of the ACM Conference on Health, Inference, and Learning, 2020 Apr 2, Toronto, Ontario, Canada. 60–68. Association for Computing Machinery, New York, NY, USA. [Google Scholar]
10.Yacamán Méndez D, Zhou M, Brynedal B, Gudjonsdottir H, Tynelius P, Lagerros YT, et al. Risk stratification for cardiovascular disease: a comparative analysis of cluster analysis and traditional prediction models. Eur J Prevent Cardiol 2025. [DOI] [PubMed] [Google Scholar]

[R1] 1.Arnett DK, Blumenthal RS, Albert MA, Buroker AB, Goldberger ZD, Hahn EJ, et al. 2019 ACC/AHA guideline on the primary prevention of cardiovascular disease: a report of the American College of Cardiology/American Heart Association task force on clinical practice guidelines. J Am Coll Cardiol 2019; 74:e177–e232. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Khera AV, Emdin CA, Drake I, Natarajan P, Bick AG, Cook NR, et al. Genetic risk, adherence to a healthy lifestyle, and coronary disease. N Engl J Med 2016;375:2349–2358. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.D’Agostino RB. Estimating treatment effects using observational data. JAMA 2007;297: 314–316. [DOI] [PubMed] [Google Scholar]

[R4] 4.Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol 2019;110:12–22. [DOI] [PubMed] [Google Scholar]

[R5] 5.SCORE2 Working Group and ESC Cardiovascular Risk Collaboration. SCORE2 risk prediction algorithms: new models to estimate 10-year risk of cardiovascular disease in Europe. Eur Heart J 2021;42:2439–2454. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Goff DC Jr, Lloyd-Jones DM, Bennett G, Coady S, D’agostino RB, Gibbons R, et al. 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association task force on practice guidelines. Circulation 2014;129:S49–S73. [DOI] [PubMed] [Google Scholar]

[R7] 7.Khan SS, Matsushita K, Sang Y, Ballew SH, Grams ME, Surapaneni A, et al. Development and validation of the American Heart Association’s PREVENT equations. Circulation 2024;149:430–449. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Hong C, Pencina MJ, Wojdyla DM, Hall JL, Judd SE, Cary M, et al. Predictive accuracy of stroke risk prediction models across black and white race, sex, and age groups. JAMA 2023;329:306–317. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Chapfuwa P, Li C, Mehta N, Carin L, Henao R. Survival cluster analysis. In Proceedings of the ACM Conference on Health, Inference, and Learning, 2020 Apr 2, Toronto, Ontario, Canada. 60–68. Association for Computing Machinery, New York, NY, USA. [Google Scholar]

[R10] 10.Yacamán Méndez D, Zhou M, Brynedal B, Gudjonsdottir H, Tynelius P, Lagerros YT, et al. Risk stratification for cardiovascular disease: a comparative analysis of cluster analysis and traditional prediction models. Eur J Prevent Cardiol 2025. [DOI] [PubMed] [Google Scholar]

PERMALINK

Towards improving risk prediction through clustering analysis

Ricardo Henao

Michael Pencina

Funding

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Towards improving risk prediction through clustering analysis

Ricardo Henao

Michael Pencina

Funding

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases