Abstract
Background:
Although statistical models have been commonly used to identify patients at risk of cardiovascular disease (CVD) for preventive therapy, these models tend to over-recommend therapy. Moreover, in populations with pre-existing diseases, the current approach is to indiscriminately treat all, as modelling in this context is currently inadequate.
Methods:
We developed and validated the Transformer-based Risk assessment survival model (TRisk), a novel deep learning model, for predicting 10-year risk of CVD in both the primary prevention population and individuals with diabetes. An open cohort of 3 million adults aged 25 to 84 years was identified using linked electronic health records from 291 and 98 general practices in England and were used for model development and validation, respectively. Comparison against QRISK3 and a deep learning derivation of QRISK3 was conducted. Additional analyses compared discriminatory performance in other age groups, by sex, and across categories of socioeconomic status.
Findings:
TRisk demonstrated superior discrimination (C-index in the primary prevention population: 0.910; 95% confidence interval [CI]: 0.906 to 0.913). TRisk’s performance was found to be less sensitive to population age range than the benchmark models and outperformed other models also in analyses stratified by age, sex or socioeconomic status. All models were overall well-calibrated. In decision curve analyses, TRisk demonstrated greater net benefit than benchmark models across the range of relevant thresholds. At the widely recommended 10% risk threshold and the higher 15% threshold, TRisk reduced both the total number of patients classified at high risk (by 22% and 35% respectively) and the number of false negatives as compared with currently recommended strategies. TRisk similarly outperformed other models in patients with diabetes. Compared with the widely recommended treat-all policy approach for patients with diabetes, TRisk at a 10% risk threshold would lead to deselection of 24% of individuals with a small fraction of false negatives (0.2% of cohort).
Interpretation:
TRisk enabled a more targeted selection of individuals at risk of CVD in both primary prevention population and cohorts with diabetes, compared to benchmark approaches. Incorporation of TRisk into routine care could potentially reduce treatment-eligible patients by approximately one-third while preventing at least as many events as with currently adopted approaches.
Funding:
None.
Introduction
Blood pressure (BP) and LDL-cholesterol lowering are well-established strategies for preventing cardiovascular diseases (CVD). Randomised trials have shown that relying on single risk factors is inadequate for selecting individuals for treatment, as the benefits of antihypertensives and statins remain consistent regardless of baseline BP, LDL-cholesterol, age, sex, or pre-existing vascular disease.1–4 Given this, predicting CVD risk as a function of multiple factors has become crucial for identifying those who will benefit most from preventive treatment.5 Consequently, multivariable CVD risk modelling now routinely guides preventive therapy decisions. This approach is strongly supported by trial evidence of greater risk reductions for those with higher predicted CVD risk.2,6
While methods for identifying high-risk individuals vary globally, guidelines generally utilise risk scores for preventive therapy recommendation. The UK, mainland Europe, and USA generally use QRISK2, SCORE2 (and SCORE-OP), and ASCVD models respectively, typically recommending treatment for those with predicted risk above specific risk thresholds.7 Guidelines also classify certain groups (e.g., those with prior CVD, diabetes, or chronic renal disease) as automatically high-risk.7 Though more efficient than simpler criteria like age or BP thresholds, current approaches remain crude. For instance, QRISK2-driven recommendation, although cost-effective in large part due to low drug costs, allocates approximately one-third of UK adults aged 30–79 years as treatment-eligible,8 yet most of these individuals would not experience an event even without treatment.9 Similar issues affect patients with pre-existing conditions like diabetes, where blanket treatment recommendations ignore substantial variability in individual risk.5,7 Indeed, current policies mandating BP and LDL-cholesterol lowering treatment for all diabetic patients fail to account for this heterogeneity in risk.7
Emerging as a potentially promising solution, the Transformer, a deep learning (DL) architecture originally designed for natural language processing, has become central to artificial intelligence (AI) research.10 Unlike traditional statistical approaches and machine learning methods (e.g., random forest) that often rely on expert-driven feature engineering, Transformers can process complex sequential patterns in minimally processed data and automatically extract relevant features. By efficiently analysing multimodal sequential information, Transformers have shown promise in computer vision and more recently clinical risk prediction.11 Given their potential, it is crucial to fully evaluate the utility of Transformers for clinical decision-making. To this end, we conducted a study to develop and validate the Transformer-based risk assessment model (TRisk) for 10-year prediction of CVD risk.
Methods
Study framework
Our model, TRisk, was an adaptation of the Transformer-based DL model, bidirectional electronic health records Transformer (BEHRT).11 BEHRT has demonstrated promising performance for prediction of many diseases including CVDs.11,12 In this study, we extended and modified BEHRT from a binary outcome prediction model into a survival model to additionally account for censoring and refer to it as TRisk.13 We compared TRisk with two established expert-guided models: the QRISK3 model14 and a locally derived non-linear DL derivation of Cox proportional hazards (CPH) modelling called DeepSurv, a multi-layer perceptron (MLP) neural network model.15 All models were validated in a primary prevention population cohort (i.e., patients without CVD at baseline). The TRIPOD+AI statement was used for study reporting.16
Data source and validation strategy
We used Clinical Practice Research Datalink (CPRD) as the study data source.17 CPRD includes detailed patients’ records such as demographics (age, sex, ethnicity), diagnoses, prescribed treatments, tests, and health related lifestyles collected from participating general practices (GPs) across UK, covering around 7% of the UK population. With linkage to Hospital Episode Statistics (HES) and Office for National Statistics for data about hospital admission and death registration, respectively, CPRD is one of the most comprehensive deidentified EHR datasets broadly representative of the UK population.17
We included data from 389 contributing GPs that met acceptable standards of research quality. Prior to cohort selection, as our validation strategy, we randomly allocated three quarters (i.e., 291) of practices to the derivation dataset and the rest (i.e., 98) of the practices to the validation dataset (details in Supplementary Methods: Clarification on validation study), ensuring evaluation on distinct practices rather than individual patients.14
Cohort selection
We identified an open cohort for our analysis consisting of individuals without prior CVD (defined as any of coronary heart disease, stroke, and transient ischaemic attack), in whom treatment is recommended when their predicted risk of CVD is above a certain threshold.5,7 Prior CVD was defined as a composite of coronary heart disease, ischaemic stroke, or transient ischaemic attack captured in both primary and secondary care diagnostic records. We adopted the validated CALIBER repository for CVD phenotyping.18 We included individuals who had records between 1 January 1998 and 31 December 2015, were aged between 25 and 84 years old, and registered with GP for at least one year. The index date (baseline) was randomly selected from the patient medical history for each individual. This method for index date selection was implemented to ensure the model was trained and evaluated on a more representative spread of age and calendar year at baseline. By adopting this approach, we simulate the calculation of a patient’s risk score at any point during the eligible period as opposed to fixing the index date as first date in eligible period, closely mirroring its real-world application in clinical settings.19
We aimed to adhere closely to the criteria for selection of the cohort as reported for the QRISK3 study.14 Patients with a prior history of CVD (as defined earlier), prescription of a statin, or no available records at the index date were excluded from the datasets. Additional exclusions included those without IMD information or HES-linkage. Patients were censored at the earliest of the last date in practice, last collection from practice (i.e., most recent date on which data was collected from a GP practice), death, incident CVD, 10 years after the index date (i.e., truncation after 120-month mark), or the study end date (31 December 2015).
Outcome definition
The outcome of interest was the 10-year risk of major CVD events. The CVD outcome phenotype was similarly adopted from the validated CALIBER repository as a composite of coronary heart disease, ischaemic stroke, or transient ischaemic attack captured in both primary and secondary care diagnostic records.18 In line with TRIPOD+AI guidelines, the outcome was assessed blind to pre-baseline information to prevent potential bias.
Model derivation
TRisk incorporated all recorded information from the following modalities in both primary and secondary care (HES) records: 3,858 distinct diagnoses, 390 categories of medications, 1,439 laboratory tests, and 679 procedures codes. As a Transformer model, TRisk considers a patient’s medical history up to baseline as a sequence which is typically of variable length. Each unit of information from the captured modalities is mapped to the patient’s age and the health service encounter, thus providing rich temporal annotations to the sequence of records (Figure S1). The information captured is as recorded in the structured EHR without any imputation of missing values. Free-text data and demographics such as sex, socioeconomic status, and other demographics data were not incorporated in TRisk. As a data-driven model, TRisk was trained on raw or minimally processed EHR (Supplementary Methods: EHR pre-processing and modelling of TRisk; Table S1).
Benchmark modelling
We implemented sex-specific QRISK3 equations using source code from ClinRisk Limited and derived a non-linear multi-layer perceptron DL version of QRISK3 (DeepSurv) on our primary prevention cohort.14,15 Predictor extraction and transformation for QRISK3 followed previously published specifications. Unlike TRisk, which captures variable-length longitudinal data, DeepSurv uses the same fixed-length cross-sectional variables as QRISK3 (Table S2). Missing data was imputed separately for derivation and validation datasets; details on predictor selection, imputation, and implementation can be found in Supplementary Methods: Predictor selection for benchmark models and Implementation details.
Performance analysis
We evaluated models on the validation dataset using complementary approaches: Discrimination was assessed using concordance index (C-index) and area under the precision-recall curve (AUPRC). Calibration was evaluated graphically using smoothed calibration curves with restricted cubic splines with three knots.20 Individual-level prediction consistency between models was examined using scatterplots to elucidate model agreement. Decision curve analysis (accounting for censoring) assessed the trade-off between true positives and false positives across risk thresholds.21 We analysed clinical impact by calculating the number of patients considered high risk, true positives, and false negatives at illustrative thresholds. While the selection of an optimal threshold for decision-making cannot be reduced to performance metrics only, a superior strategy would maximise capture of those who are in need of preventative therapy (i.e., minimising false negatives) whilst minimising false alarms. For all analyses, patient-level risk was derived from estimated survival functions, focusing on the 10th-year risk estimates.14 While some models (e.g., QRISK3) are validated separately by sex, we have presented aggregate results for the full validation dataset.
In accordance with the prior QRISK3 study, our main analyses were restricted to patients aged 25 to 84 years.14 However, in additional analyses, we assessed model discrimination in terms of C-index in the age range of 40 to 69 years and 40 to 84 years. Model discrimination was further assessed separately in men and women, and at different levels of socioeconomic deprivation denoted by quantiles of IMD.
To provide additional comparative analysis, we derived a sex-agnostic CPH model using SCORE2 predictors.22 Our aim was to assess the SCORE2 predictors performance in our specific context. Detailed methods for predictor extraction, imputation, and model implementation are provided in Supplementary Methods: Predictor selection for benchmark models and Implementation details.
We repeated all aforementioned analyses on patients with a diagnosis of diabetes prior to or on the index date (i.e., diabetes cohort) to assess the usefulness of TRisk in a ‘high-risk’ population. By utilising a transfer learning approach, the proposed TRisk model that was initially trained on the primary prevention population cohort was transferred, fine-tuned, and validated in patients with diabetes at baseline (Supplementary Methods: Modelling in diabetes cohort).23
Study approval was given by the CPRD Independent Scientific Advisory Committee of UK (protocol number: 16_049R).
Role of the funding source
The funders of the study had no role in the study design, data collection, data analysis, data interpretation, writing of the report, or the decision to submit the report for this research. Drs. Rao and Li had access to the data; Drs Rahimi and Rao were responsible for decision to submit the manuscript.
Results
2.97 million patients were included in the primary prevention population cohort (747,076 patients in validation) median follow-up of 2.5 (interquartile range [IQR]: [0.8, 5.9]) years with approximately 9% of patients having 10 or more years of follow-up. 4.6% of patients were diagnosed with CVD during follow-up (4.7% in derivation; 4.5% in validation) (Table 1; population selection flowchart in Figure S2). As expected, there was a noticeable variation in the distribution of regions between the derivation and validation datasets in the cohort.
Table 1.
Population characteristics for derivation and validation datasets of primary prevention population cohort.
| Derivation (n=2,224,701) | Validation (n=747,076) | |
|---|---|---|
| CVD cases (%) | 104,058 (4.68) | 33,248 (4.45) |
| Women (%) | 1,202,114 (54.03) | 402,962 (53.94) |
| Mean age in years (SD) | 48 (16) | 47 (16) |
| Ethnicity | ||
| Unknown (%) | 1,481,771 (66.61) | 498,763 (66.76) |
| White (%) | 691,797 (31.1) | 231,088 (30.93) |
| Other Asian (%) | 5,399 (0.24) | 1,804 (0.24) |
| Pakistani (%) | 5,673 (0.26) | 1,358 (0.18) |
| Indian (%) | 9,707 (0.44) | 2,528 (0.34) |
| Other (%) | 9,324 (0.42) | 3,492 (0.47) |
| Caribbean (%) | 4,187 (0.19) | 2,154 (0.29) |
| Mixed (%) | 3,615 (0.16) | 1,396 (0.19) |
| Bangladeshi (%) | 1,707 (0.08) | 449 (0.06) |
| Chinese (%) | 2,352 (0.11) | 805 (0.11) |
| Other Black (%) | 2,715 (0.12) | 1,021 (0.14) |
| Black African (%) | 6,454 (0.29) | 2,218 (0.30) |
| Index of multiple deprivation (IMD) | ||
| IMD 1 (%) | 526,997 (23.69) | 168,810 (22.60) |
| IMD 2 (%) | 500,928 (22.52) | 155,571 (20.82) |
| IMD 3 (%) | 454,225 (20.42) | 167,026 (22.36) |
| IMD 4 (%) | 396,469 (17.82) | 147,640 (19.76) |
| IMD 5 (%) | 346,082 (15.56) | 108,029 (14.46) |
| Region | ||
| North East (%) | 31,767 (1.43) | 26,764 (3.58) |
| North West (%) | 331,324 (14.89) | 68,397 (9.16) |
| Yorkshire and the Humber (%) | 78,998 (3.55) | 42,732 (5.72) |
| East Midlands (%) | 60,070 (2.70) | 37,876 (5.07) |
| West Midlands (%) | 247,430 (11.12) | 92,491 (12.38) |
| East of England (%) | 271,342 (12.2) | 84,811 (11.35) |
| South West (%) | 239,212 (10.75) | 123,501 (16.53) |
| South Central (%) | 322,056 (14.48) | 55,236 (7.39) |
| London (%) | 363,896 (16.36) | 115,077 (15.4) |
| South East Coast (%) | 278,606 (12.52) | 100,191 (13.41) |
| Mean systolic blood pressure†, mmHg (SD) | 129.13 (14.04) | 128.77 (14.07) |
| Mean body mass index†, kg/m2 (SD) | 27.0 (3.89) | 26.93 (3.90) |
| Mean high density lipoprotein†, mmol/L (SD) | 1.39 (0.66) | 1.34 (0.70) |
| Mean total cholesterol†, mmol/L (SD)† | 4.93 (3.15) | 5.08 (3.96) |
| Smoking status† (number of cigarettes per day) | ||
| Non-smoker (%) | 1,145,267 (51.48) | 388,858 (52.05) |
| Ex-smoker (%) | 580,394 (26.09) | 187,931 (25.16) |
| Light (<10 cigarettes/day) (%) | 145,747 (6.55) | 51,860 (6.94) |
| Moderate (10–20 cigarettes/day) (%) | 211,134 (9.49) | 71394 (9.56) |
| Heavy (>20 cigarettes/day) (%) | 142,159 (6.39) | 47,033 (6.30) |
| Comorbidities at baseline | ||
| Diabetes (%) | 44,668 (2.01) | 14,518 (1.94) |
| Rheumatoid arthritis (%) | 10,838 (0.49) | 3,602 (0.48) |
| Atrial fibrillation (%) | 26,889 (1.21) | 8,565 (1.15) |
| CKD (%) | 7,351 (0.33) | 2,470 (0.33) |
| Migraine (%) | 78,559 (3.53) | 26,263 (3.52) |
| Lupus erythematosus (%) | 1,755 (0.08) | 557 (0.07) |
| Mental illness (%) | 17,007 (0.76) | 6,223 (0.83) |
| HIV/AIDS (%) | 4,139 (0.19) | 1,484 (0.20) |
| Erectile dysfunction (%) | 44,744 (2.01) | 15,386 (2.06) |
| Medication use at baseline | ||
| Antihypertensives (%) | 99,411 (4.47) | 31,039 (4.15) |
| Antipsychotics (%) | 4,222 (0.19) | 1,374 (0.18) |
| Corticosteroids (%) | 29,362 (1.32) | 9,207 (1.23) |
SD: standard deviation; %: percentage; CVD: cardiovascular disease; IMD: Index of multiple deprivation; CKD: chronic kidney disease
missing observations: smoking status (48% missing in general population cohort), systolic blood pressure (39%), standard deviation of systolic blood pressure (64%), body mass index (59%), total cholesterol (73%), and high-density lipoprotein cholesterol (81%).
Model performance analyses on validation data
In terms of discrimination, TRisk demonstrated higher C-index and AUPRC compared to QRISK3 and DeepSurv models (Table 2; Table S3). Assessing calibration, while QRISK3 showed some deviation from perfect calibration, all models generally exhibited acceptable calibration within the relevant threshold range (0–20%) for decision making (Figure 1A; Figure S3). Comparison of the predictive risk distributions (Figure 1B) showed that benchmark models had a narrower distribution than TRisk, which classified a higher fraction of patients into very high and very low risk ranges. Comparing individual-level prediction consistency, with QRISK3 as the reference, benchmark models largely ranked patients consistently with each other. However, there was poor correlation with TRisk, which generally predicted higher probabilities for true positives than benchmark approaches (Figure S4).
Table 2.
Discrimination performance of models in primary prevention population cohort
| Model | Concordance-index (95% CI) |
|---|---|
| QRISK3 | 0.831 (0.826, 0.835) |
| DeepSurv | 0.846 (0.841, 0.850) |
| TRisk | 0.910 (0.906, 0.913) |
CI: confidence interval
Figure 1. Calibration plots and distribution of predicted risk of models.

Smoothed calibration plots (A) and distribution of predicted risk (B) are presented for models implemented on primary prevention population cohort.
In analyses investigating populations in different age bands, the gap in discrimination performance between TRisk and the benchmark models were found to become larger as the age range was narrowed (Table S4). Despite not using sex and IMD as predictors, TRisk outperformed benchmark models in these subgroups by these variables with no significant differences between stratifications (Tables S5, S6).
Decision curve analysis demonstrated that across relevant thresholds for decision making, TRisk provided greater net benefit than benchmark models (Figure 2). In supplementary modelling analysis, we found that our locally derived CPH risk model with SCORE2 predictors performed similarly to other benchmark models across all analyses (Figures S4–S7; Tables S3–S6).
Figure 2. Decision curve analysis for analysed models.

Decision curve analysis (including censored observations) has been conducted for models in primary prevention population cohort. 10% decision threshold used by various clinical guidelines for preventative treatment recommendation in primary prevention population is illustrated with dotted red line. Threshold probability is shown on the x-axis and the net benefit, a function of threshold probability, is shown on the y-axis and is the difference between the proportion of true positives and false positives weighted by odds of the respective decision threshold.
In analysis of the diabetes cohort, 59,186 patients with diabetes (14,518 patients in validation) were identified. Over a median follow-up of 2.3 years (IQR: [0.9, 5.0]), 12.5% suffered a CVD event (12.8% in derivation; 11.7% in validation) (Table S7) Model performance metrics observed in the diabetes cohort were overall concordant with those in the primary prevention population cohort (Figures S8–S11; Tables S8–S11).
The clinical impact analysis of risk assessment approaches at different risk thresholds standardised to a population of 1,000 patients is shown in Table 3 (select strategies in Figure S11; analysis on non-standardised cohorts in Supplementary Results).
Table 3.
Comparison of the clinical impact of models on selected outcomes at different risk thresholds, standardised to 1,000 patients in primary prevention population and diabetes cohorts.
| Cohort | Strategy (model) | Number of people classified as high risk | Number of events in people at high risk | Number of events in people at low risk |
|---|---|---|---|---|
| Primary prevention population | Treat all | 1000 | 45 | 0 |
| QRISK3 at 10% threshold (recommended) | 272 | 36 | 9 | |
| TRisk at 10% threshold | 216 | 40 | 5 | |
| QRISK3 at 15% threshold | 187 | 29 | 15 | |
| TRisk at 15% threshold | 178 | 37 | 8 | |
| QRISK3 at 20% threshold | 131 | 24 | 21 | |
| TRisk at 20% threshold | 152 | 35 | 10 | |
| Treat none | 0 | 0 | 45 | |
| Diabetes | Treat all (recommended) | 1000 | 117 | 0 |
| QRISK3 at 10% threshold | 866 | 114 | 3 | |
| TRisk at 10% threshold | 757 | 115 | 2 | |
| QRISK3 at 15% threshold | 764 | 110 | 7 | |
| TRisk at 15% threshold | 682 | 112 | 5 | |
| QRISK3 at 20% threshold | 649 | 103 | 14 | |
| TRisk at 20% threshold | 615 | 109 | 8 | |
| Treat none | 0 | 0 | 117 |
In the primary prevention population cohort, QRISK3 at the recommended 10% threshold5 classified 272 per 1,000 individuals as treatment-eligible, with 36 true positive cases (13%) (Table 3; Figure S12A). Of the remaining 728 low-risk patients, 719 (99%) were true negatives. Operating at the 15% threshold would reduce treatment-eligible patients to 187 but increase false negatives from 9 to 15 patients (66% increase).
At the 10% threshold, TRisk achieved improved true positive capture whilst reducing the treatment-eligible population by 56 patients (21% reduction) compared to status-quo strategy (QRISK3 at 10%). At the 15% threshold, TRisk effectively maintained true positive captured compared to QRISK3 at 10% threshold while reducing high-risk classifications from 272 to 178 patients (35% reduction) (Table 3; Figure S12B). Thus, TRisk improved upon QRISK3 at both thresholds: delivering better true positive capture at 10% and more selective high-risk classification at 15%.
For the diabetes cohort, TRisk demonstrated improved positive and negative capture compared to conventional approaches (Table 3; Figure S12C). At 10% threshold, TRisk identified 757 patients as high-risk with 115 true positives (15%). Compared to the traditional approach of treating all diabetes patients, this represents 243 fewer treatment recommendations (24% reduction) with minimal false negatives (0.2% of cohort).
Recently, revised UK guidelines have recommended QRISK3 for treatment recommendation in diabetes patients (Table 3; Figure S12D).5 At the 10% threshold, QRISK3 selected 866 patients for treatment with 114 true positives (13%), with natural degradation in true positive capture at higher thresholds. In comparison, TRisk recommended 109 fewer patients (13% reduction) while identifying one additional true positive case (Table 3; Figure S11E).
Discussion
Our study shows that TRisk, a novel Transformer-based survival model, significantly outperformed widely recommended benchmark models in identifying individuals at risk of CVD. TRisk was also less dependent on age than our benchmark models. Due to its higher discriminatory performance, TRisk enabled a potential upwards shift in the commonly recommended risk thresholds for more targeted selection of individuals for treatment. Application of TRisk would lead to selection of about one-third and one-fourth fewer individuals in general and diabetes populations respectively than application of existing policies without any material trade-offs.
Risk-based selection of individuals for CVD preventative therapy is widely used in clinical practice.5 Major risk models include the USA’s ASCVD score, Europe’s SCORE2 and SCORE-OP, and UK’s QRISK3.7 One limitation however is these approaches select a large number of individuals being eligible for treatment, even though many of them will not experience an event.8 Indeed, in a survey of general physicians in the Netherlands, a key barrier to uptake of CVD risk prediction models was their potential for over-treatment.24 In line with this, we found QRISK3 to classify over a quarter of the adult population as eligible for treatment in the primary prevention population8; however, seven out of eight individuals predicted to have a preventable CVD event did not experience such an event. We found that increasing the risk threshold for such models would not overcome their prediction inefficiency; while the proportion of true positives would increase, number of false negatives would also increase, due to their relatively low sensitivity. Recent AI models like NeuralCVD and UKCRP have been developed on UK data in part to address these issues but either fail to materially improve upon existing models like QRISK3 or do not consider censoring.25 Additionally, these approaches lack validation in representative primary care data (e.g., QResearch, CPRD) - a critical gap as primary care is the implementation target.
TRisk has the potential to lead to substantial efficiency gains without a trade-off in increased false negatives. At both 10% and 15% risk thresholds, it reduced high-risk classifications while simultaneously decreasing false negatives. In diabetes patients, where guidelines traditionally recommended universal preventative therapy, TRisk enables a 24% reduction in treatment recommendations without missing cases. Even compared to the recent guidelines revision adopting QRISK3 at a 10% threshold, TRisk classifies 13% fewer diabetes patients as high-risk while capturing more true positives.5
Despite the increasing use of preventive statins due to their low cost and safety, there have been calls for better matching of individuals to treatments.26 TRisk addresses this by more accurately identifying high-risk individuals who most require medical intervention. In the UK, this could reduce preventive treatment prescriptions by 35% while maintaining or improving event prevention rates compared to QRISK3. Importantly, this is not to restrict treatment access; rather TRisk can enable more precise targeting, achieved using only readily available EHR data without requiring additional tests or biomarkers.
While a key concern with AI in medicine is potential bias against particular patient groups, TRisk demonstrated lower variation across sex and socioeconomic subgroups, despite not explicitly including these features. Additionally, TRisk was less reliant on age than benchmark models. Traditional models’ performance decreased substantially in narrower age ranges;14,22 however, TRisk’s performance improved, suggesting it captures rich temporal features from EHR data rather than relying on age as a proxy for unmeasured risk factors (as traditional models often do).
How does TRisk conduct prediction and achieve superior performance compared to expert-driven models? TRisk analyses entire patient EHR in their temporal context. While the model’s complexity limits immediate explainability unlike other machine learning approaches (e.g., tree-based models), previous analyses of the underlying BEHRT framework have revealed its ability to infer patient characteristics and identify both established and novel risk factors.11,12 For example, BEHRT identified traditionally underappreciated risk factors for heart failure like iron deficiency anaemia and COPD treatments.12 More impressively, it captured temporal shifts in risk associations – such as how the relationship between treatment of glaucoma and heart failure evolved as treatments shifted from beta-blockers to prostaglandin analogues.12 Looking ahead, incorporating additional data types such as omics could enhance prediction performance for specific patient subgroups, while formal explainability studies could better illuminate TRisk’s predictive mechanisms and lead to knowledge discovery.
Lastly, while established models like SCORE2 and QRISK3 perform well in the primary prevention population, they tend to perform poorly in disease-specific groups like diabetes, where risk factor relationships differ from the primary prevention population.14,22,27 TRisk overcomes this limitation through knowledge transfer: by first learning patterns from the larger primary prevention population and then fine-tuning these learned relationships on the diabetes cohort, it achieves superior predictive performance without requiring separate models for different patient subgroups. This successful approach suggests potential applications for knowledge transfer in other clinical contexts.23
Strengths and limitations
Our study provides comprehensive model comparisons (statistical and deep learning) across general and high-risk groups, evaluating discrimination, calibration, net benefit, and impact analyses at various risk thresholds.
Although TRisk was validated using UK population data (akin to other widely accepted benchmark approaches such as QRISK314), it was deliberately engineered to be transferable across various data settings via utilisation of standardised clinical dictionaries (e.g., SNOMED CT), aimed to for adaptation to various data settings with minimal friction. Certainly, TRisk’s underlying foundational BEHRT model surpasses several other approaches in multi-outcome prediction of over 300 clinical outcomes, hence providing initial evidence that TRisk can serve as a multipurpose risk assessment tool.11 Additionally, independent research groups have successfully applied BEHRT variants across multiple US healthcare systems for diverse prediction tasks, suggesting TRisk’s wider applicability.28,29 Nevertheless, future research should validate TRisk on external EHR datasets against additional comparator models, and evaluate its performance for predicting CVD and other outcomes.
Comparing to other CVD risk prediction studies, the median follow-up of 2.5 years in our study aligns with prior research using similar methods and designs, with 91% of patients lacking complete 10-year follow-up, comparable to the 88% reported in a recent QRISK3 validation study.19 Our study’s follow-up patterns are generally consistent with UK primary prevention risk prediction studies, indicating our suitable validation of CVD risk scores. However, unlike previous QRISK3 studies, we lacked access to Townsend scores for modelling.14,19 Additionally, excluding statin users at baseline, while consistent with the QRISK3 studies, may introduce bias as patients often initiate statins soon after baseline.14,30 Indeed, future research should explore methods to account for treatment “drop-in”.30
Finally, TRisk depends on access to the entire EHR of an individual and cannot be reduced to a simple scoring algorithm. While this is a practical limitation, there are effective tools that have been robustly tested to facilitate the transition of AI from development to deployment in settings with limited computational capacities. Future efforts could explore leveraging TRisk as a first-line screening tool for population-level risk assessment, alerting GPs to high-risk patients requiring clinical attention. Computing risk scores offline using comprehensive EHR data would overcome the computational constraints of running complex models at point-of-care. This approach would also minimise burden on stretched primary care services by communicating only pertinent information about at-risk patients.
Conclusion
TRisk, a novel Transformer-based survival model, outperformed standard models and recommendations for selection of individuals at high risk of CVD. Implementation of TRisk into routine practice could improve allocative efficiency by reducing the number of patients offered treatment by about one-third and one-fourth fewer individuals in general and diabetes populations respectively as compared with status-quo recommendation strategies.
Supplementary Material
Conflict of interest
KR is editor-in-chief for BMJ Heart; he has previously received consulting fees from Medtronic CRDN, and honoraria or fees from BMJ Heart, PLoS Medicine, AstraZeneca MEA Region, Medscape, Radcliffe Cardiology, and WebMD Medscape UK. KR reports grants from National Institute for Health and Care Research (NIHR), Medical Research Council (MRC), British Heart Foundation (BHF), Novo Nordisk Foundation, Horizon Europe, and Roche. SR is methodological advisor for the BMJ Heart; he has previously received consultancy fees from Lucem Health. SR reports grants from Oxford University Hospital (OUH) Trust. GSC is a statistics editor for the BMJ; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years. GSC is a National Institute for Health and Care Research (NIHR) Senior Investigator. The views expressed in this article are those of the author(s) and not necessarily those of the NIHR, the BMJ, BMJ Heart, PLoS Medicine, AstraZeneca MEA Region, Medscape, and WebMD Medscape UK.
Footnotes
Public and patient involvement
Patients were not involved during the design, conduct, reporting, interpretation, or dissemination of the study.
Data sharing
This study was approved by the CPRD Independent Scientific Advisory Committee (protocol number: 16_049R). The CPRD data used in this study are available to researchers through a licence agreement following protocol approval from the Independent Scientific Advisory Committee (ISAC). These data are not publicly available due to licensing restrictions. Details about data access and sharing policies can be found at www.cprd.com. The code for the modelling and analysis can be found at https://github.com/srn284/TRisk.
References
- 1.The Blood Pressure Lowering Treatment Trialists’ Collaboration. Pharmacological blood pressure lowering for primary and secondary prevention of cardiovascular disease across different levels of blood pressure: an individual participant-level data meta-analysis. The Lancet 2021; 397: 1625–1636. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Mihaylova B, Emberson J, Blackwell L, Keech A, Simes J, Barnes EH, Voysey M, Gray A, Collins R, Baigent C, De Lemos J, Braunwald E, Blazing M, Murphy S, Downs JR, Gotto A, Clearfield M, Holdaas H, Gordon D, et al. The effects of lowering LDL cholesterol with statin therapy in people at low risk of vascular disease: Meta-analysis of individual data from 27 randomised trials. The Lancet 2012; 380: 581–590. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Bidel Z, Nazarzadeh M, Canoy D, Copland E, Gerdts E, Woodward M, Gupta AK, Reid CM, Cushman WC, Wachtell K, Teo K, Davis BR, Chalmers J, Pepine CJ, Rahimi K. Sex-Specific Effects of Blood Pressure Lowering Pharmacotherapy for the Prevention of Cardiovascular Disease: An Individual Participant-Level Data Meta-Analysis. Hypertension. Epub ahead of print 24 July 2023. DOI: 10.1161/HYPERTENSIONAHA.123.21496. [DOI] [PubMed] [Google Scholar]
- 4.Fulcher J, O’Connell R, Voysey M, Emberson J, Blackwell L, Mihaylova B, Simes J, Collins R, Kirby A, Colhoun H, Braunwald E, La Rosa J, Pedersen TR, Tonkin A, Davis B, Sleight P, Franzosi MG, Baigent C, Keech A, et al. Efficacy and safety of LDL-lowering therapy among men and women: Meta-analysis of individual data from 174 000 participants in 27 randomised trials. The Lancet; 385. Epub ahead of print 2015. DOI: 10.1016/S0140-6736(14)61368-4. [DOI] [PubMed] [Google Scholar]
- 5.Nat’l Inst for Health and Care Excellence. Lipid modification: cardiovascular risk assessment and the modification of blood lipids for the primary and secondary prevention of cardiovascular disease | introduction | Guidance and guidelines | NICE. 2014. [PubMed]
- 6.Sundström J, Arima H, Woodward M, Jackson R, Karmali K, Lloyd-Jones D, Baigent C, Emberson J, Rahimi K, Macmahon S, Patel A, Perkovic V, Turnbull F, Neal B, Agodoa L, Estacio R, Schrier R, Lubsen J, Chalmers J, et al. Blood pressure-lowering treatment based on cardiovascular risk: A meta-analysis of individual patient data. The Lancet 2014; 384: 591–598. [DOI] [PubMed] [Google Scholar]
- 7.Lloyd-Jones DM, Braun LT, Ndumele CE, Smith SC, Sperling LS, Virani SS, Blumenthal RS. Use of Risk Assessment Tools to Guide Decision-Making in the Primary Prevention of Atherosclerotic Cardiovascular Disease: A Special Report from the American Heart Association and American College of Cardiology. Circulation; 139. Epub ahead of print 2019. DOI: 10.1161/CIR.0000000000000638. [DOI] [PubMed] [Google Scholar]
- 8.Herrett E, Gadd S, Jackson R, al. et. Eligibility and subsequent burden of cardiovascular disease of four strategies for blood pressure-lowering treatment: a retrospective cohort study. Lancet 2019; 394: 663–671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hippisley-Cox J, Coupland C, Vinogradova Y, Robson J, Minhas R, Sheikh A, Brindle P. Predicting cardiovascular risk in England and Wales: Prospective derivation and validation of QRISK2. BMJ 2008; 336: 1475–1482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention Is All You Need. Adv Neural Inf Process Syst. [Google Scholar]
- 11.Li Y, Rao S, Solares JRA, Hassaine A, Ramakrishnan R, Canoy D, Zhu Y, Rahimi K, Salimi-Khorshidi G. BEHRT: Transformer for Electronic Health Records. Sci Rep 2020; 10: 7155. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Rao S, Li Y, Ramakrishnan R, Hassaine A, Canoy D, Cleland JG, Lukasiewicz T, Salimi-Khorshidi G, Rahimi K. An explainable Transformer-based deep learning model for the prediction of incident heart failure. IEEE J Biomed Health Inform 2022; 1–1. [DOI] [PubMed] [Google Scholar]
- 13.Tang W, Ma J, Mei Q, Zhu J. SODEN: A Scalable Continuous-Time Survival Model through Ordinary Differential Equation Networks, http://arxiv.org/abs/2008.08637 (2020).
- 14.Hippisley-Cox J, Coupland C, Brindle P. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: Prospective cohort study. BMJ; 357. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y. DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol; 18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL, Van Calster B, Ghassemi M, Liu X, Reitsma JB, Van Smeden M, Boulesteix AL, Camaradou JC, Celi LA, Denaxas S, Denniston AK, Glocker B, Golub RM, Harvey H, Heinze G, et al. TRIPOD+AI statement: Updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. Epub ahead of print 2024. DOI: 10.1136/bmj-2023-078378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, Smeeth L. Data Resource Profile: Clinical Practice Research Datalink (CPRD). Int J Epidemiol 2015; 44: 827–836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kuan V, Denaxas S, Gonzalez-Izquierdo A, Direk K, Bhatti O, Husain S, Sutaria S, Hingorani M, Nitsch D, Parisinos CA, Lumbers RT, Mathur R, Sofat R, Casas JP, Wong ICK, Hemingway H, Hingorani AD. A chronological map of 308 physical and mental health conditions from 4 million individuals in the English National Health Service. Lancet Digit Health 2019; 1: e63–e77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Li Y, Sperrin M, Ashcroft DM, Van Staa TP. Consistency of variety of machine learning and statistical models in predicting clinical risks of individual patients: Longitudinal cohort study using cardiovascular disease as exemplar. The BMJ; 371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Austin PC, Harrell FE, van Klaveren D. Graphical calibration curves and the integrated calibration index (ICI) for survival models. Stat Med; 39. Epub ahead of print 2020. DOI: 10.1002/sim.8570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making 2006; 26: 565–574. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.SCORE2 risk prediction algorithms: New models to estimate 10-year risk of cardiovascular disease in Europe. Eur Heart J; 42. Epub ahead of print 2021. DOI: 10.1093/eurheartj/ehab309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q. A Comprehensive Survey on Transfer Learning. Proceedings of the IEEE. [Google Scholar]
- 24.Eichler K, Zoller M, Tschudi P, Steurer J. Barriers to apply cardiovascular prediction rules in primary care: a postal survey. BMC Fam Pract 2007; 8: 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Steinfeldt J, Buergel T, Loock L, Kittner P, Ruyoga G, zu Belzen JU, Sasse S, Strangalies H, Christmann L, Hollmann N, Wolf B, Ference B, Deanfield J, Landmesser U, Eils R. Neural network-based integration of polygenic and clinical information: development and validation of a prediction model for 10-year risk of major adverse cardiac events in the UK Biobank cohort. Lancet Digit Health; 4. Epub ahead of print 2022. DOI: 10.1016/S2589-7500(21)00249-1. [DOI] [PubMed] [Google Scholar]
- 26.Lancet The. Personalised medicine in the UK. The Lancet 2018; 391: e1. [DOI] [PubMed] [Google Scholar]
- 27.Rao S, Li Y, Nazarzadeh M, Canoy D, Mamouei M, Hassaine A, Salimi-Khorshidi G, Rahimi K. Systolic Blood Pressure and Cardiovascular Risk in Patients with Diabetes: A Prospective Cohort Study. Hypertension 2023; 80: 598–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Huang D, Cogill S, Hsia RY, Yang S, Kim D. Development and external validation of a pretrained deep learning model for the prediction of non-accidental trauma. NPJ Digit Med; 6. Epub ahead of print 2023. DOI: 10.1038/s41746-023-00875-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Pang C, Jiang X, Kalluri KS, Spotnitz M, Chen R, Perotte A, Natarajan K. CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks. Proceedings of Machine Learning Research 2021; 158: 239–260. [Google Scholar]
- 30.Sperrin M, Martin GP, Pate A, Van Staa T, Peek N, Buchan I. Using marginal structural models to adjust for treatment drop-in when developing clinical prediction models. Stat Med; 37. Epub ahead of print 2018. DOI: 10.1002/sim.7913. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
This study was approved by the CPRD Independent Scientific Advisory Committee (protocol number: 16_049R). The CPRD data used in this study are available to researchers through a licence agreement following protocol approval from the Independent Scientific Advisory Committee (ISAC). These data are not publicly available due to licensing restrictions. Details about data access and sharing policies can be found at www.cprd.com. The code for the modelling and analysis can be found at https://github.com/srn284/TRisk.
