Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

medRxiv logoLink to medRxiv
[Preprint]. 2025 Feb 3:2025.02.01.25321518. [Version 1] doi: 10.1101/2025.02.01.25321518

Genetics of cardiometabolic disease progression

Johanne M Justesen 1,2, Guhan Venkataraman 1, Yosuke Tanigawa 1, Ruilin Li 3, Trevor Hastie 1,4, Robert Tibshirani 1,4, Joshua W Knowles 2, Manuel A Rivas 1
PMCID: PMC11838626  PMID: 39974115

Abstract

Background:

Genome-wide association studies have been crucial in gaining insights into the genetics of cardiometabolic diseases. However, little is known about the genetics of cardiometabolic disease progression which may have both a different genetic architecture and significant implications for treatment decisions. Disease progression can be ascertained by the time from the first disease diagnosis to a second qualifying event (e.g. diagnostic lab, code or procedure). While data of this nature have been available in large repositories such as the UK Biobank, large-scale genome-wide screens in a time-to-event setting have been extremely challenging due to various computational and statistical challenges.

Methods and Results:

We applied our method, snpnet-Cox, that has proven to be an effective method for simultaneous variable selection and estimation in high-dimensional settings, to examine the genetic contributions to cardiometabolic disease progression, measured by time from disease diagnosis to time of complication/comorbidity diagnosed or procedure in the UK Biobank. We apply a Cox regression model in a time-to-event setting to compute polygenic hazard scores (PHS). We identified ten new PHS that significantly predict disease progression. One example is the PHS that significantly predicts the time from hyperlipidemia diagnosis to having coronary artery bypass graft (CABG) surgery performed (Hazards Ratio 1.3 per PHS standard deviation: p=4.5×10−9). In this PHS, we identified a common variant, rs11041816 (downstream of LMO1), which protects against this disease progression (beta = −0.05).

Conclusion:

snpnet-Cox is a fast and reliable tool to compute PHS capturing genetics in the time-to-event setting. The computed PHS can be used to stratify individuals with an underlying diagnosis (e.g. hyperlipidemia) into different trajectories disease progression (e.g CABG) thereby identifying potential points of intervention. With more time-to-event data to be released, this approach can provide great insight into disease progression at the fraction of computational cost necessary. We make available ten polygenic hazard scores that we find to be significant predictors of cardiometabolic disease progression.

Keywords: Genetics, Cardiometabolic traits, Polygenic score, disease progression, lipids, lasso, Cox model, polygenic hazards score, prediction, hyperlipidemia, operation

Introduction

Cardiometabolic diseases are an increasing global health burden affecting about a third of the world population 12. These common, chronic metabolic conditions such as cardiovascular disease (including hypertension, dyslipidemia, atherosclerosis and heart failure) and type 2 diabetes are a major source of morbidity and mortality but are highly heterogeneous in disease progression. There are many well-known clinical modifiers of disease progression (e.g smoking for atherosclerosis and obesity for type 2 diabetes)2. Similarly, genome-wide association studies (GWAS) have identified thousands of SNPs that are associated with incident and prevalent cardiometabolic disease status3. However, the genetic factors that alter the course of disease progression are not well understood. Furthermore, it is unclear whether the genetic risk factors that alter initial disease presentation are the same that alter disease progression. From a clinical perspective, it would be highly relevant to identify individuals who are genetically predisposed to a specific prognosis and therefore could particularly benefit from targeted interventions.

Indeed, a key prerequisite for precision health and medicine is a deeper understanding of disease state progression. Populations in biobanks with longitudinal medical record data are useful in this regard. For instance, one measure of disease progression is time from initial disease diagnosis to a comorbidity diagnosis (i.e., time from first disease diagnosis of type 2 diabetes to the diagnosis of a relevant comorbidity like diabetic neuropathy). Another measure of progression reflecting disease severity is captured by the disease-relevant procedures performed after the initial diagnosis (i.e. time from initial diagnosis to subsequent surgery, as recorded by Office of Population Censuses and Surveys (OPCS) Classification of Interventions and Procedures version 4 (OPCS-4 ).

The UK Biobank provides an opportunity to study genetics in longitudinal settings, providing times of disease diagnoses as well as dates of surgical procedures. This cohort has genotypic data on about 500,000 individuals coupled with unique phenotype information from electronic health records (EHRs) comprising primary care, hospital inpatient, cancer registry and death records 4.

Statistical models for disease progression have been challenging to develop. The Cox proportional hazards model provides a flexible mathematical framework to describe the relationship between the time to an event and various independent variable features, allowing for the calculation of a time-dependent baseline hazard useful in modeling disease progression. This model is regularly used to analyze time-to-event data in prospective epidemiological settings. However, these survival models face computational and statistical challenges when the predictors are ultra-high dimensional (i.e. feature dimension is far greater than the number of observations) and in large-scale settings where the data exceeds memory limits. To overcome this issue, we have previously developed a batch-screening iterative lasso (BASIL) algorithm to fit a Cox proportional hazard model by maximizing the Lasso partial likelihood function (snpnet-Cox)5. The Lasso is an effective tool for high-dimensional variable selection and prediction 6. We have previously applied snpnet-Cox to examine genetics of common complex diseases in a time-to-event setting 7.

Here, we apply snpnet-Cox to compute polygenic hazard scores (PHS) that examine the genetic variation of cardiometabolic disease progression in white British individuals from the UK Biobank (n = 337,129). Disease progression is defined as time from disease diagnosis to either next recorded disease or undergoing operational procedure. We replicate our findings using these PHS in a separate, non-British white population in UK Biobank (n = 28,134). The output of our algorithm is the full Lasso path, the parameter estimates at all predefined regularization parameters, as well as their validation accuracy measured using the concordance index (C-index) (Figure 1). We show that ten PHS significantly predict disease progression. Several of the PHS variants are known from GWAS to be associated with cardiometabolic disease status. However, we identify additional variants such as a common variant downstream of LMO1 (rs11041816) that protects against progression from a hyperlipidemia diagnosis to undergoing coronary artery bypass graft (CABG).

Figure 1: Study overview.

Figure 1:

A: Summary of the UK Biobank genotype and phenotype data used in the study. We used the white British subset of the UK Biobank for discovery and the non-British white subset for replication, analyzing LD-pruned and quality-controlled variants in relation to cardiometabolic disease progression measures based on electronic health records (hospital in-patient, primary care, and death records) and questionnaires with date of event. B: We created cardiometabolic surgical phenotypes by combining in-hospital operational codes from the OPCS-4 (Supplementary Table 1). C: We study the genetics of disease progression in a time-to-event setting in which disease progression is measured by time from disease diagnosis (International Classification of Diseases (ICD)-10 code) to disease comorbidity (next ICD-10 code) or having operation performed (next OPCS-4 code). D: We applied our R package, snpnet-Cox, to fit the Cox proportional Hazard model on the large genotype-disease progression dataset and compute polygenic hazard scores (PHS). The output of our algorithm is the full Lasso path.

Methods

Study population

The UK Biobank is a large, prospective population-based cohort study including individuals collected from multiple sites across the United Kingdom 4,8. It contains extensive genotypic and phenotypic information for 500,000 individuals aged 40–69 years when recruited in 2006–2010, including genome-wide genotyping, questionnaires, and physical measures for a wide range of health-related outcomes. In addition, information has been linked to registries from primary care, in-hospital records, death records, and the cancer registry. The discovery analyses were performed in white British individuals (n=337,129), and the replication in non-British white individuals (n=28,134).

Genetic data preparation

We used genotype data from the UK Biobank dataset release version 2 and the hg19 human genome reference for all analyses in the study. To minimize the variabilities due to population structure in our dataset, we restricted our discovery analyses to include 337,129 unrelated white British individuals. In replication analyses, we included 28,134 non-British white individuals. Focusing on the entire study population of 337,129 individuals and 147,604 variants that are marked as “in_PCA” in the variant QC file (ukb_snp_qc.txt) from UK Biobank Sudlow et al. (2015)4, we compute the principal components of genetic variants using “–pcaapprox” sub-command in PLINK2 9. We use the top 10 principal components, sex genotyping array and age at initial disease diagnosis as covariates.

The genetic variants used in this study were a combination of the directly genotyped variants from the UK Biobank (release version 2)4, the imputed allelotypes in human leukocyte antigen (HLA) allelotypes 10, and copy number variations (CNVs)11, resulting in a genotype matrix of 1,080,968 variants, as described in Sinnot-Armstrong et al 202112. As a preprocessing step, we excluded variants with a missing rate greater than 10% and variants whose minor allele frequency is less than 0.001, which left approximately 700,000 variants remaining as features for our progression analyses.

Disease phenotypes ICD-10

Time-to-event phenotypes were derived from First Occurrence of Health Outcomes data as defined by 3-character ICD-10 codes in UK Biobankʼs Category 1712. The First Occurrence data-fields were generated by combining: read code information in the primary care data (Category 3000); ICD-9 and ICD-10 codes in hospital inpatient data (Category 2000); ICD-10 codes in death registry records (Field 40001, Field 40002); and self-reported medical condition codes (Field 20002), reported at baseline or subsequent UK Biobank assessment center visits as 3-character ICD-10 codes (censoring date March 1st 2020). A group of ʻSpecial event datesʼ from primary care data were changed similarly to UK Biobank self-reported events.

  • 1902–02-02 was changed to date of birth (total of 831 events)

  • 1903–03-03 was changed to date of birth + 6 months (total of 651 events)

  • 2037–07-07 is a date in the future and was removed from analysis (total 6 events)

Algorithmically-defined outcomes (based on data from Category 42) include phenotypes of select health-related events obtained through algorithmic combinations of coded information from the UK Biobankʼs baseline assessment data collection. The data were derived from self-reported medical conditions, operations and medications together with linked data from hospital admissions and death registries (censoring date March 1st 2019).

  • 334 items have date of unknown value, 1900–01-01, and were removed from analyses

To calculate age at disease diagnosis, death, and censoring, we computed the dates of birth (DOB) using the Month of Birth Data Field (Data-Field 52) and Year of Birth (Data-Field 34). All DOB were set to the first day of their birth month to avoid negative age of disease values. All ages at events = 0 were changed to the age event of 1 month, since snpnet-Cox uses values greater than 0. There were 40 dates of events that were one month before birth that were changed to DOB.

Data was structured by an n by p matrix of covariate values, where each row corresponded to an individual from the UK Biobank and each column a covariate. y was an n-length vector of event/death/censoring times, and status was an n-length vector where 0 was assigned if the entry in y was indicative of right censoring (i.e. the event had not yet happened at the time the data was collected) or death, and 1 was assigned if the event occurred.

Phenotypes generated from surgical procedures OPCS-4

In the UK, OPCS Classification of Interventions and Procedures version 4.7 is the system used to code interventions, and ICD-10 the system for diagnoses. Each admission may contain several episodes, each corresponding to the care provided during a hospitalization. Using operational procedure codes (OPCS-4), we constructed phenotypes for common medical procedures for cardiometabolic conditions and complications. Most codes relevant to cardiac surgery belong to OPCS chapters K (Heart) and L (Arteries and veins). Multiple OPCS-4 codes were combined to create the phenotypes created, which include those described in Supplementary table 1. The two surgical phenotypes with the highest number of cases were percutaneous coronary intervention (angioplasty) and coronary artery bypass grafting (CABG).

Percutaneous coronary intervention (PCI, formerly known as coronary angioplasty and stent implantation) is a procedure that improves blood flow to the heart and thereby decreases heart-related chest pain (angina). By using a catheter (thin flexible tube) to place a small structure called a stent, it opens up blood vessels in the heart that have been narrowed by plaque buildup (a condition known as atherosclerosis).

Coronary artery bypass grafting (CABG) is an open chest procedure to perform direct revascularization of the heart by using a suitable vein from the chest, arm or leg for grafting to the coronary artery, thereby allowing the blood to bypass narrowings or blockages in the artery and reducing angina.

Cardiometabolic disease progression phenotypes

We selected seven common cardiometabolic disease phenotypes with more than 15,000 cases in the UK Biobank to use as baseline diseases - Essential hypertension (ICD10: I10), lipoprotein disorder (E78 - hereafter we refer to lipoprotein disorder as hyperlipidemia given common clinical usage), chronic ischemic heart disease (I25), angina pectoris (I20), obesity (), non-insulin-dependent diabetes (type 2 diabetes) (E11) and atrial fibrillation (I48) (Supplementary Table 2). For disease progression outcomes, we included 21 disease and 12 operation phenotype outcomes which had more than 400 cases (Supplementary Table 3 and 4).

BASIL algorithm to fit a Cox Proportional Hazard model on genotype-disease progression time dataset

To compute PHS, we use an R package, snpnet, which is based on a batch-screening iterative lasso (BASIL) algorithm that fits the full lasso solution path for very large and high-dimensional datasets, method previously described in Qian et al 20205 and a Cox model application described in Li et al 20217.

Briefly, this method is particularly suitable for large-scale and high-dimensional data that does not fit entirely in memory. Loading our UK Biobank data matrix with 1.08 million variants into R takes around 2.4 Terabytes of memory, which exceeds the size of most typical machinesʼ RAM. The Lasso is an effective tool for high-dimensional variable selection and prediction 6. In each iteration, we are able to effectively prune out variables (genetic variants) that are not relevant to disease progression, thereby eventually determining a “path” (of selected variants) relevant to disease.

We apply snpnet-Cox to the time-to-event setting where we analyze time from disease diagnosis to comorbidity event. Individuals were only included in the analysis if the age at disease diagnosis was prior to age of comorbidity diagnosis / surgical procedure. We split the dataset into a 70% training, 10% validation and 20% held out test set and apply snpnet-Cox with 50 iterations.

First, we assessed the predictive power of the PHS on time-to-event in the individuals in the held-out test set, thereby obtaining a p-value for each disease progression analysis. Second, we computed the hazard ratio (HR) for the scale (standard deviation (SD) unit) within different threshold percentiles (top 1%, 5%, 10% and bottom 10% compared to the 40–60%). Third, we computed the concordance index (C-index), which is a validation accuracy measure 13.

Results

The ten cardiometabolic disease progression analyses with PHS p-values < 0.01 are listed in Table 2. Among all disease progression analyses, we found that significantly predictive PHS included a range of active variables from 2 to up to 188, highlighting the sparse property of Lasso in the Cox model (Table 1). The genetic predictive accuracy, C-index, ranged from 0.53 to 0.58, and HR per SD of PHS, from 1.13 to 1.31 (Table 1).

Table 1:

Significant polygenic hazards scores (p < 0.01) computed using snpnet-Cox for cardiometabolic disease progression in white British individuals from UK Biobank

Phenotypes N cases Active variables Scale HR Scale p-value Scale C Top 1 % HR (p-value) Top 5% HR (p-value) Top 10% HR (p-value) Bottom 10% HR (p-value)
Hyperlipidemia to CABG 2393 147 1.31 4.49x10−9 0.58 2.60 (0.0019) 1.61 (0.023) 1.74 (0.0040) 0.57 (0.01)
Hyperlipidemia to chronic ischaemic heart disease 9197 27 1.13 3.02x10−7 0.53 1.25 (0.31) 1.25 (0.055) 1.12 (0.32) 0.79 (0.014)
Hypertension to angioplasty 3075 126 1.22 2.03x10−6 0.56 1.97 (0.025) 1.08 (0.72) 1.71 (0.0016) 0.81 (0.21)
Hypertension to T2D 5747 3 1.15 2.90x10−6 0.54 1.13 (0.68) 1.18 (0.27) 1.31 (0.11) 0.96 (0.76)
Hyperlipidemia to myocardial infarction 4201 188 1.16 2.34x10−5 0.54 1.78 (0.045) 1.14 (0.045) 1.33 (0.077) 0.90 (0.47)
Hyperlipidemia to angina 5728 2 1.13 3.92x10−5 0.53 1.51 (0.0006) 0.92 (0.60) 0.90 (0.26)
Hyperlipidemia to angioplasty 3636 10 1.13 7.53x10−4 0.54 1.30 (0.42) 1.04 (0.86) 1.35 (0.072) 0.80 (0.15)
Chronic ischaemic heart disease to CABG 3548 5 1.13 0.0019 0.53 1.12 (0.54) 1.34 (0.085) 0.85 (0.27)
Hypertension to CABG 1978 13 1.16 0.0040 0.55 1.84 (0.12) 1.05 (0.85) 1.24 (0.36) 0.67 (0.078)
Angina to CABG 2648 6 1.13 0.0078 0.53 1.24 (0.61) 1.04 (0.89) 1.75 (0.0023) 0.93 (0.70)

Scaled values (HR, p-value and C=index) are per standard deviation unit. N cases is the outcome (disease or surgery) number of cases. HR: Hazards ratio, CABG: Coronary Artery bypass Graft. T2D: Type 2 diabetes

All results are provided on the Global Biobank Engine (GBE)ʼs snpnet-Cox Disease Progression application (https://biobankengine.shinyapps.io/disease_progression/).

Polygenic hazard score predicts progression from hyperlipidemia to coronary artery bypass surgery

Hyperlipidemia (“Disorders of lipoprotein metabolism and other lipidemias” - ICD-10 code E78) is general code applied to disorders with a spectrum of abnormalities in the levels of blood lipids (primarily cholesterol and triglycerides) carried by lipoproteins (e.g. Low density lipoprotein cholesterol (LDL)-C, high density lipoprotein cholesterol HDL-C) in the blood. Over time, increased levels of lipids in the blood, particularly LDL-C, can result in buildup of plaque in, narrowing of, or in the worst cases, blockage of arterial blood vessels especially in the heart (coronary artery disease). While mild coronary artery disease can be effectively managed with medications alone, advanced coronary artery disease may require percutaneous coronary interventions (including angioplasty and stent placement) and, when especially severe, coronary artery bypass graft (CABG) surgery to restore blood flow to the heart14. Susceptibility to and progression of coronary artery disease in the setting of hyperlipidemia is highly heterogeneous and the heterogeneity is only partly explainable by standard clinical characteristics (e.g. age, absolute lipid levels, use of medications to lower lipid levels).

In the UK Biobank, we estimated a HR of 1.31 per SD of PHS (p = 4.49×10−9) for the disease progression from hyperlipidemia to CABG (Table 2). This PHS was composed of 147 active variables and had a C-index of 0.58 (Table 1).

For individuals in the top 1%, 5% and 10% of the PHS distribution compared to the 40–60%, we estimated a HR of 2.6, 1.6 and 1.7 respectively, whereas individuals in the bottom 10% of the PHS distribution had significantly lower risk of having CABG surgery, with a HR of 0.57 (p = 0.01) (Figure 2A and Supplementary table 5).

Figure 2: Kaplan-Meier curves and PHS extremes.

Figure 2:

A: Kaplain-Meier curves for percentiles of polygenic hazard scores (PHS) for variants selected by snpnet-Cox, in the held out test set (red - top 1%, green - top 5%, light blue - top 10%, blue - 40–60%, and orange bottom 10%: ticks represent censored observations). B: Zoom box highlighted are the proportion of coronary artery bypass graft (CABG) surgeries, 1, 10 and 15 years after hyperlipidemia diagnosis across the percentile scores. CABG case count = 2393. C: For the top 1% PHS the Hazards ratio of CABG.

Just one year after a hyperlipidemia diagnosis, we found that 5.24% of individuals in the top 1% of the PHS distribution had CABG performed, whereas only 0.53% of the bottom 10% and 1.61% of the 40–60 percentile of the PHS underwent CABG surgery (Figure 2B).

Further, we found the risk of having CABG fifteen years after hyperlipidemia diagnosis was more than doubled for individuals in the top 10% of the PHS distribution (6.54%) compared with the bottom 10% (2.59%) (Figure 2B).

In 2008, the global prevalence of increased plasma cholesterol levels was estimated to be ~39% among individuals 25 years and older15 and more than one-third caused by CAD and ischemic stroke were attributable to increased plasma LDL-cholesterol levels 16. Compared with the general population, individuals in the UK Biobank are generally healthier due to the “healthy volunteer” selection bias 17; we find that 21% of all UK Biobank individuals have hyperlipidemia, with a mean age of 58 years at diagnosis. However, this condition is also becoming diagnosed more frequently earlier in life, and in the US, 20% of children (age 6–19 years) have adverse lipid levels18. In the UK Biobank, we find that 22 years after hyperlipidemia diagnosis, among the PHS distributionʼs top 1%, around 1 in 5 (19.7 %) will have had CABG surgery performed. In contrast, for the bottom 10% and 40–60% of the PHS distribution, 1 in 26 (3.79%) and 1 in 19 (5.29%) will have had the CABG surgery, respectively (Figure 2A and 2C). This highlights the potential relevance of applying PHS in the context of screening patients for further followup examinations.

Active variables of the polygenic hazards score for hyperlipidemia to coronary artery bypass surgery

We applied the BASIL algorithm to generate predictive models with sparse solutions using the genotype and phenotype data, thereby identifying the features (genetic variants) that are most relevant for disease progression in the time-to-event setting.

In the analysis of the progression progressing from hyperlipidemia to CABG, snpnet-Cox identified 147 genetic variants (active variables), of which several have been previously identified as risk loci from GWAS of Coronary artery disease, hyperlipidemia and related traits. We identified rs10455872 (MAF = 0.08), an intron variant in the LPA, to associate with an effect size (beta) of 0.12, i.e. 12% increased hazards of disease progression (Figure 3). This gene encodes lipoprotein(a), a large lipoprotein made by the liver, and is known to be an independent risk factor for cardiovascular diseases and causal of atherosclerosis, heart attacks, strokes, and heart failure 19. Among other known loci are PHACTR1 (beta = 0.06 20, CDKN2B-AS1 (beta = 0.09) 2123 and ATP2B1 (beta = 0.02) 24,25. Additionally, we identified a missense variant, rs1990760, in IFIH1, to increase the risk of progressing to CABG with beta = 0.02 (Figure 3). This variant has previously been associated with reduced risk of hypothyroidism (OR 0.92 CI 0.90, 0.94; p = 9.3 × 10−17) and was suggested to influence coronary artery disease risk (OR 0.97 CI 0.96, 0.99; p = 2.5 × 10−5) 26. Furthermore, we found a common variant (MAF = 0.46) downstream of LMO1, rs11041816, which showed a protective effect of progression to having CABG performed, beta = −0.05. This gene encodes a transcriptional regulator and it has been found to regulate transcription by competitively binding to specific DNA-binding transcription factors 27. LMO1 has previously been associated with other metabolic traits such as systolic and diastolic blood pressure 28, fasting blood glucose 29, body mass index 30 and birth weight 31 but not coronary artery disease.

Figure 3. snpnet-Cox selected variables.

Figure 3.

Plot of snpnet-Cox coefficients in analysis of progression from hyperlipidemia to coronary artery bypass graft (CABG) surgery, with 147 active variables. Green dots represent protein-altering variants. https://biobankengine.shinyapps.io/disease_progression/

Phenome wide association study (PheWAS) was performed for rs11041816 using the Global Biobank Engine, we found association with vascular heart problems diagnosed by doctor - high blood pressure (https://biobankengine.stanford.edu/RIVAS_HG19/variant/11-8243798-A-G and Supplementary Table 6). Additionally, we found an independent variant rs4480535 upstream of LMO1 protected from CABG in the Finngen study beta = −0.12 and p = 1.2×10−6 (http://r4.finngen.fi/region/I9_CABG/gene/LMO1).

Replication of significant PHS in non British white individuals of UK Biobank

Next, we examined the significant PHS in the non British white individuals from UK Biobank (n=28,134) where four of the ten PHS significantly replicate (p < 0.05, Supplementary Table 3). Due to the smaller size of this cohort, we had very few cases for some traits, affecting the significance of the other six PHS. Yet, for the six PHS that did not significantly replicate, we found the PHS had the same direction of effect as observed in the discovery analyses.

TCF7L2 is the main driver of progression from hypertension to type 2 diabetes

In the UK Biobank, we estimated a HR of 1.15 per SD of PHS (p = 2.90×10−6) for the disease progression from hypertension to T2D. This PHS was composed of only 3 active variables and had a C-index of 0.54 (Table 1). TCF7L2 rs7903146 has a beta of 0.136 . It is known to influence insulin secretion and glucose production and is the main loci known in relation to T2D32.

Applicability snpnet-cox

When biobanks have longer follow up time and include more individuals these type of analyses will have more power to identify loci. Even though the PHS computed for hyperlipidemia to embolectomy was not significant, this PHS included interesting SNPs such as the GPR107 (3 prime UTR variant rs1306, beta −0.058) that had a protective effect for the disease progression. This has previously been … Additionally, an intron variant of TBC1D4 (rs517130, beta −0.028) did also protect from progressing from hyperlipidemia to embolectomy. TBC1D4 confers muscle insulin resistance and type 2 diabetes33.

Furthermore, the method could also be applied to study other disease progression patterns like those observed in cancer.

Discussion

In this study, we applied the batch-screening iterative Lasso (BASIL) algorithm implemented in the R snpnet package to study the genetic architecture of disease progression by computing PHS in a time-to-event setting within the UK Biobank. The BASIL algorithm to fit a Cox proportional hazard model on our large-scale and high dimensional dataset to generate predictive models with sparse solutions, which means that most variants have no effect. The output of this method is a list of the genetic variants that associate with the progression measure of interest. We generated a PHS using these genetic variants, identifying ten PHS that were significantly associated with disease progression in a white British cohort, of which four were replicated in a smaller, non-British white cohort. All results from snpnet-Cox have been made available in our Global Biobank Engine (https://biobankengine.shinyapps.io/disease_progression/) 34.

First, we assessed the predictive power of the PHS on time-to-event in the individuals in the held-out test set, thereby obtaining a p-value for each disease progression analysis. Second, we computed the hazard ratio (HR) for the scale (standard deviation (SD) unit) within different threshold percentiles (top 1%, 5%, 10% and bottom 10% compared to the 40–60%). Third, we computed the concordance index (C-index), which is a validation accuracy measure 13.

The best predictive PHS model (HR: 1.31, p = 4.49×10−9) was obtained for the severity progression from hyperlipidemia to CABG. In this process, hyperlipidemia causes coronary artery disease, which eventually progresses to a stage where surgery is needed to restore blood flow to the heart. Several of the variants are known from GWAS to associate with CAD. However, we did also identify a common variant downstream of LMO1 which protects from this severity progression to CABG. This variant has not been described in relation to CAD from GWAS (NHGRI GWAS Catalog).

Applying BASIL algorithm we computed PHS for disease progression which importantly takes time to event into account. Prior studies have examined disease progression by recurrent events of MI and revascularization where a PRS based on risk loci from GWAS is associated with risk of subsequent events 3538. However, variants that influence disease onset may not necessarily influence disease progression. By computing As we move towards whole genome sequencing, this algorithm is extremely fast and can handle the larger-than-memory datasets.

For cardiometabolic diseases the main determinant of patient well-being is not the diagnosis itself but instead the progression to a variety of complications. This varies substantially between patients. This is most likely due to a combination of lifestyle factors together with some unknown degree of genetic predisposition. However, in this setting, we cannot take either medication, physical activity or diet into account. We applied phenotypes which mainly are derived from linking to electronic health records, and the UK Biobank lifestyle measures are mainly from enrollment in this study. Additionally, we had the ICD-10 in 3-character codes and not the more specific 4-character codes.

A substantial challenge in detecting loci that associate with disease progression is sufficient sample size. Overall, we observed that for the ten PHS which significantly predicted cardiometabolic disease progression in the test sets, eight out of the ten were for the two phenotypes having most cases at baseline - hypertension (n=83,727) and hyperlipidemia (n=68,477). These exploratory settings may be underpowered for more of the cardiometabolic disease progression analyses. However, with even more time-to-event data coming online as biobanks continue to gather follow-up data, as well as the growing number of biobanks becoming available, this method can prove useful, especially with its ability to handle out-of-memory datasets.

For some traits, there are too few cases. However, even with a small number of cases the snpnet-Cox is able to select variables that are associated with disease, albeit with less confidence (as evidenced by the replication cohortʼs inability to confirm all the PHS selected in the discovery cohort).

Genetics could be a part of precision medicine adding to the traditional risk factors for disease. Potentially, using PHS for individuals who early in life have a cardiometabolic diagnosis, could help. Previously, other studies have computed polygenic scores from GWAS and tested in a time-to-event setting3940, whereas here we take advantage of computing the PHS in the time-to-event setting.

In the usual GWAS setting, cases are defined as individuals who have the disease or trait of interest at any time point recorded. However, taking into account time-at-diagnosis, and, in this case, time for disease to progress, is relevant since this time length is highly variable between patients. With more biobanks having the opportunity to link data to electronic health records, it will become increasingly important to develop methods and statistical approaches that can take advantage of the longitudinal nature of these observations. This could potentially facilitate improved prediction, drug development and thereby better precision medicine41.

In conclusion, we applied a batch-screening iterative Lasso (BASIL) algorithm to find the lasso path of Cox proportional hazards models to study disease progression using genotype-phenotype dataset from UK Biobank. We selected 7 common complex cardiometabolic traits to identify genetic variants that are associated with time-to-event disease and surgical outcomes. We find 10 of these computed PHS that predict cardiometabolic disease progression. The genetics of disease status and disease progression overlap. However, we do identify putative genetic markers potentially important for disease progression such as the variant in LMO1 which protects from progressing from a hyperlipidemia to CABG surgery. Future studies to illuminate the biological mechanism of this gene.

Supplementary Material

Supplement 1

Acknowledgements and funding sources

This research has been conducted using the UK Biobank Resource under Application Number 24983, “Generating effective therapeutic hypotheses from genomic and hospital linkage data” (http://www.ukbiobank.ac.uk/wp-content/uploads/2017/06/24983-Dr-Manuel-Rivas.pdf). We thank all of the participants in the UK Biobank study. This work was supported by the National Human Genome Research Institute (NHGRI) of the National Institutes of Health (NIH) under award R01HG010140. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Some of the computing for this project was performed on the Sherlock cluster. We would like to thank Stanford University and the Stanford Research Computing Center for providing computational resources and support that contributed to these research results.

J.M.J. is supported by grant NNF17OC0025806 from the Novo Nordisk Foundation and the Stanford Bio-X Program. J.W.K is supported by NIH grants: U41HG009649, R01 DK116750, R01 DK120565, P30DK116074. M.A.R. is supported by Stanford University and a National Institute of Health center for Multi- and Trans-ethnic Mapping of Mendelian and Complex Diseases grant (5U01 HG009080).

Footnotes

Disclosures

MAR is a co-Founder of Broadwing Bio.

References

  • 1.Townsend N, Nichols M, Scarborough P. Cardiovascular disease in Europe—epidemiological update 2015. Eur Heart J 2015. [DOI] [PubMed] [Google Scholar]
  • 2.Mathers C, World Health Organization. The Global Burden of Disease: 2004 Update. World Health Organization; 2008. [Google Scholar]
  • 3.Cerezo M, Sollis E, Ji Y, Lewis E, Abid A, Bircan KO, et al. The NHGRI-EBI GWAS Catalog: standards for reusability, sustainability and diversity. Nucleic Acids Res 2025;53:D998–D1005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med 2015;12:e1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Qian J, Tanigawa Y, Du W, Aguirre M, Chang C, Tibshirani R, et al. A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genet 2020;16:e1009141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Tibshirani R. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological) 1996;58:267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x. [DOI] [Google Scholar]
  • 7.Li R, Chang C, Justesen JM, Tanigawa Y, Qian J, Hastie T, et al. Fast Lasso method for Large-scale and Ultrahigh-dimensional Cox Model with applications to UK Biobank n.d. doi: 10.1101/2020.01.20.913194. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 2018;562:203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 2015;4:7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Venkataraman GR, Olivieri JE, DeBoever C, Tanigawa Y, Justesen JM, Dilthey A, et al. Pervasive additive and non-additive effects within the HLA region contribute to disease risk in the UK Biobank. Cold Spring Harbor Laboratory 2020:2020.05.28.119669. doi: 10.1101/2020.05.28.119669. [DOI] [Google Scholar]
  • 11.Aguirre M, Rivas MA, Priest J. Phenome-wide Burden of Copy-Number Variation in the UK Biobank. Am J Hum Genet 2019;105:373–383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Sinnott-Armstrong N, Tanigawa Y, Amar D, Mars N, Benner C, Aguirre M, et al. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat Genet 2021;53:185–194. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Harrell FE Jr, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. JAMA 1982;247:2543–2546. [PubMed] [Google Scholar]
  • 14.Deconinck OG, Sharman JE, Bishop W, Lees CF, Dare L, Hardikar A, et al. Familial Hypercholesterolemia and Cardiovascular Outcomes Amongst Younger Patients Undergoing Coronary Bypass Surgery. Heart Lung Circ 2025;34:77–83. [DOI] [PubMed] [Google Scholar]
  • 15.Noncommunicable diseases: Risk factors and conditions n.d. https://www.who.int/data/gho/data/themes/topics/topic-details/GHO/ncd-risk-factors (accessed February 1, 2025).
  • 16.Pirillo A, Casula M, Olmastroni E, Norata GD, Catapano AL. Global epidemiology of dyslipidaemias. Nat Rev Cardiol 2021;18:689–700. [DOI] [PubMed] [Google Scholar]
  • 17.Fry A, Littlejohns TJ, Sudlow C, Doherty N, Adamska L, Sprosen T, et al. Comparison of Sociodemographic and Health-Related Characteristics of UK Biobank Participants With Those of the General Population. Am J Epidemiol 2017;186:1026–1034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Dai S, Yang Q, Yuan K, Loustalot F, Fang J, Daniels SR, et al. Non-high-density lipoprotein cholesterol: distribution and prevalence of high serum levels in children and adolescents: United States National Health and Nutrition Examination Surveys, 2005–2010. J Pediatr 2014;164:247–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Wilson DP, Jacobson TA, Jones PH, Koschinsky ML, McNeal CJ, Nordestgaard BG, et al. Use of Lipoprotein(a) in clinical practice: A biomarker whose time has come. A scientific statement from the National Lipid Association. J Clin Lipidol 2019;13:374–392. [DOI] [PubMed] [Google Scholar]
  • 20.Myocardial Infarction Genetics Consortium, Kathiresan S, Voight BF, Purcell S, Musunuru K, Ardissino D, et al. Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants. Nat Genet 2009;41:334–341. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.McPherson R, Pertsemlidis A, Kavaslar N, Stewart A, Roberts R, Cox DR, et al. A common allele on chromosome 9 associated with coronary heart disease. Science 2007;316:1488–1491. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007;447:661–678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Helgadottir A, Thorleifsson G, Manolescu A, Gretarsdottir S, Blondal T, Jonasdottir A, et al. A common variant on chromosome 9p21 affects the risk of myocardial infarction. Science 2007;316:1491–1493. [DOI] [PubMed] [Google Scholar]
  • 24.Nikpay M, Goel A, Won H-H, Hall LM, Willenborg C, Kanoni S, et al. A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet 2015;47:1121–1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Nelson CP, Goel A, Butterworth AS, Kanoni S, Webb TR, Marouli E, et al. Association analyses based on false discovery rate implicate new loci for coronary artery disease. Nat Genet 2017;49:1385–1391. [DOI] [PubMed] [Google Scholar]
  • 26.Emdin CA, Khera AV, Chaffin M, Klarin D, Natarajan P, Aragam K, et al. Analysis of predicted loss-of-function variants in UK Biobank identifies variants protective for disease. Nat Commun 2018;9:1613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Oram SH, Thoms J, Sive JI, Calero-Nieto FJ, Kinston SJ, Schütte J, et al. Bivalent promoter marks and a latent enhancer may prime the leukaemia oncogene LMO1 for ectopic expression in T-cell leukaemia. Leukemia 2013;27:1348–1357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Surendran P, Drenos F, Young R, Warren H, Cook JP, Manning AK, et al. Trans-ancestry meta-analyses identify rare and common variants associated with blood pressure and hypertension. Nat Genet 2016;48:1151–1161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Manning AK, Hivert M-F, Scott RA, Grimsby JL, Bouatia-Naji N, Chen H, et al. A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nat Genet 2012;44:659–669. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Schlauch KA, Read RW, Lombardi VC, Elhanan G, Metcalf WJ, Slonim AD, et al. A Comprehensive Genome-Wide and Phenome-Wide Examination of BMI and Obesity in a Northern Nevadan Cohort. G3 2020;10:645–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Warrington NM, Beaumont RN, Horikoshi M, Day FR, Helgeland Ø, Laurin C, et al. Maternal and fetal genetic effects on birth weight and their relevance to cardio-metabolic risk factors. Nat Genet 2019;51:804–814. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Hattersley AT. Prime suspect: the TCF7L2 gene and type 2 diabetes risk. J Clin Invest 2007;117:2077–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Moltke I, Grarup N, Jørgensen ME, Bjerregaard P, Treebak JT, Fumagalli M, et al. A common Greenlandic TBC1D4 variant confers muscle insulin resistance and type 2 diabetes. Nature 2014;512:190–193. [DOI] [PubMed] [Google Scholar]
  • 34.McInnes G, Tanigawa Y, DeBoever C, Lavertu A, Olivieri JE, Aguirre M, et al. Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics. Bioinformatics 2019;35:2495–2497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Tragante V, Doevendans PAFM, Nathoe HM, van der Graaf Y, Spiering W, Algra A, et al. The impact of susceptibility loci for coronary artery disease on other vascular domains and recurrence risk. Eur Heart J 2013;34:2896–2904. [DOI] [PubMed] [Google Scholar]
  • 36.Christiansen MK, Nyegaard M, Larsen SB, Grove EL, Würtz M, Neergaard-Petersen S, et al. A genetic risk score predicts cardiovascular events in patients with stable coronary artery disease. Int J Cardiol 2017;241:411–416. [DOI] [PubMed] [Google Scholar]
  • 37.Vaara S, Tikkanen E, Parkkonen O, Lokki M-L, Ripatti S, Perola M, et al. Genetic Risk Scores Predict Recurrence of Acute Coronary Syndrome. Circ Cardiovasc Genet 2016;9:172–178. [DOI] [PubMed] [Google Scholar]
  • 38.Weijmans M, de Bakker PIW, van der Graaf Y, Asselbergs FW, Algra A, Jan de Borst G, et al. Incremental value of a genetic risk score for the prediction of new vascular events in patients with clinically manifest vascular disease. Atherosclerosis 2015;239:451–458. [DOI] [PubMed] [Google Scholar]
  • 39.Sun L, Pennells L, Kaptoge S, Nelson CP, Ritchie SC, Abraham G, et al. Polygenic risk scores in cardiovascular risk prediction: A cohort study and modelling analyses. PLOS Medicine 2021;18:e1003498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.A multilocus genetic risk score for coronary heart disease: case-control and prospective cohort analyses. The Lancet 2010;376:1393–1400. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.van Zuydam NR, Ladenvall C, Voight BF, Strawbridge RJ, Fernandez-Tajes J, William Rayner N, et al. Genetic Predisposition to Coronary Artery Disease in Type 2 Diabetes Mellitus. Circulation: Genomic and Precision Medicine 2020. doi: 10.1161/CIRCGEN.119.002769. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

Articles from medRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES