Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2020 Sep 29;23(2):522–540. doi: 10.1093/biostatistics/kxaa038

Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank

Ruilin Li 1,, Christopher Chang 2, Johanne M Justesen 3, Yosuke Tanigawa 3, Junyang Qian 3, Trevor Hastie 3, Manuel A Rivas 3, Robert Tibshirani 3
PMCID: PMC9007437  PMID: 32989444

Summary

We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the Inline graphic-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in Qian and others (2019). Our algorithm is particularly suitable for large-scale and high-dimensional data that do not fit in the memory. The output of our algorithm is the full Lasso path, the parameter estimates at all predefined regularization parameters, as well as their validation accuracy measured using the concordance index (C-index) or the validation deviance. To demonstrate the effectiveness of our algorithm, we analyze a large genotype-survival time dataset across 306 disease outcomes from the UK Biobank (Sudlow and others, 2015). We provide a publicly available implementation of the proposed approach for genetics data on top of the PLINK2 package and name it snpnet-Cox.

Keywords: Concordance index, Cox proportional hazard model, LASSO, Time-to-event data, UK Biobank

1. Introduction

Survival analysis involves predicting time-to-event, such as survival time of a patient, from a set of features of the subject, as well as identifying features that are most relevant to time-to-event. Cox proportional hazard model (Cox, 1972) provides a flexible mathematical framework to describe the relationship between the survival time and the features, allowing a time-dependent baseline hazard. Survival analysis faces computational and statistical challenges when the predictors are ultrahigh-dimensional (when feature dimension is greater than the number of observations) and large scale (when the data matrix does not fit in the memory). Based on the Batch Screening Iterative Lasso (BASIL), we develop an algorithm to fit a Cox proportional hazard model by maximizing the Lasso partial likelihood function. We apply the method to 306 time-to-event disease outcomes from UK Biobank combined with genetic data. We generate improved predictive models with sparse solutions using genetic data with the number of variables selected ranging from a single active variable in the set and others with almost 2000 active variables. We note that our algorithm can be easily adapted to other applications with arbitrarily large dataset, provided that the Lasso solution is sufficiently sparse.

1.1. Cox proportional hazard model

Given a numerical predictor Inline graphic, Cox model assumes that there exists a baseline hazard function Inline graphic and a parameter vector Inline graphic such that the hazard function for survival time has the form:

graphic file with name Equation1.gif (1.1)

Intuitively the hazard function at time Inline graphic measures the relative risk of death around time Inline graphic, given that the patient survives up to time Inline graphic. Under Cox proportional hazard model, the hazard ratio between two subject with covariates Inline graphic and Inline graphic can be written as:

graphic file with name Equation2.gif (1.2)

When Inline graphic is an indicator for a treatment, the hazard ratio can be interpreted as the risk of event occurring in the treatment group, compared to the risk in the control group, and the regression coefficient Inline graphic is the log-hazard ratio.

To describe the distribution of the survival time, we can equivalently use its cumulative distribution function:

graphic file with name Equation3.gif (1.3)

In practice, it is often the case that the survival time is right-censored. That is the event has not yet happened at the time the data was collected. Therefore for the Inline graphicth individual, we observe a tuple Inline graphic, where Inline graphic is the predictors, Inline graphic is the event indicator. If Inline graphic, then Inline graphic is the actual survival time of the Inline graphicth individual. If Inline graphic, then we only know that the true survival time of the Inline graphicth individual is longer than Inline graphic. Throughout this article, we will assume that the censoring is non-informative, meaning that the time of censoring is independent of the (possibly unobserved) event time conditional on Inline graphic.

One advantage of the Cox model is that, while being a semi-parametric model (the baseline function is non-parametric), we could still estimate the parameter Inline graphic without estimating the baseline function. This can be achieved by choosing Inline graphic that maximizes the log-partial likelihood function:

graphic file with name Equation4.gif (1.4)

We use the C-index to evaluate a fitted Inline graphic:

graphic file with name Equation5.gif (1.5)

The C-index is the proportion of comparable pairs for which the model Inline graphic predicts the correct order of the events. Each tie in the prediction is considered half of a correct prediction. For a more complete description of C-index, see Harrell and others (1982) and Li and Tibshirani (2019).

1.2. Computational challenges in large-scale and high-dimensional survival analysis

In today’s applications, it is common to have dataset with millions of observations and variables. For example, the UK Biobank dataset (Sudlow and others, 2015) contains millions of genetic variants for over 500 000 individuals. Loading this data matrix to R takes around Inline graphic TB of memory, which exceeds the size of a typical machine’s RAM. While memory-mapping techniques allow users to perform computation on data outside of RAM (out-of-core computation) relatively easily (Kane and others, 2013), popular optimization algorithms require repeatedly computing matrix-vector multiplications involving the entire data matrix, resulting in slow overall speed.

The Lasso (Tibshirani, 1996) is an effective tool for high-dimensional variable selection and prediction. R packages such as glmnet (Friedman and others, 2010), penalized (Goeman, 2010), coxpath (Park and Hastie, 2007), and glcoxph (Sohn and others, 2009) solve Lasso Cox regression problem using various strategies. However, all of these packages require loading the entire data matrix to the memory, which is infeasible for Biobank-scale data. To the best of the authors’ knowledge, our method is the first to solve Inline graphic regularized Cox regression with larger-than-memory data. On the other hand, most optimization strategies used in these packages can also be incorporated in the fitting step of our algorithm. In particular, snpnet-Cox uses cyclical coordinate descent implemented in glmnet.

Even if these packages do support out-of-core computation, using them directly would be computationally inefficient. To be more concrete, in one of our simulation studies on the UK Biobank data, the training data take about 2 TB. With the highly optimized out-of-core matrix-vector multiplication function PLINK2 provides, we are able to run one single such operation in about 2–3 min. Without variable screening, cyclic coordinate descent (or proximal gradient descent) would require from a few to tens or even hundreds of such matrix-vector multiplications for one Inline graphic. Our algorithm exploits the sparsity structure in the problem to reduce the frequency of this operation to mostly once or twice for several Inline graphics. Most of these expensive, out-of-core matrix-vector multiplications are replaced with fast, in-memory ones that work on much smaller subsets of the data.

2. Methods

2.1. Preliminaries

We first introduce the following notations:

  • Let Inline graphic be the number of observations and the number of features, respectively. Let Inline graphic be the matrix of predictors. To simplify notation, we use Inline graphic for all of train, test, and validation set. Whether Inline graphic comes from train, test, or validation data can be inferred from the context.

  • Let Inline graphic be the Inline graphicth row of Inline graphic.

  • Let Inline graphic be the Inline graphicth column of Inline graphic.

  • Denote the log-partial likelihood function as Inline graphic. That is
    graphic file with name Equation6.gif (2.6)
  • Denote the set Inline graphic.

We focus on survival analysis in the high-dimensional regime, where the number of predictors is greater than the number of observations (Inline graphic), although same procedure can easily be applied to low-dimensional cases. We use Lasso to perform variable selection and estimation at the same time. In particular, we optimize the Inline graphic-regularized log-partial likelihood:

graphic file with name Equation7.gif (2.7)

where the Inline graphic. More generally, we allow each parameter or each observation to have a different weight in the objective function, the right-hand side of (2.7). In particular, given a vector of penalty factors Inline graphic, and observation weight Inline graphic, we define the general objective function to be

graphic file with name Equation8.gif (2.8)

This can be particularly useful if we are considering genetic variants that we would like to up-weight during variable selection, e.g., coding variants in a region of perfect linkage disequilibrium. To simplify the notation, we describe our algorithm assuming that the parameters and the observations are unweighted.

2.2. Hyperparameter selection

To find the optimal hyperparameter Inline graphic, we start with a sequence of Inline graphic candidate regularization parameters Inline graphic and compute the corresponding parameter estimate as well as the validation metric. The optimal regularization parameter is then selected to be Inline graphic that maximizes the validation metric and Inline graphic is set to be Inline graphic. The sequence of regularization parameters can be chosen by setting Inline graphic to be sufficiently large such that the optimal Inline graphic is just zero, and find Inline graphic equally spaced Inline graphics in log-scale.

Applying this procedure naively requires solving Inline graphic optimization problems, each reading the entire predictor matrix Inline graphic. To effectively reduce the number of computations involving the entire data matrix, we exploit the sparsity of the Lasso solution. The key components of our algorithm that significantly speed up the computation are the following observations adapted from Qian and others (2019).

2.3. Batching screening procedure

The Karush–Kuhn–Tucker (KKT) condition of (2.7) indicates that the optimal Inline graphic must satisfy:

graphic file with name Equation9.gif (2.9)

When Inline graphic is sufficiently large, Inline graphic is sparse, so our strategy is to solve the optimization problem (2.7) using only a small subset of features, assuming all the others have coefficient zero. Then, we verify that the solution satisfies the KKT condition (2.9). We iteratively apply this strategy for Inline graphic to obtain the entire Lasso path. To determine which predictors to include in the model, we adopt the screening rules used in BASIL, which is inspired by the strong rules proposed in Tibshirani and others (2012). In Cox model, the strong rules assumes Inline graphic (discard the Inline graphicth predictor when fitting) if

graphic file with name Equation10.gif (2.10)

By convention we set Inline graphic. Although it is possible for strong rules to fail, it rarely happens when Inline graphic.

Before we describe the full algorithm, first we write the gradient of the log-partial likelihood into a simple form. Notice that the gradient of the log-partial likelihood function can be written as:

graphic file with name Equation11.gif (2.11)

Let Inline graphic be defined as

graphic file with name Equation12.gif (2.12)

Then by direct computation one can show that

graphic file with name Equation13.gif (2.13)

Here, Inline graphic is the martingale residual when the baseline survival function comes from a Breslow estimator (Barlow and Prentice, 1988; Breslow, 1974), and it can be computed using only variables currently in the model. However, based on its definition computing Inline graphic requires summing over the risk set for each Inline graphic (the denominator in the second term), which in total takes Inline graphic. Our solution is to first sort the individuals based on Inline graphic, so that Inline graphic. Then Inline graphic can be obtained for all Inline graphic in Inline graphic. The only expensive part then is the matrix multiplication above, which involves the large data matrix Inline graphic. Our full algorithm follows the same structure as in BASIL (Qian and others, 2019), where at each iteration of our algorithm we look for Lasso solution for multiple consecutive Inline graphics in the Lasso path so that large dataset is not read in too frequently. Suppose Inline graphic have been computed in the first Inline graphic iterations, and the KKT condition for these solution has been verified. The Inline graphicth iteration has the following parts:

  1. (Screening) In the last iteration, we obtain Inline graphic defined as the right-hand side of (2.12), which we will use to compute the full gradient through Inline graphic. We include two types of variables in the fitting step of this iteration:
    • The set of variables Inline graphic with Inline graphic for some Inline graphic. We call this set ever-active set at iteration Inline graphic, denoted Inline graphic.
    • Top Inline graphic variables with the largest partial derivative magnitude Inline graphic, that are also not ever active.

    The union of these two type of variables is denoted Inline graphic, and we will use this set of variables for the fitting step.

  2. (Fitting) In this step we solve the problem (2.7) using the variables Inline graphic, for the next few (default Inline graphic) Inline graphic values starting at Inline graphic. The set of Inline graphics used in this iteration is denoted Inline graphic. The solutions are obtained through coordinate descent, with iterates initialized from the most recent Lasso solution (warm start). The corresponding coefficients of the variables not in Inline graphic are set to Inline graphic.

  3. (Checking) In this step, we verify that the solutions obtained in the fitting step assuming Inline graphic for Inline graphic is indeed valid. To do this, for each Inline graphic in this iteration, we obtain the martingale residual Inline graphic as in (2.12), compute the gradient Inline graphic, and verify the KKT condition (2.9). If Inline graphic satisfies the KKT condition, then it is a valid solution. Otherwise, we go back to the screening step and continue with the largest Inline graphic for which the KKT condition fails.

  4. (Early Stopping) When the regularization parameter Inline graphic becomes small, the model tends to overfit and the solution we obtain becomes unstable. We keep a separate validation set to determine the optimal hyperparameter. For Inline graphic, if Inline graphic satisfies the KKT condition, we evaluate its validation C-index. Once the validation C-index starts to decrease, we stop computing solution for smaller Inline graphic values. A naive implementation of C-index requires comparing Inline graphic pairs of individuals in the study, which is not scalable. In the next section, we describe an Inline graphic C-index algorithm.

We emphasize that, in each iteration, only one matrix–matrix multiplication using the entire data set Inline graphic is needed (except the first iteration where Inline graphic are needed). Algorithm 1 summarizes this procedure. The early stopping step is omitted to save some space.

Algorithm 1:

Algorithm 1:

BASIL for Cox model

2.4. Fast C-index computation

Several frequently used C-index computational algorithms, including the first algorithm we tried, have time complexity Inline graphic. As population-scale cohorts, like UK Biobank, Million Veterans Program, and FinnGen, aggregate time-to-event data for survival analysis it is increasingly important to consider the computational costs of statistics like C-index to build and evaluate predictive models. The time-to-event data for survival analysis include age of disease onset, progression from disease diagnosis to another more severe outcome, like surgery or death. Here, we present an implementation with Inline graphic time complexity (and Inline graphic space complexity) that can introduce over 10 000Inline graphic speedup for biobank-scale data relative to several R packages, and over 10Inline graphic speedup compared to existing Inline graphic time complexity (and Inline graphic space complexity) algorithm implemented in the survival analysis package by Therneau and Lumley (2014).

We first assume there are no tied predictions or events. We evaluate the C-index of the estimator Inline graphic on the data Inline graphic through the following steps:

  1. First relabel the data in increasing Inline graphic. This takes Inline graphic. After relabeling, we have
    graphic file with name Equation14.gif
    Define
    graphic file with name Equation15.gif

    Inline graphic is the size of the risk set immediately after Inline graphic. Clearly, computing all Inline graphic takes Inline graphic.

  2. Define
    graphic file with name Equation16.gif

    That is, Inline graphic is the number of individuals that Inline graphic predicts to have lower or equal risk of the event than Inline graphic. We have Inline graphic for all Inline graphic, and Inline graphic is equivalent to Inline graphic. The Inline graphic’s can be computed in linear time by first sorting Inline graphic (here we assume Inline graphic has already been computed and given as an input to the C-index function). The total time complexity of this step is Inline graphic.

  3. Using the above definition, the C-index (1.5) can be equivalently written as
    graphic file with name Equation17.gif

    The denominator clearly can be computed in linear time. In the next steps, we focus on computing the numerator.

  4. This step is the key factor in our algorithm. For each Inline graphic, define a binary vector (bitarray) Inline graphic, where we set, for each Inline graphic:
    graphic file with name Equation18.gif
    Inline graphic is well defined since Inline graphic. In addition, it has two nice properties:
    • (a)
      graphic file with name Equation19.gif (2.14)
      where the summation on the right-hand side is computed through an array popcount on the bitarray Inline graphic.
    • (b) We can update Inline graphic from Inline graphic simply by setting Inline graphic from Inline graphic to Inline graphic.

    In our implementation, we represent these binary vectors as bitarrays. Bitarrays are compact, and very efficient to work with. (The exact arithmetic and bitwise operations we used were primarily informed by Knuth (2011) and Muła and others (2018).) However, we need to perform Inline graphic array-popcount operations, so the top-level algorithm is still Inline graphic if each popcount takes Inline graphic time. Here, we provide a high-level description on how we get the array-popcount operations down to Inline graphic. To simplify our discussion, we assume Inline graphic is an integer power of Inline graphic.

    For each Inline graphic, we define a binary tree Inline graphic with Inline graphic leaves, each having distance Inline graphic from the root. At the Inline graphicth level of Inline graphic, there are Inline graphic nodes, and the Inline graphicth node among them stores the sum:
    graphic file with name Equation20.gif

    For example, the root of Inline graphic stores the sum of Inline graphic, the left-child of the root stores the sum of the first half of Inline graphic, and its left-child stores the sum of the first quarter of Inline graphic. The Inline graphicth leaf of Inline graphic is exactly Inline graphic.

    With Inline graphic, computing (2.14) can be done within the same time complexity as traversing from the root of Inline graphic to the Inline graphicth leaf. Updating the Inline graphic to Inline graphic can be done by setting Inline graphic from Inline graphic to Inline graphic and traverse back to the root. Both operations are Inline graphic. We describe them with the pseudocode in algorithm 2.

    In our implementation, each internal node in these trees has Inline graphicInline graphic children instead Inline graphic to better utilize the memory hierarchy. We do not actually build the tree data structures, but using them as a concept to describe our algorithm. In the package, these trees are represented by a stack of arrays, and accessing a node’s children, its parent, and a particular leaf takes Inline graphic.

  5. When there are tied predictions, we keep the definitions from steps 1–3. The computation in step Inline graphic then misses Inline graphic times the number of ties at Inline graphic. If for some Inline graphic, Inline graphic is already flipped before step Inline graphic is done, then we know there is a tie at Inline graphic, and the distance between Inline graphic to the next unflipped bit is the number of ties already seen, so we can adjust accordingly. The tie-heavy version of the function maintains an extra table which lists the number of times each Inline graphic has been seen. By looking up that table, it can immediately find the first unflipped bit instead of performing a potentially Inline graphic scan.

Algorithm 2:

Algorithm 2:

Inline graphic array count and tree update algorithm to compute (2.14)

3. Results

3.1. UK Biobank age of diagnosis data preparation

We have prepared an age of diagnosis dataset from the UK Biobank derived from Category 1712, the category containing data showing the “first occurrence” of any code mapped to 3-character (International Classification of Diseases) ICD-10 (see Supplementary material available at Biostatistics online).

Briefly, the data-fields have been generated by mapping: Read code information in the Primary Care data (Category 3000); ICD-9 and ICD-10 codes in the Hospital inpatient data (Category 2000); ICD-10 codes in Death Register records (Field 40001, Field 40002), and Self-reported medical condition codes (Field 20002) reported at the baseline or subsequent UK Biobank assessment center visit to 3-character ICD-10 codes.

For each code two data-fields are available: the date the code was first recorded across any of the sources listed above, the source where the code was first recorded, and information on whether the code was recorded in at least one other source subsequently.

We used these data and computed an age of diagnosis by using the Month of Birth Data Field (Data-Field 52) and Year of Birth (Data-Field 34).

3.2. Genetic data preparation

Here, we used genotype data from the UK Biobank dataset release version 2 and the hg19 human genome reference for all analyses in the study. To minimize the variability due to population structure in our dataset, we restricted our analyses to include Inline graphic unrelated White British individuals, used sex, Array (UK Biobank was genotyped in two different platforms), and 10 principal components derived from the genotype data as covariates (described in detail in Supplementary material available at Biostatistics online).

Algorithm 3:

Algorithm 3:

Proposed C-index algorithm

We focused our analysis on variants with a minor allele frequency (MAF) greater than or equal to Inline graphic for directly genotyped variants in either array, in addition to the human leukocyte antigen alleles (Bycroft and others, 2018) and copy number variants described in Aguirre and others (2019) for a total of 1.08 million variants.

We split our dataset into a Inline graphic training (Inline graphic), Inline graphic validation, (Inline graphic) and Inline graphic held out test set (Inline graphic), and apply snpnet-Cox with 50 iterations. We focus our analysis on 306 ICD10 codes with at least Inline graphic cases in the Inline graphic individuals dataset.

3.3. snpnet-Cox results

We summarize the results across the 306 ICD10 codes, but focus our detailed analysis for four of them including:

  1. asthma (ICD10 code: J45),

  2. gout (M10),

  3. disorders of porphyrin and bilirubin metabolism (E80), and

  4. atrial fibrillation and flutter (I48).

The Lasso paths for these phenotypes are illustrated in Figure 1, where the estimated individual parameter values are plotted against the Inline graphic norm of Inline graphic, for a decreasing sequence of Inline graphic. For an individual with genotype Inline graphic, we define the Polygenic Hazard Score (PHS) to be Inline graphic, where Inline graphic is the fitted regression coefficients obtained from snpnet-Cox. We assess the predictive power of PHS on survival time using the individuals in the held out test set. We applied a couple of procedures to give a high-level overview of the results. First, we assessed whether the PHS was significantly associated to the time-to-event data in the held out test set (so that we obtained a Inline graphic-value for each ICD10 code). Second, we computed the hazards ratio (HR) for the scale (standard deviation unit), and different thresholded percentiles (top Inline graphic, Inline graphic, Inline graphic, and bottom Inline graphic compared to the 40–60Inline graphic) of Inline graphic. Third, we computed the C-index (Harrell and others, 1982).

Fig. 1.

Fig. 1.

snpnet-Cox paths. Each line in these plots corresponds to a variable from the best model. The vertical axis represents the Inline graphic norm of the estimated coefficients and the horizontal axis represents the value of the coefficients. The path is computed at various level of regularization parameter. The whiskers at the top of the plot are the number of variables selected. The first 12 variables are the covariates including age, sex, PC1-10.

The C-index for the Inline graphic ICD10 codes with PHS Inline graphic range from 0.511 to 0.884 (see Global Biobank Engine snpnet-Cox page https://biobankengine.stanford.edu/snpnetcox) and HR per standard deviation of PHS from 1.042 to 13.167. The results further highlight the sparsity property of Lasso in the Cox model implemented in snpnet-Cox with some ICD10 codes including a single active variable in the set and others with almost 2000 active variables (e.g., non-insulin-dependent diabetes mellitus).

3.3.1. Asthma - J45

Motivated by the varying age of asthma onset, a common disease that affects a substantial fraction of young adults, we hypothesized that a PHS could capture individuals that are not only at higher risk of disease onset but also at a higher risk of developing asthma at a younger age.

Here, we estimate a HR of 1.428 per SD of PHS (C-index of 0.605), and HR of 2.740, 2.137, and 1.825 for the top 1, 5, and 10Inline graphic of the PHS distribution compared to the 40–60%. Further, we find that Inline graphic of individuals in the top Inline graphic of the PHS score developed asthma by age 20.5 compared to only Inline graphic in the bottom Inline graphic and Inline graphic of the 40–60 percentile of the PHS score (see Figure 2), which underscores the relevance of PHS in the context of early onset of common diseases that are hypothesized to have a monogenic signature (Kelsen and Baldassano, 2017). The asthma PHS is composed of 1.567 active variable of which some are known from previous Genome-Wide Association Studies (GWAS) of traits related to asthma. As an example, we identify the rs2381416 (MAF = 0.26) upstream of GTF3AP1 to associate with asthma with an effect size of Inline graphic0.11. This variant has previously been found to associate with eosinophil count (Gudbjartsson and others, 2009) and severity of childhood asthma (Smith and others, 2017).

Fig. 2.

Fig. 2.

Asthma. (A) The Kaplan–Meier curves for percentiles of PHSs for variants selected by snpnet-Cox, in the held out test set (orange: top Inline graphic, green: top Inline graphic, red: top Inline graphic, blue: 40-60Inline graphic, and brown: bottom 10Inline graphic; ticks represent censored observations. Highlighted are the proportion of asthma events by age 20 across the percentile scores. (B) Plot of snpnet-Cox coefficients for asthma with Inline graphic active variables. Green dots represent protein-altering variants.

3.3.2. Gout - M10

Gout is a common disease, affecting at least Inline graphic of men in Western countries, with a strong male to female imbalance (Terkeltaub, 2003). It is a form of arthritis caused by excess uric acid in the bloodstream and characterized by severe pain, redness, and tenderness in joints.

In the UK Biobank study, we estimate a HR of 1.679 per SD of PHS (C-index of 0.649), and HR of 3.70, 2.502, and 2.073 for the top 1, 5, and 10Inline graphic of the PHS distribution compared to the 40–60%. Further, we find that Inline graphic of individuals in the top Inline graphic of the PHS score developed asthma by age 50.1 compared to only Inline graphic in the bottom Inline graphic and Inline graphic of the 40–60 percentile of the PHS score (see Figure 3). The gout PHS consists of 1.970 active variables, and we identify loci that have been identified in prior GWAS (Dehghan and others, 2008).

Fig. 3.

Fig. 3.

Gout (A) The Kaplan–Meier curves for percentiles of PHSs for variants selected by snpnet-Cox, in the held out test set (orange: top Inline graphic, green: top Inline graphic, red: top Inline graphic, blue: 40–60Inline graphic, and brown: bottom 10Inline graphic; ticks represent censored observations. Highlighted are the proportion of gout events by age 50 across the percentile scores. (B) Plot of snpnet-Cox coefficients for gout with Inline graphic active variables. Green dots represent protein-altering variants.

3.3.3. Disorders of porphyrin and bilirubin metabolism - E80

Bilirubin, which is the principal component of bile pigments, is the end product of the catabolism of the heme moiety of hemoglobin and other hemoproteins. If bilirubin is produced in excessive amounts or hepatic excretion of bilirubin into bile is defective, the concentration of bilirubin in the blood and tissues increases, which may result in jaundice (Bosma, 2003), a well-recognizable symptom of liver disease.

We estimate a HR of 13.167 per SD of PHS (C-index of 0.884). Here, given that we have only two active variables, we find that the snpnet-Cox algorithm finds a sparse solution (see Figure 4). One of the active variables is the intron variant (rs6742078) of UTG1A4 (MAF = 0.31) which encodes an enzyme (Uridine diphosphate) UDP-glucuronosyltransferase that transforms small lipophilic molecules such as bilirubin (Tukey and Strassburg, 2000).

Fig. 4.

Fig. 4.

Disorders of porphyrin and bilirubin metabolism. (A) The Kaplan–Meier curves for percentiles of PHS for variants selected by snpnet-Cox, in the held out test set (orange: top Inline graphic, green: top Inline graphic, red: top Inline graphic, blue: 40–60Inline graphic, and brown: bottom 10Inline graphic; ticks represent censored observations. Highlighted are the proportion of disorders of porphyrin and bilirubin metabolism events by age 60 across the percentile scores. (B) Plot of snpnet-Cox coefficients for disorders of porphyrin and bilirubin metabolism with Inline graphic active variables. Green dots represent protein-altering variants.

3.3.4. Atrial fibrillation and flutter - I48

Atrial fibrillation is the most common type of arrhythmia in adults. The prevalence increases from less than Inline graphic in persons younger than 60 years of age to more than Inline graphic in those older than 80 years of age (McNamara and others, 2003). Earlier onset of atrial fibrillation is believed to have a strong genetic component and whether that has more of a polygenic or monogenic flavor is currently unknown.

In the UK Biobank study, we estimate a HR of 1.466 per SD of PHS (C-index of 0.618), and HR of 3.883, 2.319, and 1.861 for the top 1, 5, and 10Inline graphic of the PHS distribution compared to the 40–60%. Further, we find that Inline graphic of individuals in the top Inline graphic of the PHS score developed asthma by age 60 compared to only Inline graphic in the bottom Inline graphic and Inline graphic of the 40–60 percentile of the PHS score (see Figure 5), which underscores the relevance of PHS in the context of early onset of atrial fibrillation.

Fig. 5.

Fig. 5.

Atrial fibrillation. (A) The Kaplan–Meier curves for percentiles of PHSs for variants selected by snpnet-Cox, in the held out test set (orange: top Inline graphic, green: top Inline graphic, red: top Inline graphic, blue: 40–60Inline graphic, and brown: bottom 10Inline graphic; ticks represent censored observations. Highlighted are the proportion of atrial fibrillation events by age 60 across the percentile scores. (B) Plot of snpnet-Cox coefficients for atrial fibrillation with Inline graphic active variables. Green dots represent protein-altering variants.

4. Discussion

In this article, we developed the batch screening iterative LASSO (BASIL) algorithm (Qian and others, 2019) to find the lasso path of Cox proportional hazard models. We implemented an optimized C-index function, which computes the C-index of a fitted Cox model in Inline graphic time with an excellent constant factor. Our method was applied to the UK Biobank dataset to identify genetic variants that are associated with time-to-event phenotypes and to build PHS. Visualizations of snpnet-Cox results across 306 ICD10 codes are available in Global Biobank Engine (https://biobankengine.stanford.edu/snpnetcox) (McInnes and others, 2019).

Our current approach does have limitations, which we hope to resolve in future work. First, we assume that each individual has independent survival times (conditional on the features). This may become a limitation as population-scale cohorts especially in population isolates like in Finland sample related individuals. Second, we do not provide procedures to estimate the confidence intervals of the selected variables, which may be useful in communicating confidence in a single active variable (Taylor and Tibshirani, 2015). Third, as we move towards whole genome sequencing data where a large fraction of variants discovered will have a rare event property, i.e., observed in a handful of individuals, the validation accuracy may need to be redefined to evaluate a fitted Inline graphic. Fourth, we do not consider time-varying coefficients and time-varying covariates, which may improve inference in the setting where features may have multiple measurements over time. These are areas of future direction that we anticipate we will address.

Supplementary Material

kxaa038_Supplementary_Data

Acknowledgments

Conflict of Interest: None declared.

5. Software

We provide the implementation of our approach in a publicly available package snpnet available at https://github.com/rivas-lab/snpnet with cindex package dependency available at https://github.com/chrchang/plink-ng/tree/master/2.0/cindex. The analysis results are published on figshare at https://figshare.com/articles/snpnet-Cox_results/12368294.

Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

Funding

Stanford University to R.L.; The Two Sigma Graduate Fellowship to J.Q., in part; Funai Overseas Scholarship from Funai Foundation for Information Technology and the Stanford University School of Medicine to Y.T. Stanford University and a National Institute of Health center for Multi and Trans-ethnic Mapping of Mendelian and Complex Diseases grant (5U01 HG009080) to M.A.R.; National Human Genome Research Institute (NHGRI) of the National Institutes of Health (NIH) under awards (R01HG010140). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health; NIH (5R01 EB001988-16) and NSF (19 DMS1208164) to R.T., in part; the National Science Foundation (DMS-1407548) to T.H., in part; The National Institutes of Health (5R01 EB 001988-21).

References

  1. Aguirre, M., Rivas, M. and Priest, J. (2019). Phenome-wide burden of copy number variation in UK Biobank. American Journal of Human Genetics, 105, 373–383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Barlow, W. E. and Prentice, R. L. (1988, 03). Residuals for relative risk regression. Biometrika 75, 65–74. [Google Scholar]
  3. Bosma, P. J. (2003). Inherited disorders of bilirubin metabolism. Journal of Hepatology 38, 107–117. [DOI] [PubMed] [Google Scholar]
  4. Breslow, N. (1974). Covariance analysis of censored survival data. Biometrics 30, 89–99. [PubMed] [Google Scholar]
  5. Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L. T., Sharp, K., Motyer, A., Vukcevic, D., Delaneau, O., O’Connell, J.et al. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society Series B (Methodological) 34, 187–220. [Google Scholar]
  7. Dehghan, A., Köttgen, A., Yang, Q., Hwang, S.-J., Kao, W. H. L., Rivadeneira, F., Boerwinkle, E., Levy, D., Hofman, A., Astor, B. C.et al. (2008). Association of three genetic loci with uric acid concentration and risk of gout: a genome-wide association study. The Lancet 372, 1953–1961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33, 1–22. [PMC free article] [PubMed] [Google Scholar]
  9. Goeman J. J. (2010). L1 penalized estimation in the Cox proportional hazards model. Biometrical journal. Biometrische Zeitschrift, 52, 70—84. [DOI] [PubMed] [Google Scholar]
  10. Gudbjartsson, D. F., Bjornsdottir, U. S., Halapi, E., Helgadottir, A., Sulem, P., Jonsdottir, G. M., Thorleifsson, G., Helgadottir, H., Steinthorsdottir, V., Stefansson, H.et al. (2009). Sequence variants affecting eosinophil numbers associate with asthma and myocardial infarction. Nature Genetics 41, 342. [DOI] [PubMed] [Google Scholar]
  11. Harrell, F. E., Califf, R. M., Pryor, D. B., Lee, K. L. and Rosati, R. A. (1982). Evaluating the yield of medical tests. JAMA 247, 2543–2546. [PubMed] [Google Scholar]
  12. Kane, M., Emerson, J. and Weston, S. (2013). Scalable strategies for computing with massive data. Journal of Statistical Software, Articles 55, 1–19. [Google Scholar]
  13. Kelsen, J. R. and Baldassano, R. N. (2017). The role of monogenic disease in children with very early onset inflammatory bowel disease. Current Opinion in Pediatrics 29, 566–571. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Knuth, D. E. (2011). The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1. Pearson Education India. [Google Scholar]
  15. Li, R. and Tibshirani, R. (2019). On the use of c-index for stratified and cross-validated Cox model. arXiv preprint arXiv:1911.09638. [Google Scholar]
  16. Mcinnes, G., Tanigawa, Y., Deboever, C., Lavertu, A., Olivieri, J. E., Aguirre, M. and Rivas, M. (2019). Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics. Bioinformatics 35:2495–2497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. McNamara, R. L., Tamariz, L. J., Segal, J. B. and Bass, E. B. (2003). Management of atrial fibrillation: review of the evidence for the role of pharmacologic therapy, electrical cardioversion, and echocardiography. Annals of Internal Medicine 139, 1018–1033. [DOI] [PubMed] [Google Scholar]
  18. Muła, W., Kurz, N. and Lemire, D. (2018). Faster population counts using AVX2 instructions, The Computer Journal 61, 111–120. [Google Scholar]
  19. Park, M. Y. and Hastie, T. (2007). L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 69, 659–677. [Google Scholar]
  20. Qian, J., Du, W., Tanigawa, Y., Aguirre, M., Tibshirani, R., Rivas, M. A. and Hastie, T. (2019). A fast and flexible algorithm for solving the lasso in large-scale and ultrahigh-dimensional problems. bioRxiv. DOI: 10.1101/630079. [DOI] [Google Scholar]
  21. Smith, D., Helgason, H., Sulem, P., Bjornsdottir, U. S., Lim, A. C., Sveinbjornsson, G., Hasegawa, H., Brown, M., Ketchem, R. R., Gavala, M.et al. (2017). A rare il33 loss-of-function mutation reduces blood eosinophil counts and protects from asthma. PLoS Genetics 13, e1006659. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Sohn, I., Kim, J., Jung, S.-H. and Park, C. (2009). Gradient lasso for Cox proportional hazards model. Bioinformatics 25, 1775–1781. [DOI] [PubMed] [Google Scholar]
  23. Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green, J., Landray, M.et al. (2015). UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine 12, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Taylor, J. and Tibshirani, R. J. (2015). Statistical learning and selective inference. Proceedings of the National Academy of Sciences United States of America 112, 7629–7634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Terkeltaub, R. A. (2003). Gout. New England Journal of Medicine 349, 1647–1655. [DOI] [PubMed] [Google Scholar]
  26. Therneau, T. M. and Lumley, T. (2014). Package ‘Survival’. Survival Analysis Published on CRAN. [Google Scholar]
  27. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological) 58, 267–288. [Google Scholar]
  28. Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J. and Tibshirani, R. J. (2012). Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 74, 245–266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Tukey, R. H. and Strassburg, C. P. (2000). Human UDP-glucuronosyltransferases: metabolism, expression, and disease. Annual Review of Pharmacology and Toxicology 40, 581–616. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxaa038_Supplementary_Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES