Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2020 Oct 23;16(10):e1009141. doi: 10.1371/journal.pgen.1009141

A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank

Junyang Qian 1, Yosuke Tanigawa 2, Wenfei Du 1, Matthew Aguirre 2, Chris Chang 3, Robert Tibshirani 1,2, Manuel A Rivas 2, Trevor Hastie 1,2,*
Editor: Xiaofeng Zhu4
PMCID: PMC7641476  PMID: 33095761

Abstract

The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports 1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with 1/2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.

Author summary

With the advent and evolution of large-scale and comprehensive biobanks, there come up unprecedented opportunities for researchers to further uncover the complex landscape of human genetics. One major direction that attracts long-standing interest is the investigation of the relationships between genotypes and phenotypes. This includes but doesn’t limit to the identification of genotypes that are significantly associated with the phenotypes, and the prediction of phenotypic values based on the genotypic information. Genome-wide association studies (GWAS) is a very powerful and widely used framework for the former task, having produced a number of very impactful discoveries. However, when it comes to the latter, its performance is fairly limited by the univariate nature. To address this, multiple regression methods have been suggested to fill in the gap. That said, challenges emerge as the dimension and the size of datasets both become large nowadays. In this paper, we present a novel computational framework that enables us to solve efficiently the entire lasso or elastic-net solution path on large-scale and ultrahigh-dimensional data, and therefore make simultaneous variable selection and prediction. Our approach can build on any existing lasso solver for small or moderate-sized problems, scale it up to a big-data solution, and incorporate other extensions easily. We provide a package snpnet that extends the glmnet package in R and optimizes for large phenotype-genotype data. On the UK Biobank, we observe competitive prediction performance of the lasso and the elastic-net for all four phenotypes considered from the UK Biobank. That said, the scope of our approach goes beyond genetic studies. It can be applied to general sparse regression problems and build scalable solution for a variety of distribution families based on existing solvers.

Introduction

The past two decades have witnessed rapid growth in the amount of data available to us. Many areas such as genomics, neuroscience, economics and Internet services are producing big datasets that have high dimension, large sample size, or both. A variety of statistical methods and computing tools have been developed to accommodate this change. See, for example, [15] and the references therein for more details.

In high-dimensional regression problems, we have a large number of predictors, and it is likely that only a subset of them have a relationship with the response and will be useful for prediction. Identifying such a subset is desirable for both scientific interests and the ability to predict outcomes in the future. The lasso [6] is a widely used and effective method for simultaneous estimation and variable selection. Given a continuous response yRn and a model matrix XRn×p, it solves the following regularized regression problem.

β^(λ)=argminβRp12ny-Xβ22+λβ1, (1)

where xq=(i=1n|xi|q)1/q is the vector q norm of xRn and λ ≥ 0 is the tuning parameter. The 1 penalty on β allows for selection as well as estimation. Normally there is an unpenalized intercept in the model, but for ease of presentation we leave it out, or we may assume that both X and y have been centered with mean 0. One typically solves the entire lasso solution path over a grid of λ values λ1 ≥ λ2 ⋯ ≥λL and chooses the best λ by cross-validation or by predictive performance on an independent validation set. In R [7], several packages, such as glmnet [8] and ncvreg [9], provide efficient procedures to obtain the solution path for the Gaussian model (1), and for other generalized linear models with the residual sum of squared replaced by the negative log-likelihood of the corresponding model. Among them, glmnet, equipped with highly optimized Fortran subroutines, is widely considered the fastest off-the-shelf lasso solver. It can, for example, fit a sequence of 100 logistic regression models on a sparse dataset with 54 million samples and 7 million predictors within only 2 hours [10].

However, as the data become increasingly large, many existing methods and tools may not be able to serve the need, especially if the size exceeds the memory size. Most packages, including the ones mentioned above, assume that the data or at least its sparse representation can be fully loaded in memory and that the remaining memory is sufficient to hold other intermediate results. This becomes a real bottleneck for big datasets. For example, in our motivating application, the UK Biobank genotypes and phenotypes dataset [11] contains about 500,000 individuals and more than 800,000 genotyped single nucleotide polymorphisms (SNPs) and small indel measurements per person. This provides unprecedented opportunities to explore more comprehensive genotypic relationships with phenotypes of interest. For polygenic traits such as height and body mass index (BMI), specific variants discovered by genome-wide association studies (GWAS) used to explain only a small proportion of the estimated heritability [12], an upper bound of the proportion of phenotypic variance explained by the genetic components. While GWAS with larger sample size on the UK Biobank can be used to detect more SNPs and rare variants, their prediction performance is fairly limited by univariate models. It is very interesting to see if full-scale multiple regression methods such as the lasso or elastic-net can improve the prediction performance and simultaneously select relevant variants for the phenotypes. That being said, the computational challenges are two fold. First is the memory bound. Even though each bi-allelic SNP value can be represented by only two bits and the PLINK software and its bed/pgen format [13, 14] stores such SNP datasets in a binary compressed format, statistical packages such as glmnet and ncvreg require that the data be loaded in memory in a normal double-precision format. Given its sample size and dimension, the genotype matrix itself will take up around one terabyte of space, which may well exceed the size of the memory available and is infeasible for the packages. Second is the efficiency bound. For a larger-than-RAM dataset, it has to sit on the disk and we may only read part of it into the memory. In such scenario, the overall efficiency of the algorithm is not only determined by the number of basic arithmetic operations but also the disk I/O—data transfer between the memory and the disk—an operation several magnitudes slower than in-memory operations.

In this paper, we propose an efficient and scalable meta algorithm for the lasso called Batch Screening Iterative Lasso (BASIL) that is applicable to larger-than-RAM datasets and designed to tackle the memory and efficiency bound. It computes the entire lasso path and can easily build on any existing package to make it a scalable solution. As the name suggests, it is done in an iterative fashion on an adaptively screened subset of variables. At each iteration, we exploit an efficient, parallelizable screening operation to significantly reduce the problem to one of manageable size, solve the resulting smaller lasso problem, and then reconstruct and validate a full solution through another efficient, parallelizable step. In other words, the iterations have a screen-solve-check substructure. That being said, it is the goal and also the guarantee of the BASIL algorithm that the final solution exactly solves the full lasso problem (1) rather than any approximation, even if the intermediate steps work repeatedly on subsets of variables.

The screen-solve-check substructure is inspired by [15] and especially the proposed strong rules. The strong rules state: assume β^(λk-1) is the lasso solution in (1) at λk−1, then the jth predictor is discarded at λk if

|xj(y-Xβ^(λk-1))|<λk-(λk-1-λk). (2)

The key idea is that the inner product above is almost “non-expansive” in λ and that the lasso solution is characterized equivalently by the Karush-Kuhn-Tucker (KKT) condition [16]. For the lasso, the KKT condition states that β^Rp is a solution to (1) if for all 1 ≤ jp,

1n·xj(y-Xβ^){=λ·sign(β^j),ifβ^j0,λ,ifβ^j=0. (3)

The KKT condition suggests that the variables discarded based on the strong rules would have coefficient 0 at the next λk. The checking step comes into play because this is not a guarantee. The strong rules can fail, though failures occur rarely when p > n. In any case, the KKT condition will be checked to see if the coefficients of the left-out variables are indeed 0 at λk. If the check fails, we add in the violated variables and repeat the process. Otherwise, we successfully reconstruct a full solution and move to the next λ. This is the iterative algorithm proposed by these authors and has been implemented efficienly into the glmnet package.

The BASIL algorithm proceeds in a similar way but is designed to optimize for datasets that are too big to fit into the memory. Considering the fact that screening and KKT check need to scan through the entire data and are thus costly in the disk Input/Output (I/O) operations, we attempt to do batch screening and solve a series of models (at different λ values) in each iteration, where a single sweep over the full data would suffice. Followed by a checking step, we can obtain the lasso solution for multiple λ’s in one iteration. This can effectively reduce the total number of iterations needed to compute the full solution path and thus reduce the expensive disk read operations that often cause significant delay in the computation. The process is illustrated in Fig 1 and will be detailed in the next section.

Fig 1. The lasso coefficient profile that shows the progression of the BASIL algorithm.

Fig 1

The horizontal axis represents the index of lambda values, 1 ≤ L, which correspond to the sequence of the regularization parameters, λ1 > λ2 > ⋯ > λL. The previously finished part of the path is colored grey, the newly completed and verified is in sky blue, and the part that is newly computed but failed the verification is colored orange. The largest lambda index with the verified model is highlighted with vertical dotted gray line.

Results

Overview of the BASIL algorithm

For convenience, we first introduce some notation. Let Ω = {1, 2, …, p} be the universe of variable indices. For 1 ≤ L, let β^(λ) be the lasso solution at λ = λ, and A(λ)={1jp:β^j(λ)0} be the active set. When X is a matrix, we use XS to represent the submatrix including only columns indexed by S. Similarly when β is a vector, βS represents the subvector including only elements indexed by S. Given any two vectors a,bRn, the dot product or inner product can be written as ab=a,b=i=1naibi. Throughout the paper, we use predictors, features, variables and variants interchangeably. We use the strong set to refer to the screened subset of variables on which the lasso fit is computed at each iteration, and the active set to refer to the subset of variables with nonzero lasso coefficients.

Remember that our goal is to compute the exact lasso solution (1) for larger-than-RAM datasets over a grid of regularization parameters λ1 > λ2 > ⋯ > λL ≥ 0. We describe the procedure for the Gaussian family in this section and discuss extension to general problems in the next. A common choice is L = 100 and λ1=max1jp|xjr(0)|/n, the largest λ at which the estimated coefficients start to deviate from zero. Here r(0) = y if we do not include an intercept term and r(0)=y-y¯ if we do. In general, r(0) is the residual of regressing y on the unpenalized variables, if any. The other λ’s can be determined, for example, by an equally spaced array on the log scale. The solution path is found iteratively with a screening-solving-checking substructure similar to the one proposed in [15]. Designed for large-scale and ultrahigh-dimensional data, the BASIL algorithm can be viewed as a batch version of the strong rules. At each iteration we attempt to find valid lasso solution for multiple λ values on the path and thus reduce the burden of disk reads of the big dataset. Specifically, as summarized in Algorithm 1, we start with an empty strong set S(0)=Ø and active set A(0)=Ø. Each of the following iterations consists of three steps: screening, fitting and checking.

Algorithm 1 BASIL for the Gaussian Model

1: Initialization: active set A(0)=Ø, initial residual r(0) (with respect to the intercept or other unpenalized variables) at λ1 = λmax, a short list of initial parameters Λ(0) = {λ1, …, λL(0)}.

2: for k = 0 to K do

3:  Screening: for each 1 ≤ jp, compute inner product with current residual cj(k)=xj,r(k). Construct the strong set

S(k)=A(k)EM(k),

 where EM(k) is the set of M variables in Ω\A(k) with largest |c(k)|.

4:  Fitting: for all λ ∈ Λ(k), solve the lasso only on the strong set S(k), and find the coefficients β^(k)(λ) and the residuals r(k)(λ).

5:  Checking: search for the smallest λ such that the KKT conditions are satisfied, i.e.,

λ¯(k)=min{λΛ(k):maxjΩ\S(k)(1/n)|xjr(k)(λ)|<λ}.

For empty set, we define λ¯(k) to be the immediate previous λ to Λ(k) but increment M by ΔM. Let the current active set A(k+1) and residuals r(k+1) defined by the solution at λ¯(k). Define the next parameter list Λ(k+1)={λΛ(k):λ<λ¯(k)}. Extend this list if it consists of too few elements. For λ ∈ Λ(k) \ Λ(k+1), we obtain exact lasso solutions for the full problem:

β^S(k)(λ)=β^(k)(λ),β^Ω\S(k)(λ)=0.

6: end for

In the screening step, an updated strong set is found as the candidate for the subsequent fitting. Suppose that so far (valid) lasso solutions have been found for λ1, …, λ but not for λ+1. The new set will be based on the lasso solution at λ. In particular, we will select the top M variables with largest absolute inner products |xj,y-Xβ^(λ)|. They are the variables that are most likely to be active in the lasso model for the next λ values. In addition, we include the ever-active variables at λ1, …, λ because they have been “important” variables and might continue to be important at a later stage.

In the fitting step, the lasso is fit on the updated strong set for the next λ values λ+1, …, λ. Here ′ is often smaller than L because we do not have to solve for all of the remaining λ values on this strong set. The full lasso solutions at much smaller λ’s are very likely to have active variables outside of the current strong set. In other words even if we were to compute solutions for those very small λ values on the current strong set, they would probably fail the KKT test. These λ’s are left to later iterations when the strong set is expanded.

In the checking step, we check if the newly obtained solutions on the strong set can be valid part of the full solutions by evaluating the KKT condition. Given a solution β^SR|S| to the sub-problem at λ, if we can verify for every left-out variable j that (1/n)|xj,y-XSβ^S|<λ, we can then safely set their coefficients to 0. The full lasso solution β^(λ)Rp is then assembled by letting β^S(λ)=β^S and β^Ω\S(λ)=0. We look for the λ value prior to the one that causes the first failure down the λ sequence and use its residual as the basis for the next screening. Nevertheless, there is still chance that none of the solutions on the current strong set passes the KKT check for the λ subsequence considered in this iterations. That suggests the number of previously added variables in the current iteration was not sufficient. In this case, we are unable to move forward along the λ sequence, but will fall back to the λ value where the strong set was last updated and include ΔM more variables based on the sorted absolute inner product.

The three steps above can be applied repeatedly to roll out the complete lasso solution path for the original problem. However, if our goal is choosing the best model along the path, we can stop fitting once an optimal model is found evidenced by the performance on a validation set. At a high level, we run the iterative procedure on the training data, monitor the error on the validation set, and stop when the model starts to overfit, or in other words, when the validation error shows a clear upward trend.

Extension to general problems

It is straightforward to extend the algorithm from the Gaussian case to more general problems. In fact, the only changes we need to make are the screening step and the strong set update step. Wherever the strong rules can be applied, we have a corresponding version of the iterative algorithm. In [15], the general problem is

β^(λ)=argminβRpf(β)+λj=1rcjβjpj, (4)

where f is a convex differentiable function, and for all 1 ≤ jr, a separate penalty factor cj ≥ 0, pj ≥ 1, and βj can be a scalar or vector whose pj-norm is represented by βjpj. If all cj’s are positive, we can derive that the starting value of the λ sequence (i.e. the minimum value of λ such that all coefficients are 0) is λmax=max1jrjf(0)qj/cj. The general strong rule discards predictor j if

jf(β^(λk-1))qj<cj(2λk-λk-1), (5)

where 1/pj + 1/qj = 1. Hence, our algorithm can adapt and screen by choosing variables with large values of jf(β^(λk-1))qj that are not in the current active set. We expand in more detail two important applications of the general rule: logistic regression and Cox’s proportional hazards model in survival analysis.

Logistic regression

In the lasso penalized logistic regression [17] where the observed outcome y ∈ {0, 1}n, the convex differential function in (4) is

f(β)=-1ni=1n(yilogpi+(1-yi)log(1-pi)).

where pi=1/(1+exp(-xiβ)) for all 1 ≤ in. The rule in (5) is reduced to

|xj(y-p^(λk-1))|<λk-(λk-1-λk),

where p^(λk-1) is the predicted probabilities at λ = λk−1. Similar to the Gaussian case, we can still fit relaxed lasso [24] and allow adjustment covariates in the model to adjust for confounding effect.

Cox’s proportional hazards model

In the usual survival analysis framework, for each sample, in addition to the predictors xiRp and the observed time yi, there is an associated right-censoring indicator δi ∈ {0, 1} such that δi = 0 if failure and δi = 1 if right-censored. Let t1 < t2 < … < tm be the increasing list of unique failure times, and j(i) denote the index of the observation failing at time ti. The Cox’s proportional hazards model [18] assumes the hazard for the ith individual as hi(t)=h0(t)exp(xiβ) where h0(t) is a shared baseline hazard at time t. We can let f(β) be the negative log partial likelihood in (4) and screen based on its gradient at the most recent lasso solution as suggested in (5). In particular,

f(β)=-1mi=1m(xj(i)β-log(jRiexp(xjβ))),

where Ri is the set of indices j with yjti (those at risk at time ti). We can derive the associated rule based on (5) and thus the survival BASIL algorithm. Further discussion and comprehensive experiments are included in a follow-up paper [19].

Extension to the elastic net

Our discussion so far focuses solely on the lasso penalty, which aims to achieve a rather sparse set of linear coefficients. In spite of good performance in many high-dimensional settings, it has limitations. For example, when there is a group of highly correlated variables, the lasso will often pick out one of them and ignore the others. This poses some hardness in interpretation. Also, under high-correlation structure like that, it has been empirically observed that when the predictors are highly correlated, the ridge can often outperform the lasso [6].

The elastic net, proposed in [20], extends the lasso and tries to find a sweet spot between the lasso and the ridge penalty. It can capture the grouping effect of highly correlated variables and sometimes perform better than both methods especially when the number of variables is much larger than the number of samples. In particular, instead of imposing the 1 penalty, the elastic net solves the following regularized regression problem.

β^(λ)=argminβRpf(β)+λ(αβ1+(1-α)β22/2), (6)

where the mixing parameter α ∈ [0, 1] determines the proportion of lasso and ridge in the penalty term.

It is straightforward to adapt the BASIL procedure to the elastic net. It follows from the gradient motivation of the strong rules and KKT condition of convex optimization. We take the Gaussian family as an example. The others are similar. In the screening step, it is easy to derive that we can still rank among the currently inactive variables on their absolute inner product with the residual |xj(y-Xβ^(λk-1))| to determine the next candidate set. In the checking step, to verify that all the left-out variables indeed have zero coefficients, we need to make sure that (1/n)|xj(y-Xβ^(λk-1))|λα holds for all such variables. It turns out that in our UK Biobank applications, the elastic-net results (after selection of α and λ on the validation set) do not differ significantly from the lasso results, which will be immediately seen in the next section.

UK Biobank analysis

We describe a real-data application on the UK Biobank that in fact motivates our development of the BASIL algorithm.

The UK Biobank [11] is a very large, prospective population-based cohort study with individuals collected from multiple sites across the United Kingdom. It contains extensive genotypic and phenotypic detail such as genomewide genotyping, questionnaires and physical measures for a wide range of health-related outcomes for over 500,000 participants, who were aged 40-69 years when recruited in 2006-2010. In this study, we are interested in the relationship between an individual’s genotype and his/her phenotypic outcome. While GWAS focus on identifying SNPs that may be marginally associated with the outcome using univariate tests, we would like to find relevant SNPs in a multivariate prediction model using the lasso. A recent study [21] fits the lasso on a subset of the variables after one-shot univariate p-value screening and suggests improvement in explaining the variation in the phenotypes. However, the left-out variants with relatively weak marginal association may still provide additional predictive power in a multiple regression environment. The BASIL algorithm enables us to fit the lasso model at full scale and gives further improvement in the explained variance over the alternative models considered.

We focused on 337,199 White British unrelated individuals out of the full set of over 500,000 from the UK Biobank dataset [11] that satisfy the same set of population stratification criteria as in [22]. The dataset is partitioned randomly into training (60%), validation (20%) and test (20%) subsets (Methods). Each individual has up to 805,426 measured variants, and each variant is encoded by one of the four levels where 0 corresponds to homozygous major alleles, 1 to heterozygous alleles, 2 to homozygous minor alleles and NA to a missing genotype. In addition, we have available covariates such as age, sex, and fortypre-computed principal components of the SNP matrix.

To evaluate the predictive performance for quantitative response, we use a common measure R-squared (R2). Given a linear estimator β^ and data (y, X), it is defined as

R2=1-y-Xβ^22y-y¯22.

We evaluate this criteria for all the training, validation and test sets. For a dichotomous response, misclassification error could be used but it would depend on the calibration. Instead the receiver operating characteristic (ROC) curve provides more information and illustrates the tradeoff between true positive and false positive rates under different thresholds. The AUC computes the area under the ROC curve—a larger value indicates a generally better classifier. Therefore, we will evaluate AUCs on the training, validation and test sets for dichotomous responses.

We compare the performance of the lasso with related methods to have a sense of the contribution of different components. Starting from the baseline, we fit a linear model that includes only age and sex (Model 1 in the tables below), and then one that includes additionally the top 10 principal components (Model 2). These are the adjustment covariates used in our main lasso fitting and we use these two models to highlight the contribution of the SNP information over and above that of age, sex and the top 10 PCs. In addition, the strongest univariate model is also evaluated (Model 3). This includes the 12 adjustment covariates together with the single SNP that is most correlated with the outcome after adjustment. Toward multivariate models, we first compare with a univariate method with some multivariate flavor (Models 4 and 5). We select a subset of the K most marginally significant variants (after adjusting for the covariates), construct a new variable by linearly combining these variants using their univariate coefficients, and fit an ordinary least squares (OLS) on the new variable together with the adjustment variables. It is similar to a one-step partial least squares [23] with p-value based truncation. We take K = 10, 000 and 100, 000 in the experiments. We further compare with a hierarchical sequence of multivariate models where each is fit on a subset of the most significant SNPs. In particular, the -th model selects × 1000 SNPs with the smallest univariate p-values, and a multivariate linear or logistic regression is fit on those variants jointly. The sequence of models are evaluated on the validation set, and the one with the smallest validation error is chosen. We call this method Sequential LR or SeqLR (Model 6) for convenience in the rest of the paper. As a byproduct of the lasso, the relaxed lasso [24] fits a debiased model by refitting an OLS on the variables selected by the lasso. This can potentially recover some of the bias introduced by lasso shrinkage. For the elastic-net, we fit separate solution paths with varying λ’s at α = 0.1, 0.5, 0.9, and evaluate their performance (R2 or AUC) on the validation set. The best pair of hyperparameters (α, λ) is selected and the corresponding test performance is reported.

There are thousands of measured phenotypes in the dataset. For demonstration purpose, we analyze four phenotypes that are known to be highly or moderately heritable and polygenic. For these complex traits, univariate studies may not find SNPs with smaller effects, but the lasso model may include them and predict the phenotype better. We look at two quantitative traits: standing height and body mass index (BMI), which are defined as a non-NA median of up to 3 measurements [25], and two qualitative traits: asthma and high cholesterol (HC) [22].

We first summarize the test performance of the methods above in Fig 2. The lasso and elastic net show significant improvement in test R2 and AUC over the others. Details of the model for height are given in the next section and for the other phenotypes (BMI, asthma and high cholesterol) in Supporting Information. A comparison of the univariate p-values and the lasso coefficients for all these traits is shown in the form of Manhattan plots and coefficient plots in the Supporting Information.

Fig 2. Comparison of the predictive performance of the different polygenic prediction methods evaluated on the test set.

Fig 2

R2 are evaluated for continuous phenotypes height and BMI, and AUC evaluated for binary phenotypes asthma and high cholesterol. The error bar uses 2 standard errors to show the statistical significance.

There are many other well-established methods for constructing the polygenic risk scores from large-scale cohorts. Among them, we compare with ridge regression, pruning and thresholding (P + T), clumping, and summary statistics-based Bayesian regression methods such as PRS-CS [26] and SBayesR [27]. Ridge regression, also known as BLUP in the quantitative genetics field, fits a multiple linear regression model with 2-penalty. It is a special case of the elastic-net with α = 0 in (6). While it is simple and has been widely used in a variety of prediction tasks, the fact that the resulting model always includes all the variables can pose great computational challenge (for example, memory) in large-scale problems. In our experiments, the size of the data prevents us from doing exact, full-scale ridge regression. Instead, we approximate its performance by fitting the elastic-net with very small α = 10−3, which can be easily handled by our snpnet package. For P + T, we first identified LD independent set of variants using --indep-pairwise 50 5.5 subcommand in PLINK2.0. We subsequently applied univariate genome-wide association analysis (--glm firth-fallback), focused on the LD independent variants, imposed the different p-value thresholds (1 × 10−3, 1 × 10−4, and 1 × 10−5), and extracted the univariate BETAs for the remaining variants to construct PRS [13]. For clumping, we applied --clump subcommand to the GWAS summary statistics with a varying (1 × 10−3, 1 × 10−4, and 1 × 10−5) p-value threshold (--clump-p1), and extracted the univariate BETAs for the identified lead SNPs [13]. For each of those two methods, we computed the PRS for each individual and fit an additional model consisting of covariates and the genotype PRS to report the predictive performance of the model consisting of both the genetic features and covariates. For PRS-CS, we first characterized the GWAS summary statistics using the combined set of training and validation set (n = 269, 927) with age, sex, and the top 10 PCs as covariates using PLINK v2.00a3LM (9 Apr 2020) [13, 28]. Using the LD reference dataset precomputed for the European Ancestry using the 1000 genome samples (https://github.com/getian107/PRScs), we applied PRS-CS with the default option. We took the posterior effect size estimates and computed the polygenic risk scores using PLINK2’s --score subcommand [13]. For SBayesR, we computed the sparse LD matrix using the combined set of training and validation set individuals (n = 269, 927) using the -- make-sparse-ldm subcommand implemented in GCTB version 2.0.1 [29]. Using the GWAS summary statistics computed on the set of individuals and following the GCTB’s recommendations, we applied SBayesR with the following options: gctb --sbayes R --ldm [the LD matrix] --pi 0.95,0.02,0.02,0.01 --gamma 0.0,0.01,0.1,1 --chain-length 10000 --burn-in 4000 --gwas-summary [the GWAS summary statistics]. We report the model performance on the test set.

We train PRS-CS and SBayesR on the combined training and validation dataset with recommended settings. To make it a fair comparison, for the other methods with tuning parameter(s), we follow a refitting mechanism:

  1. fit models on the training set under different parameters;

  2. choose the optimal parameter based on the metric (R2/AUC) on the validation set;

  3. refit the model with the chosen parameter on a combined training and validation set.

This is often recommended for methods with tuning parameters to make the most of the validation set, and for the lasso/elastic-net, we demonstrate those steps with a code example in the vignette of our snpnet package. The predictive performance is compared in Fig 3. SBayesR seems fairly competitive on binary phenotypes, achieving higher test AUC on asthma and high cholesterol. For continuous phenotypes, the lasso and the elastic-net seem to have some advantage, with the lasso doing the best for height and the elastic-net doing the best for BMI. Aside from the predictive performance, SBayesR would always include all the variables in the model, while the lasso/elastic-net class often ends up using only a small fraction of the variables. While prediction is key to the relevance of PRS methods, the sparsity of the solution achieved by the lasso/elastic-net class is also very important for scientific understanding of the genetics behind.

Fig 3. Comparison of the test set predictive performance of the different polygenic risk score (PRS) methods with refitting on the training and the validation set.

Fig 3

R2 are evaluated for continuous phenotypes height and BMI, and AUC evaluated for binary phenotypes asthma and high cholesterol. The error bar uses 2 standard errors to show the statistical significance.

Height is a polygenic and heritable trait that has been studied for a long time. It has been used as a model for other quantitative traits, since it is easy to measure reliably. From twin and sibling studies, the narrow sense heritability is estimated to be 70-80% [3032]. Recent estimates controlling for shared environmental factors present in twin studies calculate heritability at 0.69 [33, 34]. A linear based model with common SNPs explains 45% of the variance [35] and a model including imputed variants explains 56% of the variance, almost matching the estimated heritability [36]. So far, GWAS studies have discovered 697 associated variants that explain one fifth of the heritability [37, 38]. Recently, a large sample study was able to identify more variants with low frequencies that are associated with height [39]. Using the lasso with the larger UK Biobank dataset allows both a better estimate of the proportion of variance that can be explained by genomic predictors and simultaneous selection of SNPs that may be associated. The results are summarized in Table 1, where for each model class (row), the reported numbers are based on the fitted model on the training set that achieves the best validation performance (if any hyper-parameter). The associated R2 curves for the lasso and the relaxed lasso are shown in Fig 4. The residuals of the optimal lasso prediction are plotted in Fig 5.

Table 1. R2 values for height (without refitting).

For sequential LR, lasso and relaxed lasso, the chosen model is based on maximum R2 on the validation set. Model (3) to (9) each includes Model (2) plus their own specification as stated in the Form column. The elastic-net picks α = 0.9 based on the validation performance.

Model Form Rtrain2 Rval2 Rtest2 Size
(1) Age + Sex 0.5300 0.5260 0.5288 2
(2) Age + Sex + 10 PCs 0.5344 0.5304 0.5336 12
(3) Strong Single SNP 0.5364 0.5323 0.5355 13
(4) 10K Combined 0.5482 0.5408 0.5444 10,012
(5) 100K Combined 0.5833 0.5515 0.5551 100,012
(6) Sequential LR 0.7416 0.6596 0.6601 17,012
(7) Lasso 0.8304 0.6992 0.6999 47,673
(8) Relaxed Lasso 0.7789 0.6718 0.6727 13,395
(9) Elastic Net 0.8282 0.6991 0.6998 48,268

Fig 4. R2 plot for height.

Fig 4

The primary horizontal axis on the bottom represents the index of lambda values, 1 ≤ L, which correspond to the sequence of the regularization parameters, λ1 > λ2 > ⋯ > λL. The top axis shows the number of active variables in the model. ReLasso: relaxed lasso.

Fig 5.

Fig 5

Left: actual height versus predicted height on 5000 random samples from the test set. A regression line with its 95% confidence band is also added on top of the dots. The correlation between actual height and predicted height is 0.9416. Right: histogram of the lasso residuals for height. Standard deviation of the residual is 5.05 (cm).

A large number (47,673) of SNPs need to be selected in order to achieve the optimal Rtest2=0.6999 for the lasso and similarly for the elastic-net, though it is only a small fraction considering the entire variant set. Comparatively, the relaxed lasso sacrifices some predictive performance by including a much smaller subset of variables (13,395). Past the optimal point, the additional variance introduced by refitting such large models may be larger than the reduction in bias. The large models confirm the extreme polygenicity of standing height.

In comparison to the other models, the lasso performs significantly better in terms of Rtest2 than all univariate methods, and outperforms multivariate methods based on univariate p-value ordering. That demonstrates the value of simultaneous variable selection and estimation from a multivariate perspective, and enables us to predict height to within 10 cm about 95% of the time based only on SNP information (together with age and sex). We also notice that the sequential linear regression approach does a good job, whose performance gets close to that of the relaxed lasso. It is straightforward and easy to implement using existing softwares such as PLINK [13].

Recently [21] apply a lasso based method to predict height and other phenotypes on the UK Biobank. Instead of fitting on all QC-satisfied SNPs (as stated in the experiment details paragraph of Materials and methods), they pre-screen 50K or 100K most significant SNPs in terms of the univariate p-value and apply lasso on that set only. In addition, although both datasets come from the same UK Biobank, the subset of individuals they used is larger than ours. While we restrict the analysis to the unrelated individuals who have self-reported white British ancestry, they look at Europeans including British, Irish and Any Other White. For a fair comparison, we follow their procedure (pre-screening 100K SNPs) but run on our subset of the dataset. The results are shown in Table 2. We see that the improvement of the full lasso over the prescreened lasso is almost 0.5% in test R2, and 1% relative to the proportion of residual variance explained after covariate adjustment.

Table 2. Comparison of prediction results on height with the model trained following the same procedure as ours except for an additional prescreening step as done in [21].

In addition to R2, proportion of residual variance explained (denoted by htest2) and correlation between the fitted values and actual values are computed. We also compute an adjusted correlation between the residual after regressing age and sex out from the prediction and the residual after regressing age and sex out from the true response, both on the test set.

Method Rval2 Rtest2 htest2 Cortest Cortest−{age, sex}
Lasso 69.92% 69.99% 35.66% 0.8366 0.4079
Prescreened lasso 69.40% 69.56% 34.73% 0.8340 0.4025

In addition, we compare the full lasso coefficients and the univariate p-values from GWAS in Fig 6. The vertical grey dotted line indicates the top 100K cutoff in terms of p-value. We see although a general decreasing trend appears in the magnitude of the lasso coefficients with respect to increasing p-values (decreasing − log10(p)), there are a number of spikes even in the large p-value region which is considered marginally insignificant. This shows that variants beyond the strongest univariate ones contribute to prediction.

Fig 6. Comparison of the lasso coefficients and univariate p-values for height.

Fig 6

The index on the horizontal axis represents the SNPs sorted by their univariate p-values. The red curve associated with the left vertical axis shows the −log10 of the univariate p-values. The blue bars associated with the right vertical axis show the corresponding lasso coefficients for each (sorted) SNP. The horizontal dotted lines in gray identifies lasso coefficients of ±0.05. The vertical one represents the 100K cutoff used in [21].

We conduct the lasso and elastic-net with the refitting mechanism and compare them with the other well-established PRS methods. From Table 3, we see that the lasso and the elastic-net do the best job and also uses only a small fraction of the variables.

Table 3. R2 values for height by different PRS methods with refitting.

Model Form Rtest2 Size
(1) Lasso 0.7127 45,653
(2) Elastic Net 0.7128 45,549
(3) Ridge 0.6986 175,012
(4) PRS-CS 0.5615 148,064
(5) SBayesR 0.7019 667,057
(6) P + T 0.5912 15,544
(7) Clumping 0.6181 17,433

Discussion

In this paper, we propose a novel batch screening iterative lasso (BASIL) algorithm to fit the full lasso solution path for very large and high-dimensional datasets. It can be used, among the others, for Gaussian linear model, logistic regression and Cox regression, and can be easily extended to fit the elastic-net with mixed 1/2 penalty. It enjoys the advantages of high efficiency, flexibility and easy implementation. For SNP data as in our applications, we develop an R package snpnet that incorporates SNP-specific optimizations and are able to process datasets of wide interest from the UK Biobank.

In our algorithm, the choice of M is important for the practical performance. It trades off between the number of iterations and the computation per iteration. With a small M or small update of the strong set, it is very likely that we are unable to proceed fast along the λ sequence in each iteration. Although the design of the BASIL algorithm guarantees that for any M, ΔM > 0, we are able to obtain the full solution path after sufficient iterations, many iterations will be needed if M is chosen too small, and the disk I/O cost will be dominant. In contrast, a large M will incur more memory burden and more expensive lasso computation, but with the hope to find more valid lasso solutions in one iteration, save the number of iterations and the disk I/O. It is hard to identify the optimal M a priori. It depends on the computing architecture, the size of the problem, the nature of the phenotype, etc. For this reason, we tend to leave it as a subjective parameter to the user’s choice. However in the meantime, we do plan to provide a more systematic option to determine M, which leverages the strong rules again. Recall that in the simple setting with no intercept and no covariates, the initial strong set is constructed by |xjy|2λ-λmax. Since the strong rules rarely make mistakes and are fairly effective in discarding inactive variables, we can guide the choice of batch size M by the number of λ values we want to cover in the first iteration. For example, one may want the strong set to be large enough to solve for the first 10 λ’s in the first iteration. We can then let M=|{1jp:|xjy|>2λ10-λmax}|. Despite being adaptive to the data in some sense, this approach is by no means computationally optimal. It is more based on heuristics that the iteration should make reasonable progress along the path.

Our numerical studies demonstrate that the iterative procedure effectively reduces a big-n-big-p lasso problem into one that is manageable by in-memory computation. In each iteration, we are able to use parallel computing when applying screening rules to filter out a large number of variables. After screening, we are left with only a small subset of data on which we are able to conduct intensive computation like cyclical coordinate descent all in memory. For the subproblem, we can use existing fast procedures for small or moderate-size lasso problems. Thus, our method allows easy reuse of previous software with lightweight development effort.

When a large number of variables is needed in the optimal predictive model, it may still require either large memory or long computation time to solve the smaller subproblem. In that case, we may consider more scalable and parallelizable methods like proximal gradient descent [40] or dual averaging [41, 42]. One may think why don’t we directly use these methods for the original full problem? First, the ultra high dimension makes the evaluation of gradients, even on mini-batch very expensive. Second, it can take a lot more steps for such first-order methods to converge to a good objective value. Moreover, the speed of convergence depends on the choice of other parameters such as step size and additional constants in dual averaging. For those reasons, we still prefer the tuning-free and fast coordinate descent methods when the subproblem is manageable.

The lasso has nice variable selection and prediction properties if the linear model assumption together with some additional assumptions such as the restricted eigenvalue condition [43] or the irrepresentable condition [44] holds. In practice, such assumptions do not always hold and are often hard to verify. In our UK Biobank application, we don’t attempt to verify the exact conditions, and the selected model can be subject to false positives. However, we demonstrate relevance of the selection via empirical consistency with the GWAS results. We have seen superior prediction performance by the lasso as a regularized regression method compared to other methods. More importantly, by leveraging the sparsity property of the lasso, we are able to manage the ultrahigh-dimensional problem and obtain a computationally efficient solution.

When comparing with other methods in the UK Biobank experiments, due to the large number of test samples (60,000+), we are confident that the lasso and the elastic-net methods are able to do significantly better than all other methods on height and BMI, and are as competitive as SBayesR on asthma and high cholesterol. In fact, the standard error of R2 can be easily derived by the delta method, and the standard error of the AUC can be estimated and upper bounded by 1/(4 min(m, n)) [45, 46], where m, n represents the number of positive and negative samples. For height and BMI, it turns out that the standard errors are roughly 0.001, or 0.1%. For asthma and high cholesterol, considering the case rate around 12%, the standard errors can be upper bounded by 0.005, or 0.5%. The estimated standard errors are reflected in the error bars in Figs 2 and 3. Therefore, speaking of the predictive performance, on height and BMI, the lasso/elastic-net class performs significantly better than the other methods, while on asthma and high cholesterol, the lasso/elastic-net and the SBayesR are both fairly competitive—their difference is not statistically significant. Moreover, the lasso/elastic-net method builds parsimonious models using only a small fraction of the variants. It is more interpretable and can have meaningful implications on the genetics behind.

Materials and methods

Variants in the BASIL framework

Some other very useful components can be easily incorporated into the BASIL framework. We will discuss debiasing using the relaxed lasso and the inclusion of adjustment covariates.

The lasso is known to shrink coefficients to exclude noise variables, but sometimes such shrinkage can degrade the predictive performance due to its effect on actual signal variables. [24] introduces the relaxed lasso to correct for the potential over-shrinkage of the original lasso estimator. They propose a refitting step on the active set of the lasso solution with less regularization, while a common way of using it is to fit a standard OLS on the active set. The active set coefficients are then set to

β^A,Relax(λ)=argminβAR|A|y-XAβA22,

whereas the coefficients for the inactive set remain at 0. This refitting step can revert some of the shrinkage bias introduced by the vanilla lasso. It doesn’t always reduce prediction error due to the accompanied increase in variance when there are many variables in the model or when the signals are weak. That being said, we can still insert a relaxed lasso step with little effort in our iterative procedure: once a valid lasso solution is found for a new λ, we may refit with OLS. As we iterate, we can monitor validation error for the lasso and the relaxed lasso. The relaxed lasso will generally end up choosing a smaller set of variables than the lasso solution in the optimal model.

In some applications such as GWAS, there may be confounding variables ZRn×q that we want to adjust for in the model. Population stratification, defined as the existence of a systematic ancestry difference in the sample data, is one of the common factors in GWAS that can lead to spurious discoveries. This can be controlled for by including some leading principal components of the SNP matrix as variables in the regression [47]. In the presence of such variables, we instead solve

(α^(λ),β^(λ))=argminαRq,βRp12ny-Zα-Xβ22+λβ1. (7)

This variation can be easily handled with small changes in the algorithm. Instead of initializing the residual with the response y, we set r(0) equal to the residual from the regression of y on the covariates. In the fitting step, in addition to the variables in the strong set, we include the covariates but leave their coefficients unpenalized as in (7). Notice that if we want to find relaxed lasso fit with the presence of adjustment covariates, we need to include those covariates in the OLS as well, i.e.,

(α^Relax(λ),β^A,Relax(λ))=argminαRq,βAR|A|y-Zα-XAβA22. (8)

UK Biobank experiment details

We focused on 337,199 White British unrelated individuals out of the full set of over 500,000 from the UK Biobank dataset [11] that satisfy the same set of population stratification criteria as in [22]: (1) self-reported White British ancestry, (2) used to compute principal components, (3) not marked as outliers for heterozygosity and missing rates, (4) do not show putative sex chromosome aneuploidy, and (5) have at most 10 putative third-degree relatives. These criteria are meant to reduce the effect of confoundedness and unreliable observations.

The number of samples is large in the UK Biobank dataset, so we can afford to set aside an independent validation set without resorting to the costly cross-validation to find an optimal regularization parameter. We also leave out a subset of observations as test set to evaluate the final model. In particular, we randomly partition the original dataset so that 60% is used for training, 20% for validation and 20% for test. The lasso solution path is fit on the training set, whereas the desired regularization is selected on the validation set, and the resulting model is evaluated on the test set.

We are going to further discuss some details in our application that one might also encounter in practice. They include adjustment for confounders, missing value imputation and variable standardization in the algorithm.

In genetic studies, spurious associations are often found due to confounding factors. Among the others, one major source is the so-called population stratification [48]. To adjust for that effect, it is common is to introduce the top principal components and include them in the regression model. Therefore in the lasso method, we are going to solve (7) where in addition to the SNP matrix X, we let Z include covariates such as age, sex and the top 10 PCs of the SNP matrix.

Missing values are present in the dataset. As quality control normally done in genetics, we first discard observations whose phenotypic value of interest is not available. We further exclude variants whose missing rate is greater than 10% or the minor allele frequency (MAF) is less than 0.1%, which results in around 685,000 SNPs for height. In particulr, 685,362 for height, 685,371 for BMI, 685,357 for asthma and 685,357 for HC. The number varies because the criteria are evaluated on the subset of individuals whose phenotypic value is observed (after excluding the missing ones), which can be different across different phenotypes. For those remaining variants, mean imputation is conducted to fill the missing SNP values; that is, the missing values in every SNP are imputed with the mean observed level of that SNP in the population under study.

When it comes to the lasso fitting, there are some subtleties that can affect its variable selection and prediction performance. One of them is variable standardization. It is often a step done without much thought to deal with heterogeneity in variables so that they are treated fairly in the objective. However in our studies, standardization may create some undesired effect. To see this, notice that all the SNPs can only take values in 0, 1, 2 and NA—they are already on the same scale by nature. As we know, standardization would use the current standard deviation of each predictor as the divisor to equalize the variance across all predictors in the lasso fitting that follows. In this case, standardization would unintentionally inflate the magnitude of rare variants and give them an advantage in the selection process since their coefficients effectively receive less penalty after standardization. In Fig 7, we can see the distribution of standard deviation across all variants in our dataset. Hence, to avoid potential spurious findings, we choose not to standardize the variants in the experiments.

Fig 7. Histogram of the standard deviations of the SNPs.

Fig 7

They are computed after mean imputation of the missing values because they would be the exact standardization factors to be used if the lasso were applied with variable standardization on the mean-imputed SNP matrix.

Computational optimization in software implementation

Among the iterative steps in BASIL, screening and checking are where we need to deal with the full dataset. To deal with the memory bound, we can use memory-mapped I/O. In R, bigmemory [49] provides a convenient implementation for that purpose. That being said, we do not want to rely on that for intensive computation modules such as cyclic coordinate descent, because frequent visits to the on-disk data would still be slow. Instead, since the subset of strong variables would be small, we can afford to bring them to memory and do fast lasso fitting there. We only use the full memory-mapped dataset in KKT checking and screening. Moreover since checking in the current iteration can be done together with the screening in the next iteration, effectively only one expensive pass over the full dataset is needed every iteration.

In addition, we use a set of techniques to speed up the computation. First, the KKT check can be easily parallelized by splitting on the features when multi-core machines are available. The speedup of this part is immediate and (slightly less than) proportional to the number of cores available. Second, specific to the application, we exploit the fact that there are only 4 levels for each SNP value and design a faster inner product routine to replace normal float number multiplication in the KKT check step. In fact, given any SNP vector x ∈ {0, 1, 2, μ}n where μ is the imputed value for the missing ones, we can write the dot product with a vector rRn as

xr=i=1nxiri=1·i:xi=1ri+2·i:xi=2ri+μ·i:xi=μri. (9)

We see that the terms corresponding to 0 SNP value can be ignored because they don’t contribute to the final result. This will significantly reduce the number of arithmetic operations needed to compute the inner product with rare variants. Further, we only need to set up 3 registers, each for one SNP value accumulating the corresponding terms in r. A series of multiplications is then converted to summations. In our UK Biobank studies, although the SNP matrix is not sparse enough to exploit sparse matrix representation, it still has around 70% 0’s. We conduct a small experiment to compare the time needed to compute XR, where X{0,1,2,3}n×p,RRp×k. The proportions for the levels in X are about 70%, 10%, 10%, 10%, similar to the distribution of SNP levels in our study, and R resembles the residual matrix when checking the KKT condition. The number of residual vectors is k = 20. The mean time over 100 repetitions is shown in Table 4.

Table 4. Timing performance (milliseconds) on multiplication of SNP matrix and residual matrix.

The methods are all implemented in C++ and run on a Macbook with 2.9 GHz Intel Core i7 and 8 GB 1600 MHz DDR3.

Multiplication Method n = 200, p = 800 n = 2000, p = 8000
Standard 3.20 306.01
SNP-Optimized 1.32 130.21

We implement the procedure with all the optimizations above in an R package called snpnet, which is currently available at https://github.com/junyangq/snpnet and will be submitted to the CRAN repository of R packages. While most of the numerical experiments throughout the paper are based on an earlier version of the package (available in the V1.0 branch of the repository) assuming bed file format provided by PLINK 1.9, we highly recommend one to use the current version that works with the pgen file format provided by PLINK 2.0 [13, 28]. It takes advantage of PLINK 2.0’s new R interface as well as its efficient --variant-score module for matrix multiplication. The module exploits combined techniques of multithreading, a good linear algebra library, and an alternate code path for very-low-MAF SNPs (similar to the one proposed in (9)) to make the computation even faster. In order to achieve better efficiency in each lasso fitting, we suggest using snpnet together with glmnetPlus, a warm-started version of glmnet, which is currently available at https://github.com/junyangq/glmnetPlus. It allows one to provide a good initialization of the coefficients to fit part of the solution path instead of always starting from the all-zero solution by glmnet.

Related methods and packages

There are a number of existing screening rules for solving big lasso problems. [50] use a screened set to scale down the logistic lasso problem and check the KKT condition to validate the solution. Their focus, however, is on selecting a lasso model of particular size and only the initial screened set is expanded if the KKT condition is violated. In contrast, we are interested in finding the whole solution path (before overfitting). We adopt a sequential approach and keep updating the screened set at each iteration. This allows us to potentially keep the screened set small as we move along the solution path. Other rules include the SAFE rule [51], Sure Independence Screening [52], and the DPP and EDPP rules [53].

We expand the discussion on these screening rules a bit. [52] exploits marginal information of correlation to conduct screening but the focus there is not optimization algorithm. Most of the screening rules mentioned above (except for EDPP) use inner product with the current residual vector to measure the importance of each predictor at the next λ—those under a threshold can be ignored. The key difference across those rules is the threshold defined and whether the resulting discard is safe. If it is safe, one can guarantee that only one iteration is needed for each λ value, compared with others that would need more rounds if an active variable was falsely discarded. Though the strong rules rarely make this mistake, safe screening is still a nice feature to have in single-λ solutions. However, under the batch mode we consider due to the desire of reducing the number of full passes over the dataset, the advantage of safe threshold may not be as much. In fact, one way we might be able to leverage the safe rules in the batch mode is to first find out the set of candidate predictors for the several λ values up to λk we wish to solve in the next iteration based on the current inner products and the rules’ safe threshold, and then solve the lasso for these parameters. Since these rules can often be conservative, we would then have strong incentive to solve for, say, one further λ value λk+1 because if the current screening turns out to be a valid one as well, we will find one more lasso solution and move one step forward along the λ sequence we want to solve for. This can potentially save one iteration of the procedure and thus one expensive pass over the dataset. The only cost there is computing the lasso solution for one more λk+1 and computing inner products with one more residual vector at λk+1 (to check the KKT condition). The latter can be done in the same pass as we compute inner products at λk for preparing the screening in the next iteration, and so no additional pass is needed. Thus under the batch mode, the property of safe screening may not be as important due to the incentive of aggressive model fitting. Nevertheless it would be interesting to see in the future EDPP-type batch screening. It uses inner products with a modification of the residual vector. Our algorithm still focuses of inner products with the vanilla residual vector.

To address the large-scale lasso problems, several packages have been developed such as biglasso [54], bigstatsr [55], oem [56] and the lasso routine from PLINK 1.9 [13, 14].

Among them, oem specializes in tall data (big n) and can be slow when p > n. In many real data applications including ours, the data are both large-sample and high-dimensional. However, we might still be able to use oem for the small lasso subroutine since a large number of variables have already been excluded. The other packages, biglasso, bigstatsr, PLINK 1.9, all provide efficient implementations of the pathwise coordinate descent with warm start. PLINK 1.9 is specifically developed for genetic datasets and is widely used in GWAS and research in population genetics. In bigstatsr, the big_spLinReg function adapts from the biglasso function in biglasso and incorporates a Cross-Model Selection and Averaging (CMSA) procedure, which is a variant of cross-validation that saves computation by directly averaging the results from different folds instead of retraining the model at the chosen optimal parameter. They both use memory-mapping to process larger-than-RAM, on-disk datasets as if they were in memory, and based on that implement coordinate descent with strong rules and warm start.

The main difference between BASIL and the algorithm these packages use is that BASIL tries to solve a series of models every full scan of the dataset (at checking and screening) and thus effectively reduce the number of passes over the dataset. This difference may not be significant in small or moderate-sized problems, but can be critical in big data applications especially when the dataset cannot be fully loaded into the memory. A full scan of a larger-than-RAM dataset can incur a lot of swap-in/out between the memory and the disk, and thus a lot of disk I/O operations, which is known to be orders of magnitude slower than in-memory operations. Thus reducing the number of full scans can greatly improve the overall performance of the algorithm.

Aside from potential efficiency consideration, all of those packages aforementioned have to re-implement a variety of features existent in many small-data solutions but for big-data context. Nevertheless, currently they don’t provide as much functionality as needed in our real-data application. First, in the current implementations, PLINK 1.9 only supports the Gaussian family, biglasso and bigstatsr only supports the Gaussian and binomial families, whereas snpnet can easily extend to other regression families and already built in Gaussian, binomial and Cox families. Also, biglasso, bigstatsr and PLINK 1.9 all standardize the predictors beforehand, but in many applications such as our UK Biobank studies, it is more reasonable to leave the predictors unstandardized. In addition, it can take some effort to convert the data to the desired format by these packages. This would be a headache if the raw data is in some special format and one cannot afford to first convert the full dataset into an intermediate format for which a tool is provided to convert to the desired one by biglasso or bigstatsr. This can happen, for example, if the raw data is highly compressed in a special format. For the BED binary format we work with in our application, readRAW_big.matrix function from BGData can convert a raw file to a big.matrix object desired by biglasso, and snp_readBed function from bigsnpr [55] allows one to convert it to FBM object desired by bigstatsr. However, bigsnpr doesn’t take input data that has any missing values, which can prevalent in an SNP matrix (as in our application). Although PLINK 1.9 works directly with the BED binary file, its lasso solver currently only supports the Gaussian family, and it doesn’t return the full solution path. Instead it returns the solution at the smallest λ value computed and needs a good heritability estimate as input from the user, which may not be immediately available.

We summarize the main advantages of the BASIL algorithm:

  • Input data flexibility. Our algorithm allows one to deal directly with any data type as long as the screening and checking steps are implemented, which is often very lightweight development work like matrix multiplication. This can be important in large-scale applications especially when the data is stored in a compressed format or a distributed way since then we would not need to unpack the full data and can conduct KKT check and screening on its original format. Instead only a small screened subset of the data needs to be converted to the desired format by the lasso solver in the fitting step.

  • Model flexibility. We can easily transfer the modeling flexibility provided by existing packages to the big data context, such as the options of standardization, sample weights, lower/upper coefficient limits and other families in generalized linear models provided by existing packages such as glmnet. This can be useful, for example, when we may not want to standardize predictors already in the same unit to avoid unintentionally different penalization of the predictors due to difference in their variance.

  • Effortless development. The BASIL algorithm allows one to maximally reuse the existing lasso solutions for small or moderate-sized problems. The main extra work would be an implementation of batch screening and KKT check with respect to a particular data type. For example, in the snpnet package, we are able to quickly extend the in-memory glmnet solution to large-scale, ultrahigh-dimentional SNP data. Moreover, the existing convenient data interface provided by the BEDMatrix package further facilitates our implementation.

  • Computational efficiency. Our design reduces the number of visits to the original data that sits on the disk, which is crucial to the overall efficiency as disk read can be orders of magnitude slower than reading from the RAM. The key to achieving this is to bring batches of promising variables into the main memory, hoping to find the lasso solutions for more than one λ value each iteration and check the KKT condition for those λ values in one pass of the entire dataset.

Lastly, we are going to provide some timing comparison with existing packages. As mentioned in previous sections, those packages provide different functionalities and have different restrictions on the dataset. For example, most of them (biglasso, bigstatsr) assume that there are no missing values, or the missing ones have already been imputed. In bigsnpr, for example, we shouldn’t have SNPs with 0 MAF either. Some packages always standardize the variants before fitting the lasso. To provide a common playground, we create a synthetic dataset with no missing values, and follow a standardized lasso procedure in the fitting stage, simply to test the computation. The dataset has 50,000 samples and 100,000 variables, and each takes value in the SNP range, i.e., in 0, 1, or 2. We fit the first 50 lasso solutions along a prefix λ sequence that contains 100 initial λ values (like early stopping for most phenotypes). The total time spent is displayed in Table 5. For bigstatsr, we include two versions since it does cross-validation by default. In one version, we make it comply with our single train/val/test split, while in the other version, we use its default 10-fold cross-validation version—Cross-Model Selection and Averaging (CMSA). Notice that the final solution of iCMSA is different from the exact lasso solution on the full data because the returned coefficient vector is a linear combination of the coefficient vectors from the 10 folds rather than from a retrained model on the full data. We uses 128GB memory and 16 cores for the computation.

Table 5. Timing comparison on a synthetic dataset of size n = 50, 000 and p = 100, 000.

The time for bigstatsr and biglasso has two components: one for the conversion to the desired data type and the other for the actual computation. The experiments are all run with 16 cores and 64 GB memory.

R Package Elapsed Time (minutes)
bigstatsr [55] 2.93 + 56.80
bigstatsr + CMSA [55] 2.93 + 101.75
biglasso [54] 4.55 + 54.27
PLINK [13, 14] 53.52
snpnet 44.79

From the table, we see that snpnet is at about 20% faster than other packages concerned. The numbers before the “+” sign are the time spent on converting the raw data to the required data format by those packages. The second numbers are time spent on actual computation.

It is important to note though that the performance relies not only on the algorithm, but also heavily on the implementations. The other packages in comparison all have their major computation done with C++ or Fortran. Ours, for the purpose of meta algorithm where users can easily integrate with any lasso solver in R, still has a significant portion (the iterations) in R and multiple rounds of cross-language communication. That can degrade the timing performance to some degree. If there is further pursuit of speed performance, there is still space for improvement by more designated implementation.

Supporting information

S1 Appendix. Results for body mass index (BMI).

(PDF)

S2 Appendix. Results for asthma.

(PDF)

S3 Appendix. Results for high cholesterol.

(PDF)

S4 Appendix. Manhattan plots.

(PDF)

S1 Table. R2 values for BMI (without refitting).

R2 values for BMI (without refitting). For lasso and relaxed lasso, the chosen model is based on maximum R2 on the validation set. Model (3) to (9) each includes Model (2) plus their own specification as stated in the Form column. The elastic-net picks α = 0.1 based on the validation performance.

(PDF)

S2 Table. R2 values for body mass index by different PRS methods with refitting.

(PDF)

S3 Table. AUC values for asthma (without refitting).

For lasso and relaxed lasso, the chosen model is based on maximum AUC on the validation set. Model (3) to (9) each includes Model (2) plus their own specification as stated in the Form column. The elastic-net picks α = 0.1 based on the validation performance.

(PDF)

S4 Table. AUC values for asthma by different PRS methods with refitting.

(PDF)

S5 Table. AUC values for high cholesterol (without refitting).

For lasso and relaxed lasso, the chosen model is based on maximum AUC on the validation set. Model (3) to (9) each includes Model (2) plus their own specification as stated in the Form column. The elastic-net picks α = 0.9 based on the validation performance.

(PDF)

S6 Table. AUC values for high cholesterol by different PRS methods with refitting.

(PDF)

S1 Fig. R2 plot for body mass index.

The primary horizontal axis on the bottom represents the index of lambda values, 1 ≤ L, which correspond to the sequence of the regularization parameters, λ1 > λ2 > ⋯ > λL. The top axis shows the number of active variables in the model. ReLasso: relaxed lasso.

(TIF)

S2 Fig. Actual body mass index (BMI) versus predicted BMI on 5000 random samples from the test set.

A regression line with its 95% confidence band is also added on top of the dots. The correlation between actual BMI and predicted BMI is 0.3256.

(TIF)

S3 Fig. Residuals of lasso prediction for body mass index.

Standard deviation of the residual is 4.51 kg/m2.

(TIF)

S4 Fig. AUC plot for asthma.

The primary horizontal axis on the bottom represents the index of lambda values, 1 ≤ L, which correspond to the sequence of the regularization parameters, λ1 > λ2 > ⋯ > λL. The top axis shows the number of active variables in the model. ReLasso: relaxed lasso.

(TIF)

S5 Fig. Box plot of the percentile of the linear prediction score among cases versus controls for asthma.

This is based on the optimal lasso model.

(TIF)

S6 Fig. Stratified prevalence across different percentile bins based on the predicted scores for asthma.

This is based on the optimal lasso model.

(TIF)

S7 Fig. AUC plot for high cholesterol.

The primary horizontal axis on the bottom represents the index of lambda values, 1 ≤ L, which correspond to the sequence of the regularization parameters, λ1 > λ2 > ⋯ > λL. The top axis shows the number of active variables in the model. ReLasso: relaxed lasso.

(TIF)

S8 Fig. Box plot of the percentile of the linear prediction score among cases versus controls for high cholesterol.

This is based on the optimal lasso model.

(TIF)

S9 Fig. Stratified prevalence across different percentile bins based on the predicted scores for high cholesterol.

This is based on the optimal lasso model.

(TIF)

S10 Fig. ROC curve for asthma.

This is based on the optimal lasso model.

(TIF)

S11 Fig. ROC curve for high cholesterol.

This is based on the optimal lasso model.

(TIF)

S12 Fig. Manhattan plot of the univariate p-values for height.

This is based on the optimal lasso model. The vertical axis shows −log10(p) for each SNP. The red horizontal line represents a reference level of p = 5 × 10−8.

(TIF)

S13 Fig. Manhattan plot of the univariate p-values for body mass index.

This is based on the optimal lasso model. The vertical axis shows −log10(p) for each SNP. The red horizontal line represents a reference level of p = 5 × 10−8.

(TIF)

S14 Fig. The lasso coefficients for height.

This is based on the optimal lasso model. The vertical axis shows the magnitude of the coefficients from snpnet. The SNPs with relatively large lasso coefficients are highlighted in green.

(TIF)

S15 Fig. The lasso coefficients for body mass index.

This is based on the optimal lasso model. The vertical axis shows the magnitude of the coefficients from snpnet. The SNPs with relatively large lasso coefficients are highlighted in green.

(TIF)

S16 Fig. Manhattan plot of the univariate p-values for asthma.

This is based on the optimal lasso model. The vertical axis shows −log10(p) for each SNP. The red horizontal line represents a reference level of p = 5 × 10−8.

(TIF)

S17 Fig. Manhattan plot of the univariate p-values for high cholesterol.

This is based on the optimal lasso model. The vertical axis shows −log10(p) for each SNP. The red horizontal line represents a reference level of p = 5 × 10−8.

(TIF)

S18 Fig. The lasso coefficients for asthma.

This is based on the optimal lasso model. The vertical axis shows the magnitude of the coefficients from snpnet. The SNPs with relatively large lasso coefficients are highlighted in green.

(TIF)

S19 Fig. The lasso coefficients for high cholesterol.

This is based on the optimal lasso model. The vertical axis shows the magnitude of the coefficients from snpnet. The SNPs with relatively large lasso coefficients are highlighted in green.

(TIF)

Acknowledgments

We thank Balasubramanian Narasimhan for helpful discussion on the package development, Kenneth Tay and the members of the Rivas lab for insightful feedback. We thank the SBayesR and GCTB authors for their invaluable feedback and help with our SBayesR experiments.

This research has been conducted using the UK Biobank Resource under application number 24983. We thank all the participants in the study. The primary and processed data used to generate the analyses presented here are available in the UK Biobank access management system (https://amsportal.ukbiobank.ac.uk/) for application 24983, “Generating effective therapeutic hypotheses from genomic and hospital linkage data” (http://www.ukbiobank.ac.uk/wp-content/uploads/2017/06/24983-Dr-Manuel-Rivas.pdf), and the results are displayed in the Global Biobank Engine (https://biobankengine.stanford.edu).

Some of the computing for this project was performed on the Sherlock cluster. We would like to thank Stanford University and the Stanford Research Computing Center for providing computational resources and support that contributed to these research results.

Data Availability

The analyses presented in this study were based on data accessed through the UK Biobank: http://www.ukbiobank.ac.uk. Experiment scripts are available on GitHub (https://github.com/junyangq/scripts_snpnet_paper).

Funding Statement

This research has been conducted using the UK Biobank Resource under application number 24983, "Generating effective therapeutic hypotheses from genomic and hospital linkage data" (http://www.ukbiobank.ac.uk/wp-content/uploads/2017/06/24983-Dr-Manuel-Rivas.pdf). This work was supported by National Human Genome Research Institute (NHGRI) of the National Institutes of Health (NIH) under awards R01HG010140. This work was funded in part by the Two Sigma Graduate Fellowship (to J.Q.), Funai Overseas Scholarship from Funai Foundation for Information Technology and the Stanford University School of Medicine (to Y.T.). NIH grants 5U01HG009080 (to M.A.R.), 5R01EB001988-16 (to R.T.), 5R01EB 001988-21 (to T.H.). NSF grants 19DMS1208164 (to R.T.) and DMS-1407548 (to T.H.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the funders; funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Friedman J, Hastie T, Tibshirani R. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition Springer series in statistics. Springer-Verlag; 2009. [Google Scholar]
  • 2. Efron B, Hastie T. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. vol. 5 Cambridge University Press; 2016. [Google Scholar]
  • 3. Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters. Commun ACM. 2008;51(1):107–113. 10.1145/1327452.1327492 [DOI] [Google Scholar]
  • 4.Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: Cluster Computing with Working Sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud’10. Berkeley, CA, USA: USENIX Association; 2010. p. 10–10. Available from: http://dl.acm.org/citation.cfm?id=1863103.1863113.
  • 5.Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, et al. TensorFlow: A System for Large-scale Machine Learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. OSDI’16. Berkeley, CA, USA: USENIX Association; 2016. p. 265–283. Available from: http://dl.acm.org/citation.cfm?id=3026877.3026899.
  • 6. Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society Series B (Methodological). 1996;58(1):267–288. 10.1111/j.2517-6161.1996.tb02080.x [DOI] [Google Scholar]
  • 7.R Core Team. R: A Language and Environment for Statistical Computing; 2017. Available from: https://www.R-project.org/.
  • 8. Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent Journal of Statistical Software; 2010. 10.18637/jss.v033.i01 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Breheny P, Huang J. Coordinate Descent Algorithms for Nonconvex Penalized Regression, with Applications to Biological Feature Selection. The Annals of Applied Statistics. 2011;5(1):232–253. 10.1214/10-AOAS388 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hastie T. Statistical Learning with Big Data; 2015. Presentation at Data Science at Stanford Seminar. Available from: https://web.stanford.edu/~hastie/TALKS/SLBD_new.pdf.
  • 11. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank Resource with Deep Phenotyping and Genomic Data. Nature. 2018;562(7726):203–209. 10.1038/s41586-018-0579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. The American Journal of Human Genetics. 2017;101(1):5–22. 10.1016/j.ajhg.2017.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4(1). 10.1186/s13742-015-0047-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Purcell S, Chang C. PLINK 1.9; 2015. Available from: www.cog-genomics.org/plink/1.9/.
  • 15. Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, et al. Strong Rules for Discarding Predictors in Lasso-Type Problems. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2012;74(2):245–266. 10.1111/j.1467-9868.2011.01004.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Boyd S, Vandenberghe L. Convex Optimization. Cambridge university press; 2004. [Google Scholar]
  • 17. Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, Articles. 2010;33(1):1–22. [PMC free article] [PubMed] [Google Scholar]
  • 18. Cox DR. Regression Models and Life-Tables. Journal of the Royal Statistical Society Series B (Methodological). 1972;34(2):187–220. 10.1111/j.2517-6161.1972.tb00899.x [DOI] [Google Scholar]
  • 19. Li R, Chang C, Justesen JM, Tanigawa Y, Qian J, Hastie T, et al. Fast Lasso method for Large-scale and Ultrahigh-dimensional Cox Model with applications to UK Biobank. Biostatistics, kxaa038 2020. 10.1093/biostatistics/kxaa038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005;67(2):301–320. 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]
  • 21. Lello L, Avery SG, Tellier L, Vazquez AI, de los Campos G, Hsu SDH. Accurate Genomic Prediction of Human Height. Genetics. 2018;210(2):477–497. 10.1534/genetics.118.301267 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. DeBoever C, Tanigawa Y, Lindholm ME, McInnes G, Lavertu A, Ingelsson E, et al. Medical Relevance of Protein-Truncating Variants across 337,205 Individuals in the UK Biobank Study. Nature Communications. 2018;9(1):1612 10.1038/s41467-018-03910-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Wold H. Soft Modelling by Latent Variables: The Non-Linear Iterative Partial Least Squares (NIPALS) Approach. Journal of Applied Probability. 1975;12(S1):117–142. 10.1017/S0021900200047604 [DOI] [Google Scholar]
  • 24. Meinshausen N. Relaxed Lasso. Computational Statistics & Data Analysis. 2007;52(1):374–393. 10.1016/j.csda.2006.12.019 [DOI] [Google Scholar]
  • 25. Tanigawa Y, Li J, Justesen JM, Horn H, Aguirre M, DeBoever C, et al. Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight adipocyte biology. Nature communications. 2019;10(1):4064 10.1038/s41467-019-11953-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Ge T, Chen CY, Ni Y, Feng YCA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature Communications. 2019;10(1):1776 10.1038/s41467-019-09718-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Lloyd-Jones LR, Zeng J, Sidorenko J, Yengo L, Moser G, Kemper KE, et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nature Communications. 2019;10(1):1776 10.1038/s41467-019-12653-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Purcell S, Chang C. PLINK 2.0; 2020. Available from: www.cog-genomics.org/plink/2.0/.
  • 29. Zeng J, De Vlaming R, Wu Y, Robinson MR, Lloyd-Jones LR, Yengo L, et al. Signatures of negative selection in the genetic architecture of human complex traits. Nature Genetics. 2018;50(5):746–753. 10.1038/s41588-018-0101-4 [DOI] [PubMed] [Google Scholar]
  • 30. Silventoinen K, Sammalisto S, Perola M, Boomsma DI, Cornes BK, Davis C, et al. Heritability of Adult Body Height: A Comparative Study of Twin Cohorts in Eight Countries. Twin Research. 2003;6(5):399–408. 10.1375/136905203770326402 [DOI] [PubMed] [Google Scholar]
  • 31. Visscher PM, Medland SE, Ferreira MAR, Morley KI, Zhu G, Cornes BK, et al. Assumption-Free Estimation of Heritability from Genome-Wide Identity-by-Descent Sharing between Full Siblings. PLOS Genetics. 2006;2(3):e41 10.1371/journal.pgen.0020041 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Visscher PM, McEvoy B, Yang J. From Galton to GWAS: Quantitative Genetics of Human Height. Genetics Research. 2010;92(5-6):371–379. 10.1017/S0016672310000571 [DOI] [PubMed] [Google Scholar]
  • 33. Zaitlen N, Kraft P, Patterson N, Pasaniuc B, Bhatia G, Pollack S, et al. Using Extended Genealogy to Estimate Components of Heritability for 23 Quantitative and Dichotomous Traits. PLOS Genetics. 2013;9(5):e1003520 10.1371/journal.pgen.1003520 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Hemani G, Yang J, Vinkhuyzen A, Powell J, Willemsen G, Hottenga JJ, et al. Inference of the Genetic Architecture Underlying BMI and Height with the Use of 20,240 Sibling Pairs. The American Journal of Human Genetics. 2013;93(5):865–875. 10.1016/j.ajhg.2013.10.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs Explain a Large Proportion of the Heritability for Human Height. Nature Genetics. 2010;42:565 10.1038/ng.608 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Yang J, Bakshi A, Zhu Z, Hemani G, Vinkhuyzen AAE, Lee SH, et al. Genetic Variance Estimation with Imputed Variants Finds Negligible Missing Heritability for Human Height and Body Mass Index. Nature Genetics. 2015;47:1114 10.1038/ng.3390 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Lango Allen H, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, et al. Hundreds of Variants Clustered in Genomic Loci and Biological Pathways Affect Human Height. Nature. 2010;467:832 10.1038/nature09410 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Wood AR, Esko T, Yang J, Vedantam S, Pers TH, Gustafsson S, et al. Defining the Role of Common Variation in the Genomic and Biological Architecture of Adult Human Height. Nature Genetics. 2014;46:1173 10.1038/ng.3097 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Marouli E, Graff M, Medina-Gomez C, Lo KS, Wood AR, Kjaer TR, et al. Rare and Low-Frequency Coding Variants Alter Human Adult Height. Nature. 2017;542:186 10.1038/nature21039 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Parikh N, Boyd S. Proximal Algorithms. Foundations and Trends in Optimization. 2014;1(3):127–239. 10.1561/2400000003 [DOI] [Google Scholar]
  • 41. Xiao L. Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization. Journal of Machine Learning Research. 2010;11(88):2543–2596. [Google Scholar]
  • 42. Duchi JC, Agarwal A, Wainwright MJ. Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling. IEEE Transactions on Automatic Control. 2012;57(3):592–606. 10.1109/TAC.2011.2161027 [DOI] [Google Scholar]
  • 43. Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of Lasso and Dantzig selector. Ann Statist. 2009;37(4):1705–1732. 10.1214/08-AOS620 [DOI] [Google Scholar]
  • 44. Zhao P, Yu B. On model selection consistency of Lasso. Journal of Machine learning research. 2006;7(90):2541–2563. [Google Scholar]
  • 45. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach. Biometrics. 1988;44(3):837–845. 10.2307/2531595 [DOI] [PubMed] [Google Scholar]
  • 46. Cortes C, Mohri M. Confidence intervals for the area under the ROC curve. In: Advances in Neural Information Processing Systems; 2005. p. 305–312. [Google Scholar]
  • 47. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal Components Analysis Corrects for Stratification in Genome-Wide Association Studies. Nature Genetics. 2006;38:904 10.1038/ng1847 [DOI] [PubMed] [Google Scholar]
  • 48. Patterson N, Price AL, Reich D. Population Structure and Eigenanalysis. PLOS Genetics. 2006;2(12):1–20. 10.1038/ng1847 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Kane MJ, Emerson J, Weston S. Scalable Strategies for Computing with Massive Data. Journal of Statistical Software. 2013;55(14):1–19. 10.18637/jss.v055.i14 [DOI] [Google Scholar]
  • 50. Sobel E, Lange K, Wu TT, Hastie T, Chen YF. Genome-Wide Association Analysis by Lasso Penalized Logistic Regression. Bioinformatics. 2009;25(6):714–721. 10.1093/bioinformatics/btp041 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.El Ghaoui L, Viallon V, Rabbani T. Safe Feature Elimination for the Lasso and Sparse Supervised Learning Problems. arXiv preprint arXiv:10094219. 2010;.
  • 52. Fan J, Lv J. Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008;70(5):849–911. 10.1111/j.1467-9868.2008.00674.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Wang J, Wonka P, Ye J. Lasso Screening Rules via Dual Polytope Projection. Journal of Machine Learning Research. 2015;16:1063–1101. [Google Scholar]
  • 54.Zeng Y, Breheny P. The biglasso Package: A Memory-and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R. arXiv preprint arXiv:170105936. 2017;.
  • 55. Privé F, Blum MGB, Aschard H, Ziyatdinov A. Efficient Analysis of Large-Scale Genome-Wide Data with Two R packages: bigstatsr and bigsnpr. Bioinformatics. 2018;34(16):2781–2787. 10.1093/bioinformatics/bty185 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Huling JD, Qian PZ. Fast Penalized Regression and Cross Validation for Tall Data with the oem Package. arXiv preprint arXiv:180109661. 2018;.
  • 57. Speliotes EK, Willer CJ, Berndt SI, Monda KL, Thorleifsson G, Jackson AU, et al. Association Analyses of 249,796 Individuals Reveal 18 New Loci Associated with Body Mass Index. Nature Genetics. 2010;42:937 10.1038/ng.686 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, et al. Genetic Studies of Body Mass Index Yield New Insights for Obesity Biology. Nature. 2015;518:197 10.1038/nature14177 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Turner SD. qqman: An R Package for Visualizing GWAS Results Using Q-Q and Manhattan Plots. Journal of Open Source Software. 2018;3(25):731 10.21105/joss.00731 [DOI] [Google Scholar]

Decision Letter 0

Gregory P Copenhaver, Xiaofeng Zhu

20 Mar 2020

Dear Dr Hastie,

Thank you very much for submitting your Research Article entitled 'A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Multivariate Genome-wide Predictive Modeling with Application to the UK Biobank' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review again a much-revised version. That is, we will consider a revised manuscript that robustly demonstrates marked improvement over the existing approaches (i.e. Bayesian Approaches, BLUP and polygenic risk scores), as the reviewer 1 pointed out.    We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see our guidelines.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool.  PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Xiaofeng Zhu

Associate Editor

PLOS Genetics

Gregory P. Copenhaver

Editor-in-Chief

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Review of Qian et al

Summary:

This paper describes an efficient algorithm for fitting

the lasso regression model to large data sets, along with an implementation

(R package snpnet) and application to the UK biobank data to

obtain predictors (effectively performing genomic prediction, or

computing polygenic risk scores, PRS) for several different phenotypes.

The paper compares prediction accuracy with some other simpler methods.

The lasso is, in general, a widely studied, and also quite widely used method.

As such an algorithm and implementation for very large datasets are of potential interest

to a general audience both inside and outside genetics. However,

for readers of PloS genetics the interest is going to stand or fall on

the application: is Lasso a good method to do genomic prediction?

I am skeptical of this: the Lasso has never been the method of choice

for genomic prediction in smaller data sets, with the field generally

preferring other large-scale regression methods, including

very simple methods (eg "ridge regression", usually known

as BLUP in the quantitative genetics literature) or very computationally

intensive methods (Bayesian regression, usually fit via MCMC). The

Elastic Net is also sometimes used. But the Lasso, rarely.

While I will keep an open mind on whether this could change for

biobank-sized data, the current paper is unconvincing

on this because none of the comparisons are with state-of-the-art

approaches to this problem.

Overall then I think the main contribution of this paper

is the the algorithmic ideas, whose main appeal is their

simplicity and generality: I like the fact that the design allows

the algorithm to maximally exploit previous implementations, rather than

having to reimplement the coordinate ascent steps for example.

However, unless the resulting method is really competitive with state

of the art for genomic prediction then this seems better suited to

another journal.

Detailed comments:

1. Comparisons with other methods:

The methods used here do not seem to

represent a reasonable selection of state-of-the-art approaches to forming predictors for

genetic data, on which there is a large literature. Historically

genomic prediction has been done using multiple linear regression

fit either using very simple methods (eg "ridge regression", usually known

as BLUP in the quantitative genetics literature) or very computationally

intensive methods (Bayesian regression, usually fit via MCMC).

More recently, motivated by the difficulty of accessing/sharing

genotype data, as well as computational considerations,

a literature has sprung up around methods that attempt

to build predictors based on summary statistics only (and LD from

a reference panel). For example, Ge et al and Lloyd-Jones et al:

https://www.nature.com/articles/s41467-019-09718-5

https://www.nature.com/articles/s41467-019-12653-0

are recent examples, and includes comparisons with other methods,

the latter specifically on the UK biobank data with some of the same

phenotypes considered here.

To take a quick example, in Fig 2 of Lloyd Jones, looking at R2

for BMI, the performance among the methods they consider

ranges from 0.1 to 0.126. In this paper

(Table 3) Lasso achieves 0.103. I realize these numbers are not directly

comparable, being based on different protocols (CV splits etc)

for analyzing the UK biobank data, but it illustrates my concern that

Lasso may not be competitive with the best existing methods.

2. Algorithmic description

I found much of the algorithmic description in the overview overly long and hard to follow.

The basic idea seems rather simple (which is a good thing!) but the presentation seems

to obscure the simplicity rather than highlighting it.

The formal presentation of Algorithm 1 in section 4 helps a lot, and I suggest it should be

moved to the overview section. This should allow the text in the algorithmic

overview to be shortened, since much of the words seems to be

repeating, in less precise terms, what is given in Algorithm 1.

Also:

- the algorithm and text did not seem to address what happens if the "checking"

step fails. That is, in step 5 of Algorithm 1, what if no lambda satisfies

the KKT conditions? Or is this guaranteed not to happen?

- How is M chosen? Does it matter?

- the algorithm seems to rely on the fact that marginal screening is

going to be effective at identifying the correct variables to add in.

In some cases with complex correlations among variables this may not be true -

one can construct problems where the best pair of variables to include

are not among the marginally strongest. How does the algorithm

cope with that kind of situation? Is it guaranteed to

converge in practice?

3. Standardization

The question of whether or not to standardize variables is usually phrased in terms of modeling

assumptions -- if rare SNPs have bigger effects than common SNPs then standardization

could be appropriate and improve predictive performance. The paper suggests

that standardization will produce

worse performance but this is not obvious a priori - it should be shown empirically.

4. Implementation

The software implementation does not

appear to be quite ready for widespread distribution

(e.g. the R package on github has no man pages, and I could

not find a minimal working example).

Other

- the use of the term "multivariate" in the context of a multiple regression with univariate

outcome is rather confusing. From the title I expected the paper to deal with multivariate outcomes.

Better to stick to "multiple regression", or perhaps "multi-SNP regression" if you prefer.

- references to heritability were also confusing. E.g.

the abstract refers to "state-of-the-art heritability estimation",

when the goal here seems not to be heritability estimation but

building a predictor, which are different things.

Heritability provides an upper

bound on prediction accuracy from genetic data, but building a predictor is not

the same as "estimating heritability", and most approaches to estimating heritability

do not explicitly build predictors. I think you can (and probably should)

write the whole paper without mentioning heritability, and focussing entirely on PRS and prediction accuracy.

- the presentation of result is much longer than it need be. The main results for different

phenotypes and methods could probably be shown in a single figure (e.g. Lloyd-Jones Fig 2).

Many of the other figures did not seem essential to the main story.

Refs:

Ge, T., Chen, C., Ni, Y. et al. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun 10, 1776 (2019). https://doi.org/10.1038/s41467-019-09718-5

Lloyd-Jones, L.R., Zeng, J., Sidorenko, J. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat Commun 10, 5086 (2019). https://doi.org/10.1038/s41467-019-12653-0

Reviewer #2: How to build the best predictive model using large-scale genetic data is important in health and disease studies. This paper provides a true regression approach for this problem, an important alternative to the polygenic risk scores. The results from analysis of the UK Biobank are convincing and interesting. The algorithm seems to be quite reasonable.

I only have a few minor comments - (1) since Lasso results in biased estimates of the regression coefficients. Do the authors think that by performing further debased estimation, one can further improve the prediction performance? (2) since a very large number of SNPs are selection for each of the data examples, would the consistency results still hold? Lasso theory requires that the model has to be very sparse. (3) Why univariate screening + Lasso does not perform as well as fitting Lasso using all the SNPs? Does this mean that the univariate screening as proposed by Jianqin Fan etc does not really work in the settings considered in this paper?

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: None

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Matthew Stephens

Reviewer #2: No

Decision Letter 1

Gregory P Copenhaver, Xiaofeng Zhu

13 Jul 2020

* Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. *

Dear Dr Hastie,

Thank you very much for submitting your Research Article entitled 'A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Sparse Regression with Application to the UK Biobank' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some aspects of the manuscript that should be improved.

We therefore ask you to modify the manuscript according to the review recommendations before we can consider your manuscript for acceptance. Your revisions should address the specific points made by reviewers. Reviewer #1 raised important issues regarding the results of SBayesR.  This issue will need to be fully resolved in the revision and the editors agree that it may be very helpful for you to reach out the authors of SBayesR during your revision process, but we leave that up to you to decide.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Xiaofeng Zhu

Associate Editor

PLOS Genetics

Gregory P. Copenhaver

Editor-in-Chief

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have done a thorough job responding to my comments, and I believe the whole paper is much improved.

Just one new substantive issue has arisen during this revision: the results reported for the SBayesR method

are very poor, and seem to strongly contradict the original publication on this method. Indeed, it is a bit

hard to believe that it performs quite so poorly, and the reasons for its poor performance need to be understood

and either corrected or explained. For example, for height, SBayesR does no better than just Age + Sex in predicting

height - so it essentially has a 0% R2 when you consider the genetic component only. In contrast, LLoyd-Jones et al

report that SBayesR achieved an R2 of >35% for height in the UK biobank.

Something is clearly wrong, either with the SBayes software or with

the way it has been applied. (Other traits show a similar pattern, but the height result is particularly striking.)

Of course, I don't know what the problem is, but I suggest a first step would be to ask the SBayesR authors

if they have suggestions, and/or get their original code and see if you can reproduce their published results.

Other items:

in SBayes i noticed you excluded the MHC. Maybe this is recommended by SBayes software, but it seems likely to hurt R2 and

AUC for many traits as the MHC has a strong effect on many traits.

To make results comparable across methods it seems necessary to either exclude or include MHC for all methods.

(It seems unlikely that this issue explains the poor performance on height noted above.)

Reviewer #2: my previous comments were minor and the authors have addressed these comments.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: No: UK Biobank data can't be provided

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Matthew Stephens

Reviewer #2: No

Decision Letter 2

Gregory P Copenhaver, Xiaofeng Zhu

4 Sep 2020

Dear Dr Hastie,

We are pleased to inform you that your manuscript entitled "A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Sparse Regression with Application to the UK Biobank" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional accept, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about one way to make your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Xiaofeng Zhu

Associate Editor

PLOS Genetics

Gregory P. Copenhaver

Editor-in-Chief

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-20-00068R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

Gregory P Copenhaver, Xiaofeng Zhu

13 Oct 2020

PGENETICS-D-20-00068R2

A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Sparse Regression with Application to the UK Biobank

Dear Dr Hastie,

We are pleased to inform you that your manuscript entitled "A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Sparse Regression with Application to the UK Biobank" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Matt Lyles

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Results for body mass index (BMI).

    (PDF)

    S2 Appendix. Results for asthma.

    (PDF)

    S3 Appendix. Results for high cholesterol.

    (PDF)

    S4 Appendix. Manhattan plots.

    (PDF)

    S1 Table. R2 values for BMI (without refitting).

    R2 values for BMI (without refitting). For lasso and relaxed lasso, the chosen model is based on maximum R2 on the validation set. Model (3) to (9) each includes Model (2) plus their own specification as stated in the Form column. The elastic-net picks α = 0.1 based on the validation performance.

    (PDF)

    S2 Table. R2 values for body mass index by different PRS methods with refitting.

    (PDF)

    S3 Table. AUC values for asthma (without refitting).

    For lasso and relaxed lasso, the chosen model is based on maximum AUC on the validation set. Model (3) to (9) each includes Model (2) plus their own specification as stated in the Form column. The elastic-net picks α = 0.1 based on the validation performance.

    (PDF)

    S4 Table. AUC values for asthma by different PRS methods with refitting.

    (PDF)

    S5 Table. AUC values for high cholesterol (without refitting).

    For lasso and relaxed lasso, the chosen model is based on maximum AUC on the validation set. Model (3) to (9) each includes Model (2) plus their own specification as stated in the Form column. The elastic-net picks α = 0.9 based on the validation performance.

    (PDF)

    S6 Table. AUC values for high cholesterol by different PRS methods with refitting.

    (PDF)

    S1 Fig. R2 plot for body mass index.

    The primary horizontal axis on the bottom represents the index of lambda values, 1 ≤ L, which correspond to the sequence of the regularization parameters, λ1 > λ2 > ⋯ > λL. The top axis shows the number of active variables in the model. ReLasso: relaxed lasso.

    (TIF)

    S2 Fig. Actual body mass index (BMI) versus predicted BMI on 5000 random samples from the test set.

    A regression line with its 95% confidence band is also added on top of the dots. The correlation between actual BMI and predicted BMI is 0.3256.

    (TIF)

    S3 Fig. Residuals of lasso prediction for body mass index.

    Standard deviation of the residual is 4.51 kg/m2.

    (TIF)

    S4 Fig. AUC plot for asthma.

    The primary horizontal axis on the bottom represents the index of lambda values, 1 ≤ L, which correspond to the sequence of the regularization parameters, λ1 > λ2 > ⋯ > λL. The top axis shows the number of active variables in the model. ReLasso: relaxed lasso.

    (TIF)

    S5 Fig. Box plot of the percentile of the linear prediction score among cases versus controls for asthma.

    This is based on the optimal lasso model.

    (TIF)

    S6 Fig. Stratified prevalence across different percentile bins based on the predicted scores for asthma.

    This is based on the optimal lasso model.

    (TIF)

    S7 Fig. AUC plot for high cholesterol.

    The primary horizontal axis on the bottom represents the index of lambda values, 1 ≤ L, which correspond to the sequence of the regularization parameters, λ1 > λ2 > ⋯ > λL. The top axis shows the number of active variables in the model. ReLasso: relaxed lasso.

    (TIF)

    S8 Fig. Box plot of the percentile of the linear prediction score among cases versus controls for high cholesterol.

    This is based on the optimal lasso model.

    (TIF)

    S9 Fig. Stratified prevalence across different percentile bins based on the predicted scores for high cholesterol.

    This is based on the optimal lasso model.

    (TIF)

    S10 Fig. ROC curve for asthma.

    This is based on the optimal lasso model.

    (TIF)

    S11 Fig. ROC curve for high cholesterol.

    This is based on the optimal lasso model.

    (TIF)

    S12 Fig. Manhattan plot of the univariate p-values for height.

    This is based on the optimal lasso model. The vertical axis shows −log10(p) for each SNP. The red horizontal line represents a reference level of p = 5 × 10−8.

    (TIF)

    S13 Fig. Manhattan plot of the univariate p-values for body mass index.

    This is based on the optimal lasso model. The vertical axis shows −log10(p) for each SNP. The red horizontal line represents a reference level of p = 5 × 10−8.

    (TIF)

    S14 Fig. The lasso coefficients for height.

    This is based on the optimal lasso model. The vertical axis shows the magnitude of the coefficients from snpnet. The SNPs with relatively large lasso coefficients are highlighted in green.

    (TIF)

    S15 Fig. The lasso coefficients for body mass index.

    This is based on the optimal lasso model. The vertical axis shows the magnitude of the coefficients from snpnet. The SNPs with relatively large lasso coefficients are highlighted in green.

    (TIF)

    S16 Fig. Manhattan plot of the univariate p-values for asthma.

    This is based on the optimal lasso model. The vertical axis shows −log10(p) for each SNP. The red horizontal line represents a reference level of p = 5 × 10−8.

    (TIF)

    S17 Fig. Manhattan plot of the univariate p-values for high cholesterol.

    This is based on the optimal lasso model. The vertical axis shows −log10(p) for each SNP. The red horizontal line represents a reference level of p = 5 × 10−8.

    (TIF)

    S18 Fig. The lasso coefficients for asthma.

    This is based on the optimal lasso model. The vertical axis shows the magnitude of the coefficients from snpnet. The SNPs with relatively large lasso coefficients are highlighted in green.

    (TIF)

    S19 Fig. The lasso coefficients for high cholesterol.

    This is based on the optimal lasso model. The vertical axis shows the magnitude of the coefficients from snpnet. The SNPs with relatively large lasso coefficients are highlighted in green.

    (TIF)

    Attachment

    Submitted filename: response_to_reviewers.pdf

    Attachment

    Submitted filename: response_to_reviewers_v2.pdf

    Data Availability Statement

    The analyses presented in this study were based on data accessed through the UK Biobank: http://www.ukbiobank.ac.uk. Experiment scripts are available on GitHub (https://github.com/junyangq/scripts_snpnet_paper).


    Articles from PLoS Genetics are provided here courtesy of PLOS

    RESOURCES