Summary
LDpred2 is a widely used Bayesian method for building polygenic scores (PGSs). LDpred2-auto can infer the two parameters from the LDpred model, the SNP heritability and polygenicity p, so that it does not require an additional validation dataset to choose best-performing parameters. The main aim of this paper is to properly validate the use of LDpred2-auto for inferring multiple genetic parameters. Here, we present a new version of LDpred2-auto that adds an optional third parameter α to its model, for modeling negative selection. We then validate the inference of these three parameters (or two, when using the previous model). We also show that LDpred2-auto provides per-variant probabilities of being causal that are well calibrated and can therefore be used for fine-mapping purposes. We also introduce a formula to infer the out-of-sample predictive performance of the resulting PGS directly from the Gibbs sampler of LDpred2-auto. Finally, we extend the set of HapMap3 variants recommended to use with LDpred2 with 37% more variants to improve the coverage of this set, and we show that this new set of variants captures 12% more heritability and provides 6% more predictive performance, on average, in UK Biobank analyses.
Keywords: LDpred2, inference
This study extends the utility of LDpred2-auto, a widely used method for deriving predictive polygenic scores. It validates the use of the latest version of LDpred2-auto for inferring key genetic parameters that inform us about the genetic architecture of traits, for identifying causal genetic variants, and for estimating predictive performance.
Introduction
Most traits and diseases in humans are heritable. What differs is the genetic architecture of each trait, which can be parameterized by three key terms: the heritability (i.e., the proportion of phenotypic variation explained by genetics), the polygenicity (i.e., the fraction of genomic variants that have a non-zero effect on the trait), and the causal effect distribution (i.e., how the effect size distribution varies across causal variants). Some phenotypes, such as schizophrenia or height, are highly heritable and highly polygenic.1,2,3,4 Causal effects are larger when a trait is more heritable and smaller when the trait is more polygenic. As for the distribution of causal effects relative to their allele frequencies, it is often investigated through a single parameter, usually called α or S, to model the effect of negative selection on complex traits whereby variants with lower frequencies are expected to have higher causal effect sizes.5 In this model, the expected phenotypic variance explained by a genetic variant is proportional to , where f is the allele frequency of this variant. Many methods have been developed to estimate the SNP heritability (referred to as for brevity) and polygenicity (p), either globally for the whole genome or locally for specific regions of the genome, as well as α. These methods (non-exhaustively) include GCTA6 (), BOLT-REML7 ( and p), LD Score regression8 (), FINEMAP9 (per-variant p, also called posterior inclusion probabilities [PIPs], used for fine-mapping), HESS10 (local ), LDAK-SumHer11,12 ( and α), S-LD4M3 (p), GRM-MAF-LD13 (α), SuSiE14 (PIPs), SBayesS15 (, p, and a third parameter S, similar to α), and BEAVR16 (local p). Not all methods have the same modeling assumptions; for example, LDAK-SumHer assumes a different prior than SBayesS and LDpred2-auto and does not estimate all the same parameters. It can estimate the SNP heritability and α. However, it cannot estimate the polygenicity nor the per-variant probabilities of being causal (since it inherently assumes an infinitesimal model, i.e., ). Moreover, since it does not sample effects, it cannot be used to estimate the predictive performance with the formula we propose in this paper.
As previously shown by Daetwyler et al.,17 and p can also be used to determine how well we can predict a phenotype using genetic variants alone, with , where is the maximum achievable squared correlation between the genetic predictor and the phenotype, is the maximum achievable squared correlation between the genetic predictor and the genetic component, N is the sample size, M is the number of variants, and is the number of causal variants. Such genetic predictors are called polygenic scores (PGSs), and they are getting closer to being included as part of existing clinical risk models for diseases.18,19,20 LDpred2 is a widely used PGS method that can directly build PGSs using resulting summary statistics from genome-wide associations studies (GWASs), making it highly applicable.21,22,23 LDpred2 is a Bayesian approach that uses the SNP heritability and polygenicity p as parameters of its model. LDpred2-auto, one variant of LDpred2, can directly estimate these two parameters from GWAS summary statistics, making it applicable even when no validation data are available for tuning these two model parameters.21
The main aim of this paper is to properly validate the use of LDpred2-auto for inferring multiple genetic parameters. Here we extend LDpred2-auto and show that it is a reliable method for estimating (global and local), p (also per-variant probabilities [PIPs] used for fine-mapping purposes), and α (by extending its model to also include this third parameter). So, on top of providing competitive PGSs, LDpred2-auto can also provide all these estimates of genetic architecture. Moreover, we show how it can now also reliably estimate the predictive ability of the PGSs it derives, allowing for direct assessment of the usefulness of the derived PGSs, without requiring an independent test set. An overview of what LDpred2-auto can now provide is presented in Figure 1. Finally, we extend the set of HapMap3 variants recommended for use with LDpred2, which enables us to capture around 12% more SNP heritability and achieve around 6% more predictive performance , on average, in UK Biobank (UKBB) analyses. We call this extended set of 1,444,196 SNPs “HapMap3+” and recommend using it when power is sufficient (i.e., with a large sample size, large , and/or small p).
Material and methods
We have extensively used R packages bigstatsr and bigsnpr26 for analyzing large genetic data, packages from the future framework27 for easy scheduling and parallelization of analyses on the high-performance computing cluster, and packages from the tidyverse suite28 for shaping and visualizing results. We have extensively used the UKBB data,29 which is available through a procedure described at https://www.ukbiobank.ac.uk/using-the-resource/. The UKBB received ethical approval from the NHS National Research Ethics Service North West (11/NW/0382). The present analyses were conducted under UKBB data application number 58024.
Data for simulations
For simulations, we use the UKBB imputed (BGEN) data, read as allele dosages with function snp_readBGEN from R package bigsnpr.26,29 We use the set of 1,054,330 HapMap3 variants recommended to use for LDpred2.21 Since we run lots of different models, we restrict the simulations to chromosomes 3, 6, 9, 12, 15, 18, and 21, resulting in a set of 322,805 SNPs. We restrict individuals to the ones used for computing the principal components (PCs) in the UKBB (field 22020). These individuals are unrelated and have passed some quality control such as removing samples with a missing rate on autosomes larger than 0.02, having a mismatch between inferred sex and self-reported sex, and outliers based on heterozygosity (more details can be found in Bycroft et al.29). To get a set of genetically homogeneous individuals, we compute a robust Mahalanobis distance based on the first 16 PCs (field 22009) and further restrict individuals to those within a log distance of 4.5.30 This results in 356,409 individuals of Northwestern European ancestry. We randomly sample 200,000 individuals to form a training set (to run the GWAS) and use the remaining individuals to form a test set (to evaluate the predictive models).
Data for the UKBB analyses
We use the set of 1,054,330 HapMap3 variants recommended to use for LDpred221 and the same 356,409 individuals of Northwestern European ancestry as in the simulations. We randomly sample 50,000 individuals to form a test set (to evaluate the predictive models) and use the remaining individuals to form a training set (to run the GWAS).
We construct and use the same phenotypes as in Privé, Aschard, et al.31 About half of these consist of phecodes mapped from ICD10 and ICD9 codes using R package PheWAS.32,33 The other half consist of phenotypes defined in UKBB fields based on manual curation.31 As covariates, we first recompute PCs for the homogeneous subset of individuals previously defined using function snp_autoSVD from R package bigsnpr and keep four PCs based on visual inspection.26,30 We also use sex (field 22001), age (field 21022), birth date (combining fields 34 and 52), and deprivation index (field 189) as additional covariates (to a total of eight).
We use the LD matrix with independent LD blocks computed in Privé, Arbel, et al.34 We design two other LD matrices: one using a smaller random subset of 2,000 individuals from the previously selected ones (which we call “hm3_small”), and one based on 10,000 individuals from around South Europe by using the “Italy” center defined in Privé, Aschard, et al.31 (“hm3_altpop”). We apply the optimal algorithm developed in Privé35 to obtain independent LD blocks, as recommended in Privé, Arbel, et al.34 We finally define a fourth LD reference by extending the set of HapMap3 variants (see next section) and using 20,000 individuals from the previously selected ones (“hm3_plus”).
Extending the set of HapMap3 variants used
The HapMap3 variants generally provide a good coverage of the whole genome. We recall that the set of 1,054,330 HapMap3 variants recommended to use for LDpred221 is a subset of the original set of HapMap3 variants, which does not include duplicated positions (e.g., multi-allelic variants), nor ambiguous variants (e.g., “A” and “T,” or “C” and “G”), and which includes SNPs only (e.g., no indel). Here we propose to extend this set of 1,054,330 HapMap3 variants to make sure many genetic variants are well tagged by the extended set. To design this new set, we first read all variants from the UKBB with a minor allele frequency (MAF) larger than 0.005 in the whole data (i.e., the MAF from the MFI files). There are around 11.5 million such variants. Then we restrict to unrelated UKBB individuals that are not listed as White British (field 22006) and use these individuals of diverse ancestries29,36 to compute all pairwise correlations between variants within a 1 Mb distance, restricting to squared correlations larger than 0.3. Finally, we design an algorithm that aims at maximizing the tagging of all these variants read. We want to maximize , where j spans the whole set of variants read and k spans the variants kept in the new set, which we call HapMap3+. We start by including all previously used HapMap3 variants. Then, for the sake of simplicity, we use a greedy approach, where we repeatedly include the variant that increases this sum most, until no variant improves it by more than 2. Note that we only allow non-ambiguous SNPs to be included. This results in an extended set of 1,444,196 SNPs, of which we compute the LD matrix (within a 3 cM window) and apply the optimal algorithm developed in Privé (2021)35 to obtain 431 independent LD blocks. Since we use individuals of diverse ancestries for computing the pairwise variant correlations used for constructing this extended set of variants, we expect this new set of variants to be beneficial across diverse ancestries. Indeed, from the 11.5 million variants we aimed at tagging, 82.1% (respectively, 69.1%), 80.0% (respectively, 66.7%), 79.3% (respectively, 66.1%), 75.4% (respectively, 65.3%), and 66.6% (respectively, 48.6%) are tagged at (respectively, ) by at least one HapMap3+ variant in Northwestern Europeans, Middle Easterns, South Asians, East Asians, and West Africans (results extrapolated from variants in chromosome 22).
Optional extended model and inference with LDpred2-auto
LDpred2 originally assumed the following model for effect sizes:
(Equation 1) |
where p is the proportion of causal variants, M the number of variants, the (SNP) heritability, the effect sizes on the allele scale, the standard deviations of the genotypes, and the effects of the scaled genotypes.21 In LDpred2-auto, p and are directly estimated within the Gibbs sampler, as opposed to testing several values of p and from a grid of hyper-parameters (as in LDpred2-grid). This makes LDpred2-auto a method free of hyper-parameters that can therefore be applied directly without requiring a validation dataset to choose best-performing hyper-parameters.21 is estimated by , where is the correlation matrix between variants and is a vector of causal effect sizes (after scaling) from one iteration of the Gibbs sampler. As for p, it is sampled from , where . Note a small change: we now sample p from , where is the average LD score, to add more variability in the sampling in order to account for a reduced effective number of (independent) variants.
Here we provide an extended model and sampling scheme for LDpred2-auto that can be optionally used by setting parameter use_MLE = TRUE (otherwise it is run as described in the previous paragraph when using use_MLE = FALSE). In this new option, we extend LDpred2-auto with a third parameter α that controls the relationship between MAFs (or, equivalently, standard deviations) of genotypes and expected effect sizes; the model becomes
(Equation 2) |
Therefore, it was earlier assumed that and in Equation 1. This extended model in Equation 2 is similar to the model assumed by SBayesS, where α is denoted by S instead.15 In SBayesS, α and are estimated by maximizing the likelihood of the normal distribution (over the causal variants from the Gibbs sampler). When using this 3-parameter model in LDpred2-auto, in order to add some sampling to these two parameters, we first sample causal variants with replacement before computing the maximum likelihood estimators. This maximum likelihood estimation (MLE) is implemented using R package roptim (see the web resources), and we bound the estimate of α to be between −1.5 and 0.5 (the default, but can be modified) and the estimate of to be between 0.5 and 2 times the one from the previous iteration of the Gibbs sampler. Note that we still estimate , but that is not used in the variance of sampled effect sizes anymore (Equation 2). Note also that, to get local heritability estimates (e.g., for a single LD block), this estimation () is simply restricted to the variants from this LD block. Finally, in both models and sampling schemes now implemented in LDpred2-auto, we now detect strong divergence when , where is the vector of scaled effect sizes from one iteration of the Gibbs sampler and is the marginal scaled effect sizes; corresponding chains are stopped, and missing values are returned for effect sizes and estimates of missing iterations.
Inference of predictive performance
To infer the out-of-sample predictive performance (and its confidence interval [CI]) of the resulting PGS from LDpred2-auto, we use the distribution of , where and are two sampled vectors of causal effect sizes (after scaling) from two different chains of the Gibbs sampler. Intuitively, if the prediction is perfect, then and ; when power is very low, and are almost uncorrelated and . Others have proposed to estimate from a reference genotype set37 or from an additional set of external GWAS summary statistics and LD38,39; here we only use the summary statistics that we input to LDpred2-auto. These previous works have shown that can be approximated by where and are, respectively, the predictive effects from the training set and the marginal effects from the test set (after scaling). Note that , , and, when the test sample has the same genetic ancestry as the GWAS used for training (to get the summary statistics), . Therefore, we propose to use as a sample of . This can also be computed for a specific chain by taking two samples and that are far enough on the same chain to remove the possible autocorrelation. This is what we use for SBayesS here, and also as an alternative means for post-filtering chains for prediction (see next material and methods section). Note that Ding et al.24 investigated autocorrelation in LDpred2(-grid) and showed that it decays very quickly. Therefore, picking two LDpred2-auto samples that are 100 iterations apart should be more than enough to ensure quasi-independence of these samples.
In this paper, we check this approximation using extensive simulations (across many genetic architectures) and real data analyses (across 248 different phenotypes). These are compared to the partial- (on individual-level data from a separate test set). The partial correlation is computed with function pcor from R package bigstatsr, adjusting for the same eight covariates as in the GWAS, then squared (while keeping the sign). Corresponding 95% CIs are estimated through bootstrapping individual indices.
Post-filtering of chains in LDpred2-auto
Because a Gibbs sampler can be unstable, with so many variants and with possible mismatches between, e.g., the GWAS summary statistics and the LD reference used, we have always recommended running multiple chains in LDpred2-auto and post-filtering some of them as quality control.21 We originally proposed to filter chains by keeping the ones providing PGSs with the largest variances. Then we tested an almost equivalent and simpler alternative,34 which is to keep only chains that provide top imputed marginal scaled effect sizes , where R is the correlation matrix and are the PGS scaled effect sizes. This is the default post-filtering of LDpred2-auto chains that we use here, for both prediction and inference.
Moreover, here we test two alternative filtering criteria in the first simulations based on continuous outcomes (and call this “LDpred2-auto_altfilter”). First, for prediction, we test filtering on the average of within each chain, which is an estimate of (cf. previous material and methods section). Second, for inference, we filter on some convergence criterion. The split-Rhat statistic is a popular metric to test for good mixing and convergence of Markov chains.40 However, we have found this statistic to perform poorly when, e.g., one parameter gets stuck and is constant; in this case, the chain does not mix well, but a perfect Rhat of 1.0 is obtained. Instead, we have found that a two-sample Cramer-von Mises statistic,41 by similarly using both parts of the chain after burn-in, is often highly correlated with the split-Rhat statistic while not suffering from the previous issue. We therefore chose to use this statistic and to average the three statistics computed for , p, and α for each chain. We use a threshold of 4, above which we filter out the chain, because we have found that a value of 4 for this statistic often corresponds to a value close to 1.05 for the split-Rhat.
Results
Validating the inference with simulations
For simulations, we use the UKBB imputed data.29 We use 356,409 individuals of Northwestern European ancestry and 322,805 SNPs across seven chromosomes (material and methods). We first simulate continuous phenotypes using function snp_simuPheno from R package bigsnpr,26 varying three parameters: the SNP heritability , the polygenicity p (i.e., the proportion of causal variants), and the parameter α in Equation 2 that controls the relationship between MAFs and expected effect sizes. This function first picks a proportion p of causal variants at random, samples effect sizes γ using the variance component parameterized by α, and then scales the effect sizes so that the genetic component has a variance , where G is the genotype matrix. Finally, some Gaussian noise is added so that the final phenotype has a variance of 1. Then, we run a GWAS to obtain summary statistics using N individuals (either the 200,000 dedicated to this, or a random subset of 20,000), using fast linear regressions implemented in big_univLinReg from R package bigstatsr.26 Finally, we run LDpred2-auto with and without the option allow_jump_sign, which was proposed in Privé, Arbel, et al.34 for robustness (when disabled, it prevents effect sizes from changing sign without going through 0 first), and with and without the extended model including a third parameter α (using option use_MLE; material and methods). LDpred2-auto is run with 50 Gibbs sampler chains with different starting values for p (from 0.0001 to 0.2, equally spaced on a log scale). Then some of these chains are filtered out for quality control (material and methods).
First, LDpred2-auto generally reliably infers the three parameters from its model, i.e., the SNP heritability , polygenicity p, and α (Figures 2, 3, 4, and S1–S6). Compared to LD Score regression, heritability estimates are as precise when power is low, and much more precise when power is large, especially for small polygenicity values (Figures 2, S1, and S2). When power is low (e.g., and , LDpred2-auto_noMLE (with only two model parameters, as in previous versions of LDpred2-auto) and SBayesS are both over-confident in their estimate of (i.e., CIs are small and do not contain the true parameter value; Figure S7), and LDpred2-auto_jump overestimates the heritability (Figure S1). LDpred2-auto_nojump looks very reliable for estimating . When power is low, LDpred2-auto can overestimate the polygenicity when the true value is very small (e.g., ) and underestimate it when the polygenicity is large (e.g., ; Figure S3). SBayesS, which uses a similar model with the same three parameters, often overestimates the polygenicity, especially when . The α estimate of LDpred2-auto (with the extended 3-parameter model) can become very imprecise when power is too low, which can be detected by a small number of chains kept from LDpred2-auto. Estimates of both p and α from SBayesS are often over-confident with small CIs that do not include the true simulated values (Figures S9 and S11). Generally, 95% CIs for all three parameters (, p, and α) obtained from LDpred2-auto_nojump contain the true simulated value (Figures S7–S12), therefore confirming the validity of these CIs; this is the preferred method we recommend using. Finally, we also investigate alternative ways of post-filtering chains in LDpred2-auto, for both prediction and inference (LDpred2-auto_altfilter compared to LDpred2-auto_nojump; material and methods); results remain practically unchanged (Figures S1–S6).
Then, LDpred2-auto can also infer per-variant probabilities of being causal and local per-block heritability estimates, which are well calibrated (Figures S13 and S14). We recall that calibrated per-variant probabilities of being causal (also known as PIPs) can be used for fine-mapping purposes.14 LDpred2-auto provides PIPs that are more calibrated than with, e.g., SuSiE-RSS,42 which we run assuming there are 10 causal variants per LD block by default (Figure S15). Finally, LDpred2-auto can also be used to reliably infer the predictive performance of its resulting polygenic score, directly from within the Gibbs sampler (material and methods), even when power is low, and we show it works with results from SBayesS’s Gibbs sampler as well (Figures S16 and S17). CIs for this estimate very often encompass the true simulated value, except for LDpred2-auto_jump when both and N are small (Figures S18 and S19).
We then run simulations with binary outcomes where the simulated continuous liabilities are transformed to binary outcomes using a threshold corresponding to the prevalence. Results are very similar to when using the continuous phenotypes above (Figures S20–S23), and they are similar whether we use a linear regression GWAS and the total sample size N, or a logistic regression GWAS and the effective sample size (i.e., ). The main difference is that the and estimates must be transformed to the liability scale,43 where should be used for transforming estimates of and when using in inference methods.44
Genetic architectures of 248 phenotypes from the UKBB
We use the same 356,409 unrelated individuals of Northwestern European ancestry as in the simulations. To form the test set, we randomly select 50,000 of these, while the other 306,409 are used to run a GWAS using linear regression (with function big_univLinReg from R package bigstatsr) for each of all 248 phenotypes and using eight covariates (material and methods). We first use the set of 1,054,330 HapMap3 variants recommended to use for LDpred2.21 Here, if not otherwise specified, we use options use_MLE = TRUE (i.e., the extended 3-parameter model and sampling scheme) and allow_jump_sign = FALSE (when disabled, this prevents effect sizes from changing sign without going through 0 first and has been proposed for extra robustness in Privé, Arbel, et al.34).
Consistent with simulations, inferred SNP heritability estimates from LDpred2-auto closely match with those from LD Score regression, while generally being more precise, especially for phenotypes with a small polygenicity (Figure S24). Note that these estimates (and later the estimates) have not been transformed to the liability scale (i.e., are on the observed scale). Most phenotypes have an estimated polygenicity p between 0.001 and 0.04; these have, therefore, a very polygenic architecture, but not an infinitesimal one (Figure S25). Most phenotypes have an estimated α between −1.0 and −0.3 with a mode at −0.65 (Figure S26). As for the inferred predictive performance (from the Gibbs sampler of LDpred2-auto; material and methods), they are highly consistent with the predictive performance in the test set; only for standing height are they overestimated (Figures 5 and S27). Heritability estimates for height are probably overestimated as well since we use similar formulas for estimating and (material and methods), and because the SNP heritability estimate for standing height is higher than values reported in the literature (also see section application to height).
To investigate whether estimates from LDpred2-auto are robust to some misspecifications, we test using two alternative LD references (material and methods). Using a smaller number of individuals for computing the LD matrix results in a slightly overestimated p and (and ) with LDpred2-auto, while the α estimate remains consistent, and the predictive performance in the test set remains mostly similar, except for three phenotypes for which none of the LDpred2-auto chains is usable (Figure S28). When using an LD reference from an alternative population (South Europe instead of Northwest Europe), p, , and are slightly overestimated as well, and a few phenotypes have lower predictive performance, while there are four phenotypes for which none of the LDpred2-auto chains is usable (distinct from the previous three; Figure S29). Interestingly, using the previous approach (use_MLE = FALSE) seems to provide more robust results, where we can always get some chains not to diverge (and therefore get non-zero predictive performance) for the seven (three and four) phenotypes mentioned before when using the previous alternative LD references (Figure S30).
Then, we investigate the effect of disabling the LDpred2-auto parameter allow_jump_sign on the estimates from LDpred2-auto; when disabled, this prevents effect sizes from changing sign without going through 0 first and has been proposed for extra robustness in Privé, Arbel, et al.34 Consistent with simulations, p estimates from LDpred2-auto are conservatively lower than when allowing effects to “jump” sign (i.e., using a standard sampling; Figure S31). estimates can also be slightly lower, while α estimates are broadly consistent. As for predictive performance (on the test set), they are similar, suggesting there is no problem of robustness here (when using the Northwestern European LD reference) and a standard sampling can be used in this case (Figure S31).
Finally, we investigate different transformations to apply to some continuous phenotypes used here. Indeed, 49 of the phenotypes used here seem log-normally distributed or heavy-tailed (when visualizing their histogram); we therefore log-transform them. However, we do investigate alternative transformations here to decide which one should be preferred and to check how this impacts the inference and prediction from LDpred2-auto. We first compare to using raw (untransformed) phenotypes in Figure S32; estimates of p and α are highly consistent. However, estimates and predictive performance (in the test set) are generally larger with the log-transformation; it probably makes sense to transform these phenotypes. We then compare to using the rank-based inverse normal (RIN) transformation in Figure S33; estimates for p and α are also highly consistent. Except for bilirubin and lipoprotein(a) concentration, higher estimates and predictive performance are generally obtained with the RIN transformation than the log-transformation.
More heritability and predictive accuracy with the new set of variants
Here we use the same individuals as in the previous section. We investigate using the extended set of HapMap3 variants proposed here, HapMap3+ (material and methods), which includes ∼37% more variants on top of the HapMap3 variants recommended to use for LDpred2 (i.e., 1,054,330 + 389,866 variants), to improve the genome coverage of this set. As expected, compared to HapMap3, higher (average increase of 12.3% [95% CI: 10.8, 13.7]) and lower p (decrease of 11.5% [10.7, 12.3]) estimates are obtained with this extended set HapMap3+ (Figure 6). This is consistent with higher predictive performance in the test set (increase of 6.1% [4.1, 8.2]). In particular, a much larger estimate is obtained for lipoprotein(a) concentration (0.508 [0.502, 0.514] instead of 0.324 [0.320, 0.329]), which is also reflected in a larger predictive performance ( in the test set of 0.516 [0.508, 0.524] instead of 0.344 [0.335, 0.353]). Interestingly, when using this extended set of HapMap3 variants, more chains are kept on average, which is a sign of better convergence of the models (Figure S34). However, running LDpred2 with this extended set of variants takes around 50% more time; yet we remind the reader that LDpred2 has been made much faster in Privé, Arbel, et al.34 and now runs in less than 1 h for 50 chains parallelized over 13 cores (Figure S35), instead of 4–12 h before.
Local heritability and polygenicity
In this section, we use the extended set of variants constructed here, HapMap3+, for which we define 431 independent LD blocks (material and methods). We compute local per-block estimates and report the UKBB phenotypes for which one block contributes to at least 10% of the total heritability of all blocks in Figure S36. For lipoprotein(a) concentration, “red hair” and “disorders of iron metabolism” (phecode 275.1), almost all heritability comes from one LD block only. We also perform the same analysis with external GWAS summary statistics for 90 cardiovascular proteins45; 22 (respectively, 8) of them have at least 50% (respectively, 80%) of their heritability explained by a single block (Figure S37).
Across 169 UKBB phenotypes with more than 25 chains kept, we compute the median heritability per block and compare it to the number of variants in these blocks; the median heritability explained by a block of variants is largely proportional to the number of variants in this block (Figure S38). The outlier block explaining a much larger heritability contains the HLA region. Across the same phenotypes, we then compute per-variant median probabilities of being causal and report them in a Manhattan-like plot in Figure S39. Some variants in multiple small regions across the genome have a larger probability of being causal across many phenotypes; interestingly, these are mapped to genes that are known to be associated with many different traits (up to more than 300) in the GWAS catalog.46 To verify that this is not driven by population structure, we compute pcadapt chi-squared statistics that quantify whether a variant is associated with population structure47; the log-statistics have only a small correlation of 5.9% with the probabilities of being causal. To verify that this does not correspond to regions of low LD, we compute LD scores; the median probabilities of being causal have a correlation of 32.0% with the LD scores and of 26.4% with the log of LD scores. Therefore, the posterior probabilities of being causal obtained from LDpred2-auto tend to slightly increase with LD scores.
Finally, since the HapMap3+ variants still represent a rather small proportion of common variants, we showcase running LDpred2-auto using a more dense set of variants from a small region, as usually done for fine-mapping. We identify the most significant HapMap3+ variant for height, rs2871960 on chromosome 3, and read all the variants within a 500 kb distance that have a MAF larger than 0.005 and INFO score larger than 0.5 in the UKBB; there are 3,881 such variants. Then, we perform a GWAS for these variants using the same 305,338 training individuals as used before (Figure S40). Finally, using the resulting GWAS summary statistics and an LD reference from the same set of individuals, we run both SuSiE-RSS and LDpred2-auto (_nojump). We test two different values for L, the maximum number of causal variants in SuSiE-RSS (10 and 100), and three different values for the maximum value of the estimated p (1, 0.01, and 0.001). Per-variant posterior probabilities of being causal (also known as PIPs) are very similar with both methods, especially when restricting p in LDpred2-auto, which is similar to what SuSiE-RSS assumes (more conservative; Figure S41).
Application to height
Here we run three LDpred2-auto models for height, one from the same 305K training UKBB individuals used before (with available height, out of 306K), one based on 100K UKBB individuals (as a random subset of the previous 305K), and one from a large GWAS meta-analysis of 1.6 million individuals of European genetic ancestries.48 We first infer the genetic ancestry proportions of individuals included in the meta-analysis using the method proposed in Privé36 and find that 81.9% are from Northwestern Europe, 9.5% are from Eastern Europe, 6.5% are from Finland, 1.5% are of Ashkenazi genetic ancestry, 0.3% are from Southwest Europe, and 0.2% are from West Africa. For this set of external GWAS summary statistics, we therefore use the same Northwestern European LD matrix as used in UKBB analyses. Note that we use the HapMap3+ set of 1,444,196 SNPs here; however, for the GWAS meta-analysis, only 1,013,499 SNPs (out of 1,373,020) are overlapping with HapMap3+ and passing quality control.
As expected,49 intercepts from LD Score regression increase with sample size: 1.02 (standard errror [SE]: 0.008) with N = 100K, 1.11 (0.015) with N = 305K, and 2.31 (0.068) with N = 1.6 million. SNP heritability estimates are 64.6% (SE: 2.7), 59.7% (2.2), and 39.2% (1.7) with LD Score regression, respectively, and 60.2% (95% CI: 57.2, 63.2), 63.2% (62.0, 64.4), and 54.2% (53.9, 54.5) with LDpred2-auto. As expected, estimated predictive performance (from the Gibbs sampler of LDpred2-auto) increase with sample size: 29.6% (28.7, 30.5), 42.7% (42.2, 43.1), and 47.0% (46.8, 47.1), respectively. Note that these estimates are probably overestimated by the same margin as the (SNP heritability) estimates and correspond to ∼49%, ∼67.5%, and ∼87% of , respectively. Even though there are 1.6 million individuals in the meta-analysis, the predictive performance corresponds to around 87% of the SNP heritability only; therefore an even larger sample size is required to be able to better predict height. Polygenicity estimates from LDpred2-auto increase with sample size—1.2% (1.0, 1.5), 2.3% (2.0, 2.5), and 5.9% (5.6, 6.3)—consistent with results of simulations with a large polygenicity (p = 10%). Therefore, we estimate that height has at least 50,000 causal variants. These results are similar irrespective of whether allow_jump_sign is used or not, which is surprising to us. We also identify 1,753 SNPs with a greater than 95% probability of being causal (fine-mapping), which are spread over the entire genome (Figure S42). As for α estimates from LDpred2-auto, they remain consistent, at −0.71 (−0.75, −0.67), −0.74 (−0.76, −0.72), and −0.78 (−0.82, −0.76), respectively. Finally, we compute per-annotation heritability estimates from LDpred2-auto results to investigate functional enrichment. We perform this analysis using 50 non-flanking binary annotations from the baselineLD v.2.2 model.50 Heritability enrichments that we obtain from LDpred2-auto results are rather modest, ranging from 0.7 to 2.5 with a GWAS sample size of N = 305,000, and of slightly smaller magnitude with N = 100,000, and of slightly larger magnitude with N = 1.6 million (Figure S43).
Application to other external GWAS summary statistics
A description of the eight external GWAS summary statistics used is provided in Table 1; these do not include UKBB individuals. Quality control of these summary statistics is performed as described in Privé, Arbel, et al.34 We run LDpred2-auto using either the HapMap3 or HapMap3+ variants, with either the extended or previous model and sampling (via parameter use_MLE). Because of the increased mismatch between external GWAS summary statistics and the LD reference we use here (compared to in simulations and UKBB analyses), we also explore multiple values for parameter coef_shrink, which is a multiplicative coefficient for shrinking/regularizing off-diagonal elements of the LD matrix in LDpred2-auto. Note that to transform and estimates to the liability scale (except for vitamin D, which is a continuous trait), we use the prevalence in the UKBB as the population prevalence, which may be slightly biased.51,52 Results are presented in Figure S44. In terms of predictive performance, using the HapMap3+ variants provides equal or better predictive performance compared to using the HapMap3 variants, except for type 1 diabetes (T1D); therefore it seems more useful to use this new set of variants for larger sample sizes. Using the extended model and sampling scheme (with option use_MLE) provides equal or better predictive performance except for vitamin D, but proves to be less robust, especially when using coef_shrink close to 1 (low or no regularization of the LD matrix). The 3-parameter model also generally provides better predictive performance in the UKBB analyses, where power and robustness are often not an issue (Figure S45). Maximum are achieved at different LD regularization coefficients coef_shrink across phenotypes, reflecting possible substantial mismatches between the GWAS summary statistics and LD reference used. However, results are virtually unchanged when regularizing the LD matrix (“hm3_plus_regul”) by multiplying correlations between variants i and j by , where is the distance in cM between the two variants (similar to Wen and Stephens53), which is surprising to us. As for estimates of and (inferred from the Gibbs sampler), they tend to increase with smaller values of coef_shrink. This is also the case for estimates of p when using use_MLE = TRUE. Estimates of α are largely constant but can become very small (capped at −1.5) in the case of robustness issues when using, e.g., use_MLE = TRUE and almost no regularization on the LD matrix (i.e., coef_shrink close to 1).
Table 1.
Trait | GWAS citation | Effective GWAS sample size | # GWAS variants | # matched variants with HapMap3+ |
---|---|---|---|---|
Asthma | Demenais et al.54 | 67,341.2 | 2,001,280 | 946,092 |
Breast cancer (BRCA) | Michailidou et al.55 | 254,862.6 | 11,792,542 | 1,411,710 |
Coronary artery disease (CAD) | Nikpay et al.56 | 129,014.3 | 9,455,778 | 1,325,052 |
Depression (MDD) (without UKBB) | Wray et al.57 | 168,040.2 | 9,874,289 | 1,314,499 |
Prostate cancer (PRCA) | Schumacher et al.58 | 135,316.1 | 20,370,946 | 1,411,710 |
Type 1 diabetes (T1D) | Censin et al.59 | 13,497.6 | 8,996,866 | 1,127,489 |
Type 2 diabetes (T2D) | Scott et al.60 | 72,143.0 | 12,056,346 | 1,408,283 |
Vitamin D | Jiang et al.61 | 79,366 | 2,579,296 | 1,028,171 |
These do not include UKBB individuals. Note that some of them may contain a substantial amount of non-European genetic ancestry (e.g., >20% for CAD).
Discussion
LDpred2-auto was originally developed for building polygenic scores.21 Here we have extended the LDpred2-auto model and shown that it can be used to reliably infer genetic architecture parameters such as the SNP heritability (both genome-wide and more locally), polygenicity (and per-variant probabilities of being causal, also known as PIPs, useful for fine-mapping), and the selection-related parameter α. We remind readers that LDpred2 can also be used to infer the uncertainty of individual polygenic scores.24,25 We also introduce a way to infer the out-of-sample predictive performance of the resulting PGS, assuming the target sample has the same genetic ancestry as the GWAS used for training. Others have proposed to estimate from a reference genotype set37 or from an additional set of external GWAS summary statistics and LD38,39; here we only use the summary statistics that we input to LDpred2-auto. Results across 248 phenotypes demonstrate that most of these phenotypes are very polygenic, yet do not have an infinitesimal architecture (i.e., not all variants are causal); this is consistent with LDpred2-inf (assuming an infinitesimal architecture, i.e., ) generally providing lower predictive performance than LDpred2-grid (testing different values for parameter p) or LDpred2-auto (estimating p).21 We also obtain widespread signatures of negative selection, with most α estimates between −1.0 and −0.3 with a mode at −0.65, consistent with previous findings from SBayesS, where they find a median S (same as α) of −0.578 (standard deviation [SD]: 0.096).15 In that paper, they found consistent results with BayesS, although the original BayesS publication reported weaker effects, with a median of −0.37 (SD: 0.11).62 In Schoech et al.,13 they found an average α of −0.38 (SE: 0.02) with GRM-MAF-LD, but they also noted that LDAK estimates were upward biased by 0.4 (originally centered at −0.25,12 suggesting a likely value would be close to −0.65, which is what we obtain here).
However, when looking at the heritability enrichments of several functional annotations for height, we obtain much smaller magnitudes with LDpred2-auto than with stratified LD Score regression (S-LDSC).50 For example, Yengo et al.48 report fold enrichments of more than 10× for, e.g., coding and conserved variants, while we get less than 2×. This is partly due to LDpred2-auto estimates being more conservative, as they are shrunk toward no enrichment (the prior); however, we do use a very large sample size here, so the prior should not matter much. Another possible reason comes from using variants that capture the causal effects by LD, while these tagging variants may be annotated differently from the causal variants, which could cause functional enrichments to be diluted.63 We also note that this heritability partitioning is performed after running LDpred2-auto for each annotation independently; therefore, unlike S-LDSC, the LDpred2-auto heritability partitioning does not depend on the set of annotations used.
Here we have also extended the set of HapMap3 variants recommended for use with LDpred2, making it 37% larger to offer better coverage of the genome. Since we used individuals of diverse ancestries for computing the pairwise variant correlations used for constructing this extended set of variants, this new set of variants is beneficial across diverse ancestries (material and methods). Increasing the genome coverage enables us to capture more of the heritability of phenotypes and therefore reduce the missing heritability, i.e., the difference between the family-based heritability and the SNP-based heritability. Using the new HapMap3+ set also improves predictive performance by an average of 6.1% in UKBB analyses here, and particularly for lipoprotein(a) concentration, with an of 0.516 instead of 0.344. However, we note that we are able to achieve an of 0.677 (0.671, 0.682) when using the penalized regression implementation of Privé et al.64 on the UKBB individual-level data while restricting to all variants within a 1 Mb window of the LPA gene. This means that this extended SNP set is still not tagging all variants perfectly, and that it might be preferable to use a more localized set of variants for phenotypes for which most of the heritability is contained in a single region of the genome. When using external GWAS summary statistics, using HapMap3+ instead of HapMap3 variants was more beneficial for larger sample sizes. Our intuition and conclusion is that using more variants is beneficial when power is sufficient; however, when power is low (e.g., small N, small , and/or large p), it may actually be detrimental.
Our proposed method has limitations. First, when power is low (i.e., when is low), estimates of α and p become less reliable. Therefore, we recommend using the 3-parameter model (with α, when using use_MLE = TRUE), but only when power is sufficient and when robustness is not an issue (e.g., without substantial misspecifications such as an ancestry mismatch between the GWAS and LD panels). We also recommend using option allow_jump_sign = FALSE for robustness34 and for getting accurate or conservative p estimates. When still obtaining a large p estimate () with this option, we recommend rerunning LDpred2-auto without this option to get a less conservative estimate (cf. simulations). However, LDpred2-auto estimates of and seem always reliable, except for height, for which they are probably overestimated. We think this is likely due to assortative mating, which causes GWAS effects to be inflated,65,66 which causes our estimates to be inflated as well. Second, the from LDpred2-auto is also slightly overestimated when using a small LD reference panel or when the reference panel does not closely match with the ancestry of the GWAS summary statistics. Future work could focus on correcting these issues. Third, when using external GWAS summary statistics, it is often beneficial to regularize the LD matrix (via parameter coef_shrink, especially when using the extended model and sampling); however, it leads to an increased estimation for, e.g., and . Future work could focus on correcting these estimates and also on identifying the best shrinkage regularization coefficient to apply based on, e.g., some distance metric (mismatch) between GWAS summary statistics and the LD reference used. Fourth, because we use a limited set of variants as input for LDpred2, causal variants identified by LDpred2-auto are probably tagging variants that are highly correlated with unobserved causal variants close by. LDpred2-auto can also miss some causal variants when they are poorly tagged by the set of variants used. Future work will focus on scaling LDpred2-auto to using more variants to alleviate this limitation. LDpred2-auto runtime currently depends on the number of causal variants (or, equivalently, the polygenicity). When the polygenicity is close to 0, it runs linearly with the number of variants. When the polygenicity is close to 1, it runs quadratically. However, the polygenicity is almost always lower than 0.1 for human traits and diseases (cf. UKBB results here), which makes running LDpred2-auto efficient (we report runtimes always under 1 h, even with HapMap3+ variants). Currently, the main issue is the large size of the LD matrix that grows quadratically with the number of variants used. For now, to identify causal variants, one can concentrate on small regions of the genome, where LDpred2-auto can be rerun using a much more dense set of variants, as typically done in fine-mapping analyses (cf. results section local heritability and polygenicity).
Nevertheless, LDpred2-auto users can now get much from running a single method. The reliable estimates provided by LDpred2-auto are very encouraging to further extend LDpred2-auto in multiple directions. As future research directions, we are interested in using LDpred2-auto for GWAS summary statistics imputation,67,68 for genetic correlation estimation,11,69,70,71,72 for multi-ancestry prediction and inference,73,74,75,76 as well as extending it to use more variants and to leverage functional annotations.63,77,78
Data and code availability
-
•
All code used for this paper is available at https://github.com/privefl/paper-infer/tree/main/code
-
•
Descriptions of UK Biobank phenotypes used here can be found at https://github.com/privefl/paper-infer/blob/main/phenotype-description.tsv
-
•
Simulation and UKBB results from this study can be found at https://github.com/privefl/paper-infer/tree/main/results
Acknowledgments
The authors thank the members of the NCRR/QGG StatGen group and Marc-André Legault for helpful discussions, as well as the reviewers for their useful comments. The authors also thank GenomeDK and Aarhus University for providing computational resources and support that contributed to these research results. This research has been conducted using the UK Biobank Resource under application number 58024; the authors thank all the UK Biobank participants for contributing to such useful data for research. F.P., C.A. and B.J.V. are supported by the Danish National Research Foundation (Niels Bohr Professorship to Prof. John McGrath), the Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH (R102-A9118, R155-2014-1724, R248-2017-2003), and a Lundbeck Foundation Fellowship (R335-2019-2339 to B.J.V.).
Declaration of interests
B.J.V. is on Allelica’s international advisory board. The other authors have no competing interests to declare.
Published: November 8, 2023
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2023.10.010.
Web resources
LD matrices for HapMap3+ variants computed from the Northwestern European UKBB data used in this paper, https://doi.org/10.6084/m9.figshare.21305061
Official tutorial on running LDpred2 with R package bigsnpr (for both prediction and inference), https://privefl.github.io/bigsnpr/articles/LDpred2.html
roptim: an R package for general purpose optimization with C++, https://cran.r-project.org/package=roptim
Supplemental information
References
- 1.Sullivan P.F., Kendler K.S., Neale M.C. Schizophrenia as a complex trait: evidence from a meta-analysis of twin studies. Arch. Gen. Psychiatry. 2003;60:1187–1192. doi: 10.1001/archpsyc.60.12.1187. [DOI] [PubMed] [Google Scholar]
- 2.Yang J., Benyamin B., McEvoy B.P., Gordon S., Henders A.K., Nyholt D.R., Madden P.A., Heath A.C., Martin N.G., Montgomery G.W., et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.O’Connor L.J., Schoech A.P., Hormozdiari F., Gazal S., Patterson N., Price A.L. Extreme polygenicity of complex traits is explained by negative selection. Am. J. Hum. Genet. 2019;105:456–476. doi: 10.1016/j.ajhg.2019.07.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Trubetskoy V., Pardiñas A.F., Qi T., Panagiotaropoulou G., Awasthi S., Bigdeli T.B., Bryois J., Chen C.-Y., Dennison C.A., Hall L.S., et al. Mapping genomic loci implicates genes and synaptic biology in schizophrenia. Nature. 2022;604:502–508. doi: 10.1038/s41586-022-04434-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Speed D., Hemani G., Johnson M.R., Balding D.J. Improved heritability estimation from genome-wide SNPs. Am. J. Hum. Genet. 2012;91:1011–1021. doi: 10.1016/j.ajhg.2012.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Yang J., Lee S.H., Goddard M.E., Visscher P.M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 2011;88:76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Loh P.-R., Bhatia G., Gusev A., Finucane H.K., Bulik-Sullivan B.K., Pollack S.J., Schizophrenia Working Group of Psychiatric Genomics Consortium. de Candia T.R., Lee S.H., Wray N.R., et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat. Genet. 2015;47:1385–1392. doi: 10.1038/ng.3431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Bulik-Sullivan B.K., Loh P.-R., Finucane H.K., Ripke S., Yang J., Schizophrenia Working Group of the Psychiatric Genomics Consortium. Patterson N., Daly M.J., Price A.L., Neale B.M. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 2015;47:291–295. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Benner C., Spencer C.C.A., Havulinna A.S., Salomaa V., Ripatti S., Pirinen M. FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics. 2016;32:1493–1501. doi: 10.1093/bioinformatics/btw018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Shi H., Kichaev G., Pasaniuc B. Contrasting the genetic architecture of 30 complex traits from summary association data. Am. J. Hum. Genet. 2016;99:139–153. doi: 10.1016/j.ajhg.2016.05.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Speed D., Balding D.J. SumHer better estimates the SNP heritability of complex traits from summary statistics. Nat. Genet. 2019;51:277–284. doi: 10.1038/s41588-018-0279-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Speed D., Holmes J., Balding D.J. Evaluating and improving heritability models using summary statistics. Nat. Genet. 2020;52:458–462. doi: 10.1038/s41588-020-0600-y. [DOI] [PubMed] [Google Scholar]
- 13.Schoech A.P., Jordan D.M., Loh P.-R., Gazal S., O’Connor L.J., Balick D.J., Palamara P.F., Finucane H.K., Sunyaev S.R., Price A.L. Quantification of frequency-dependent genetic architectures in 25 UK Biobank traits reveals action of negative selection. Nat. Commun. 2019;10:790–810. doi: 10.1038/s41467-019-08424-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wang G., Sarkar A., Carbonetto P., Stephens M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Series B Stat. Methodol. 2020;82:1273–1300. doi: 10.1111/rssb.12388. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zeng J., Xue A., Jiang L., Lloyd-Jones L.R., Wu Y., Wang H., Zheng Z., Yengo L., Kemper K.E., Goddard M.E., et al. Widespread signatures of natural selection across human complex traits and functional genomic categories. Nat. Commun. 2021;12:1164–1212. doi: 10.1038/s41467-021-21446-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Johnson R., Burch K.S., Hou K., Paciuc M., Pasaniuc B., Sankararaman S. Estimation of regional polygenicity from gwas provides insights into the genetic architecture of complex traits. PLoS Comput. Biol. 2021;17 doi: 10.1371/journal.pcbi.1009483. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Daetwyler H.D., Villanueva B., Woolliams J.A. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS One. 2008;3:e3395. doi: 10.1371/journal.pone.0003395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Torkamani A., Wineinger N.E., Topol E.J. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 2018;19:581–590. doi: 10.1038/s41576-018-0018-x. [DOI] [PubMed] [Google Scholar]
- 19.Lambert S.A., Abraham G., Inouye M. Towards clinical utility of polygenic risk scores. Hum. Mol. Genet. 2019;28:R133–R142. doi: 10.1093/hmg/ddz187. [DOI] [PubMed] [Google Scholar]
- 20.Kumuthini J., Zick B., Balasopoulou A., Chalikiopoulou C., Dandara C., El-Kamah G., Findley L., Katsila T., Li R., Maceda E.B., et al. The clinical utility of polygenic risk scores in genomic medicine practices: a systematic review. Hum. Genet. 2022;141:1697–1704. doi: 10.1007/s00439-022-02452-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Privé F., Arbel J., Vilhjálmsson B.J. LDpred2: better, faster, stronger. Bioinformatics. 2020;36:5424–5431. doi: 10.1093/bioinformatics/btaa1029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Pain O., Glanville K.P., Hagenaars S.P., Selzam S., Fürtjes A.E., Gaspar H.A., Coleman J.R.I., Rimfeld K., Breen G., Plomin R., et al. Evaluation of polygenic prediction methodology within a reference-standardized framework. PLoS Genet. 2021;17 doi: 10.1371/journal.pgen.1009021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Kulm S., Marderstein A., Mezey J., Elemento O. A systematic framework for assessing the clinical impact of polygenic risk scores. medRxiv. 2021 doi: 10.1101/2020.04.06.20055574. Preprint at. [DOI] [Google Scholar]
- 24.Ding Y., Hou K., Burch K.S., Lapinska S., Privé F., Vilhjálmsson B., Sankararaman S., Pasaniuc B. Large uncertainty in individual polygenic risk score estimation impacts PRS-based risk stratification. Nat. Genet. 2022;54:30–39. doi: 10.1038/s41588-021-00961-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Ding Y., Hou K., Xu Z., Pimplaskar A., Petter E., Boulier K., Privé F., Vilhjálmsson B.J., Olde Loohuis L.M., Pasaniuc B. Polygenic scoring accuracy varies across the genetic ancestry continuum. Nature. 2023;618:774–781. doi: 10.1038/s41586-023-06079-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Privé F., Aschard H., Ziyatdinov A., Blum M.G.B. Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics. 2018;34:2781–2787. doi: 10.1093/bioinformatics/bty185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Bengtsson H. A unifying framework for parallel and distributed processing in R using futures. The R Journal. 2021;13:208–291. [Google Scholar]
- 28.Wickham H., Averick M., Bryan J., Chang W., McGowan L., François R., Grolemund G., Hayes A., Henry L., Hester J., et al. Welcome to the tidyverse. J. Open Source Softw. 2019;4:1686. [Google Scholar]
- 29.Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Privé F., Luu K., Blum M.G.B., McGrath J.J., Vilhjálmsson B.J. Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics. 2020;36:4449–4457. doi: 10.1093/bioinformatics/btaa520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Privé F., Aschard H., Carmi S., Folkersen L., Hoggart C., O’Reilly P.F., Vilhjálmsson B.J. Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. Am. J. Hum. Genet. 2022;109:12–23. doi: 10.1016/j.ajhg.2021.11.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Carroll R.J., Bastarache L., Denny J.C. R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment. Bioinformatics. 2014;30:2375–2376. doi: 10.1093/bioinformatics/btu197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Wu P., Gifford A., Meng X., Li X., Campbell H., Varley T., Zhao J., Carroll R., Bastarache L., Denny J.C., et al. Mapping ICD-10 and ICD-10-CM codes to phecodes: workflow development and initial evaluation. JMIR Med. Inform. 2019;7 doi: 10.2196/14325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Privé F., Arbel J., Aschard H., Vilhjálmsson B.J. Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores. HGG Adv. 2022;3 doi: 10.1016/j.xhgg.2022.100136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Privé F. Optimal linkage disequilibrium splitting. Bioinformatics. 2021;38:255–256. doi: 10.1093/bioinformatics/btab519. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Privé F. Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics. Bioinformatics. 2022;38:3477–3480. doi: 10.1093/bioinformatics/btac348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Mak T.S.H., Porsch R.M., Choi S.W., Zhou X., Sham P.C. Polygenic scores via penalized regression on summary statistics. Genet. Epidemiol. 2017;41:469–480. doi: 10.1002/gepi.22050. [DOI] [PubMed] [Google Scholar]
- 38.Pattee J., Pan W. Penalized regression and model selection methods for polygenic scores on summary statistics. PLoS Comput. Biol. 2020;16 doi: 10.1371/journal.pcbi.1008271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Witteveen M.J., Pedersen E.M., Meijsen J., Andersen M.R., Privé F., Speed D., Vilhjálmsson B.J. Publicly available privacy-preserving benchmarks for polygenic prediction. bioRxiv. 2022 doi: 10.1101/2022.10.10.510645. Preprint at. [DOI] [Google Scholar]
- 40.Vehtari A., Gelman A., Simpson D., Carpenter B., Bürkner P.-C. Rank-normalization, folding, and localization: An improved rhat for assessing convergence of mcmc (with discussion) Bayesian Analysis. 2021;16:667–718. [Google Scholar]
- 41.Anderson T.W. On the distribution of the two-sample Cramer-von Mises criterion. Ann. Math. Statist. 1962;33:1148–1159. [Google Scholar]
- 42.Zou Y., Carbonetto P., Wang G., Stephens M. Fine-mapping from summary data with the “Sum of Single Effect” model. PLoS Genet. 2022;18 doi: 10.1371/journal.pgen.1010299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Lee S.H., Wray N.R., Goddard M.E., Visscher P.M. Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 2011;88:294–305. doi: 10.1016/j.ajhg.2011.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Grotzinger A.D., Fuente J.d.l., Privé F., Nivard M.G., Tucker-Drob E.M. Pervasive downward bias in estimates of liability-scale heritability in genome-wide association study meta-analysis: a simple solution. Biol. Psychiatry. 2023;93:29–36. doi: 10.1016/j.biopsych.2022.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Folkersen L., Gustafsson S., Wang Q., Hansen D.H., Hedman Å.K., Schork A., Page K., Zhernakova D.V., Wu Y., Peters J., et al. Genomic and drug target evaluation of 90 cardiovascular proteins in 30,931 individuals. Nat. Metab. 2020;2:1135–1148. doi: 10.1038/s42255-020-00287-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Buniello A., MacArthur J.A.L., Cerezo M., Harris L.W., Hayhurst J., Malangone C., McMahon A., Morales J., Mountjoy E., Sollis E., et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics. Nucleic Acids Res. 2019;47:D1005–D1012. doi: 10.1093/nar/gky1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Privé F., Luu K., Vilhjálmsson B.J., Blum M.G.B. Performing highly efficient genome scans for local adaptation with R package pcadapt version 4. Mol. Biol. Evol. 2020;37:2153–2154. doi: 10.1093/molbev/msaa053. [DOI] [PubMed] [Google Scholar]
- 48.Yengo L., Vedantam S., Marouli E., Sidorenko J., Bartell E., Sakaue S., Graff M., Eliasen A.U., Jiang Y., Raghavan S., et al. A saturated map of common genetic variants associated with human height. Nature. 2022;610:704–712. doi: 10.1038/s41586-022-05275-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Loh P.-R., Kichaev G., Gazal S., Schoech A.P., Price A.L. Mixed-model association for biobank-scale datasets. Nat. Genet. 2018;50:906–908. doi: 10.1038/s41588-018-0144-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Finucane H.K., Reshef Y.A., Anttila V., Slowikowski K., Gusev A., Byrnes A., Gazal S., Loh P.-R., Lareau C., Shoresh N., et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nat. Genet. 2018;50:621–629. doi: 10.1038/s41588-018-0081-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Fry A., Littlejohns T.J., Sudlow C., Doherty N., Adamska L., Sprosen T., Collins R., Allen N.E. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. Am. J. Epidemiol. 2017;186:1026–1034. doi: 10.1093/aje/kwx246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.van Alten S., Domingue B.W., Galama T.J., Marees A.T. Reweighting the UK Biobank to Reflect its Underlying Sampling Population Substantially Reduces Pervasive Selection Bias Due to Volunteering. medRxiv. 2022 doi: 10.1101/2022.05.16.22275048. Preprint at. [DOI] [Google Scholar]
- 53.Wen X., Stephens M. Using linear predictors to impute allele frequencies from summary or pooled genotype data. Ann. Appl. Stat. 2010;4:1158–1182. doi: 10.1214/10-aoas338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Demenais F., Margaritte-Jeannin P., Barnes K.C., Cookson W.O.C., Altmüller J., Ang W., Barr R.G., Beaty T.H., Becker A.B., Beilby J., et al. Multiancestry association study identifies new asthma risk loci that colocalize with immune-cell enhancer marks. Nat. Genet. 2018;50:42–53. doi: 10.1038/s41588-017-0014-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Michailidou K., Lindström S., Dennis J., Beesley J., Hui S., Kar S., Lemaçon A., Soucy P., Glubb D., Rostamianfar A., et al. Association analysis identifies 65 new breast cancer risk loci. Nature. 2017;551:92–94. doi: 10.1038/nature24284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Nikpay M., Goel A., Won H.-H., Hall L.M., Willenborg C., Kanoni S., Saleheen D., Kyriakou T., Nelson C.P., Hopewell J.C., et al. A comprehensive 1000 genomes–based genome-wide association meta-analysis of coronary artery disease. Nat. Genet. 2015;47:1121–1130. doi: 10.1038/ng.3396. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Wray N.R., Ripke S., Mattheisen M., Trzaskowski M., Byrne E.M., Abdellaoui A., Adams M.J., Agerbo E., Air T.M., Andlauer T.M.F., et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 2018;50:668–681. doi: 10.1038/s41588-018-0090-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Schumacher F.R., Al Olama A.A., Berndt S.I., Benlloch S., Ahmed M., Saunders E.J., Dadaev T., Leongamornlert D., Anokian E., Cieza-Borrella C., et al. Association analyses of more than 140,000 men identify 63 new prostate cancer susceptibility loci. Nat. Genet. 2018;50:928–936. doi: 10.1038/s41588-018-0142-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Censin J.C., Nowak C., Cooper N., Bergsten P., Todd J.A., Fall T. Childhood adiposity and risk of type 1 diabetes: A mendelian randomization study. PLoS Med. 2017;14 doi: 10.1371/journal.pmed.1002362. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Scott R.A., Scott L.J., Mägi R., Marullo L., Gaulton K.J., Kaakinen M., Pervjakova N., Pers T.H., Johnson A.D., Eicher J.D., et al. An expanded genome-wide association study of type 2 diabetes in Europeans. Diabetes. 2017;66:2888–2902. doi: 10.2337/db16-1253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Jiang X., O’Reilly P.F., Aschard H., Hsu Y.-H., Richards J.B., Dupuis J., Ingelsson E., Karasik D., Pilz S., Berry D., et al. Genome-wide association study in 79,366 european-ancestry individuals informs the genetic architecture of 25-hydroxyvitamin d levels. Nat. Commun. 2018;9:260–312. doi: 10.1038/s41467-017-02662-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Zeng J., De Vlaming R., Wu Y., Robinson M.R., Lloyd-Jones L.R., Yengo L., Yap C.X., Xue A., Sidorenko J., McRae A.F., et al. Signatures of negative selection in the genetic architecture of human complex traits. Nat. Genet. 2018;50:746–753. doi: 10.1038/s41588-018-0101-4. [DOI] [PubMed] [Google Scholar]
- 63.Zheng Z., Liu S., Sidorenko J., Yengo L., Turley P., Ani A., Wang R., Nolte I.M., Snieder H., Yang J., et al. Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries. bioRxiv. 2022 doi: 10.1101/2022.10.12.510418. Preprint at. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Privé F., Aschard H., Blum M.G.B. Efficient implementation of penalized regression for genetic risk prediction. Genetics. 2019;212:65–74. doi: 10.1534/genetics.119.302019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Border R., O’Rourke S., de Candia T., Goddard M.E., Visscher P.M., Yengo L., Jones M., Keller M.C. Assortative mating biases marker-based heritability estimators. Nat. Commun. 2022;13:660–710. doi: 10.1038/s41467-022-28294-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Herzig A.F., Noûs C., Pierre A.S., Perdry H. A model for co-occurrent assortative mating and vertical cultural transmission and its impact on measures of genetic associations. bioRxiv. 2023 doi: 10.1101/2023.04.08.536101. Preprint at. [DOI] [Google Scholar]
- 67.Rüeger S., McDaid A., Kutalik Z. Evaluation and application of summary statistic imputation to discover new height-associated loci. PLoS Genet. 2018;14 doi: 10.1371/journal.pgen.1007371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Julienne H., Shi H., Pasaniuc B., Aschard H. RAISS: robust and accurate imputation from summary statistics. Bioinformatics. 2019;35:4837–4839. doi: 10.1093/bioinformatics/btz466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Bulik-Sullivan B., Finucane H.K., Anttila V., Gusev A., Day F.R., Loh P.-R., ReproGen Consortium. Psychiatric Genomics Consortium. Genetic Consortium for Anorexia Nervosa of the Wellcome Trust Case Control Consortium 3. Duncan L., et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 2015;47:1236–1241. doi: 10.1038/ng.3406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Shi H., Mancuso N., Spendlove S., Pasaniuc B. Local genetic correlation gives insights into the shared genetic architecture of complex traits. Am. J. Hum. Genet. 2017;101:737–751. doi: 10.1016/j.ajhg.2017.09.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Frei O., Holland D., Smeland O.B., Shadrin A.A., Fan C.C., Maeland S., O’Connell K.S., Wang Y., Djurovic S., Thompson W.K., et al. Bivariate causal mixture model quantifies polygenic overlap between complex traits beyond genetic correlation. Nat. Commun. 2019;10:2417–2511. doi: 10.1038/s41467-019-10310-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Werme J., van der Sluis S., Posthuma D., de Leeuw C.A. An integrated framework for local genetic correlation analysis. Nat. Genet. 2022;54:274–282. doi: 10.1038/s41588-022-01017-y. [DOI] [PubMed] [Google Scholar]
- 73.Brown B.C., Asian Genetic Epidemiology Network Type 2 Diabetes Consortium. Ye C.J., Price A.L., Zaitlen N., et al. Transethnic genetic-correlation estimates from summary statistics. Am. J. Hum. Genet. 2016;99:76–88. doi: 10.1016/j.ajhg.2016.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Shi H., Burch K.S., Johnson R., Freund M.K., Kichaev G., Mancuso N., Manuel A.M., Dong N., Pasaniuc B. Localizing components of shared transethnic genetic architecture of complex traits from GWAS summary data. Am. J. Hum. Genet. 2020;106:805–817. doi: 10.1016/j.ajhg.2020.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Ruan Y., Lin Y.-F., Feng Y.-C.A., Chen C.-Y., Lam M., Guo Z., Stanley Global Asia Initiatives. He L., Sawa A., Martin A.R., et al. Improving polygenic prediction in ancestrally diverse populations. Nat. Genet. 2022;54:573–580. doi: 10.1038/s41588-022-01054-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Lu Z., Gopalan S., Yuan D., Conti D.V., Pasaniuc B., Gusev A., Mancuso N. Multi-ancestry fine-mapping improves precision to identify causal genes in transcriptome-wide association studies. Am. J. Hum. Genet. 2022;109:1388–1404. doi: 10.1016/j.ajhg.2022.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Zhang Q., Privé F., Vilhjálmsson B., Speed D. Improved genetic prediction of complex traits from individual-level data or summary statistics. Nat. Commun. 2021;12:1–9. doi: 10.1038/s41467-021-24485-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Márquez-Luna C., Gazal S., Loh P.-R., Kim S.S., Furlotte N., Auton A., Price A.L. Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. Nat. Commun. 2021;12:1–11. doi: 10.1038/s41467-021-25171-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
-
•
All code used for this paper is available at https://github.com/privefl/paper-infer/tree/main/code
-
•
Descriptions of UK Biobank phenotypes used here can be found at https://github.com/privefl/paper-infer/blob/main/phenotype-description.tsv
-
•
Simulation and UKBB results from this study can be found at https://github.com/privefl/paper-infer/tree/main/results