Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2020 May 19;16(5):e1008612. doi: 10.1371/journal.pgen.1008612

Beyond SNP heritability: Polygenicity and discoverability of phenotypes estimated with a univariate Gaussian mixture model

Dominic Holland 1,2,*, Oleksandr Frei 3, Rahul Desikan 4, Chun-Chieh Fan 1,5,6, Alexey A Shadrin 3, Olav B Smeland 3,7, V S Sundar 1,5, Paul Thompson 8, Ole A Andreassen 3,7, Anders M Dale 1,2,5,9
Editor: Xiaofeng Zhu10
PMCID: PMC7272101  PMID: 32427991

Abstract

Estimating the polygenicity (proportion of causally associated single nucleotide polymorphisms (SNPs)) and discoverability (effect size variance) of causal SNPs for human traits is currently of considerable interest. SNP-heritability is proportional to the product of these quantities. We present a basic model, using detailed linkage disequilibrium structure from a reference panel of 11 million SNPs, to estimate these quantities from genome-wide association studies (GWAS) summary statistics. We apply the model to diverse phenotypes and validate the implementation with simulations. We find model polygenicities (as a fraction of the reference panel) ranging from ≃ 2 × 10−5 to ≃ 4 × 10−3, with discoverabilities similarly ranging over two orders of magnitude. A power analysis allows us to estimate the proportions of phenotypic variance explained additively by causal SNPs reaching genome-wide significance at current sample sizes, and map out sample sizes required to explain larger portions of additive SNP heritability. The model also allows for estimating residual inflation (or deflation from over-correcting of z-scores), and assessing compatibility of replication and discovery GWAS summary statistics.

Author summary

There are ∼10 million common variants in the genome of humans with European ancestry. For any particular phenotype a number of these variants will have some causal effect. It is of great interest to be able to quantify the number of these causal variants and the strength of their effect on the phenotype.

Genome wide association studies (GWAS) produce very noisy summary statistics for the association between subsets of common variants and phenotypes. For any phenotype, these statistics collectively are difficult to interpret, but buried within them is the true landscape of causal effects. In this work, we posit a probability distribution for the causal effects, and assess its validity using simulations. Using a detailed reference panel of ≃11 million common variants – among which only a small fraction are likely to be causal, but allowing for non-causal variants to show an association with the phenotype due to correlation with causal variants—we implement an exact procedure for estimating the number of causal variants and their mean strength of association with the phenotype. We find that, across different phenotypes, both these quantities—whose product allows for lower bound estimates of heritability—vary by orders of magnitude.

Introduction

The genetic components of complex human traits and diseases arise from hundreds to likely many thousands of single nucleotide polymorphisms (SNPs) [1], most of which have weak effects. As sample sizes increase, more of the associated SNPs are identifiable (they reach genome-wide significance), though power for discovery varies widely across phenotypes. Of particular interest are estimating the proportion of common SNPs from a reference panel (polygenicity) involved in any particular phenotype; their effective strength of association (discoverability, or causal effect size variance); the proportion of variation in susceptibility, or phenotypic variation, captured additively by all common causal SNPs (approximately, the narrow sense heritability), and the fraction of that captured by genome-wide significant SNPs—all of which are active areas of research [29]. The effects of population structure [10], combined with high polygenicity and linkage disequilibrium (LD), leading to spurious degrees of SNP association, or inflation, considerably complicate matters, and are also areas of much focus [1113]. Despite these challenges, there have been recent significant advances in the development of mathematical models of polygenic architecture based on GWAS [14, 15]. One of the advantages of these models is that they can be used for power estimation in human phenotypes, enabling prediction of the capabilities of future GWAS.

Here, in a unified approach explicitly taking into account LD, we present a model relying on genome-wide association studies (GWAS) summary statistics (z-scores for SNP associations with a phenotype [16]) to estimate polygenicity (π1, the proportion of causal variants in the underlying reference panel of approximately 11 million SNPs from a sample size of 503) and discoverability (σβ2, the causal effect size variance), as well as elevation of z-scores due to any residual inflation of the z-scores arising from variance distortion (σ02, which for example can be induced by cryptic relatedness), which remains a concern in large-scale studies [10]. We estimate π1, σβ2, and σ02, by postulating a z-score probability distribution function (pdf) that explicitly depends on them, and fitting it to the actual distribution of GWAS z-scores.

Estimates of polygenicity and discoverability allow one to estimate compound quantities, like narrow-sense heritability captured by the SNPs [17]; to predict the power of larger-scale GWAS to discover genome-wide significant loci; and to understand why some phenotypes have higher power for SNP discovery and proportion of heritability explained than other phenotypes.

In previous work [18] we presented a related model that treated the overall effects of LD on z-scores in an approximate way. Here we take the details of LD explicitly into consideration, resulting in a conceptually more basic model to predict the distribution of z-scores. We apply the model to multiple phenotype datasets, in each case estimating the three model parameters and auxiliary quantities, including the overall inflation factor λ, (traditionally referred to as genomic control [19]) and narrow sense heritability, h2. We also perform extensive simulations on genotypes with realistic LD structure in order to validate the interpretation of the model parameters. A discussion of the relation of the present paper to other work is provided in the first section of the S1 Appendix (pp. S2-S3).

Materials and methods

Overview

Our basic model is a simple postulate for the distribution of causal effects (denoted β below) [20]. Our model assumes that only a fraction of all SNPs are in some sense causally related to any given phenotype. We work with a reference panel of approximately 11 million SNPs with 503 samples, and assume that all common causal SNPs (minor allele frequency (MAF) > 0.002) are contained in it. Any given GWAS will have z-scores for a subset of these reference SNPs (we use the term “typed” below to refer to GWAS SNPs with z-scores, whether they were directly genotyped or their genotype was imputed). When a z-score partially involves a latent causal component (i.e., not pure noise), we assume that it arises through LD with neighboring causal SNPs, or that it itself is causal.

We construct a pdf for z-scores that directly follows from the underlying distribution of effects. For any given typed SNP’s z-score, it is dependent on the other SNPs the focal SNP is in LD with (SNPs that are “tagged” by the focal SNP), taking into account their LD with the focal SNP and their heterozygosity (i.e., it depends not just on the focal typed SNP’s total LD and heterozygosity, but also on the distribution of neighboring reference SNPs in LD with it and their heterozygosities). We present two ways of constructing the model pdf for z-scores, using multinomial expansion, and using convolution. The former is perhaps more intuitive, but the latter is more numerically tractable, yielding an exact solution, and is used here to obtain all reported results. The problem then is finding the three model parameters that give a maximum likelihood best fit for the model’s prediction of the distribution of z-scores to the actual distribution of z-scores. Because we are fitting three parameters typically using ≳106 data points, it is appropriate to incorporate some data reduction to facilitate the computations. To that end, we bin the data (z-scores) into a 10 × 10 grid of heterozygosity-by-total LD (having tested different grid sizes to ensure convergence of results). Also, when building the LD and heterozygosity structures of reference SNPs, we fine-grained the LD range (0 ≤ r2 ≤ 1), again ensuring that bins were small enough that results were well converged. To fit the model to the data we bin the z-scores (within each heterozygosity/total LD window) and calculate the multinomial probability for having the actual distribution of z-scores (numbers of z-scores in the z-score bins) given the model pdf for the distribution of z-scores, and adjusting the model parameters using a multidimensional unconstrained nonlinear minimization (Nelder-Mead), so as to maximize the likelihood of the data, given the parameters.

A visual summary of the predicted and actual distribution of z-scores is obtained by making quantile-quantile plots showing, for a wide range of significance thresholds going well beyond genome-wide significance, the proportion (x-axis) of typed SNPs exceeding any given threshold (y-axis) in the range. It is important also to assess the quantile-quantile sub-plots for SNPs in the heterozygosity-by-total LD grid elements (see S1 Appendix).

With the pdf in hand, various quantities can be calculated: the number of causal SNPs; the expected genetic effect (denoted δ below, where δ2 is the non-centrality parameter of a Chi-squared distribution) at the current sample size for a typed SNP given the SNP’s z-score and its full LD and heterozygosity structure; the estimated SNP heritability, hSNP2 (excluding contributions from rare reference SNPs, i.e., with MAF<0.2%); and the sample size required to explain any percentage of that with genome-wide significant SNPs. The model can easily be extended using a more complex distribution for the underlying β’s, with multiple-component mixtures for small and large effects, and incorporating selection pressure through both heterozygosity dependence on effect sizes and linkage disequilibrium dependence on the prior probability of a SNP’s being causal—issues we will address in future work.

The model: Probability distribution for z-scores

To establish notation, we consider a bi-allelic genetic variant, i, and let βi denote the effect size of allele substitution of that variant on a given quantitative trait. We assume a simple additive generative model (simple linear regression, ignoring covariates) relating genotype to phenotype [18, 21]. That is, assume a linear vector equation (no summation over repeated indices)

y=giβi+ei (1)

for phenotype vector y over N samples (mean-centered and normalized to unit variance), mean-centered genotype vector gi for the ith of n SNPs (vector over samples of the additively coded number of reference alleles for the ith variant), true fixed effect βi (regression coefficient) for the SNP, and residual vector ei containing the effects of all the other causal SNPs, the independent random environmental component, and random error. Variants with non-zero fixed effect βi are said to be “causal”. For SNP i, the estimated simple linear regression coefficient is

β^i=giTy/(giTgi)=cov(gi,y)/var(gi), (2)

where T denotes transpose and giTgi/N=var(gi)=Hi is the SNP’s heterozygosity (frequency of the heterozygous genotype): Hi = 2pi(1 − pi) where pi is the frequency of either of the SNP’s alleles.

Consistent with the work of others [11, 15], we assume the causal SNPs are distributed randomly throughout the genome (an assumption that can be relaxed when explicitly considering different SNP categories, but that in the main is consistent with the additive variation explained by a given part of the genome being proportional to the length of DNA [22]). In a Bayesian approach, we assume that the parameter β for a SNP has a distribution (in that specific sense, this is similar to a random effects model), representing subjective information on β, not a distribution across tangible populations [23]. Specifically, we posit a normal distribution for β with variance given by a constant, σβ2:

βN(0,σβ2). (3)

This is also how the β are distributed across the set of causal SNPs. Therefore, taking into account all SNPs (the remaining ones are all null by definition), this is equivalent to the two-component Gaussian mixture model we originally proposed [20]

βπ1N(0,σβ2)+(1-π1)N(0,0) (4)

where N(0,0) is the Dirac delta function, so that considering all SNPs, the net variance is var(β)=π1σβ2. If there is no LD (and assuming no source of spurious inflation), the association z-score for a SNP with heterozygosity H can be decomposed into a fixed effect δ and a residual random environment and error term, ϵN(0,1), which is assumed to be independent of δ [18]:

z=δ+ϵ (5)

with

δ=NHβ (6)

so that

var(z)=var(δ)+var(ϵ)σ2+1 (7)

where

σ2=σβ2NH. (8)

By construction, under null, i.e., when there is no genetic effect, δ = 0, so that var(ϵ) = 1.

If there is no source of variance distortion in the sample, but there is a source of bias in the summary statistics for a subset of markers (e.g., the sample is composed of two or more subpopulations with different allele frequencies for a subset of markers—pure population stratification in the sample [24]), the marginal distribution of an individual’s genotype at any of those markers will be inflated. The squared z-score for such a marker will then follow a non-central Chi-square distribution (with one degree of freedom); the non-centrality parameter will contain the causal genetic effect, if any, but biased up or down (confounding or loss of power, depending on the relative sign of the genetic effect and the SNP-specific bias term). The effect of bias shifts, arising for example due to stratification, is nontrivial, and currently not explicitly in our model; it is usually accounted for using standard methods [25].

Variance distortion in the distribution of z-scores can arise from cryptic relatedness in the sample (drawn from a population mixture with at least one subpopulation with identical-by-descent marker alleles, but no population stratification) [19]. If zu denotes the uninflated z-scores, then the inflated z-scores are

z=σ0zu, (9)

where σ0 ≥ 1 characterizes the inflation. Thus, from Eq 7, in the presence of inflation in the form of variance distortion

var(z)=σ02(σ2+1)σ˜2+σ02σ˜β2NH+σ02 (10)

where σ˜β2σ02σβ2, so that var(δ)=σ˜2σ˜β2NH and N(0,σ02). In the presence of variance distortion one is dealing with inflated random variables β˜N(0,σ˜β2), but we will drop the tilde on the β’s in what follows.

Since variance distortion leads to scaled z-scores [19], then, allowing for this effect in some of the extremely large data sets, we can assess the ability of the model to detect this inflation by artificially inflating the z-scores (Eq 9), and checking that the inflated σ^02 is estimated correctly while the other parameter estimates remain unchanged.

Implicit in Eq 8 is approximating the denominator, 1 − q2, of the χ2 statistic non-centrality parameter to be 1, where q2 is the proportion of phenotypic variance explained by the causal variant, i.e., qHβ. So a more correct δ is

δ=Nq/1-q2. (11)

Taylor expanding in q and then taking the variance gives

var(δ)=σβ2NH[1+(15/4)σβ4H2+O(σβ8H4)]. (12)

The additional terms will be vanishingly small and so do not contribute in a distributional sense; (quasi-) Mendelian or outlier genetic effects represent an extreme scenario where the model is not expected to be accurate, but SNPs for such traits are by definition easily detectable. So Eq 8 remains valid for the polygenicity of complex traits.

Now consider the effects of LD on z-scores. The simple linear regression coefficient estimate for typed SNP i, β^i, and hence the GWAS z-score, implicitly incorporates contributions due to LD with neighboring causal SNPs. (A typed SNP is a SNP with a z-score, imputed or otherwise; generally these will compose a smaller set than that available in reference panels like 1000 Genomes used here for calculating the LD structure of typed SNPs.) In Eq 1, ei = ∑ji gj βj + ε, where gj is the genotype vector for SNP j, βj is its true regression coefficient, and ε is the independent true environmental and error residual vector (over the N samples). Thus, explicitly including all causal true β’s, Eq 2 becomes

β^i=jgiTgjβjNHi+giTεgiTgiβi+εi (13)

(the sum over j now includes SNP i itself). This is the simple linear regression expansion of the estimated regression coefficient for SNP i in terms of the independent latent (true) causal effects and the latent environmental (plus error) component; βi is the effective simple linear regression expression for the true genetic effect of SNP i, with contributions from neighboring causal SNPs mediated by LD. Note that giTgj/N is simply cov(gi, gj), the covariance between genotypes for SNPs i and j. Since correlation is covariance normalized by the variances, βi in Eq 13 can be written as

βi=jHjHirijβj. (14)

where rij is the correlation between genotypes at reference SNP j and typed SNP i. Then, from Eq 5, the z-score for the typed SNP’s association with the phenotype is given by:

zi=NHiβi+ϵi=NjHjrijβj+ϵi. (15)

We noted that in the absence of LD, the distribution of the residual in Eq 5 is assumed to be univariate normal. But in the presence of LD (Eq 15) there are induced correlations. Letting ϵ denote the vector of residuals (with element ϵi for SNP i, i = 1…n), and M denote the (sparse) n × n LD-r2 matrix, then, ignoring inflation, ϵN(0,M) [26]. Since the genotypes of two unrelated individuals are marginally independent, this multivariate normal distribution for ϵ is contingent on the summary statistics for all SNPs being determined from the same set of individuals, which generally is overwhelmingly, if not in fact entirely, the case (in the extreme, with an independent set of individuals for each SNP, M would be reduced to the identity matrix). A limitation of the present work is that we do not consider this complexity. This may account for the relatively minor misfit in the simulation results for cases of high polygenicity—see below.

Thus, for example, if the SNP itself is not causal but is in LD with k causal SNPs that all have heterozygosity H, and where its LD with each of these is the same, given by some value r2 (0 < r2 ≤ 1), then σ˜2 in Eq 10 will be given by

σ˜2=kr2σ˜β2NH. (16)

For this idealized case, the marginal distribution, or pdf, of z-scores for a set of such associated SNPs is

f1(z;N,H,σβ,σ0)=ϕ(z;0,kr2σ˜β2NH+σ02) (17)

where ϕ(⋅;μ, σ2) is the normal distribution with mean μ and variance σ2, and H is shorthand for the LD and heterozygosity structure of such SNPs (in this case, denoting exactly k causal SNPs with LD given by r2 and heterozygosity given by H). If a proportion α of all typed SNPs are similarly associated with the phenotype while the remaining proportion are all null (not causal and not in LD with causal SNPs), then the marginal distribution for all SNP z-scores is the Gaussian mixture

f(z)=(1-α)ϕ(z;0,σ02)+αf1(z), (18)

dropping the parameters for convenience.

For real genotypes, however, the LD and heterozygosity structure is far more complicated, and of course the causal SNPs are generally numerous and unknown. Thus, more generally, for each typed SNP H will be a two-dimensional histogram over LD (r2) and heterozygosity (H), each grid element giving the number of SNPs falling within the edges of that (r2, H) bin. Alternatively, for each typed SNP it can be built as two one-dimensional histograms, one giving the LD structure (counts of neighboring SNPs in each LD r2 bin), and the other giving, for each r2 bin, the mean heterozygosity for those neighboring SNPs, which will be accurate for sufficiently fine binning—within a bin, the heterozygosities of the tagged referene SNPs wll be in a vary narrow range. We use the latter in what follows. We present two consistent ways of expressing the a posteriori pdf for z-scores, based on multinomial expansion and on convolution, that provide complementary views. The multinomial approach perhaps gives a more intuitive feel for the problem, but the convolution approach is considerably more tractable numerically and is used here to obtain all reporter results. All code used in the analyses, including simulations, is publicly available on GitHub [27].

Model PDF: Multinomial expansion

As in our previous work, we incorporate the model parameter π1 for the fraction of all SNPs that are causal [18]. Additionally, we calculate the actual LD and heterozygosity structure for each SNP. That is, for each SNP we build a histogram of the numbers of other SNPs in LD with it for w equally-spaced r2-windows between rmin2 and 1 where rmin2=0.05 (approximately the noise floor for correlation when LD is calculated from the 503 samples in 1000 Genomes), and record the mean heterozygosity for each bin; as noted above, we use H as shorthand to represent all this. We find that w ≃ 20 is sufficient for converged results. For any given SNP, the set of SNPs thus determined to be in LD with it constitute its LD block, with their number given by n (LD with self is always 1, so n is at least 1). The pdf for z-scores, given N,H, and the three model parameters π1, σβ, σ0, will then be given by the sum of Gaussians that are generalizations of Eq 17 for different combinations of numbers of causal SNPs among the w LD windows, each Gaussian scaled by the probability of the corresponding combination of causal SNPs among the LD windows, i.e., by the appropriate multinomial distribution term.

For w r2-windows, we must consider the possibilities where the typed SNP is in LD with all possible numbers of causal SNPs in each of these windows, or any combination thereof. There are thus w + 1 categories of SNPs: null SNPs (which r2-windows they are in is irrelevant), and causal SNPs, where it does matter which r2-windows they reside in. If window i has ni SNPs (i=1wni=n) and mean heterozygosity Hi, and the overall fraction of SNPs that are causal is π1, then the probability of having simultaneously k0 null SNPs, k1 causal SNPs in window 1, and so on through kw causal SNPs in window w, for a nominal total of K causal SNPs (i=1wki=K and k0 = nK), is given by the multinomial distribution, which we denote M(k0, …, kw; n0, …, nw; π1). For an LD block of n SNPs, the prior probability, pi, for a SNP to be causal and in window i is the product of the independent prior probabilities of a SNP being causal and being in window i: pi = π1 ni/n. The prior probability of being null (regardless of r2-window) is simply p0 = (1 − π1). The probability of a given breakdown k0, …, kw of the neighboring SNPs into the w + 1 categories is then given by

M(k0,,kw;n0,,nw;π1)=n!k0!kw!p0k0pwkw (19)

and the corresponding Gaussian is

ϕ(z;0,(k1H1r12++kwHwrw2)σ˜β2N+σ02). (20)

For a SNP with LD and heterozygosity structure H, the pdf for its z-score, given N and the model parameters, is then given by summing over all possible numbers of total causal SNPs in LD with the SNP, and all possible distributions of those causal SNPs among the w r2-windows:

pdf(z;N,H,π1,σβ,σ0)=K=0Kmaxk1,,kwn!k0!kw!p0k0pwkw×ϕ(z;0,(k1H1r12++kwHwrw2)σ˜β2N+σ02), (21)

where Kmax is bounded above by n. Note again that H is shorthand for the heterozygosity and linkage-disequilibrium structure of the SNP, giving the set {ni} (as well as {Hi}), and hence, for a given π1, pi. Also there is the constraint i=1wki=K on the second summation, and, for all i, max(ki) = max(K, ni), though generally Kmaxni. The number of ways of dividing K causal SNPs amongst w LD windows is given by the binomial coefficient (ab), where aK + w − 1 and bw − 1, so the number of terms in the second summation grows rapidly with K and w. However, because π1 is small (often ≤10−3), the upper bound on the first summation over total number of potential causal SNPs K in the LD block for the SNP can be limited to Kmax < min(20, n), even for large blocks with n ≃ 103. That is,

K=0Kmaxk1,,kwM(k0,,kw;n0,,nw;π1)1. (22)

Still, the number of terms is large; e.g., for K = 10 and w = 10 there are 92,378 terms.

For any given typed SNP (whose z-score we are trying to predict), it is important to emphasize that the specific LD r2 and the heterozygosity of each underlying causal (reference) SNP tagged by it need to be taken into account, at least in an approximate sense that can be controlled to allow for arbitrary finessing giving converged results. This is the purpose of our w = 20 LD-r2 windows, which inevitably leads to the multinomial expansion. Which window the causal SNP is in matters, leading to w+ 1 SNP categories, as noted above. Setting w = 1 would result in only a very rough approximation for the model pdf, reducing our multinomial to a binomial involving just two categories of SNPs: null and causal, with all causal SNPs treated the same, regardless of their LD with the tag SNP and their heterozygosity, as is done for the “M2” and “M3” models in [14]. The effects of this are demonstrated in the S1 Appendix (pp. S2-S3, and p. 14).

Model PDF: Convolution

From Eq 15, there exists an efficient procedure that allows for accurate calculation of a z-score’s a posteriori pdf (given the SNP’s heterozygosity and LD structure, and the phenotype’s model parameters). Any GWAS z-score is a sum of unobserved random variables (LD-mediated contributions from neighboring causal SNPs, and the additive environmental component), and the pdf for such a composite random variable is given by the convolution of the pdfs for the component random variables. Since convolution is associative, and the Fourier transform of the convolution of two functions is just the product of the individual Fourier transforms of the two functions, one can obtain the a posteriori pdf for z-scores as the inverse Fourier transform of the product of the Fourier transforms of the individual random variable components.

From Eq 15 z is a sum of correlation- and heterozygosity-weighted random variables {βj} and the random variable ϵ, where {βj} denotes the set of true causal parameters for each of the SNPs in LD with the typed SNP whose z-score is under consideration. The Fourier transform F(k) of a Gaussian f(x) = c × exp(−ax2) is F(k)=cπ/a×exp(-π2k2/a). From Eq 4, for each SNP j in LD with the typed SNP (1 ≤ jb, where b is the typed SNP’s block size),

NHjrjβjπ1N(0,NHjrj2σ˜β2)+(1-π1)N(0,0). (23)

The Fourier transform (with variable k—see below) of the first term on the right hand side is

F(k)=π1exp(-2π2k2NHjrj2σ˜β2), (24)

while that of the second term is simply (1 − π1). Additionally, the environmental term is ϵN(0,σ02) (ignoring LD-induced correlation, as noted earlier), and its Fourier transform is exp(-2π2σ02k2). For each typed SNP, one could construct the a posteriori pdf based on these Fourier transforms. However, it is more practical to use a coarse-grained representation of the data. Thus, in order to fit the model to a data set, we bin the typed SNPs whose z-scores comprise the data set into a two-dimensional heterozygosity/total LD grid (whose elements we denote “H-L” bins), and fit the model with respect to this coarse grid instead of with respect to every individual typed SNP z-score; in the section “Parameter Estimation” below we describe using a 10 × 10 grid. Additionally, for each H-L bin the LD r2 and heterozygosity histogram structure for each typed SNP is built, using wmax equally-spaced r2 bins for rmin2r21 (this is a change in notation from the previous section: wmax here plays the role of w there; in what follows, w will be used as a running index, 1 ≤ wwmax); wmax = 20 is large enough to allow for converged results; rmin2=0.05 is generally small enough to capture true causal associations in weak LD while large enough to exclude spurious contributions to the pdf arising from estimates of r2 that are non-zero due to noise. This points up a minor limitation of the model stemming from the small reference sample size (NR = 503 for 1000 Genomes) from which H is built. Larger NR would allow for more precision in handling very low LD (r2 < 0.05), but this is an issue only for situations with extremely large σβ2 (high heritability with low polygenicity) that we do not encounter for the 16 phenotypes we analyze here. In any case, this can be calibrated for using simulations.

We emphasize again that setting wmax = 1 would result in only an approximation for the model pdf (see “Relation to Other Work” in the S1 Appendix).

For any H-L bin with mean heterozygosity H and mean total LD L there will be an average LD and heterozygosity structure with a mean breakdown for the typed SNPs having nw reference SNPs (not all of which necessarily are typed SNPs, i.e., have a z-score) with LD r2 in the wth r2 bin whose average heterozygosity is Hw. Thus, one can re-express z-scores for an H-L bin as

z=Nw=1wmax(Hwrwj=0nwβj)+ϵ (25)

where βj and ϵ are unobserved random variables.

In the spirit of the discrete Fourier transform (DFT), discretize the set of possible z-scores into the ordered set of n (equal to a power of 2) values z1, …, zn with equal spacing between neighbors given by Δz (zn = −z1 − Δz, and zn/2+1 = 0). Taking z1 = −38 allows for the minimum p-values of 5.8 × 10−316 (near the numerical limit); with n = 210, Δz = 0.0742. Given Δz, the Nyquist critical frequency is fc=12Δz, so we consider the Fourier transform function for the z-score pdf at n discrete values k1, …, kn, with equal spacing between neighbors given by Δk, where k1 = −fc (kn = −k1 − Δk, and kn/2+ 1 = 0; the DFT pair Δz and Δk are related by ΔzΔk = 1/n). Define

Aw-2π2NHwrw2σ˜β2. (26)

(see Eq 24). Then the product (over r2 bins) of Fourier transforms for the genetic contribution to z-scores, denoted GjG(kj), is

G(kj)=w=1wmax(π1exp(Awkj2)+(1-π1))nw. (27)

Recall that H denotes the LD and heterozygosity structure of a particular SNP (or representative SNP in an average sense for an H-L grid element), a shorthand for the set of values {nw, Hw, Lw: w = 1, …, wmax} that characterize the SNP. Let M denote the set of model parameters. The Fourier transform of the environmental contribution, denoted EjE(kj), is

E(kj)=exp(-2π2σ02kj2). (28)

Let Fz = (G1 E1,…,GnEn) denote the vector of products of Fourier transform values, and let F-1 denote the inverse Fourier transform operator. Then for the SNP in question, the vector of pdf values, pdfz, for the uniformly discretized possible z-score outcomes z1, …, zn described above, i.e., pdfz = (f1,…,fn) where fipdf(zi|H,M,N), is

pdfz=F-1[Fz]. (29)

Thus, the ith element pdfzi=fi is the a posteriori probability of obtaining a z-score value zi for the SNP, given the SNP’s LD and heterozygosity structure, the model parameters, and the sample size.

Data preparation

For real phenotypes, we calculated SNP minor allele frequency (MAF) and LD between SNPs using the 1000 Genomes phase 3 data set for 503 subjects/samples of European ancestry [2830]. In order to carry out realistic simulations (i.e., with realistic heterozygosity and LD structures for SNPs), we used HAPGEN2 [3133] to generate genotypes; we calculated SNP MAF and LD structure from 1000 simulated samples. We elected to use the same intersecting set of SNPs for real data and simulation. For HAPGEN2, we eliminated SNPs with MAF<0.002; for 1000 Genomes, we eliminated SNPs for which the call rate (percentage of samples with useful data) was less than 90%. This left nsnp = 11,015,833 SNPs. See S1 Appendix for further details.

We analyzed summary statistics for sixteen phenotypes (S1 DataSourceList.; in what follows, where sample sizes varied by SNP, we quote the median value): (1) major depressive disorder (Ncases = 59,851, Ncontrols = 113,154) [34]; (2) bipolar disorder (Ncases = 20,352, Ncontrols = 31,358) [35]; (3) schizophrenia (Ncases = 35,476, Ncontrols = 46,839) [36]; (4) coronary artery disease (Ncases = 60,801, Ncontrols = 123,504) [37]; (5) ulcerative colitis (Ncases = 12,366, Ncontrols = 34,915) and (6) Crohn’s disease (Ncases = 12,194, Ncontrols = 34,915) [38]; (7) late onset Alzheimer’s disease (LOAD; Ncases = 17,008, Ncontrols = 37,154) [39] (in the S1 Appendix we present results for a more recent GWAS with Ncases = 71,880 and Ncontrols = 383,378 [40]); (8) amyotrophic lateral sclerosis (ALS) (Ncases = 12,577, Ncontrols = 23,475) [41]; (9) number of years of formal education (N = 293,723) [42]; (10) intelligence (N = 262,529) [43, 44]; (11) body mass index (N = 233,554) [45]; (12) height (N = 251,747) [46]; (13) putamen volume (normalized by intracranial volume, N = 11,598) [47]; (14) low- (N = 89,873) and (15) high-density lipoprotein (N = 94,295) [48]; and (16) total cholesterol (N = 94,579) [48]. Most participants were of European ancestry.

For height, we focused on the 2014 GWAS [46], not the more recent 2018 GWAS [49], although we also report below model results for the latter. There are issues pertaining to population structure in the various height GWAS [50, 51], and the 2018 GWAS is a combination of GIANT and UKB GWAS, so some caution is warranted in interpreting results for these data.

For the ALS GWAS data, there is very little signal outside chromosome 9: the data QQ plot essentially tracks the null distribution straight line. The QQ plot for chromosome 9, however, shows a significant departure from the null distribution. Of 471,607 SNPs on chromosome 9 a subset of 273,715 have z-scores, of which 107 are genome-wide significant, compared with 114 across the full genome. Therefore, we restrict ALS analysis to chromosome 9.

A limitation in the current work is that we have not taken account of imputation inaccuracy, where lower MAF SNPs are, through lower LD, less certain. Thus, the effects from lower MAF causal variants will be noisier than for higher MAF variants.

Simulations

We generated genotypes for 105 unrelated simulated samples using HAPGEN2 [33]. For narrow-sense heritability h2 equal to 0.1, 0.4, and 0.7, we considered polygenicity π1 equal to 10−5, 10−4, 10−3, and 10−2. For each of these 12 combinations, we randomly selected ncausal = π1 × nsnp “causal” SNPs and assigned them β-values drawn from the standard normal distribution (i.e., independent of H), with all other SNPs having β = 0. We repeated this ten times, giving ten independent instantiations of random vectors of β’s. Defining YG = , where G is the genotype matrix and β here is the vector of true coefficients over all SNPs, the total phenotype vector is constructed as Y = YG+ε, where the residual random vector ε for each instantiation is drawn from a normal distribution such that h2 = var(YG)/var(Y). For each of the instantiations this implicitly defines the “true” value σβ2.

The sample simple linear regression slope, β^, and the Pearson correlation coefficient, r^, are assumed to be t-distributed. These quantities have the same t-value: t=β^/se(β^)=r^/se(r^)=r^N-2/1-r^2, with corresponding p-value from Student’s t cumulative distribution function (cdf) with N − 2 degrees of freedom: p = 2 × tcdf(−|t|, N − 2) (see S1 Appendix). Since we are not here dealing with covariates, we calculated p from correlation, which is slightly faster than from estimating the regression coefficient. The t-value can be transformed to a z-value, giving the z-score for this p: z=-Φ-1(p/2)×sign(r^), where Φ is the normal cdf (z and t have the same p-value).

Parameter estimation

We randomly pruned SNPs using the threshold r2 > 0.8 to identify “synonymous” SNPs, performing ten such iterations. That is, for each of ten iterations, we randomly selected a SNP (not necessarily the one with largest z-score) to represent each subset of synonymous SNPs. For schizophrenia, for example, pruning resulted in approximately 1.3 million SNPs in each iteration.

The postulated pdf for a SNP’s z-score depends on the SNP’s LD and heterozygosity structure (histogram), H. Given the data–the set of z-scores for available SNPs, as well as their LD and heterozygosity structure—and the H-dependent pdf for z-scores, the objective is to find the model parameters that best predict the distribution of z-scores. We bin the SNPs with respect to a grid of heterozygosity and total LD; for any given H-L bin there will be a range of z-scores whose distribution the model it intended to predict. We find that a 10 × 10 grid of equally spaced bins is adequate for converged results. (Using equally-spaced bins might seem inefficient because of the resulting very uneven distribution of z-scores among grid elements—for example, orders of magnitude more SNPs in grid elements with low total LD compared with high total LD. However, the objective is to model the effects of H and L: using variable grid element sizes so as to maximize balance of SNP counts among grid elements means that the true H- and L-mediated effects of the SNPs in a narrow range of H and L get subsumed with the effects of many more SNPs in a much wider range of H and L—a misspecification of the pdf leading to some inaccuracy.) In lieu of or in addition to total LD (L) binning, one can bin SNPs with respect to their total LD block size (total number of SNPs in LD, ranging from 1 to ≃1,500).

To find the model parameters that best fit the data, for a given H-L bin we binned the selected SNPs z-scores into equally-spaced bins of width dz = 0.0742 (between zmin = −38 and zmax = 38, allowing for p-values near the numerical limit of 10−316), and from Eq 29 calculated the probability for z-scores to be in each of those z-score bins (the prior probability for “success” in each z-score bin). Then, knowing the actual numbers of z-scores (numbers of “successes”) in each z-score bin, we calculated the multinomial probability, pm, for this outcome. The optimal model parameter values will be those that maximize the accrual of this probability over all H-L bins. We constructed a cost function by calculating, for a given H-L bin, −ln(pm) and averaging over prunings, and then accumulating this over all H-L bins. Model parameters minimizing the cost were obtained from Nelder-Mead multidimensional unconstrained nonlinear minimization of the cost function, using the Matlab function fminsearch().

Posterior effect sizes

Model posterior effect sizes, given z (along with N, H, and the model parameters), were calculated using numerical integration over the random variable δ:

δexpectedE(δ|z)=P(δ|z)δdδ=1P(z)P(z|δ)P(δ)δdδ. (30)

Here, since z|δN(δ,σ02), the posterior probability of z given δ is simply

P(z|δ)=ϕ(z;δ,σ02). (31)

P(z) is shorthand for pdf(z|N,H,π1,σβ,σ0), given by Eq 29. P(δ) is calculated by a similar procedure that lead to Eq 29 but ignoring the environmental contributions {Ej}. Specifically, let Fδ = (G1,…,Gn) denote the vector of products of Fourier transform values. Then, the vector of pdf values for genetic effect bins (indexed by i; numerically, these will be the same as the z-score bins) in the H-L bin, pdfδ = (f1,…,fn) where fipdf(δi|H), is

pdfδ=F-1[Fδ]. (32)

Similarly,

δexpected2E(δ2|z)=P(δ|z)δ2dδ=1P(z)P(z|δ)P(δ)δ2dδ, (33)

which is used in power calculations.

GWAS replication

A related matter has to do with whether z-scores for SNPs reaching genome-wide significance in a discovery-sample are compatible with the SNPs’ z-scores in a replication-sample, particularly if any of those replication-sample z-scores are far from reaching genome-wide significance, or whether any apparent mismatch signifies some overlooked inconsistency. The model pdf allows one to make a principled statistical assessment in such cases. We present the details for this application, and results applied to studies of bipolar disorder, in the S1 Appendix (pp. S7-S11).

GWAS power

Chip heritability, hSNP2, is the proportion of phenotypic variance that in principle can be captured additively by the nsnp SNPs under study [17]. It is of interest to estimate the proportion of hSNP2 that can be explained by SNPs reaching genome-wide significance, p≤5 × 10−8 (i.e., for which |z|>zt = 5.45), at a given sample size [52, 53]. In Eq 1, for SNP i with genotype vector gi over N samples, let ygigiβi. If the SNP’s heterozygosity is Hi, then var(ygi)=βi2Hi. If we knew the full set {βi} of true β-values, then, for z-scores from a particular sample size N, the proportion of SNP heritability captured by genome-wide significant SNPs, A(N), would be given by

A(N)=i:|zi|>ztβi2Hialliβi2Hi. (34)

Now, from Eq 15, δi=NjHjrijβj. If SNP i is causal and sufficiently isolated so that it is not in LD with other causal SNPs, then δi=NHiβi, and var(ygi) = δi2/N. When all causal SNPs are similarly isolated, Eq 34 becomes

A(N)=i:|zi|>ztδi2alliδi2. (35)

Of course, the true βi are not known and some causal SNPs will likely be in LD with others. Furthermore, due to LD with causal SNPs, many SNPs will have a nonzero (latent or unobserved) effect size, δ. Nevertheless, we can formulate an approximation to A(N) which, assuming the pdf for z-scores (Eq 29) is reasonable, will be inaccurate to the degree that the average LD structure of genome-wide significant SNPs differs from the overall average LD structure. As before (see the subsection “Model PDF: Convolution”), consider a fixed set of n equally-spaced nominal z-scores covering a wide range of possible values (changing from the summations in Eq 35 to the uniform summation spacing Δz now requires bringing the probability density into the summations). For each z from the fixed set (and, as before, employing data reduction by averaging so that H and L denote values for the 10 × 10 grid), use E(δ2|z, N, H, L) given in Eq 33 to define

C(z|N,H,L)E(δ2|z,N,H,L)P(z|N,H,L) (36)

(emphasizing dependence on N, H, and L). Then, for any N, A(N) can be estimated by

A(N)=H,Lz:|z|>ztC(z,N,H,L)H,LallzC(z,N,H,L) (37)

where ∑H,L denotes sum over the H-L grid elements. The ratio in Eq 37 should be accurate if the average effects of LD in the numerator and denominator cancel—which will always be true as the ratio approaches 1 for large N. Plotting A(N) gives an indication of the power of future GWAS to capture chip heritability.

Quantile-quantile plots and genomic control

One of the advantages of quantile-quantile (QQ) plots is that on a logarithmic scale they emphasize behavior in the tails of a distribution, and provide a valuable visual aid in assessing the independent effects of polygenicity, strength of association, and variance distortion—the roles played by the three model parameters–as well as showing how well a model fits data. QQ plots for the model were constructed using Eq 29, replacing the normal pdf with the normal cdf, and replacing z with an equally-spaced vector znom of length 10,000 covering a wide range of nominal |z| values (0 through 38). SNPs were divided into a 10 × 10 grid of H × L bins, and the cdf vector (with elements corresponding to the z-values in znom) accumulated for each such bin (using mean values of H and L for SNPs in a given bin).

For a given set of samples and SNPs, the genomic control factor, λ, for the z-scores is defined as the median z2 divided by the median for the null distribution, 0.455 [19]. This can also be calculated from the QQ plot. In the plots we present here, the abscissa gives the -log10 of the proportion, q, of SNPs whose z-scores exceed the two-tailed significance threshold p, transformed in the ordinate as -log10(p). The median is at qmed = 0.5, or −log10(qmed) ≃ 0.3; the corresponding empirical and model p-value thresholds (pmed) for the z-scores—and equivalently for the z-scores-squared—can be read off from the plots. The genomic inflation factor is then given by

λ=[Φ-1(pmed/2)]2/0.455.

Note that the values of λ reported here are for pruned SNP sets; these values will be lower than for the total GWAS SNP sets.

Knowing the total number, ntot, of p-values involved in a QQ plot (number of GWAS z-scores from pruned SNPs), any point (q, p) (log-transformed) on the plot gives the number, np = qntot, of p-values that are as extreme as or more extreme than the chosen p-value. This can be thought of as np “successes” out of ntot independent trials (thus ignoring LD) from a binomial distribution with prior probability q. To approximate the effects of LD, we estimate the number of independent SNPs as ntot/f where f ≃ 10. The 95% binomial confidence interval for q is calculated as the exact Clopper-Pearson 95% interval [54], which is similar to the normal approximation interval, q±1.96q(1-q)/ntot/f.

Number of causal SNPs

The estimated number of causal SNPs is given by the polygenicity, π1, times the total number of SNPs, nsnp: ncausal = π1 nsnp. nsnp is given by the total number of SNPs that went into building the heterozygosity/LD structure, H in Eq 29, i.e., the approximately 11 million SNPs selected from the 1000 Genomes Phase 3 reference panel, not the number of typed SNPs in the particular GWAS. The parameters estimated are to be seen in the context of the reference panel, which we assume contains all common causal variants. Stable quantities (i.e., fairly independent of the reference panel size. e.g., using the full panel or ignoring every second SNP), are the estimated effect size variance and number of causal variants—which we demonstrate below—and hence the heritability. Thus, the polygenicity will scale inversely with the reference panel size. A reference panel with a substantially larger number of samples would allow for inclusion of more SNPs (non-zero MAF), and thus the actual polygenicity estimated would change slightly.

Narrow-sense chip heritability

Since we are treating the β coefficients as fixed effects in the simple linear regression GWAS formalism, with the phenotype vector standardized with mean zero and unit variance, from Eq 1 the proportion of phenotypic variance explained by a particular causal SNP whose reference panel genotype vector is g, q2 = var(y;g), is given by q2 = β2 H. The proportion of phenotypic variance explained additively by all causal SNPs is, by definition, the narrow sense chip heritability, h2. Since E(β2)=σβ2 and ncausal = π1 nsnp, and taking the mean heterozygosity over causal SNPs to be approximately equal to the mean over all SNPs, H¯, the chip heritability can be estimated as

h2=π1nsnpH¯σβ2. (38)

Mean heterozygosity from the ≃11 million SNPs is H¯=0.2165.

For all-or-none traits like disease status, the estimated h2 from Eq 38 for an ascertained case-control study is on the observed scale and is a function of the prevalence in the adult population, K, and the proportion of cases in the study, P. The heritability on the underlying continuous liability scale [55], hl2, is obtained by adjusting for ascertainment (multiplying by K(1 − K)/(P(1 − P)), the ratio of phenotypic variances in the population and in the study) and rescaling based on prevalence [6, 56]:

hl2=h2K(1-K)P(1-P)×K(1-K)a2, (39)

where a is the height of the standard normal pdf at the truncation point zK defined such that the area under the curve in the region to the right of zK is K.

Confidence intervals

Confidence intervals for parameters were estimated using the inverse of the observed Fisher information matrix (FIM). The full FIM was estimated for all three parameters used in the model. For the derived quantity h2, which depends on all parameters, the covariances among the parameters, given by the off-diagonal elements of the inverse of the FIM, were incorporated. Numerical values are in S1 Appendix (p. S12).

Results

Simulations

Table 1 shows the simulation results, comparing true and estimated values for the model parameters, heritability, and the number of causal SNPs, for twelve scenarios where π1 and σβ2 both range over three orders of magnitude, encompassing the range of values for the phenotypes; in S1 Appendix (p. S15) are QQ plots for a randomly chosen (out of 10) β-vector and phenotype instantiation for each of the twelve (π1, h2) scenarios. Most of the π^1 estimates are in very good agreement with the true values, though for the extreme scenario of high heritability and low polygenicity it is overestimated by factors of two-to-three. The numbers of estimated causal SNPs (out of ≃11 million) are in correspondingly good agreement with the true values, ranging in increasing powers of 10 from 110 through 110,158. The estimated discoverabilities (σ^β2) are also in good agreement with the true values. In most cases, σ^02 is close to 1, indicating little or no global inflation, though it is elevated for high heritability with high polygenicity, suggesting it is capturing some ubiquitous effects.

Table 1. Simulation results: Comparison of mean (std) true and estimated (^) model parameters and derived quantities.

Results for each line, for specified heritability h2 and fraction π1 of causal SNPs, are from 10 independent instantiations with random selection of the ncausal causal SNPs that are assigned a β-value from the standard normal distribution. Defining Yg = , where G is the genotype matrix, the total phenotype vector is constructed as Y = Yg+ε, where the residual random vector ε for each instantiation is drawn from a normal distribution such that var(Y) = var(Yg)/h2 for predefined h2. For each of the instantiations, i, this implicitly defines the true value σβi2, and σβ2 is their mean. An example QQ plot for each line entry is shown in in S1 Appendix (p. S15).

h2 h^2 π1 π^1 σβ2 σ^β2 σ^02 ncausal n^causal
0.1 0.12 (0.01) 1E-5 1.4E-5 (2E-6) 4.3E-3 (7E-4) 3.6E-3 (5E-4) 1.01 (0.002) 110 151 (20)
0.1 0.10 (0.01) 1E-4 1.0E-4 (2E-5) 4.2E-4 (2E-5) 4.1E-4 (5E-5) 1.01 (0.003) 1101 1130 (206)
0.1 0.09 (0.01) 1E-3 0.9E-3 (1E-4) 4.2E-5 (5E-7) 4.1E-5 (4E-6) 1.02 (0.003) 11015 10340 (1484)
0.1 0.09 (0.01) 1E-2 0.8E-2 (2E-3) 4.2E-6 (4E-8) 5.6E-6 (2E-6) 1.02 (0.002) 110158 83411 (25448)
0.4 0.52 (0.05) 1E-5 2.3E-5 (2E-6) 1.7E-2 (3E-3) 9.1E-3 (1E-3) 1.02 (0.002) 110 259 (20)
0.4 0.45 (0.02) 1E-4 1.2E-4 (8E-6) 1.7E-3 (7E-5) 1.5E-3 (9E-5) 1.04 (0.002) 1101 1310 (92)
0.4 0.39 (0.01) 1E-3 1.0E-3 (5E-5) 1.7E-4 (2E-6) 1.6E-4 (8E-6) 1.05 (0.003) 11015 10607 (578)
0.4 0.37 (0.01) 1E-2 0.9E-2 (1E-3) 1.7E-5 (2E-7) 1.7E-5 (2E-6) 1.06 (0.003) 110158 95135 (10851)
0.7 0.91 (0.09) 1E-5 2.9E-5 (2E-6) 3.0E-2 (5E-3) 1.3E-2 (2E-3) 1.02 (0.003) 110 324 (24)
0.7 0.82 (0.02) 1E-4 1.4E-4 (7E-6) 2.9E-3 (1E-4) 2.4E-3 (1E-4) 1.05 (0.002) 1101 1493 (79)
0.7 0.70 (0.01) 1E-3 1.0E-3 (4E-5) 2.9E-4 (4E-6) 2.8E-4 (1E-5) 1.08 (0.003) 11015 10866 (406)
0.7 0.66 (0.01) 1E-2 0.9E-2 (7E-4) 2.9E-5 (3E-7) 2.9E-5 (2E-6) 1.09 (0.003) 110158 95067 (8191)

In the S1 Appendix (pp. S5-S7) we examine the issue of model misspecification. Specifically, we assign causal effects β drawn from a Gaussian whose variance is not simply a constant but depends on heterozygosity, such that rarer causal SNPs will tend to have larger effects [15]. The results—see S1 Appendix (p. S5)–show that the model still makes reasonable estimates of the underlying genetic architecture. Additionally, we tested the scenario where true causal effects are distributed with respect to two Gaussians [14], a situation that allows for a small number of the causal SNPs to have quite large effects—see S1 Appendix (p. S6). We find that heritabilities are still reasonably estimated using our model. In all these scenarios the overall data QQ plots were accurately reproduced by the model. As a counter example, we simulated summary statistics where the prior probability of a reference SNP being causal decreased linearly with total LD (see S1 Appendix (p. S7)). In this case, our single Gaussian fit (which assumes no LD dependence on the prior probability of a reference SNP being causal) did not produce model QQ plots that accurately tracked the data QQ plots (see S1 Appendix (p. S19)). The model parameters and heritabilities were also poor. But this scenario is highly artificial; in contrast, in situations where the data QQ plots were accurately reproduced by the model, the estimated model parameters and heritability were plausible.

Phenotypes

Figs 1 and 2 show QQ plots for the pruned z-scores for eight qualitative and eight quantitative phenotypes, along with model estimates (S1 Appendix (pp. S20-S38) show a 4 × 4 grid breakdown with respect to heterozygosity × total-LD of QQ plots for all phenotypes studied here; the 4 × 4 grid is a subset of the 10 × 10 grid used in the calculations). In all cases, the model fit (yellow) closely tracks the data (dark blue). For the sixteen phenotypes, estimates for the model polygenicity parameter (fraction of reference panel, with ≃11 million SNPs, estimated to have non-null effects) range over two orders of magnitude, from π1 ≃ 2 × 10−5 to π1 ≃ 4 × 10−3. The estimated SNP discoverability parameter (variance of β, or expected β2, for causal variants) also ranges over two orders of magnitude from σβ27×10-6 to σβ22×10-3 (in units where the variance of the phenotype is normalized to 1).

Fig 1. QQ plots of (pruned) z-scores for qualitative phenotypes (dark blue, 95% confidence interval in light blue) with model prediction (yellow): (A) major depressive disorder; (B) bipolar disorder; (C) schizophrenia; (D) coronary artery disease (CAD); (E) ulcerative colitis (UC); (F) Crohn’s disease (CD); (G) late onset Alzheimer’s disease (AD), excluding APOE (see also S1 Appendix (p. S17)); and (H) amyotrophic lateral sclerosis (ALS), restricted to chromosome 9 (see also S1 Appendix (p. S18)).

Fig 1

The dashed line is the expected QQ plot under null (no SNPs associated with the phenotype). p is a nominal p-value for z-scores, and q is the proportion of z-scores with p-values exceeding that threshold. λ is the overall nominal genomic control factor for the pruned data (which is accurately predicted by the model in all cases). The three estimated model parameters are: polygenicity, π^1; discoverability, σ^β2 (corrected for inflation); and SNP association χ2-statistic inflation factor, σ^02. h^2 is the estimated narrow-sense chip heritability, re-expressed as hl2 on the liability scale for these case-control conditions assuming a prevalence of: MDD 7.1% [57], BIP 0.5% [58], SCZ 1% [59], CAD 3% [60], UC 0.1% [61], CD 0.1% [61], AD 14% (for people aged 71 and older in the USA [62, 63]), and ALS 5 × 10−5 [64]. The estimated number of causal SNPs is given by n^causal=π^1nsnp where nsnp = 11, 015, 833 is the total number of SNPs, whose LD structure and MAF underlie the model; the GWAS z-scores are for subsets of these SNPs. Neff is the effective case-control sample size–see text. Reading the plots: on the vertical axis, choose a p-value threshold (more extreme values are further from the origin), then the horizontal axis gives the proportion of SNPs exceeding that threshold (higher proportions are closer to the origin). Numerical values for the model parameters are also given in Table 2. See also S1 Appendix (pp. S20-S28).

Fig 2. this area for notes QQ plots of (pruned) z-scores and model fits for quantitative phenotypes: (A) educational attainment; (B) intelligence; (C) body mass index (BMI); (D) height; (E) putamen volume; (F) low-density lipoprotein (LDL); (G) high-density lipoprotein (HDL); and (H) total cholesterol (TC).

Fig 2

N is the sample size. See Fig 1 for further description. Numerical values for the model parameters are also given in Table 2. See also S1 Appendix (pp. S29-S38).

We find that schizophrenia and bipolar disorder appear to be similarly highly polygenic, with model polygenicities ≃ 2.84 × 10−3 and ≃ 2.70 × 10−3, respectively. The model polygenicity of major depressive disorder, however, is 40% higher, π1 ≃ 4 × 10−3—the highest value among the sixteen phenotypes. In contrast, the model polygenicities of late onset Alzheimer’s disease and Crohn’s disease are almost thirty times smaller than that of schizophrenia.

In S1 Appendix (p. S17) we show results for Alzheimer’s disease exclusively for chromosome 19 (which contains APOE), and for all autosomal chromosomes excluding chromosome 19. We also show results with the same chromosomal breakdown for a recent GWAS involving 455,258 samples that included 24,087 clinically diagnosed LOAD cases and 47,793 AD-by-proxy cases (individuals who were not clinically diagnosed with LOAD but for whom at least one parent had LOAD) [65]. These GWAS give consistent estimates of polygenicity: π1 ∼ 1 × 10−4 excluding chromosome 19, and π1 ∼ 6 × 10−5 for chromosome 19 exclusively.

Of the quantitative traits, educational attainment has the highest model polygenicity, π1 = 3.2 × 10−3, similar to intelligence, π1 = 2.2 × 10−3. Approximately two orders of magnitude lower in polygenicity are the endophenotypes putamen volume and low- and high-density lipoproteins.

The model effective SNP discoverability for schizophrenia is σ^β2=5.51×10-5, similar to that for bipolar disorder. Major depressive disorder, which has the highest polygenicity, has the lowest SNP discoverability, approximately one-eighth that of schizophrenia; it is this low value, combined with high polygenicity that leads to the weak signal in Fig 1 (A) even though the sample size is relatively large. In contrast, SNP discoverability for Alzheimer’s disease is almost four times that of schizophrenia. The inflammatory bowel diseases, however, have much higher SNP discoverabilities, 16 and 31 times that of schizophrenia respectively for ulcerative colitis and Crohn’s disease—the latter having the second highest value of the sixteen phenotypes: σ^β2=1.7×10-3.

Additionally, for Alzheimer’s disease we show in the S1 Appendix (p. S17) that the discoverability is two orders of magnitude greater for chromosome 19 than for the remainder of the autosome. Note that since two-thirds of the 2018 “cases” are AD-by-proxy, the discoverabilities for the 2018 data are, as expected, reduced relative to the values for the 2013 data (approximately 3.5 times smaller).

The narrow sense SNP heritability from the ascertained case-control schizophrenia GWAS is estimated as h2 = 0.37. Taking adult population prevalence of schizophrenia to be K = 0.01 [66, 67] (but see also [68], for K = 0.005), and given that there are 51,900 cases and 71,675 controls in the study, so that the proportion of cases in the study is P = 0.42, the heritability on the liability scale for schizophrenia from Eq 39 is h^l2=0.21. For bipolar disorder, with K = 0.005 [58], 20,352 cases and 31,358 controls, h^l2=0.16. Major depressive disorder appears to have a much lower model-estimated SNP heritability than schizophrenia: h^l2 = 0.07. The model estimate of SNP heritability for height is 17%, lower than the oft-reported value ≃50% (see Discussion). However, despite the huge differences in sample size, we find the same value, 17%, for the 2010 GWAS (N = 133,735 [69]), and 19% for the 2018 GWAS (N = 707,868 [46, 49])—see Table 2.

Table 2. Summary of model results for phenotypes shown in Figs 1 and 2.

The subscript in h(l)2 indicates that for the qualitative phenotypes (the first eight) the reported SNP heritability is on the liability scale. MDD: Major Depressive Disorder; CAD: coronary artery disease; AD: Alzheimer’s Disease (excluding APOE locus; *for the full autosomal reference panel, i.e., including APOE, hl2=0.15 for AD—see S1 Appendix (p. S17)); BMI: body mass index; ALS: amyotrophic lateral sclerosis, restricted to chromosome 9; LDL: low-density lipoproteins; HDL: high-density lipoproteins. $In addition to the 2014 height GWAS (N = 251,747 [46]), we include here model results for the 2010 (N = 133,735 [69]) and 2018 (N = 707,868 [49]) height GWAS; there is remarkable consistency for the 2010 and 2014 GWAS despite very large differences in the sample sizes—see S1 Appendix (p. S17). Confidence intervals are in S1 Appendix (p. S12).

Phenotype π1 σβ2 σ02 ncausal h(l)2
MDD 4.01E-3 7.20E-6 1.06 4.4E4 0.07
Bipolar Disorder 2.70E-3 5.25E-5 1.05 3.0E4 0.16
Schizophrenia 2.84E-3 5.51E-5 1.14 3.1E4 0.21
CAD 1.14E-4 1.47E-4 0.97 1.3E3 0.03
Ulcerative Colitis 1.26E-4 8.82E-4 1.14 1.4E3 0.11
Crohn’s Disease 9.56E-5 1.70E-3 1.17 1.1E3 0.18
AD (no APOE)* 1.11E-4 2.22E-4 1.05 1.2E3 0.08
ALS 1.43E-5 3.04E-3 1.02 7 0.00
Education 3.20E-3 1.57E-5 1.00 3.5E4 0.12
Intelligence 2.20E-3 2.32E-5 1.28 2.4E4 0.13
BMI 6.44E-4 4.28E-5 0.88 7.5E3 0.07
Height (2010)$ 4.32E-4 1.66E-4 0.94 4.8E3 0.17
Height (2014) 5.66E-4 1.23E-4 1.66 6.2E3 0.17
Height (2018)$ 8.56E-4 9.46E-5 2.50 9.4E3 0.19
Putamen Volume 4.94E-5 9.72E-4 1.00 540 0.11
LDL 3.58E-5 6.61E-4 0.96 390 0.06
HDL 2.37E-5 1.25E-3 0.97 260 0.07
TC 4.26E-5 8.99E-4 0.96 469 0.09

Fig 3 shows the sample size required so that a given proportion of chip heritability is captured by genome-wide significant SNPs for the phenotypes (assuming equal numbers of cases and controls for the qualitative phenotypes: Neff = 4/(1/Ncases + 1/Ncontrols), so that when Ncases = Ncontrols, Neff = Ncases + Ncontrols = N, the total sample size, allowing for a straightforward comparison with quantitative traits). At current sample sizes, only 4% of narrow-sense chip heritability is captured for schizophrenia and only 1% for bipolar disorder; using current methodologies, a sample size of Neff ∼ 1 million would be required to capture the preponderance of SNP heritability for these phenotypes. Major depressive disorder GWAS currently is greatly under-powered, as shown in Fig 3(A). For education, we predict that 3.5% of phenotypic variance would be explained at N = 1.1 million, in good agreement with the value found from direct computation of 3.2% [70]. For other phenotypes, the proportions of total SNP heritability captured at the available sample sizes are given in Fig 3.

Fig 3. Proportion of narrow-sense chip heritability, A(N) (Eq 37), captured by genome-wide significant SNPs as a function of sample size, N, for phenotypes shown in Figs 1 and Fig 2.

Fig 3

Values for current sample sizes are shown in parentheses. Left-to-right curve order is determined by decreasing σβ2. The prediction for education at sample size N = 1.1 million is A(N) = 0.27, so that the proportion of phenotypic variance explained is predicted to be 3.5%, in good agreement with 3.2% reported in [70]. (The curve for AD excludes the APOE locus. For HDL, see S1 Appendix (p. S9) for additional notes).

The sample size for ALS was quite low, and we restricted the analysis to chromosome 9, which had most of the genome-wide significant typed SNPs; we estimate that there are ≃7 causal SNPs with high discoverability on chromosome 9 [71, 72], with very high discoverability, σβ2 ≃ 0.003. In contrast, for AD restricted to chromosome 19, there were an estimated 14 causal SNPs with discoverability σβ2 ≃ 0.02 (see the S1 Appendix (p. S17)).

In this study, we assume that population stratification in the raw data has been corrected for in the publicly-available summary statistics. However, given that some of the sample sizes are extremely large, we allow for the possibility of residual cryptic relatedness. This would result in a scaling of the z-scores, Eq 9 [19]. Thus, to test the modeling of inflation due to cryptic relatedness, we scaled the simulation z-scores as described earlier (z = σ0 zu with σ0 > 1, where zu are the original z-scores, i.e., not artificially inflated) and reran the model. E.g., for education and schizophrenia we inflated the z-scores by a factor of 1.2. For schizophrenia we found σ02=1.366, which is almost exactly as predicted (1.14 × 1.2 = 1.368), while the polygenicity and discoverability parameters are essentially unchanged: π1 = 2.81 × 10−3, and σβ2=5.56×10-5. For education we found σ02=1.206, which again is almost exactly as predicted (1.0 × 1.2 = 1.2), while the polygenicity and discoverability parameters are again essentially unchanged: π1 = 3.19 × 10−3, and σβ2=1.57×10-5.

A comparison of our results with those of [14] and [15] is in the S1 Appendix (p. S13). Critical methodological differences with model M2 in [14] are that we use a full reference panel of 11 million SNPs from 1000 Genomes Phase 3, we allow for the possibility of inflation in the data, and we provide an exact solution, based on Fourier Transforms, for the z-score pdf arising from the posited distribution of causal effects, resulting in better fits of the model and the data QQ plots—as can be seen by comparing our QQ plots with those reported in S1 Appendix (p. S13). Although our estimated number of causal are often within a factor of two of those from the nominally equivalent model M2 of Zhang et al, there is no clear pattern to the mismatch.

GWAS replication

In the S1 Appendix (pp. S7-S11) we provide an extensive example of testing the compatibility of summary statistics from two large bipolar disorder GWASs. Because z-scores are so noisy, it is possible for a typed SNP with a highly significant p-values in one GWAS to completely fail to reach significance in a subsequent GWAS, and for these outcomes to be statistically consistent. SNP heterozygosity and total LD, as well as sample sizes, are relevant in making such assessments.

Dependence on reference panel

Given a liberal MAF threshold of 0.002, our reference panel should contain the vast majority of common SNPs for European ancestry. However, it does not include other structural variants (such as small insertions/deletions, or haplotype blocks) which may also be causal for phenotypes. To validate our parameter estimates for an incomplete reference, we reran our model on real phenotypes using the culled reference where we exclude every other SNP. The result is that all estimated parameters are as before except that π^1 doubles, leaving the estimatde number of causal SNPs and heritability as before. For example, for schizophrenia we get π1 = 5.3 × 10−3 and σβ2=5.8×10-5 for the reduced reference panel, versus π1 = 2.8 × 10−3 and σβ2=5.5×10-5 for the full panel, with heritability remaining essentially the same (37% on the observed scale).

Discussion

Here we present a unified method based on GWAS summary statistics, incorporating detailed LD structure from an underlying reference panel of SNPs with MAF>0.002, for estimating: phenotypic polygenicity, π1, expressed as the fraction of the reference panel SNPs that have a non-null true β value, i.e., are “causal”; and SNP discoverability or mean strength of association (the variance of the underlying causal effects), σβ2. In addition the model can be used to estimate residual inflation of the association statistics due to variance distortion induced by cryptic relatedness, σ02. The model assumes that there is very little, if any, inflation in the GWAS summary statistics due to population stratification (bias shift in z-scores due to ethnic variation).

We apply the model to sixteen diverse phenotypes, eight qualitative and eight quantitative. From the estimated model parameters we also estimate the number of causal common-SNPs in the underlying reference panel, ncausal, and the narrow-sense common-SNP heritability, h2 (for qualitative phenotypes, we re-express this as the proportion of population variance in disease liability, hl2, under a liability threshold model, adjusted for ascertainment); in the event rare SNPs (i.e., not in the reference panel) are causal, h2 will be an underestimate of the true SNP heritability. In addition, we estimate the proportion of SNP heritability captured by genome-wide significant SNPs at current sample sizes, and predict future sample sizes needed to explain the preponderance of SNP heritability.

We find that schizophrenia is highly polygenic, with π1 = 2.8 × 10−3. This leads to an estimate of ncausal ≃ 31, 000, which is in reasonable agreement with a recent estimate that the number of causal SNPs is >20,000 [73]. The SNP associations, however, are characterized by a narrow distribution, σβ2=6.27×10-5, indicating that most associations are of weak effect, i.e., have low discoverability. Bipolar disorder has similar parameters. The smaller sample size for bipolar disorder has led to fewer SNP discoveries compared with schizophrenia. However, from Fig 3, sample sizes for bipolar disorder are approaching a range where rapid increase in discoveries becomes possible. For educational attainment [42, 74, 75], the polygenicity is somewhat greater, π1 = 3.2 × 10−3, leading to an estimate of ncausal ≃35, 000, half a recent estimate, ≃ 70, 000, for the number of loci contributing to heritability [74]. The variance of the distribution for causal effect sizes is a quarter that of schizophrenia, indicating lower discoverability. Intelligence, a related phenotype [43, 76], has a larger discoverability than education while having lower polygenicity (∼ 10, 000 fewer causal SNPs).

In marked contrast are the lipoproteins and putamen volume which have very low polygenicity: π1 < 5 × 10−5, so that only 250 to 550 SNPs (out of ≃11 million) are estimated to be causal. However, causal SNPs for putamen volume and HDL appear to be characterized by relatively high discoverability, respectively 17-times and 23-times larger than for schizophrenia (see S1 Appendix (p. S9) for additional notes on HDL, and [77] for a relevant comparison with our work).

The QQ plots (which are sample size dependent) reflect these differences in genetic architecture. For example, the early departure of the schizophrenia QQ plot from the null line indicates its high polygenicity, while the steep rise for putamen volume after its departure corresponds to its high SNP discoverability.

For Alzheimer’s disease, our estimate of the liability-scale SNP heritability for the full 2013 dataset [39] is 15% for prevalence of 14% for those aged 71 older, half from APOE, while the recent “M2” and “M3” models of Zhang et al [14] gave values of 7% and 10% respectively–see S1 Appendix (p. S13). A recent report from two methods, LD Score Regression (LDSC) and SumHer [77], estimated SNP heritability of 3% for LDSC and 12% for SumHer (assuming prevalence of 7.5%). A raw genotype-based analysis (GCTA), including genes that contain rare variants that affect risk for AD, reported SNP heritability of 53% [7, 78]; an earlier related study that did not include rare variants and had only a quarter of the common variants estimated SNP heritability of 33% for prevalence of 13% [79]. GCTA calculations of heritability are within the domain of the so-called infinitesimal model where all markers are assumed to be causal. Our model suggests, however, that phenotypes are characterized by polygenicities less than 5 × 10−3; for AD the polygenicity is ≃ 10−4. Nevertheless, the GCTA approach yields a heritability estimate closer to the twin-based (broad sense) value, estimated to be in the range 60-80% [80]. The methodology appears to be robust to many assumptions about the distribution of effect sizes [81, 82]; the SNP heritability estimate is unbiased, though it has larger standard error than methods that allow for only sparse causal effects [69, 83]. For the 2013 data analyzed here [39], a summary-statistics-based method applied to a subset of 54,162 of the 74,046 samples gave SNP heritability of almost 7% on the observed scale [12, 84]; our estimate is 12% on the observed scale—see S1 Appendix (p. S17).

Onset and clinical progression of sporadic Alzheimer’s disease is strongly age-related [85, 86], with prevalence in differential age groups increasing at least up through the early 90s [62]. Thus, it would be more accurate to assess heritability (and its components, polygenicity and discoverability) with respect to, say, five-year age groups beginning with age 65 years, and using a consistent control group of nonagenarians and centenarians. By the same token, comparisons among current and past AD GWAS are complicated because of potential differences in the age distributions of the respective case and the control cohorts. Additionally, the degree to which rare variants are included will affect heritability estimates. The summary-statistic-based estimates of polygenicity that we report here are, however, likely to be robust for common SNPs: π1 ≃ 1.1 × 10−4, with only a few causal SNPs on chromosome 19.

Our point estimate for the liability-scale SNP heritability of schizophrenia is hl2=0.21 (assuming a population risk of 0.01), and that 4% of this (i.e., 1% of overall disease liability) is explainable based on common SNPs reaching genome-wide significance at the current sample size. This hl2 estimate is in reasonable agreement with a recent result, hl2=0.27 [73, 87], also calculated from the PGC2 data set but using raw genotype data for 472,178 markers for a subset of 22,177 schizophrenia cases and 27,629 controls of European ancestry; and with an earlier result of hl2=0.23 from PGC1 raw genotype data for 915,354 markers for 9,087 schizophrenia cases and 12,171 controls [7, 88]. The recent “M2” (single non-null Gaussian) model estimate is hl2=0.29 [14] (see S1 Appendix (p. S13)). No QQ plot was available for the M2 model fit to schizophrenia data, but such plots (truncated on the y-axis at −log10(p) = 10) for many other phenotypes were reported [14]. We note that for multiple phenotypes (height, LDL cholesterol, total cholesterol, years of schooling, Crohn’s disease, coronary artery disease, and ulcerative colitis) our single causal Gaussian model appears to provide a better fit to the data than M2: many of the M2 plots show a very early and often dramatic deviation between prediction and data, as compared with our model QQ plots which are also built from a single causal Gaussian, suggesting an upward bias in polygenicity and/or variance of effect sizes, and hence heritability as measured by the M2 model for these phenotypes. The LDSC liability-scale (1% prevalence) SNP heritability for schizophrenia has been reported as hl2=0.555 [12] and more recently as 0.19 [77], the latter in very good agreement with our estimate; on the observed scale it has been reported as 45% [12, 84], in contrast to our corresponding value of 37%. Our estimate of 1% of overall variation on the liability scale for schizophrenia explainable by genome-wide significant loci compares reasonably with the proportion of variance on the liability scale explained by Risk Profile Scores (RPS) reported as 1.1% using the “MGS” sample as target (the median for all 40 leave-one-out target samples analyzed is 1.19%—see Extended Data Figure 5 and Supplementary Tables 5 and 6 in [36]; this was incorrectly reported as 3.4% in the main paper). These results show that current sample sizes need to increase substantially in order for RPSs to have predictive utility, as the vast majority of associated SNPs remain undiscovered. Our power estimates indicate that ∼500,000 cases and an equal number of controls would be needed to identify these SNPs (note that there is a total of approximately 3 million cases in the US alone).

A subtle but important issue is downward bias of large-sample maximum-likelihood estimates of SNP heritability, due to over-ascertainment of cases in case-control studies [87]; it has been examined in the context of restricted maximum likelihood (REML) in GCTA, which assumes a polygenicity of 1, i.e., every SNP is causal. For schizophrenia, this has been assessed in the context of BOLT-REML, which assumes a mixture distribution of small (‘spike’) and large (‘slab’) effects [73]: from 22,177 cases and 27,629 controls, the observed-scale heritability is reported as h2 = 0.415, equivalent to hl2=0.23 on the liability scale, assuming 1% disease prevalence. However, using “phenotype correlation-genetic correlation” (PCGC) regression, a moments-based approach requiring raw-genotype data which produces unbiased estimates for case-control studies of disease traits [87], the unbiased liability-scale heritability is reported as hg2=0.27, indicating that the likelihood-maximization estimate is biased down by 15% of the unbiased value (the degree of underestimation decreases for smaller sample sizes). Our estimate for the liability-scale heritability of schizophrenia, from a larger sample than in [73], is hl2=0.21. This at least would be consistent with downward bias operating in point-normal causal distributions, in a manner similar to that in GCTA and BOLT-REML. This would then translate into either an underestimate of the number of causal SNPs, or more likely an underestimate of the variance of the distribution of causal effects.

For educational attainment, we estimate SNP heritability h2 = 0.12, in good agreement with the estimate of 11.5% given in [42]. As with schizophrenia, this is substantially less than the estimate of heritability from twin and family studies of ≃40% of the variance in educational attainment explained by genetic factors [74, 89].

For putamen volume, we estimate the SNP heritability h2 = 0.11, in reasonable agreement with an earlier estimate of 0.1 for the same overall data set [4, 47]. For LDL and HDL, we estimate h2 = 0.06 and h2 = 0.07 respectively, in good agreement with the LDSC estimates h2 = 0.08 and h2 = 0.07 [77], and the M2 model of [14]—see S1 Appendix (p. S13).

For height (N = 251,747 [46]) we find that its model polygenicity is π1 = 5.66 × 10−4, a quarter that of intelligence, while its discoverability is five times that of intelligence, leading to a SNP heritability of 17%. The number of causal SNPs (out of a total of approximately 11 million) is approximately 6k; although this is about one twentieth the estimate reported in [90], it remains large and allows for height to be interpreted as “omnigenic”. For the 2010 GWAS (N = 133,735 [69]) and 2018 GWAS (N = 707,868 [49]), we estimate SNP heritability of 17% and 19% respectively (see Table 2, and S1 Appendix (p. S17)). These heritabilities are in considerable disagreement with the SNP heritability estimate of ≃50% [46] (average of estimates from five cohorts ranging in size from N = 1,145 to N = 5,668, with ≃1 million SNPs). For the 2010 GWAS, the M2 model [14] gives h2 = 0.30 (see S1 Appendix (p. S13)); the upward deviation of the model QQ plot in [14] suggests that this value might be inflated. For the 2014 GWAS, the M3 model estimate is h2 = 33% [14]; the Regression with Summary Statistics (RSS) model estimate is h2 = 52% (with ≃11, 000 causal SNPs) [91], which, not taking any inflation into account, is definitely a model overestimate; and in [77] the LDSC estimate is reported as h2 = 20% while the SumHer estimate is h2 = 46% (in general across traits, the SumHer heritability estimates tend to be two-to-five times larger than the LDSC estimates). The M2, M3, and RSS models use a reference panel of ≃1 million common SNPs, in contrast with the ≃11 million SNPs used in our analysis. Also, it should be noted that the M2, M3, and RSS model estimates did not take the possibility of inflation into account. For the 2014 height GWAS, that inflation is reported as the LDSC intercept is 2.09 in [77], indicating considerable inflation; for the 2018 dataset we find σ02 = 2.5, while the LD score regression intercept is 2.1116 (se 0.0458). Given the various estimates of inflation and the controversy over population structure in the height data [50, 51], it is not clear what results are definitely incorrect.

Our power analysis for height (2014) shows that 37% of the narrow-sense heritability arising from common SNPs is explained by genome-wide significant SNPs (p ≤ 5 × 10−8), i.e., 6.3% of total phenotypic variance, which is substantially less than the 16% direct estimate from significant SNPs [46]. It is not clear why these large discrepancies exist. One relevant factor, however, is that we estimate a considerable confounding (σ02 = 1.66) in the height 2014 dataset. Our h2 estimates are adjusted for the potential confounding measured by σ02, and thus they represent what is likely a lower bound of the actual SNP-heritability, leading to a more conservative estimate than what has previously been reported. We note that after adjustment, our h2 estimates are consistent across all three datasets (height 2010, 2014 and 2018), which otherwise would range by more than 2.5-fold. Another factor might be the relative dearth of typed SNPs with low heterozygosity and low total LD (see top left segment in S1 Appendix (p. S29), n = 780): there might be many causal variants with weak effect that are only weakly tagged. Nevertheless, given the discrepancies noted above, caution is warranted in interpreting our model results for height.

Conclusion

The common-SNP causal effects model we have presented is based on GWAS summary statistics and detailed LD structure of an underlying reference panel, and assumes a Gaussian distribution of effect sizes at a fraction of SNPs randomly distributed across the autosomal genome. While not incorporating the effects of rare SNPs, we have shown that it captures the broad genetic architecture of diverse complex traits, where polygenicities and the variance of the effect sizes range over orders of magnitude.

The current model (essentially Eq 4) and its implementation (essentially Eq 29) are basic elements for building a more refined model of SNP effects using summary statistics. Higher accuracy in characterizing causal alleles in turn will enable greater power for SNP discovery and phenotypic prediction.

Supporting information

S1 Appendix. Appendix.

Additional text, tables, and figures.

(PDF)

S1 DataSourceList. Data source list.

(XLSX)

Acknowledgments

We thank the consortia for making available their GWAS summary statistics, and the many people who provided DNA samples.

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1. Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. The American Journal of Human Genetics. 2012;90(1):7–24. 10.1016/j.ajhg.2011.11.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Stahl EA, Wegmann D, Trynka G, Gutierrez-Achury J, Do R, Voight BF, et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nature genetics. 2012;44(5):483–489. 10.1038/ng.2232 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Yang J, Bakshi A, Zhu Z, Hemani G, Vinkhuyzen AA, Lee SH, et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nature genetics. 2015;. 10.1038/ng.3390 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. So HC, Li M, Sham PC. Uncovering the total heritability explained by all true susceptibility variants in a genome-wide association study. Genetic epidemiology. 2011;35(6):447–456. 10.1002/gepi.20593 [DOI] [PubMed] [Google Scholar]
  • 5. Speed D, Hemani G, Johnson MR, Balding DJ. Improved heritability estimation from genome-wide SNPs. The American Journal of Human Genetics. 2012;91(6):1011–1021. 10.1016/j.ajhg.2012.10.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Lee SH, Wray NR, Goddard ME, Visscher PM. Estimating missing heritability for disease from genome-wide association studies. The American Journal of Human Genetics. 2011;88(3):294–305. 10.1016/j.ajhg.2011.02.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics. 2011;88(1):76–82. 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Kumar SK, Feldman MW, Rehkopf DH, Tuljapurkar S. Limitations of GCTA as a solution to the missing heritability problem. Proceedings of the National Academy of Sciences. 2016;113(1):E61–E70. 10.1073/pnas.1520109113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Palla L, Dudbridge F. A Fast Method that Uses Polygenic Scores to Estimate the Variance Explained by Genome-wide Marker Panels and the Proportion of Variants Affecting a Trait. The American Journal of Human Genetics. 2015;97(2):250–259. 10.1016/j.ajhg.2015.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nature Reviews Genetics. 2010;11(7):459–463. 10.1038/nrg2813 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Yang J, Weedon MN, Purcell S, Lettre G, Estrada K, Willer CJ, et al. Genomic inflation factors under polygenic inheritance. European Journal of Human Genetics. 2011;19(7):807–812. 10.1038/ejhg.2011.39 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, Yang J, Patterson N, et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature genetics. 2015;47(3):291–295. 10.1038/ng.3211 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Kang HM, Sul JH, Zaitlen NA, Kong Sy, Freimer NB, Sabatti C, et al. Variance component model to account for sample structure in genome-wide association studies. Nature genetics. 2010;42(4):348–354. 10.1038/ng.548 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Zhang Y, Qi G, Park JH, Chatterjee N. Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits. Nature genetics. 2018;50(9):1318 10.1038/s41588-018-0193-x [DOI] [PubMed] [Google Scholar]
  • 15. Zeng J, Vlaming R, Wu Y, Robinson MR, Lloyd-Jones LR, Yengo L, et al. Signatures of negative selection in the genetic architecture of human complex traits. Nature genetics. 2018;50(5):746 10.1038/s41588-018-0101-4 [DOI] [PubMed] [Google Scholar]
  • 16. Pasaniuc B, Price AL. Dissecting the genetics of complex traits using summary association statistics. Nature Reviews Genetics. 2016;. 10.1038/nrg.2016.142 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Witte JS, Visscher PM, Wray NR. The contribution of genetic variants to disease depends on the ruler. Nature Reviews Genetics. 2014;15(11):765–776. 10.1038/nrg3786 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Holland D, Wang Y, Thompson WK, Schork A, Chen CH, Lo MT, et al. Estimating Effect Sizes and Expected Replication Probabilities from GWAS Summary Statistics. Front Genet. 2016;7:15 10.3389/fgene.2016.00015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999. December;55(4):997–1004. 10.1111/j.0006-341x.1999.00997.x [DOI] [PubMed] [Google Scholar]
  • 20. Holland D, Fan CC, Frei O, Shadrin AA, Smeland OB, Sundar VS, et al. Estimating degree of polygenicity, causal effect size variance, and confounding bias in GWAS summary statistics. bioRxiv. 2017; Available from: https://www.biorxiv.org/content/early/2017/05/24/133132. [Google Scholar]
  • 21. Thompson WK, Wang Y, Schork A, Zuber V, Andreassen OA, Dale AM, et al. An empirical Bayes method for estimating the distribution of effects in genome-wide association studies. PLoS Genetics. 2015;[in press]. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Yang J, Manolio TA, Pasquale LR, Boerwinkle E, Caporaso N, Cunningham JM, et al. Genome partitioning of genetic variation for complex traits using common SNPs. Nature genetics. 2011;43(6):519–525. 10.1038/ng.823 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Gelman A, Stern HS, Carlin JB, Dunson DB, Vehtari A, Rubin DB. Bayesian data analysis. Chapman and Hall/CRC; 2013. [Google Scholar]
  • 24. Laird NM, Lange C. The fundamentals of modern statistical genetics. Springer Science & Business Media; 2010. [Google Scholar]
  • 25. Wu C, DeWan A, Hoh J, Wang Z. A comparison of association methods correcting for population stratification in case–control studies. Annals of human genetics. 2011;75(3):418–427. 10.1111/j.1469-1809.2010.00639.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E. Identifying causal variants at loci with multiple signals of association. Genetics. 2014;198(2):497–508. 10.1534/genetics.114.167908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Holland D. GWAS-Causal-Effects-Model; 2019. https://github.com/dominicholland/GWAS-Causal-Effects-Model.
  • 28. Consortium GP, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Consortium GP, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65. 10.1038/nature11632 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Sveinbjornsson G, Albrechtsen A, Zink F, Gudjonsson SA, Oddson A, Másson G, et al. Weighting sequence variants based on their annotation increases power of whole-genome association studies. Nature genetics. 2016;. 10.1038/ng.3507 [DOI] [PubMed] [Google Scholar]
  • 31. Li N, Stephens M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics. 2003;165(4):2213–2233. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Spencer CC, Su Z, Donnelly P, Marchini J. Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet. 2009;5(5):e1000477 10.1371/journal.pgen.1000477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Su Z, Marchini J, Donnelly P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. 2011;27(16):2304–2305. 10.1093/bioinformatics/btr341 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Wray NR, Ripke S, Mattheisen M, Trzaskowski M, Byrne EM, Abdellaoui A, et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nature genetics. 2018;50(5):668 10.1038/s41588-018-0090-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Stahl EA, Breen G, Forstner AJ, McQuillin A, Ripke S, Trubetskoy V, et al. Genome-wide association study identifies 30 Loci Associated with Bipolar Disorder. Nature genetics. 2019;p. 1 10.1038/s41588-019-0397-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014. July;511(7510):421–427. 10.1038/nature13595 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Nikpay M, Goel A, Won HH, Hall LM, Willenborg C, Kanoni S, et al. A comprehensive 1000 Genomes–based genome-wide association meta-analysis of coronary artery disease. Nature genetics. 2015;47(10):1121 10.1038/ng.3396 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. de Lange KM, Moutsianas L, Lee JC, Lamb CA, Luo Y, Kennedy NA, et al. Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nature genetics. 2017;49(2):256 10.1038/ng.3760 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Lambert JC, Ibrahim-Verbaas CA, Harold D, Naj AC, Sims R, Bellenguez C, et al. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease. Nature genetics. 2013;45(12):1452–1458. 10.1038/ng.2802 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Jansen I, Savage J, Watanabe K, Bryois J, Williams D, Steinberg S, et al. Genetic meta-analysis identifies 10 novel loci and functional pathways for Alzheimer’s disease risk. bioRxiv. 2018;p. 258533. [Google Scholar]
  • 41. Van Rheenen W, Shatunov A, Dekker AM, McLaughlin RL, Diekstra FP, Pulit SL, et al. Genome-wide association analyses identify new risk variants and the genetic architecture of amyotrophic lateral sclerosis. Nature genetics. 2016;48(9):1043 10.1038/ng.3622 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Okbay A, Beauchamp JP, Fontana MA, Lee JJ, Pers TH, Rietveld CA, et al. Genome-wide association study identifies 74 loci associated with educational attainment. Nature. 2016;533(7604):539–542. 10.1038/nature17671 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Sniekers S, Stringer S, Watanabe K, Jansen PR, Coleman JR, Krapohl E, et al. Genome-wide association meta-analysis of 78,308 individuals identifies new loci and genes influencing human intelligence. Nature genetics. 2017;49(7):1107 10.1038/ng.3869 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Savage JE, Jansen PR, Stringer S, Watanabe K, Bryois J, De Leeuw CA, et al. Genome-wide association meta-analysis in 269,867 individuals identifies new genetic and functional links to intelligence. Nature genetics. 2018;50(7):912 10.1038/s41588-018-0152-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015;518(7538):197 10.1038/nature14177 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Wood AR, Esko T, Yang J, Vedantam S, Pers TH, Gustafsson S, et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nature genetics. 2014;46(11):1173–1186. 10.1038/ng.3097 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Hibar DP, Stein JL, Renteria ME, Arias-Vasquez A, Desrivières S, Jahanshad N, et al. Common genetic variants influence human subcortical brain structures. Nature. 2015;. 10.1038/nature14101 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Willer CJ, Schmidt EM, Sengupta S, Peloso GM, Gustafsson S, Kanoni S, et al. Discovery and refinement of loci associated with lipid levels. Nature genetics. 2013;45(11):1274 10.1038/ng.2797 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Yengo L, Sidorenko J, Kemper KE, Zheng Z, Wood AR, Weedon MN, et al. Meta-analysis of genome-wide association studies for height and body mass index in 700,000 individuals of European ancestry. bioRxiv. 2018;p. 274654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Sohail M, Maier RM, Ganna A, Bloemendal A, Martin AR, Turchin MC, et al. Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. eLife. 2019;8:e39702 10.7554/eLife.39702 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Berg JJ, Harpak A, Sinnott-Armstrong N, Joergensen AM, Mostafavi H, Field Y, et al. Reduced signal for polygenic adaptation of height in UK Biobank. eLife. 2019;8:e39725 10.7554/eLife.39725 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Pe’er I, Yelensky R, Altshuler D, Daly MJ. Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genetic epidemiology. 2008;32(4):381–385. 10.1002/gepi.20303 [DOI] [PubMed] [Google Scholar]
  • 53. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews genetics. 2008;9(5):356–369. 10.1038/nrg2344 [DOI] [PubMed] [Google Scholar]
  • 54. Clopper CJ, Pearson ES. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika. 1934;26(4):404–413. 10.1093/biomet/26.4.404 [DOI] [Google Scholar]
  • 55. Falconer DS. The inheritance of liability to certain diseases, estimated from the incidence among relatives. Annals of human genetics. 1965;29(1):51–76. 10.1111/j.1469-1809.1965.tb00500.x [DOI] [Google Scholar]
  • 56. Dempster ER, Lerner IM. Heritability of threshold characters. Genetics. 1950;35(2):212 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.NIMH. Prevalence of Major Depressive Episode Among Adults; 2016. (accessed December 27, 2018). Available from: https://www.nimh.nih.gov/health/statistics/major-depression.shtml.
  • 58. Merikangas KR, Jin R, He JP, Kessler RC, Lee S, Sampson NA, et al. Prevalence and correlates of bipolar spectrum disorder in the world mental health survey initiative. Archives of general psychiatry. 2011;68(3):241–251. 10.1001/archgenpsychiatry.2011.12 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Speed D, Cai N, Johnson MR, Nejentsev S, Balding DJ, Consortium U, et al. Reevaluation of SNP heritability in complex human traits. Nature genetics. 2017;49(7):986 10.1038/ng.3865 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Sanchis-Gomar F, Perez-Quilis C, Leischik R, Lucia A. Epidemiology of coronary heart disease and acute coronary syndrome. Annals of translational medicine. 2016;4(13). 10.21037/atm.2016.06.33 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Burisch J, Jess T, Martinato M, Lakatos PL, ECCO-EpiCom. The burden of inflammatory bowel disease in Europe. Journal of Crohn’s and Colitis. 2013;7(4):322–337. 10.1016/j.crohns.2013.01.010 [DOI] [PubMed] [Google Scholar]
  • 62. Plassman BL, Langa KM, Fisher GG, Heeringa SG, Weir DR, Ofstedal MB, et al. Prevalence of dementia in the United States: the aging, demographics, and memory study. Neuroepidemiology. 2007;29(1-2):125–132. 10.1159/000109998 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Alzheimer’s Association. 2018 Alzheimer’s disease facts and figures. Alzheimer’s & Dementia. 2018;14(3):367–429. [Google Scholar]
  • 64. Mehta P, Kaye W, Raymond Jea. Prevalence of Amyotrophic Lateral Sclerosis 2014 United States. MMWR Morb Mortal Wkly Rep. 2018;67:216–218. 10.15585/mmwr.mm6707a3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Jansen IE, Savage JE, Watanabe K, Bryois J, Williams DM, Steinberg S, et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nature genetics. 2019;p. 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460(7256):748–752. 10.1038/nature08185 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Whiteford HA, Degenhardt L, Rehm J, Baxter AJ, Ferrari AJ, Erskine HE, et al. Global burden of disease attributable to mental and substance use disorders: findings from the Global Burden of Disease Study 2010. The Lancet. 2013;382(9904):1575–1586. 10.1016/S0140-6736(13)61611-6 [DOI] [PubMed] [Google Scholar]
  • 68. Kinney DK, Teixeira P, Hsu D, Napoleon SC, Crowley DJ, Miller A, et al. Relation of schizophrenia prevalence to latitude, climate, fish consumption, infant mortality, and skin color: a role for prenatal vitamin d deficiency and infections? Schizophrenia bulletin. 2009;p. sbp023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010. July;42(7):565–569. 10.1038/ng.608 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nature genetics. 2018;50(8):1112 10.1038/s41588-018-0147-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Balendra R, Isaacs AM. C9orf72-mediated ALS and FTD: multiple pathways to disease. Nature Reviews Neurology. 2018;p. 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Fomin V, Richard P, Hoque M, Li C, Gu Z, Fissore-O’Leary M, et al. The C9ORF72 gene, implicated in amyotrophic lateral sclerosis and frontotemporal dementia, encodes a protein that functions in control of endothelin and glutamate signaling. Molecular and cellular biology. 2018;38(22):e00155–18. 10.1128/MCB.00155-18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Loh PR, Bhatia G, Gusev A, Finucane HK, Bulik-Sullivan BK, Pollack SJ, et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nature genetics. 2015;. 10.1038/ng.3431 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Rietveld CA, Medland SE, Derringer J, Yang J, Esko T, Martin NW, et al. GWAS of 126,559 individuals identifies genetic variants associated with educational attainment. science. 2013;340(6139):1467–1471. 10.1126/science.1235488 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Cesarini D, Visscher PM. Genetics and educational attainment. npj Science of Learning. 2017;2(1):4 10.1038/s41539-017-0005-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76. Plomin R, von Stumm S. The new genetics of intelligence. Nature Reviews Genetics. 2018;. 10.1038/nrg.2017.104 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77. Speed D, Balding DJ. SumHer better estimates the SNP heritability of complex traits from summary statistics. Nature genetics. 2019;51(2):277 10.1038/s41588-018-0279-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Ridge PG, Hoyt KB, Boehme K, Mukherjee S, Crane PK, Haines JL, et al. Assessment of the genetic variance of late-onset Alzheimer’s disease. Neurobiology of aging. 2016;41:200–e13. 10.1016/j.neurobiolaging.2016.02.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. Ridge PG, Mukherjee S, Crane PK, Kauwe JS, et al. Alzheimer’s disease: analyzing the missing heritability. PloS One. 2013;8(11):e79771 10.1371/journal.pone.0079771 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Gatz M, Reynolds CA, Fratiglioni L, Johansson B, Mortimer JA, Berg S, et al. Role of genes and environments for explaining Alzheimer disease. Archives of general psychiatry. 2006;63(2):168–174. 10.1001/archpsyc.63.2.168 [DOI] [PubMed] [Google Scholar]
  • 81. Evans LM, Tahmasbi R, Vrieze SI, Abecasis GR, Das S, Gazal S, et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nature genetics. 2018;50(5):737 10.1038/s41588-018-0108-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82. Yang J, Zeng J, Goddard ME, Wray NR, Visscher PM. Concepts, estimation and interpretation of SNP-based heritability. Nature genetics. 2017;49(9):1304 10.1038/ng.3941 [DOI] [PubMed] [Google Scholar]
  • 83. Zhou X, Carbonetto P, Stephens M. Polygenic modeling with bayesian sparse linear mixed models. PLoS genetics. 2013;9(2):e1003264 10.1371/journal.pgen.1003264 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84. Zheng J, Erzurumluoglu AM, Elsworth BL, Kemp JP, Howe L, Haycock PC, et al. LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics. 2017;33(2):272–279. 10.1093/bioinformatics/btw613 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Holland D, Desikan RS, Dale AM, McEvoy LK, Initiative ADN, et al. Rates of decline in Alzheimer disease decrease with age. PloS one. 2012;7(8):e42325 10.1371/journal.pone.0042325 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86. Desikan RS, Fan CC, Wang Y, Schork AJ, Cabral HJ, Cupples LA, et al. Genetic assessment of age-associated Alzheimer disease risk: Development and validation of a polygenic hazard score. PLoS medicine. 2017;14(3):e1002258 10.1371/journal.pmed.1002258 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87. Golan D, Lander ES, Rosset S. Measuring missing heritability: inferring the contribution of common variants. Proceedings of the National Academy of Sciences. 2014;111(49):E5272–E5281. 10.1073/pnas.1419064111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88. Lee SH, DeCandia TR, Ripke S, Yang J, Sullivan PF, Goddard ME, et al. Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nature genetics. 2012;44(3):247–250. 10.1038/ng.1108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89. Branigan AR, McCallum KJ, Freese J. Variation in the heritability of educational attainment: An international meta-analysis. Social Forces. 2013;p. 109–140. 10.1093/sf/sot076 [DOI] [Google Scholar]
  • 90. Boyle EA, Li YI, Pritchard JK. An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell. 2017;169(7):1177–1186. 10.1016/j.cell.2017.05.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91. Zhu X, Stephens M. Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. The annals of applied statistics. 2017;11(3):1561 10.1214/17-aoas1046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92. Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, Koseki M, et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature. 2010;466(7307):707 10.1038/nature09270 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Scott M Williams, Xiaofeng Zhu

5 Aug 2019

Dear Dr Holland,

Thank you very much for submitting your Research Article entitled 'Beyond SNP Heritability: Polygenicity and Discoverability of Phenotypes Estimated with a Univariate Gaussian Mixture Model' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review again a much-revised version. We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

In particular, please provide the software codes and the detailed simulations, as all three reviewers have a question about the availability of the software.

  

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see our guidelines

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool.  PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Xiaofeng Zhu

Associate Editor

PLOS Genetics

Scott Williams

Section Editor: Natural Variation

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In this report, the authors proposed a general computitional method that can infer the genetic architecture underlying complex traits. an idea similar to LD score regression, using a mixture Gaussian distribution GWAS summary statistics have been combined with a reference population (1KG); a powerful method, as demonstrated by the authors, various interesting parameters and questions, such as the proportion of causal variants and their variance can be addressed. In real GWAS summary data analysis, a range of GWAS complex traits have been tested and consistent to previous reported results in other studies.

Technical comments

1 Similar ideas but using individual-level data had been proposed in Bayesian approaches [Nat Genet, Zeng, 2018]. In particular, their model assume the effect size is relavent to MAF. It is likely to be the case that the MAF and effect has certain correlation, negative or positive upon the scheme of the selection pattern. It is probably one of the reasons that some z-score distribution was not fitted as well as others in the manuscript.

2 The kernel of the method is the introduction of the mixture model that is composed of the causal and null components. In the multinomial pdf function Eq 29 in Supplementary notes, H_i is taken as the mean 2pq for the n_i snps in the bin. However, as k_i is a small number, say k_i =1, the 2pq of this specific causal locus will be very different from the mean 2pq of n_i snps in this bin. It is not really clear H_i in Eq 29 has been updated for k_i or always taken as H_i as approximation? In gut feeling, it may effect the estimation. However, the authors emphized that the real calculation was based on convolution,

3 Eq 51. H_bar is taken as 0.2165, similar critics as in the last comment.

4 The LD score is estimated from the 1KG, which is presumed to be “health controls”. For case-control traits, the LD in the case-control sample will be different from “health controls”, in particular the effect size is large or the trait in under selection. Any consequence due to this leap?

5 The estimation procedure should be better demonstrated in the main text, which is the kernel of the paper. As well as its step-by-step algorithm, and demo code should be provided for others to follow up.

Minors

1 The main text and the supplementary notes are very similar to each other.

2 References are inconsistent between the main and the supplementary notes.

3 Many typos. The ms has not been lined well for the reviewer to tell where those typos. In the supplementary materials it is also referred to “Supplementary Materials”, page 16 second column.

4 Probably the author is from physics field, I guess. The authors used some jargons such as Dirac delta function. However, according the sup Notes, Eq 6-8, the Dirac delta function seems a standard normal distribution instead.

Reviewer #2: In this paper the authors present a method for estimating the genetic architecture of a complex trait from the set of GWAS summary statistics and an estimate of the LD matrix. From a methodological perspective, the major contribution of the paper seems to be the methods, presented in the supplement, for fitting the model using summary statistics and the alone. In the model, the marginal effect sizes for each individiaul SNP are assumed to be generated due to tagging effects of multiple linked SNPs, where the causal effects of the linked SNPs follow a mixture distribution that is a combination of a point mass at zero, and a normal distribution with some variance. Such mixture distributions are fairly common in the field, but one contribution here seems to be a conceptual stand on principle that a significant number of variants should have no effect on a given trait at all, and the methods here fit such a model (whereas most mixture distribution approaches assume that all sites makes at least some small contribution).

In terms of the presentation, I think it would be helpful if the authors could present a bit more of the details of the model could be presented. E.g. I see no reason why equations 5 and 16 from the supplement (along with perhaps whatever others the authors feel necessary to provide appropriate context; see the following) couldn't appear in the main text. More broadly, I think the paper could do more to help place the present work in the context of existing methods of a similar flavor (of which there are many). What specifically distinguishes the approaches laid out here from those which can be found elsewhere in the literature. The reader shouldn't have to take a detour to read 10 other paper in order to figure it out.

I must confess I am a bit concerned about identifiability issues here. E.g. the authors show in the supplement that if you simulate data under diffiferent assumptions, you can somewhat seriously misestimate the number of causal SNPs. In that specific case (where effects come from a particular mixture of Gaussians), the show that their model does not fit the QQ plot well, so we might take the fact that the the model does fit the QQ plots for real traits as a reason to believe in the estimates, but this is weak evidence at best. In the one case where the authors have real data for the same trait at multiple different samples sizes (height), the estimated number of causal SNPs is correlated with GWAS sample size. It is of course hard to know exactly what to make of this, given the known structure confounding issues in the GIANT consortium data, but it is not super reassuring. If the true effect size distribution exhibits the sort of long flat tails the emerge from scale mixtures of Gaussians (i.e. the sort that is explored in the supplement), I would not be surprised to see sample size dependence of parameter estimates, particularly the polygenicity, if the effect sizes truly follow this distribution.

In general, I don't like to suggest too much extra work in review, but it strikes me that it would be reassuring to see something like the UK biobank height data subsetted to different sample sizes, as a demonstration that you get consistent estimates of the number of causal sites. I.e. if you run 10 GWAS on resampled subsets of 50k UK Biobank participants, and then apply your method, do you get roughly the same average estimates for the parameters as you do if you run 10 GWAS on different subsets of 200k individuals?

I also wanted to respond briefly to note that, while it is a very minor point in the context of this paper, I think the authors have a mistaken view of what the omnigenic model is. The omnigenic model, as articulated by Boyle et al, and more recently by Liu et al, is not a claim about the total number of casual SNPs that exist for complex phenotypes. Rather, it is a conceptual model meant to explain multiple features that are commonly observed in GWAS data. One of these features is that the total number of causal SNPs is high, but I think those authors would argue that ~10k is still "high", as far as they are concerned. But another important feature is the fact that much heritability seems to be attributable to variants within genes that are not directly related to the the trait in question, but simply expressed in the relevant cell type. In fact, the "omni" in "omnigenic" is best interpretted in this context. It means roughly "all genes expressed in the relevant cell types affect the trait" not "all variants affect all traits", which is what I read the authors to be arguing against when they say in the main text that height is "very far from omnigenic".

Is software to fit the model made available? I could not find any mention of it in the text.

Reviewer #3: This manuscript describes a new method for estimating heritability and the number of causal SNPs for GWAS traits. The proposed method relies on a spike-slab model and perform a grid search based on MAF and LD pattern for inference. The authors applied the model to analyze 16 diverse phenotypes including eight binary disease traits. Overall, the manuscript is well written, and the analysis is well carried out. The propose algorithm makes sense and appears to work well. With that said, however, I do have a few main concerns:

1. The proposed spike-slab model is identical to that used in ref #14 and one of the proposed inference algorithms (the one based on the multinomial expansion) is also very similar to the one proposed in ref #14. Therefore, it would be important to provide a detailed description of the method in ref #14 and explain how the present method differs from that of ref #14. In addition, it would be important to perform comprehensive comparisons with the ref #14 method in all simulations and real data applications, to demonstrate any benefits of the proposed method.

2. The authors did not provide any software. Some of the proposed algorithms are complicated. Without providing an open-source and user-friendly software along with a software manual, it would be almost impossible for anyone to use the proposed method.

3. The authors did not provide the analysis code used in the paper. I am thus unable to reproduce any of their simulation and real data analysis results. It would be important to provide all analysis code along with all analyzed gwas summary statistics.

4. It has been shown previously that sparse models, like the spike-slab model used in the present paper, are not suitable for heritability estimation under polygenic settings. Indeed, sparse models often under-estimate heritability, at least in studies with small samples (e.g. as illustrated in ref #71). I am thus surprised to see the h2 estimates are fine in the simulations. I was wondering if this is due to the unrealistic simulations employed in the present paper: all simulations were based on simulated SNPs but not based on SNPs from real data. Therefore, it would be important to perform realistic simulations either using real genotype data or following the simulation approach used in ref #14.

5. Half of the analyzed traits in the present study are binary disease traits. It is well known that likelihood-based approaches are not suitable for h2 estimation for case control studies, as likelihood-based method would bias the estimation of h2 due to case control ascertainment (e.g. ref #75). I am thus concerned whether the proposed method can be used for analyzing binary traits. It would be important to provide mathematical details on how the proposed model can accommodate case control studies. For example, do you need to incorporate the spike-slab prior to the liability threshold model? Do you need to develop moment matching type of algorithms for inference? It would also be important to perform comprehensive simulations under ascertained case control settings, like those have been done in ref #75.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: No: demo code is not provided

Reviewer #2: Yes

Reviewer #3: No:

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Decision Letter 1

Scott M Williams, Xiaofeng Zhu

5 Nov 2019

Dear Dr Holland,

Thank you very much for submitting your Research Article entitled 'Beyond SNP Heritability: Polygenicity and Discoverability of Phenotypes Estimated with a Univariate Gaussian Mixture Model' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review again a much-revised version. We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see our guidelines.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool.  PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Xiaofeng Zhu

Associate Editor

PLOS Genetics

Scott Williams

Section Editor: Natural Variation

PLOS Genetics

I suggested that the authors carefully address the reviewer 3's comments, which are also the previous comments 1 and 5 by the reviewer 3.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Congratulations!

It is very helpful to provide the code for the proposed analysis.

Reviewer #3: My previous main comments #1 and #5 were not well addressed. In the previous comment #1, I suggested the authors to perform a comprehensive comparison with previous methods developed in ref #14 (M2 and M3 methods) in all simulations and real data applications, due to the apparent similarity between the proposed method and the previous M2/M3 methods. The authors only criticized the previous M2/M3 methods being incorrect and applied M2/M3 models to real data. In the real data applications, the authors showed that the proposed method tends to produce a higher estimated number of causal SNPs than the previous M2/M3 methods and tends to estimate the heritability lower than the previous M2/M3 methods. Without simulations, however, it is unclear whether such difference observed in the real data is due to the inaccurate estimates from the proposed method or the inaccurate estimates from the previous M2/M3 methods.

In the previous comment #5, I suggested the authors to perform simulations based on the liability threshold model to examine any potential downward bias in heritability estimation for binary traits. The authors explained that the estimates from the proposed method did produce downward bias (in the case of schizophrenia, about 15% downward bias). Because the estimates from M2/M3 models for all binary traits are all higher than that from the proposed method, I was wondering if the previous M2/M3 methods were not susceptible to case control ascertainment and thus could produce unbiased heritability estimates. It would thus be important to perform simulations based on the liability threshold model, to better understand whether the biased heritability estimation is a unique drawback for the proposed method, or whether it also applies to M2/M3.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: No

Decision Letter 2

Scott M Williams, Xiaofeng Zhu

15 Jan 2020

Dear Dr Holland,

We are pleased to inform you that your manuscript entitled "Beyond SNP Heritability: Polygenicity and Discoverability of Phenotypes Estimated with a Univariate Gaussian Mixture Model" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional accept, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about one way to make your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Xiaofeng Zhu

Associate Editor

PLOS Genetics

Scott Williams

Section Editor: Natural Variation

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-19-00865R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

Scott M Williams, Xiaofeng Zhu

27 Apr 2020

PGENETICS-D-19-00865R2

Beyond SNP Heritability: Polygenicity and Discoverability of Phenotypes Estimated with a Univariate Gaussian Mixture Model

Dear Dr Holland,

We are pleased to inform you that your manuscript entitled "Beyond SNP Heritability: Polygenicity and Discoverability of Phenotypes Estimated with a Univariate Gaussian Mixture Model" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Kaitlin Butler

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Appendix.

    Additional text, tables, and figures.

    (PDF)

    S1 DataSourceList. Data source list.

    (XLSX)

    Attachment

    Submitted filename: Reviewer_Response_HollandEtAl_PLOS_GENETICS_21Sep2019.docx

    Attachment

    Submitted filename: hollandEtAl_coverLetter_PlosGen12Dec2019.doc

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLoS Genetics are provided here courtesy of PLOS

    RESOURCES