Skip to main content
American Journal of Human Genetics logoLink to American Journal of Human Genetics
. 2015 Aug 6;97(2):250–259. doi: 10.1016/j.ajhg.2015.06.005

A Fast Method that Uses Polygenic Scores to Estimate the Variance Explained by Genome-wide Marker Panels and the Proportion of Variants Affecting a Trait

Luigi Palla 1, Frank Dudbridge 1,
PMCID: PMC4573448  PMID: 26189816

Abstract

Several methods have been proposed to estimate the variance in disease liability explained by large sets of genetic markers. However, current methods do not scale up well to large sample sizes. Linear mixed models require solving high-dimensional matrix equations, and methods that use polygenic scores are very computationally intensive. Here we propose a fast analytic method that uses polygenic scores, based on the formula for the non-centrality parameter of the association test of the score. We estimate model parameters from the results of multiple polygenic score tests based on markers with p values in different intervals. We estimate parameters by maximum likelihood and use profile likelihood to compute confidence intervals. We compare various options for constructing polygenic scores, based on nested or disjoint intervals of p values, weighted or unweighted effect sizes, and different numbers of intervals, in estimating the variance explained by a set of markers, the proportion of markers with effects, and the genetic covariance between a pair of traits. Our method provides nearly unbiased estimates and confidence intervals with good coverage, although estimation of the variance is less reliable when jointly estimated with the covariance. We find that disjoint p value intervals perform better than nested intervals, but the weighting did not affect our results. A particular advantage of our method is that it can be applied to summary statistics from single markers, and so can be quickly applied to large consortium datasets. Our method, named AVENGEME (Additive Variance Explained and Number of Genetic Effects Method of Estimation), is implemented in R software.

Introduction

Genome-wide association studies have been successful in identifying many variants linked to complex diseases. To date more than 6,000 have been found in more than 500 quantitative traits and common diseases in humans.1 However, when considering the variance explained by the markers associated with any specific disease, there remains a large gap to match the heritability estimates obtained from family studies.2 This observation has spurred the development of theories and investigations to explain the missing heritability, including copy-number variation,3 rare variants,4 epigenetics,5 and genetic interactions.6

It has become increasingly clear that a large portion of the missing heritability is represented on current genotyping products, but the associated markers are not statistically significant. Several approaches have been developed to estimate the heritability explained by a set of genetic markers that might not be individually associated. In the linear mixed model approach, the genetic value of each individual is treated as a random effect whose sample covariance matrix is derived from the relatedness matrix, which is estimated from the genotype data.7 Solving this model gives an estimate of the additive genetic variance explained by the available genotypes, often called the “chip heritability.” Variations of this approach include multiple classes of variant with different effect size distributions,8,9 regression of pair-wise phenotypic correlation on genetic correlation,10 and multivariate models to estimate genetic correlation between traits.11

Another approach uses polygenic scores to estimate chip heritability. Here, effect sizes for all markers are estimated in one sample of data, called the training sample. These effects are then used to construct a score for each subject in a second sample, called the target sample, as the weighted sum of genotypes across a set of markers. Originally, association of the score in the target sample was used to demonstrate the presence of missing heritability among an ensemble of markers.12 More recently, the strength of this association has been used to infer the chip heritability.13,14

A further approach uses empirical Bayes methods to estimate the chip heritability from the distribution of Z scores for individual markers.15 This has the advantage of requiring only summary statistics from standard association analysis. Finally, a very recent method of “LD scoring”16 estimates the chip heritability from the correlation between the marginal effect size of a marker and a measure of its linkage disequilibrium (LD) with other markers, also using only summary statistics.

In general, the methods using linear mixed models are computationally expensive and require individual-level data to calculate the genetic relatedness matrix. Furthermore, many of these methods estimate only the chip heritability, but it is often of interest also to estimate the proportion of markers that affect a trait. This bears on the design of association studies, because it indicates the number and effect sizes of the associated markers remaining to be found. It is also relevant for the debate on the nature of evolution,17,18 because if a large number of variants affect a trait, mechanisms of selection by polygenic adaptation are possible, acting on standing variation without requiring new mutations.19 Methods for the estimation of the number of genes affecting a trait have been proposed since the early 20th century, including complex segregation analysis comparing single- and multi-locus models with or without polygenic background,20 but only with the recent availability of dense genome-wide data has it become possible to assess the polygenic background itself.

Linear mixed models have been extended to allow for a proportion of variants with effects,8 but this remains computationally demanding. Polygenic scoring has also been used to estimate this proportion, but again with a computationally demanding procedure that uses repeated genome-wide simulations within a Bayesian sampling scheme.13 On the other hand, an analytic method for polygenic scores14 estimates only one parameter among several defined in its model; therefore, it can estimate the proportion of variants with effects if the chip heritability is assumed to be known, or vice versa. Empirical Bayes methods are also available to estimate the proportion of markers with effects21 but have not been adapted to jointly estimate this proportion with the chip heritability.

Here we extend the analytic approach of Dudbridge14 to develop a fast analytic method based on polygenic scores for the joint estimation of chip heritability and the proportion of variants affecting the trait, and we further estimate the genetic covariance between two related traits. A particular advantage is that our method can be applied when only summary data is available for individual markers, and this allows our approach to be readily applied to the increasingly large datasets that are now being made available by study consortia.

Methods

Parameter Estimation: AVENGEME

We consider the model presented by Dudbridge14 in which a pair of standardized traits Y = (Y1,Y2)′ is expressed as a linear combination of m genetic effects and an error term E = (E1,E2)′:

Y=βG+E=(i=1mβi1Gi+E1,i=1mβi2Gi+E2) (Equation 1)

where G is an m vector of coded genetic markers and β an m × 2 matrix of coefficients, with E independent of G. Assuming that in two independent samples the estimates of the genetic effects are given, respectively, by βˆi1 and βˆi2, where i = 1,...,m, either set of estimates can then be used to create polygenic scores Sˆ1=i=1mβˆi2Gi and Sˆ2=i=1mβˆi1Gi to be tested for association with Y1 and Y2, respectively. Focusing without loss of generality on Sˆ2, the statistical properties of the test of association have been described.14 In particular, the coefficient of determination between Sˆ2 and Y2, i.e., the variance explained by the polygenic score in the regression of Y2 on Sˆ2, is given by

RSˆ2,Y22=mcov(βˆi1,βi2)2var(βˆi1)var(Y2),

where the terms on the right-hand side are expressed analytically in terms of the following parameters. For study design: sample sizes of the two samples, (n1, n2); number of variants in the marker panel, m, assumed to be uncorrelated; p value thresholds for selecting a marker into the score from the training sample, (pL,pU); and for binary traits, population prevalences (K1,K2) and case sampling fractions (P1,P2). For genetic model: additive genetic variance in the training sample, σ12; genetic covariance between training and testing samples, σ12; and proportion of null markers with no effect on the trait in the training sample, π01. The variance and covariance are marginal over all markers, so include the null markers with βi1 = 0 or βi2 = 0.

The asymptotic non-centrality parameter of the χ12 test of association between Y2 and Sˆ2 is given by λ=n2RSˆ2,Y22/(1RSˆ2,Y22); equivalently, the expectation of the Z (or t) test is μ=(n2RSˆ2,Y22/(1RSˆ2,Y22)) with the sign taken from the correlation between Y2 and Sˆ2.

Binary traits are assumed to arise from a liability threshold model, in which each subject has an unobserved trait, called the liability, that is normally distributed in the population. Subjects with liability greater than a fixed threshold have the trait. The same theory then holds when either Y1 or Y2 is binary as for when it is quantitative, assuming linear transformations between effects on the liability scale to effects on the observed (0/1) scale, and accounting for ascertainment in case/control studies. Specifically, each effect βij on the liability scale corresponds to an effect βijφ(τj)(Pj(1Pj)/Kj(1Kj)) on the observed binary scale,14 where τj = Φ−1(1 − Kj) with ϕ and Φ the standard normal density and cumulative distribution functions, respectively.

We aim to estimate the genetic model parameters σ12, σ12, and π01 from the association test between Sˆ2 and Y2. Previously it was shown14 that one parameter could be estimated by solving for the value at which λ equals the observed χ2 statistic. To estimate multiple parameters, we now propose using association tests of Y2 with multiple polygenic scores constructed by selecting markers with different p value thresholds in the training data. We then fit parameters to the observed association tests by using maximum likelihood.

Specifically, let d1,...,dk denote a set of k intervals within the unit interval, where k is equal to or greater than the number of parameters to be estimated. For each i = 1,..., k, we select markers with p values falling in di, construct the corresponding polygenic score, and obtain its (signed) Z score (Zi) for association with Y2. The log-likelihood for σ12, σ12, and π01 is then

(σ12,σ12,π01)=i=1klogφ(Ziμ(σ12,σ12,π01;di)),

where μ(σ12,σ12,π01;di) is the expectation of the Z test as described above, expressed explicitly as a function of the model parameters given selection interval di. Maximization of this log-likelihood yields estimates of the model parameters. Note that any of σ12, σ12, and π01 could be held fixed while the other parameter(s) are estimated.

An equivalent procedure estimates (using obvious notation) σ22, σ12, and π02 by reversing the roles of the training and target samples. Furthermore, a bidirectional procedure can be used to simultaneously estimate up to five parameters (σ12, σ22, σ12, π01, and π02) by fitting to the Z scores for association of both Sˆ2 with Y2 and Sˆ1 with Y1.

The number of estimated parameters can be reduced by assuming that the genetic architectures are identical in the training and testing samples. This would occur if two samples are drawn from the same population with the same trait definitions, or if one sample is randomly split into training and target subsets. Then we can assume σ12=σ22=σ12 and π01 = π02, estimating just two parameters in either unidirectional or bidirectional analysis.

Ours is not a proper likelihood because the Z scores Zi corresponding to the marker selection intervals are not independent. The presence of a marker in one interval determines its presence or absence in all other intervals, creating dependence between the corresponding scores, but this is not reflected in our likelihood. Furthermore, the bidirectional likelihood does not account for dependence between the scores calculated in each direction. We are therefore using a quasi-likelihood and will later use simulations to investigate its sensitivity to the assumption of independent likelihood contributions.

Maximization of the log-likelihood is complicated by constraints on the range of σ12. Because the absolute correlation between βi1 and βi2 must be no greater than 1, |σ12|σ1σ2. In the unidirectional estimation, σ22 is not identified and we need only respect that σ221, giving the constraint |σ12|σ1. In the bidirectional estimation, we must also consider that the absolute correlation is no greater than 1 for the markers that have non-null effects in both training and target samples. Denoting this correlation as ρ, the correlation over all markers as ρ, and the proportion of markers with non-null effects in both samples as γ ≤ 1 − max(π01, π02), we have

ρ=σ12γ1σ12(1π01)1σ22(1π02)1=ρ(1π01)(1π02)γσ12=ρσ1σ2=ργσ1σ2(1π01)(1π02)|σ12|(1max(π01,π02))σ1σ2(1π01)(1π02).

We maximize the likelihood numerically by nesting the maximization for σ12 within that for the other parameters: for each proposed value of σ12, σ22, π01, and π02, we perform a univariate maximization for σ12 subject to the constraint imposed by the proposed values.

To obtain analytic confidence intervals, we use profile likelihood.22 For a general scalar parameter θ, its profile log-likelihood function is P(θ)=(θ,ϑˆ(θ)) where ϑˆ(θ) is the maximum likelihood estimate of the remaining parameters in the model given θ. Because for a regular model 2((θˆ,ϑˆ(θˆ))(θ,ϑˆ(θ)))Dχ12, for the estimated value θˆ we obtain a (1 − α) confidence interval as the set {θ:P(θ)P(θˆ)(1/2)χ12(1α)} where χ12(1α) is the 1 − α quantile point of the χ12 distribution. This procedure is used to obtain confidence intervals for each of σ12, σ22, σ12, π01, and π02.

Often it is the genetic correlation rather than the covariance between two traits that is of interest. Because the unidirectional estimation does not identify σ22, the correlation cannot be estimated unless a value is assumed for σ22. In the bidirectional estimation, the correlation and its confidence interval can be obtained via previously derived formulas.23

Association tests of polygenic scores can be calculated from summary data alone, as shown in the gtx package for R (see Web Resources). The regression of Y2 on Sˆ2 has coefficient

cov(Y2,Sˆ2)var(Sˆ2)=cov(Y2,βˆ1jGj)var(βˆ1jGj)=βˆ1jβˆ2jvar(Gj)βˆ1j2var(Gj)βˆ1jβˆ2js2j2βˆ1j2s2j2

where s2j2 is the sampling variance of βˆ2j, assuming markers are uncorrelated. This is the inverse-variance weighted mean of βˆ2j/βˆ1j and hence has sampling variance (1/βˆ1j2s2j2). The Wald statistic,

βˆ1jβˆ2js2j2βˆ1j2s2j2, (Equation 2)

is then calculated from summary effect sizes and standard errors for the individual markers. These data are frequently available from research consortia even when access to individual-level data is impractical.24,25

Our methods, named AVENGEME (Additive Variance Explained and Number of Genetic Effects Method of Estimation), are implemented in R software available from the authors.

Method Evaluation

To study the statistical and operating characteristics of AVENGEME, we simulated genome-wide marker data under various genetic models. We based our simulations on four complex diseases studied by Stahl et al.,13 allowing direct comparisons with their ABPA method, which is conceptually similar to ours. We also performed simulations based on three successively larger studies of schizophrenia.26 The study design parameters and the genetic models used for our simulations are given in Table 1.

Table 1.

Parameter Values for Studies of Four Diseases13 and Three Studies of Schizophrenia26

RA CD MI T2D SCZ ISC SCZ PGC1 SCZ PGC2
n1 16,016 5,309 6,042 14,919 5,953 19,548 77,195
n2 12,078 6,785 4,861 4,862 5,120 5,120 5,120
m 82,390 91,388 89,808 75,912 84,882 93,093 103,125
σ12 0.18 0.44 0.48 0.49 0.30
π01 0.973 0.972 0.980 0.962 0.95
P1 0.248 0.394 0.491 0.416 0.423 0.477 0.425
P2 0.126 0.273 0.396 0.396 0.515 0.515 0.515
K1 0.01 0.01 0.06 0.08 0.01 0.01 0.01
K2 0.01 0.01 0.06 0.08 0.01 0.01 0.01

Abbreviations are as follows: RA, rheumatoid arthritis; CD, celiac disease; MI, myocardial infarction; T2D, type II diabetes; SCZ, schizophrenia; ISC, International Schizophrenia Consortium; PGC, Psychiatric Genomics Consortium. Values of σ12 and π01 for RA, CD, MI, and T2D were estimated by Stahl et al.13 and subsequently used in our simulations. Those for SCZ are an approximation based on estimates from several studies and methods (Table 5).

For each genetic model, we simulated estimated effect sizes βˆ1j and βˆ2j independently for each marker, by drawing the true effects from the bivariate normal distribution in Equation 1 and adding independent sampling error to each effect. We then selected markers according to their p values in the training sample and used the summary statistic formula in Equation 2 to obtain tests of association for each polygenic score. We verified this approach for sample sizes up to 10K by explicitly simulating genotypes in case and control subjects as previously described.14 In brief, independent biallelic markers were defined with population minor allele frequencies uniformly distributed on (0.01,0.5). Their effect sizes were drawn from the bivariate normal distribution such that the desired variances and covariances were attained. Allele frequencies were then derived for case and control subjects and genotypes simulated in each. Allelic odds ratios were then computed from the genotype counts. The results from the genotype simulations were indistinguishable from those from summary statistics, so we adopted the summary statistic method, which is much faster and easily scales up to very large sample sizes. Note that in our simulations, markers were assumed to be independent, i.e., in linkage equilibrium, as assumed by AVENGEME. We will later consider the effect of LD on our method.

For the models in Table 1, we simulated 1,000 sets of polygenic score results and estimated the genetic model parameters via the unidirectional AVENGEME. This was done both when assuming σ12=σ12 (which reflects the assumption that the two samples have the same genetic model), in which case AVENGEME estimates the two free parameters σ12 and π01, and when allowing σ12σ12, in which case AVENGEME estimates three free parameters. We evaluated the accuracy from the mean and SD of the parameter estimates and the coverage of the 95% confidence intervals.

We then considered different options for constructing polygenic scores, simulating under the design of the largest schizophrenia study (rightmost column of Table 1) (hereafter termed “SCZ simulation”). We fixed ten thresholds (Table S1, right half) and compared the use of disjoint to nested p value intervals with those thresholds, with the nested intervals each having a lower limit of 0. We compared weighted scores to unweighted scores in which all markers were given an equal weight in the direction of disease risk. We performed 1,000 simulations and evaluated bias, precision, and coverage as before.

We considered the effect of increasing the number of selection intervals and the sample size. Here we simulated different heritabilities in the two samples: σ12=0.3, σ22=0.45, σ12 = 0.294 (giving genetic correlation of 0.8), and different proportions of null markers: π01 = 0.95 and π02 = 0.94 (hereafter termed “bivariate simulation”). We compared the use of 3, 5, 10, 20, and 40 selection intervals in sample sizes of 10K, 20K, 40K, and 80K subjects with case sampling fractions P1 = 0.425 and P2 = 0.515 and disease prevalences K1 = K2 = 0.01. This reflected the SCZ PGC2 study design, although because that was a meta-analysis of case/control studies, the overall sampling fraction should be adjusted to reflect the different fractions in each study. We did not do this here but have found that such adjustments have very little effect on the estimated model.

We evaluated the bidirectional AVENGEME for the simultaneous estimation of all five parameters. We then returned to the SCZ simulation and applied bidirectional AVENGEME under the constraints σ12=σ22=σ12 and π01 = π02 to compare the precision of the bidirectional and unidirectional AVENGEME when estimating only two free parameters.

Finally, we compared AVENGEME to the genomic restricted maximum likelihood (GREML) solution of the linear mixed model, as implemented in the popular GCTA program.27 We performed the bivariate simulation with a total sample size of 10K. GREML was applied on the entire sample, whereas for AVENGEME it was split into training and testing samples each of 5K subjects. We also compared AVENGEME to the method of So et al.,15 which also uses summary statistics for estimation of σ12 only, under the SCZ simulation for a total sample size of 10K.

Linkage Disequilbrium

The theory underlying AVENGEME assumes that markers are uncorrelated.14 This is approximately ensured in practice by pre-filtering markers with “LD-pruning” algorithms that select markers with limited pairwise correlation. Although this practice is common for many methods that estimate chip heritability, it might lead to under-estimation of the true chip heritability because the selected markers might not fully tag the causal variation. Conversely, in our approach, the residual LD among the pruned markers might lead to over-estimation of the explained variance and under-estimation of the proportion of null markers, because marker effects will be biased by LD with other markers.28

We therefore performed simulations on real genotype data to assess the effect of LD pruning. We combined genotype data from all seven case and both control samples in phase 1 of the Wellcome Trust Case-Control Consortium (WTCCC),29 giving genotypes for 384,845 markers on 15,769 subjects after basic quality control (Table S2). We allocated a chip heritability of σ12=σ22=σ12=0.3 among a random 5% of the markers (π01 = π02 = 0.95). We simulated a normally distributed quantitative trait under this model, split the sample into equally sized training and target samples, and estimated the model with AVENGEME on a reduced marker set. We considered both a “pruning” algorithm, which does not take association results into account (“indep-pairwise” option in PLINK,30 window size 100, step 10), and a “clumping” algorithm that greedily retains the most associated markers in the reduced set (“clump” option in PLINK with index and clumped p value thresholds of 1 and 100 marker radius). Both algorithms were applied with r2 thresholds of 0.1 and 0.2, giving reduced sets of approximately 77,000 and 102,000 markers, respectively, on average. The simulation was repeated 1,000 times.

Results

Bias and Precision

We simulated data based on the estimates for additive genetic variance and proportion of null markers obtained by Stahl et al.13 for four common diseases (Table 1). We compared the performance of AVENGEME for these four models with the same p value intervals as those authors (Table S1). Results are shown in Table 2. For the estimation of two parameters only, assuming the same genetic model in the training and target samples, our method yielded nearly unbiased results for both σ12 and π01 with small variance, suggesting that it is expected to work very well in practice. However, the coverage was lower than 95%, suggesting that the analytic confidence intervals are too narrow. This might result from our assumption that the selection intervals make independent contributions to the likelihood. To confirm this, we directly simulated χ2 statistics from the analytic non-central distributions, independently for each selection interval, and repeated the estimation. The confidence intervals then indeed had appropriate coverage (Table S3), confirming that the assumption of independent contributions from each selection interval leads to confidence intervals that are too narrow. Nevertheless, this effect appears to be fairly small.

Table 2.

Application of AVENGEME to Simulated Data for Four Genetic Models Shown in Table 1

Estimation ofσ12, π01
Estimation ofσ12, π01,σ12
RA CD MI T2D RA CD MI T2D
True σ12 0.180 0.440 0.480 0.490 0.180 0.440 0.480 0.490
Mean σˆ12 0.180 0.438 0.486 0.483 0.270 0.467 0.522 0.581
SD σˆ12 0.019 0.035 0.050 0.034 0.312 0.325 0.335 0.332
 Coverage 0.95 0.89 0.91 0.93 0.97 0.95 0.99 0.99

True π01 0.973 0.972 0.980 0.962 0.958 0.972 0.979 0.961
Mean πˆ01 0.972 0.972 0.979 0.961 0.968 0.972 0.979 0.957
SD πˆ01 0.0054 0.0046 0.0040 0.0052 0.028 0.016 0.011 0.018
 Coverage 0.94 0.85 0.88 0.90 0.98 0.95 0.98 0.98

True σ12 0.180 0.440 0.480 0.490
Mean σˆ12 0.190 0.442 0.491 0.509
SD σˆ12 0.034 0.048 0.061 0.072
Coverage 0.98 0.93 0.94 0.993

Mean and standard deviation of parameter estimates and coverage of 95% confidence interval are shown over 1,000 simulations. Monte Carlo error for the mean is SD/√1000 and for coverage of 0.95 is 0.007.

In the estimation of three parameters, the estimate of σ12 had some upward bias and much larger variance; π01 had greater variance compared to the two-parameter estimation, but coverage close to 95%. Inspection of individual simulations revealed that the estimated σ12 is often close to 0 or to 1, pulling the mean estimate toward 0.5. Generally, this suggests that the variability is too large to allow reliable estimation of σ12 when estimating π01 and σ12 as well, at least at these sample sizes. The estimates for σ12, however, showed nearly unbiased estimates and small variance, suggesting that our method is reliable for estimating the genetic covariance when it is not assumed to equal the variance. Coverage was slightly less accurate in the estimation of three parameters, but generally close to the nominal level.

We conclude that for the estimation of σ12 and π01, it is preferable for the training and target samples to be from the same trait population and to apply AVENGEME under the constraint σ12=σ12, whereas if the interest lies in the estimation of the genetic covariance between traits, then the unconstrained version of AVENGEME is more appropriate.

Nested Intervals and Unweighted Scores

We wondered whether the sample sizes could be a reason for the poorer performance of the three-parameter estimation; in addition, we considered the effect of the score weighting versus an unweighted score and whether the p value selection intervals were disjoint or nested. We therefore simulated under a scenario with parameters derived from a large meta-analysis of schizophrenia26 (Methods; Table 1 rightmost column). The results are shown in Table 3. For the two-parameter estimation, disjoint intervals had the least bias and most accurate coverage, although its variance was slightly greater than for nested intervals. The reduced coverage of the confidence intervals for nested intervals can be ascribed to the dependence between intervals, which is greater for nested intervals. The bias is possibly due to the imbalance in the sample size between training and test set (reversing the direction of estimation led to a reduction in bias, for example for disjoint intervals, weighted score, mean σˆ12=0.291). Similar patterns were observed when estimating three parameters, with the disjoint intervals generally showing less bias and more accurate coverage than the nested intervals, but with slightly increased variance. The choice of weights seems to be generally neutral, although a slight increase in variance was observed for unweighted scores. Taken together, these results suggest that the weighted score with disjoint selection intervals is the most reliable and accurate approach for use with AVENGEME.

Table 3.

Comparison of AVENGEME Performance with Weighted and Unweighted Score with Nested or Disjoint Intervals

Estimation ofσ12, π01
Estimation ofσ12, π01, σ12
Disjoint
Nested
Disjoint
Nested
W U W U W U W U
Mean σˆ12 0.274 0.274 0.254 0.258 0.299 0.298 0.422 0.471
SD σˆ12 0.011 0.011 0.008 0.009 0.105 0.106 0.045 0.081
 Coverage 0.36 0.37 0 0 0.94 0.93 0.01 0.20

Mean πˆ01 0.950 0.950 0.951 0.950 0.946 0.946 0.941 0.933
SD πˆ01 0.004 0.004 0.003 0.003 0.016 0.017 0.006 0.008
 Coverage 0.93 0.93 0.80 0.78 0.95 0.94 0.37 0.14

Mean σˆ12 0.281 0.280 0.289 0.309
SD σˆ12 0.042 0.043 0.013 0.021
Coverage 0.91 0.91 0.69 0.85

The SCZ simulation model with σ12=σ12=0.3, π01 = 0.95 was used (see main text for full details). Mean and standard deviation of parameter estimates and coverage of 95% confidence interval are shown over 1,000 simulations. Monte Carlo error for the mean is SD/√1000 and for coverage of 0.95 is 0.007. Abbreviations are as follows: W, weighted; U, unweighted.

Sample Size and Number of Selection Intervals

We then performed bivariate simulations (see Methods) to consider the effect of varying the sample size and the number of selection intervals. In Tables S4–S7, we show the performance of AVENGEME in each direction. The results confirm the poor ability to estimate σ12 or σ22, with mean values mostly around 0.5 and high variance reflecting the frequent estimates of 0 or 1. This applies across all numbers of selection intervals, but there is a reduction in variance as the number of intervals increases and a substantial reduction in bias and variance as the sample size increases from 10K to 80K, whereas more bias persists for the lower genetic variance (mean σˆ12 = 0.362 and σˆ22 = 0.444 with 40 intervals and 80K total sample size). A similar pattern was observed for π01 and π02, although there was much less bias in general.

For the covariance σ12, the estimation again worked well, being nearly unbiased and with low variance regardless of sample size and number of selection intervals. We again observed a general trend of improved bias and precision with more selection intervals and greater sample size.

Bidirectional Estimation

We applied the bidirectional method to the same bivariate simulation data for total sample size of 80K. The results (Table S8) showed consistently lower variance for each parameter compared to the unidirectional estimators, but with a similar level of bias resulting in lower coverage of the confidence intervals. The information gain from analyzing the bidirectional data together is offset to some degree by the increased number of parameters in the model. Furthermore, this analysis was considerably more time consuming than the unidirectional analyses.

Similarly, when applying the bidirectional estimation to data simulated under the SCZ model (Table 1, rightmost column) and constraining σ12=σ22=σ12 and π01 = π02 in the estimation, we obtained lower bias for σ12 (mean σˆ12 = 0.286, πˆ01 = 0.95), similar variance (SD (σˆ12) = 0.011, SD (πˆ01) = 0.004), and greater coverage for σ12 and less for π01 ( = 0.498 for σ12, = 0.760 for π01) compared to the unidirectional analyses (first column of Table 3), although the differences were very small.

We performed a sensitivity analysis to compare the performance of the bidirectional estimation with different initial parameter values for the numerical optimization and the results were virtually unchanged, with just a slight change in bias, variance, and coverage. A similar analysis conducted for the complex diseases in Table 1 also revealed that the estimate of covariance was robust to the choice of initial parameter values.

Linkage Disequilibrium

We simulated a normally distributed trait on 15,769 subjects in the WTCCC (see Methods). Using reduced marker sets with pairwise r2 constrained to <0.1 and <0.2, we estimated σ12=σ22=σ12 and π01 = π02 when (1) the markers were pruned without regard to their association and (2) the markers were clumped by greedily retaining the most strongly associated markers. Table 4 shows that for r2 < 0.1, AVENGEME is unbiased in estimating σ12 when clumping is used but has a small downward bias in π01. Pruning, however, incurs a strong downward bias in both σ12 and π01. For r2 < 0.2, clumping over-estimates σ12 and under-estimates π01 owing to the residual LD. Pruning reduces, but does not eliminate, these biases. These results suggest in practice using a clumping algorithm with pairwise r2 < 0.1 as the least-biased approach with AVENGEME.

Table 4.

Application of AVENGEME to Normally Distributed Traits Simulated on Real Genotypes

Pruned Clumped Independent
r2 0.1 0.2 0.1 0.2 0.1 0.2
Mean σˆ12 0.173 0.281 0.297 0.389 0.297 0.300
SD σˆ12 0.041 0.053 0.042 0.05 0.039 0.046
Mean πˆ01 0.559 0.579 0.900 0.879 0.949 0.931
SD πˆ01 0.428 0.400 0.066 0.080 0.02 0.096

Terms are as follows: pruned, markers are randomly retained in the reduced set; clumped, most strongly associated markers are greedily retained in the reduced set; r2, threshold on residual pairwise LD within the reduced set; independent, results for simulated markers with no LD between any pair. True σ12=0.3, π01 = 0.95.

Comparison with Related Methods

We analyzed our bivariate simulations for total sample size 10K using the bivariate GREML implemented in GCTA.27 The mean σˆ12 was 0.265 with standard deviation 0.032, which compared to the results in Table S4 shows that in this case the GREML estimate has greater bias but less variance than AVENGEME.

We also applied the method by So et al.15 to the SCZ simulation (Table 1, rightmost column). Although their method appeared unbiased in the simulation they performed in which π01 = π02 = 0.995, in our setting it yielded seriously biased results for σ12 with a mean estimate of 0.189 compared to the true value of 0.3.

Having established the good operating characteristics of AVENGEME, we applied our method to some published association results for polygenic scores. For the four diseases from Stahl et al.13, our estimates were systematically lower than the ones obtained by their ABPA method (Table 5), and for σ12 our confidence intervals excluded their estimates. These results were surprising because the two methods are conceptually similar, and our simulations had shown that under the models inferred by ABPA, AVENGEME achieved nearly unbiased estimation. LD is unlikely to affect these results because the markers were clumped to r2 < 0.1. We speculate that the differences might arise from ABPA’s use of prior distributions, and we return to this point in the Discussion. Compared to results from GREML, our estimates for σ12 were lower, with non-overlapping confidence intervals, for rheumatoid arthritis and type 2 diabetes, whereas the results were similar for celiac disease and myocardial infarction.

Table 5.

Genetic Model Parameters Estimated by AVENGEME, ABPA,13,31 and GREML13,32

RA CD MI T2D SCZ ISC SCZ PGC1 SCZ PGC2
AVENGEME σˆ12 .13 (.09–.17) .28 (.21–.35) .34 (.24–.45) .30 (.23–.37) .31 (.28–.34) .31 (.29–.33) .24 (.24–.25)
ABPA σˆ12 .18 (.11–.25) .44 (.34–.54) .48 (.32–.64) .49 (.39–.59) .50 (.45–.54)a
GREML σˆ12 .32 (.25–.39) .33 (.25–.41) .41 (.28–.54) .51 (.38–.64) .33 (.27–.39) .23 (.21–.25)
AVENGEME πˆ01 .946 (.887–.975) .969 (.950–.982) .965 (.933–.982) .954 (.929–.971) .953 (.940–.963) .867 (.841–.887) .852(.835–.867)
ABPA πˆ01 .973 (.953–.993) .972 (.954–.990) .980 (.965–.995) .962 (.941–.983) .936 (.922–.952)a

95% confidence intervals given in parentheses, those for ABPA converted from the reported 50% credible intervals by assuming normally distributed posteriors and those for GREML from the reported standard error by assuming normally distributed estimators.

a

Includes an additional Swedish case/control study.

We applied AVENGEME to three waves of SCZ meta-analyses (Table 5). The genetic variance σ12 was similar in the ISC and PGC1 data, but decreased in the PGC2 data. The proportion of null markers decreased in PGC1 and PGC2 compared to ISC. This might reflect increased heterogeneity: as more studies contribute to the meta-analyses, increased genetic heterogeneity could decrease the proportion of null markers, whereas increased environmental heterogeneity could decrease the genetic variance, which on the liability scale is expressed relative to the total variance. GREML has been applied to the ISC and PGC1 data;32 for the former, the estimate is similar to ours, whereas it is significantly lower in the latter. ABPA has been applied to an expanded PGC1 analysis,31 yielding a significantly higher estimate of σ12 than ours.

We finally applied AVENGEME to estimate genetic covariance between psychiatric traits by using published summary data.33 These data included five pairs from four disorders: schizophrenia, bipolar disorder, major depressive disorder, and autistic spectrum disorder (other combinations, for which only two selection intervals were reported, were excluded because our method requires at least three). The method of Dudbridge14 has previously been shown to agree well with GREML for these data,34 but in estimating the genetic covariance it assumes that σ12 and π01 are known exactly. Here we estimated all three parameters simultaneously. The results are presented in Table 6 and show that the estimates from AVENGEME are of similar magnitude to those from GREML but are consistently larger and have narrower confidence intervals. This difference might arise from LD, because here the markers were clumped to r2 < 0.25, which according to Table 5 might create an upward bias in AVENGEME.

Table 6.

Genetic Covariance Estimates for Five Pairs of Four Psychiatric Traits

AVENGEMEσˆ12 GREMLσˆ12
BPD-SCZ 0.199 (0.186–0.209) 0.151 (0.131–0.171)
MDD-BPD 0.134 (0.120–0.148) 0.102 (0.077–0.127)
SCZ-MDD 0.165 (0.153–0.177) 0.087 (0.065–0.110)
SCZ-ASD 0.050 (0.038–0.059) 0.03 (0.008–0.052)
ASD-BPD 0.042 (0.030–0.055) 0.008 (−0.017–0.033)

Abbreviations are as follows: BPD, bipolar disorder; SCZ, schizophrenia; MDD, major depressive disorder; ASD, autistic spectrum disorder. AVENGEME estimates are from bidirectional analysis. GREML confidence intervals derived from published standard errors35 assuming normally distributed estimators.

Discussion

The method we have proposed allows simultaneous estimation of the additive variance explained by a set of genetic markers, the proportion of markers affecting the trait of interest, and the genetic covariance between two traits. It does so by solving analytic expressions to obtain maximum likelihood estimates and profile likelihood confidence intervals and is consequently very fast. Furthermore, the polygenic score tests required by our method can be rapidly calculated from summary statistics for individual markers, allowing application to very large datasets and results from published literature. Our simulations show that our method enjoys good bias and coverage properties in spite of its assumption that the tests from different selection intervals are independent. Although we presented results only for case/control designs here, they represent the most challenging scenarios for polygenic modeling and we have observed results of comparable or greater accuracy for quantitative traits (data not shown).

AVENGEME has a number of advantages compared to currently available methods. In comparison with GREML it can deal with very large sample sizes and obtain estimates much more rapidly, and it additionally estimates the proportion of null markers. Compared to ABPA, it does not require Monte Carlo sampling nor simulation of genome-wide data and is therefore much faster; AVENGEME also extends to estimate the covariance between related traits. Compared to the method of So et al.15 and other empirical Bayes methods, it appears to be less biased and can simultaneously estimate up to five model parameters. Compared to the LD-scoring approach, it can estimate the proportion of null markers and does not require calculation of LD between pairs of markers.

One limitation of our approach is the need for two independent datasets, which is often not available when common controls are used; in contrast, GREML can estimate a bivariate model from a single sample and LD scoring is robust to overlapping samples. We assume that population structure has been entirely adjusted for in the target sample, and might over-estimate chip heritability if this is not the case, whereas GREML and LD-scoring adjust for structure explicitly in their calculations. Our method also assumes that markers are uncorrelated. In practice this is approximately ensured by a LD-pruning step that is also commonly conducted for other methods. We have shown that if the residual LD between pruned markers is not too high, say r2 < 0.1, then AVENGEME retains its unbiased properties if a “clumping” algorithm is used, but can otherwise overestimate the genetic variance. In contrast, LD scoring explicitly uses LD to estimate the variance explained. The similarity of estimates obtained by that approach to those of ours and other current methods suggests that this problem is currently not too severe, but as marker densities increase toward whole-genome coverage, it will become more important to include all markers and account for LD. Our methods can be extended to allow correlation between markers, and this will be pursued in a subsequent paper.

A limitation is that unless very large sample sizes are used, estimation of the chip heritability in the training sample is unstable if it is jointly estimated with the covariance with the testing sample. Therefore, if the variance is of particular interest, we recommend analyzing the same trait in both samples, either by splitting a single sample in two or by drawing two samples from the same trait population. Then good performance in estimating the variance can be achieved by constraining it to equal the covariance.

The unidirectional estimation provides good estimates in all situations we considered. The bidirectional estimation can also be applied, providing a less variable estimate than the unidirectional estimators, with a similar degree of bias. However, the bidirectional analysis is more time consuming than the unidirectional, and because its reduction in variance is rather small, we do not find a compelling reason to prefer it to the unidirectional.

We recommend using disjoint selection intervals, whereas the influence of the weighting seemed limited in the situations we considered. However, the use of nested intervals still provides good estimates if the number of intervals is sufficiently large (say ten) and appears to work well for the covariance across sample sizes, number, and type of intervals. Nested intervals seem more appealing for obtaining significant tests of association between polygenic scores and a trait of interest, and to date have been reported more often than disjoint intervals. However, for the estimation of the underlying genetic model, we suggest that results for disjoint intervals should also be made available. The current fashion for using around ten intervals appears to be sufficient for obtaining accurate estimates; although precision increases as more intervals are used, the gains diminish rapidly beyond that number.

Our method was generally found to produce under-coverage of confidence intervals. This is due both to some bias in the estimation, though this was generally small, and the assumption of independent tests from each interval. We have observed that our profile likelihood intervals closely match the empirical distribution of parameter estimates in our simulations. The under-coverage is therefore more likely to arise from the slight bias in our estimator rather than from the calculation of its variance. Our experience is that, in this application, an approximately valid confidence interval is generally sufficient for practitioners.

AVENGEME requires numerical optimization to estimate parameters, and this can be sensitive to the algorithm used and the initial estimates provided. We have used the default settings of the optim() function in R (Nelder-Mead non-linear optimization) and in the simulations provided the true parameter values as the initial estimates. This was to obtain, as far as possible, the ideal results from truly maximizing the likelihood. We found that slight variations can result from different starting values (our default values are 0.5 for all parameters), but the conclusions remain the same. In practice we suggest using a range of plausible starting values to identify the solution with the maximum likelihood.

AVENGEME is conceptually similar to ABPA,13 both methods seeking the genetic model that best fits the observed results of polygenic score tests using multiple selection intervals. The main difference is that AVENGEME uses analytic formulae to construct an explicit likelihood, whereas ABPA uses approximate Bayesian computation with Monte Carlo sampling. In the application to the complex diseases in Table 1, we obtained lower estimates for all parameters and the reason for this might be the effect of the prior distributions used by ABPA. Their prior for π01 is uniform on the log scale and therefore heavily favors values of π01 close to 1. On the other hand, their prior for σ12 is beta distributed on a relative scale and does not have a natural correspondence to maximum likelihood. Furthermore, if the true distribution of effects departs from the assumed model (for example, as a mixture of normal distributions8,9), then the two methods might diverge further. Our approach might benefit from imposing prior distributions on the parameters and performing Bayesian estimation, particularly for improving the precision of estimating σ12 jointly with σ12. This is a promising subject for future work.

Our approach provides a fast and accurate method for estimating the genetic model parameters underlying large-scale association studies. It is particularly applicable to summary statistics for individual markers, often made freely available online by research consortia. Therefore, it will greatly facilitate the estimation of genetic covariance, especially between traits that have been studied by different consortia and for which combined analysis of individual-level data is logistically challenging. The rapid estimation of genetic models at arbitrarily large sample sizes suggests that our approach will prove useful as the sizes of consortium and biobank studies begin to approach millions of subjects.

Acknowledgments

We thank Eli Stahl, Stephan Ripke, and Dominika Sieradzka for discussions. This work was funded by the MRC (K006215).

Published: July 16, 2015

Footnotes

Supplemental Data include eight tables and can be found with this article online at http://dx.doi.org/10.1016/j.ajhg.2015.06.005.

Web Resources

The URLs for data presented herein are as follows:

Supplemental Data

Document S1. Tables S1–S8
mmc1.pdf (648.4KB, pdf)
Document S2. Article plus Supplemental Data
mmc2.pdf (881KB, pdf)

References

  • 1.Robinson M.R., Wray N.R., Visscher P.M. Explaining additional genetic variation in complex traits. Trends Genet. 2014;30:124–132. doi: 10.1016/j.tig.2014.02.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Maher B. Personal genomes: The case of the missing heritability. Nature. 2008;456:18–21. doi: 10.1038/456018a. [DOI] [PubMed] [Google Scholar]
  • 3.Gamazon E.R., Cox N.J., Davis L.K. Structural architecture of SNP effects on complex traits. Am. J. Hum. Genet. 2014;95:477–489. doi: 10.1016/j.ajhg.2014.09.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Zuk O., Schaffner S.F., Samocha K., Do R., Hechter E., Kathiresan S., Daly M.J., Neale B.M., Sunyaev S.R., Lander E.S. Searching for missing heritability: designing rare variant association studies. Proc. Natl. Acad. Sci. USA. 2014;111:E455–E464. doi: 10.1073/pnas.1322563111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Furrow R.E., Christiansen F.B., Feldman M.W. Environment-sensitive epigenetics and the heritability of complex diseases. Genetics. 2011;189:1377–1387. doi: 10.1534/genetics.111.131912. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Zuk O., Hechter E., Sunyaev S.R., Lander E.S. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc. Natl. Acad. Sci. USA. 2012;109:1193–1198. doi: 10.1073/pnas.1119675109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Yang J., Benyamin B., McEvoy B.P., Gordon S., Henders A.K., Nyholt D.R., Madden P.A., Heath A.C., Martin N.G., Montgomery G.W. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Zhou X., Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 2012;44:821–824. doi: 10.1038/ng.2310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Speed D., Balding D.J. MultiBLUP: improved SNP-based prediction for complex traits. Genome Res. 2014;24:1550–1557. doi: 10.1101/gr.169375.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Golan D., Lander E.S., Rosset S. Measuring missing heritability: inferring the contribution of common variants. Proc. Natl. Acad. Sci. USA. 2014;111:E5272–E5281. doi: 10.1073/pnas.1419064111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Maier R., Moser G., Chen G.B., Ripke S., Coryell W., Potash J.B., Scheftner W.A., Shi J., Weissman M.M., Hultman C.M., Cross-Disorder Working Group of the Psychiatric Genomics Consortium Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am. J. Hum. Genet. 2015;96:283–294. doi: 10.1016/j.ajhg.2014.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Purcell S.M., Wray N.R., Stone J.L., Visscher P.M., O’Donovan M.C., Sullivan P.F., Sklar P., International Schizophrenia Consortium Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460:748–752. doi: 10.1038/nature08185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Stahl E.A., Wegmann D., Trynka G., Gutierrez-Achury J., Do R., Voight B.F., Kraft P., Chen R., Kallberg H.J., Kurreeman F.A., Diabetes Genetics Replication and Meta-analysis Consortium. Myocardial Infarction Genetics Consortium Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat. Genet. 2012;44:483–489. doi: 10.1038/ng.2232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Dudbridge F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 2013;9:e1003348. doi: 10.1371/journal.pgen.1003348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.So H.C., Li M., Sham P.C. Uncovering the total heritability explained by all true susceptibility variants in a genome-wide association study. Genet. Epidemiol. 2011;35:447–456. doi: 10.1002/gepi.20593. [DOI] [PubMed] [Google Scholar]
  • 16.Bulik-Sullivan B., Finucane H., Anttila V., Gusev A., Day F.R., ReproGen Consortium, Psychiatric Genomics Consortium, Genetic Consortium for Anorexia Nervosa. Perry J.R.B., Patterson N. An atlas of genetic correlations across human diseases and traits. bioRxiv. 2015 doi: 10.1038/ng.3406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Orr H.A. The genetic theory of adaptation: a brief history. Nat. Rev. Genet. 2005;6:119–127. doi: 10.1038/nrg1523. [DOI] [PubMed] [Google Scholar]
  • 18.Pritchard J.K., Di Rienzo A. Adaptation - not by sweeps alone. Nat. Rev. Genet. 2010;11:665–667. doi: 10.1038/nrg2880. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Pritchard J.K., Pickrell J.K., Coop G. The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Curr. Biol. 2010;20:R208–R215. doi: 10.1016/j.cub.2009.11.055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Lynch M., Walsh B. Sinauer Associates; 1998. Genetics and Analysis of Quantitative Traits. [Google Scholar]
  • 21.Efron B., Tibshirani R., Storey J.D., Tusher V. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 2001;96:1151–1160. [Google Scholar]
  • 22.Davison A.R. Cambridge University Press; Cambridge: 2003. Statistical Models. [Google Scholar]
  • 23.Visscher P.M., Hemani G., Vinkhuyzen A.A., Chen G.B., Lee S.H., Wray N.R., Goddard M.E., Yang J. Statistical power to detect genetic (co)variance of complex traits using SNP data in unrelated samples. PLoS Genet. 2014;10:e1004269. doi: 10.1371/journal.pgen.1004269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Ehret G.B., Munroe P.B., Rice K.M., Bochud M., Johnson A.D., Chasman D.I., Smith A.V., Tobin M.D., Verwoert G.C., Hwang S.J., International Consortium for Blood Pressure Genome-Wide Association Studies. CARDIoGRAM consortium. CKDGen Consortium. KidneyGen Consortium. EchoGen consortium. CHARGE-HF consortium Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk. Nature. 2011;478:103–109. doi: 10.1038/nature10405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Dastani Z., Hivert M.F., Timpson N., Perry J.R., Yuan X., Scott R.A., Henneman P., Heid I.M., Kizer J.R., Lyytikäinen L.P., DIAGRAM+ Consortium. MAGIC Consortium. GLGC Investigators. MuTHER Consortium. DIAGRAM Consortium. GIANT Consortium. Global B Pgen Consortium. Procardis Consortium. MAGIC investigators. GLGC Consortium Novel loci for adiponectin levels and their influence on type 2 diabetes and metabolic traits: a multi-ethnic meta-analysis of 45,891 individuals. PLoS Genet. 2012;8:e1002607. doi: 10.1371/journal.pgen.1002607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Schizophrenia Working Group of the Psychiatric Genomics Consortium Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014;511:421–427. doi: 10.1038/nature13595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Lee S.H., Yang J., Goddard M.E., Visscher P.M., Wray N.R. Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood. Bioinformatics. 2012;28:2540–2542. doi: 10.1093/bioinformatics/bts474. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Bulik-Sullivan B.K., Loh P.R., Finucane H.K., Ripke S., Yang J., Patterson N., Daly M.J., Price A.L., Neale B.M., Schizophrenia Working Group of the Psychiatric Genomics Consortium LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 2015;47:291–295. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A., Bender D., Maller J., Sklar P., de Bakker P.I., Daly M.J., Sham P.C. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Ripke S., O’Dushlaine C., Chambert K., Moran J.L., Kähler A.K., Akterin S., Bergen S.E., Collins A.L., Crowley J.J., Fromer M., Multicenter Genetic Studies of Schizophrenia Consortium. Psychosis Endophenotypes International Consortium. Wellcome Trust Case Control Consortium 2 Genome-wide association analysis identifies 13 new risk loci for schizophrenia. Nat. Genet. 2013;45:1150–1159. doi: 10.1038/ng.2742. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Lee S.H., DeCandia T.R., Ripke S., Yang J., Sullivan P.F., Goddard M.E., Keller M.C., Visscher P.M., Wray N.R., Schizophrenia Psychiatric Genome-Wide Association Study Consortium (PGC-SCZ) International Schizophrenia Consortium (ISC) Molecular Genetics of Schizophrenia Collaboration (MGS) Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nat. Genet. 2012;44:247–250. doi: 10.1038/ng.1108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Cross-Disorder Group of the Psychiatric Genomics Consortium Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. Lancet. 2013;381:1371–1379. doi: 10.1016/S0140-6736(12)62129-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Wray N.R., Lee S.H., Mehta D., Vinkhuyzen A.A., Dudbridge F., Middeldorp C.M. Research review: Polygenic methods and their application to psychiatric traits. J. Child Psychol. Psychiatry. 2014;55:1068–1087. doi: 10.1111/jcpp.12295. [DOI] [PubMed] [Google Scholar]
  • 35.Lee S.H., Ripke S., Neale B.M., Faraone S.V., Purcell S.M., Perlis R.H., Mowry B.J., Thapar A., Goddard M.E., Witte J.S., Cross-Disorder Group of the Psychiatric Genomics Consortium. International Inflammatory Bowel Disease Genetics Consortium (IIBDGC) Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs. Nat. Genet. 2013;45:984–994. doi: 10.1038/ng.2711. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Tables S1–S8
mmc1.pdf (648.4KB, pdf)
Document S2. Article plus Supplemental Data
mmc2.pdf (881KB, pdf)

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

RESOURCES