Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2008 Oct 14;3(10):e3395. doi: 10.1371/journal.pone.0003395

Accuracy of Predicting the Genetic Risk of Disease Using a Genome-Wide Approach

Hans D Daetwyler 1,2,*, Beatriz Villanueva 3, John A Woolliams 1
Editor: Michael Nicholas Weedon4
PMCID: PMC2561058  PMID: 18852893

Abstract

Background

The prediction of the genetic disease risk of an individual is a powerful public health tool. While predicting risk has been successful in diseases which follow simple Mendelian inheritance, it has proven challenging in complex diseases for which a large number of loci contribute to the genetic variance. The large numbers of single nucleotide polymorphisms now available provide new opportunities for predicting genetic risk of complex diseases with high accuracy.

Methodology/Principal Findings

We have derived simple deterministic formulae to predict the accuracy of predicted genetic risk from population or case control studies using a genome-wide approach and assuming a dichotomous disease phenotype with an underlying continuous liability. We show that the prediction equations are special cases of the more general problem of predicting the accuracy of estimates of genetic values of a continuous phenotype. Our predictive equations are responsive to all parameters that affect accuracy and they are independent of allele frequency and effect distributions. Deterministic prediction errors when tested by simulation were generally small. The common link among the expressions for accuracy is that they are best summarized as the product of the ratio of number of phenotypic records per number of risk loci and the observed heritability.

Conclusions/Significance

This study advances the understanding of the relative power of case control and population studies of disease. The predictions represent an upper bound of accuracy which may be achievable with improved effect estimation methods. The formulae derived will help researchers determine an appropriate sample size to attain a certain accuracy when predicting genetic risk.

Introduction

Genetic risk of disease is an important component of overall risk of disease in addition to environmental, socio-economic, and behavioral risk factors. Therefore, predicting the genetic risk of disease for an individual is a powerful tool in taking preventative measures against the onset of the disease. Such predictions from genetic testing are relatively straightforward when a disease is caused by one or few genes. However, when a disease is of complex inheritance, the genetic risk of the disease may be associated with many loci, each explaining only a small portion of the genetic variance [1], [2]. In this case, the prediction of genetic risk of disease of a particular individual becomes more challenging. Currently, prediction of risk for complex diseases is based mainly on pedigree analysis but this approach yields predictions of risk that are of low precision; for example predictions would be identical for full siblings without offspring, yet the genetic variation among them accounts for half or more of the genetic variance [3], [4].

The identification of very large numbers of single nucleotide polymorphisms (SNP) has enabled the use of genome-wide association studies (GWA) to detect alleles that are associated with risk for complex diseases [5], such as Type II Diabetes and Crohn's disease [6]. In tandem with this substantive increase of SNP data, several methods for quantifying and/or predicting genetic risk of disease from multiple genes have been put forward [7], [8]. Wray et al.[9] extended these methods by using an GWA approach to estimate the individual genetic risk of disease. Unlike the risk estimates obtained using only pedigree, the estimates resulting from such a GWA approach are more precise by allowing for differentiation among full-siblings. In addition, no pedigree or family history is needed either for estimating risk in one genotyped sample from the population or for predicting risk in a fresh sample. Similar genome-wide methodology has been proposed in animal and plant breeding to estimate additive genetic values for quantitative traits [10], [11]. One critical difference between the two genome-wide approaches is that Wray et al. [9] set a significance threshold for the loci selected for disease prediction, whereas Meuwissen et al. [10] use all loci regardless of whether they affect or not the trait considered. The approach of Meuwissen et al. [10] therefore attempts to achieve the maximum estimate precision of the complete genetic value for a given dataset by including loci that may have too small of an effect to achieve statistical significance, and, thus, reduces the overestimation of allele effects [12].

Wray et al. [9] computed the precision of the individual genetic risk estimates by simulation. While simulation studies are useful in getting initial results on the number of phenotypic records needed to achieve a desired level of accuracy, they are computer intensive and time consuming with large numbers of markers. Most importantly, they do not provide a deep insight on how all variables that affect accuracy interact. Therefore, it is desirable to develop deterministic equations that are responsive to all variables that influence accuracy.

Here we present simple expressions for the genome-wide accuracy of prediction of genetic disease risk. We derive general expressions for continuous traits and the necessary extensions for dichotomous disease traits with data obtained either from population studies or case control studies. The predictions are tested by computer simulation under a variety of parameters influencing accuracy, such as, for example, disease prevalence, heritability and distributions of allele effects and frequencies

Materials and Methods

Derivation of Equations

The predicted accuracy that is derived below represents the upper bound that can be achieved when estimating effects in one population sample and then predicting individual genetic risk in another sample from the same population. Throughout this article the accuracy of predicted genetic risk (r) is defined as the correlation between true and predicted genetic values. One advantage of using r is that the factors influencing it can be clearly derived using the principles of population genetics, as we show below. We will first derive equations that are predictive of r for a genome-wide approach with a continuous phenotype, such as height, assuming a population study where individuals are sampled at random. These will then be adapted to predict disease risk for a dichotomous phenotype (‘affected’ or ‘unaffected’) with an underlying continuous liability. The equations are then further adapted to the situation of case control data.

Continuous phenotype

We will assume that there are nG potential loci affecting a trait which are independent, biallelic and acting additively, where nG may be large. These loci may be candidate genes or genetic markers of which a significant proportion may have zero effects. For locus j, j = 1…nG, let a randomly chosen reference allele for that locus have frequency pj and true allelic substitution effect βj. We shall assume without loss of generality that the distribution of allele frequencies pj is symmetric about p = ½, and likewise that allelic effects βj are symmetric about β = 0. No further distributional assumptions will be made here on pj and βj, so for example, many of the allele segregating may have negligible or zero effect. No assumptions are made concerning the covariance between pj and βj in the populations sampled. We intend to derive the accuracy of the prediction of the additive genetic value (r) of an individual that can be achieved after the measurement of nP phenotypes.

An estimate of the effect of each allele may be obtained by regression of the phenotypic records on the genotypes one locus at a time because the loci are independently segregating. Assume the population variance of the phenotypes is 1. The estimated allele substitution effect will be Inline graphic with expectation Inline graphic, and is obtained by regressing the phenotypes on the observed number of reference alleles in the genotype, denoted xij for individual i and locus j (i.e. xij = 0, 1, or 2). The sampling variance of the allele estimate is Inline graphic where Inline graphic is the residual variance after regression on xij and Sxx ,j = nPvar(xij) is the adjusted sums of squares for xij. Although not assumed here, when the population is in Hardy-Weinberg equilibrium Sxx ,j is given by 2nP pj(1−pj). For the present, we shall conservatively take Inline graphic, which underestimates the accuracy of the prediction.

Our aim is to predict the accuracy of a new population sample, so we apply the original estimates to a new sample of the same population. Values referring to the second sample will be ‘dashed’, hence individual i from the second sample has Inline graphic alleles at locus j. The additive genetic value of i is given by Inline graphic with estimate Inline graphic. Then Inline graphic. Noting that ĝ i can be re-written as Inline graphic with Inline graphic, it is seen that cov(gi, ĝ i) = var(gi) and that Inline graphic. Of these remaining terms, Inline graphic, where Inline graphic is the observed heritability for the trait, assuming the phenotypic variance is 1. Again using the decomposition Inline graphic, it can be shown that Inline graphic, following from (i) the independence of the loci and (ii) the sampling variance of Inline graphic derived earlier. Finally Inline graphic, since the second sample comes from the same population, so Inline graphic, and substituting λ = nP/nG gives

graphic file with name pone.0003395.e020.jpg (1)

Therefore accuracy is seen to be a function of the product of the observed heritability Inline graphic and the ratio of the number of phenotypes recorded to the number of loci involved, λ. A second order correction to relax the assumption Inline graphic is given in Appendix S1, where it is shown to result in an upward correction to r of fractional magnitude Inline graphic.

Dichotomous disease phenotype

We shall now derive the accuracy of predicting individual genetic risk to disease (r) in a random population sample by considering disease prevalence in a liability model [9]. For a disease with prevalence q, phenotypes are defined as si = 0 for unaffected, and si = 1 for affected, so E[si] = q and var(si) = q(1−q). Individuals with the highest liability are affected by the disease. Let liability be yi, scaled so E[yi] = 0 and var(yi) = 1, and βj is the regression of liability on the number of reference alleles at locus j. The linear predictor of si on yi is given by si = q+qiqyi [13], where iq equals the mean liability of affected individuals, which we will term the selection intensity [3] corresponding to the prevalence of the disease in the population. Let the slope of the regression of si on xij be Inline graphic, then Inline graphic, with sampling variance, estimated conservatively using the phenotypic variance q(1−q)

graphic file with name pone.0003395.e026.jpg (2)

The coefficients Inline graphic may be rescaled to give estimates Inline graphic, with sampling variance

graphic file with name pone.0003395.e029.jpg (3)

Repeating the argument outlined above for a continuous phenotype with Inline graphic, and Inline graphic, where Inline graphic is the heritability on the liability scale. Simplifying terms results in:

graphic file with name pone.0003395.e033.jpg (4)

Robertson and Lerner [14] show that the relationship between additive heritability on the observed scale and the heritability on the liability scale satisfies

graphic file with name pone.0003395.e034.jpg (5)

Substitution then results in Equation (1) with Inline graphic being replaced by Inline graphic:

graphic file with name pone.0003395.e037.jpg (6)

Therefore the dichotomous phenotype study of disease results in an identical formula for r as the continuous phenotype provided the heritability used is that for the observed dichotomous scale.

Case Control Disease Study

The formulae will now be extended to derive the accuracy r of a genetic risk prediction when applying a case control design to a dichotomous phenotype. The need for modification of the equations for a case control design comes from the selection of individuals from within the population to achieve a prevalence within the sample of cases and controls of w, and where typically w = 1/2 with equal numbers of cases and controls. Parameter values post-selection will be ‘starred’. It is assumed in the following without loss of generality that cases are less common than controls in the population so qw≤1/2. Two parameters in particular need to be re-estimated because of the selection practiced: (i) Inline graphic; and (ii) the regression of si on xij, Inline graphic. Both these corrections can be made as shown in detail in Appendix S2.

Briefly, assuming no covariance between pj and βj, Inline graphic. Inline graphic is nPvar*(xij) and so since nG and Inline graphic over loci are unaffected by the sampling of cases and controls, E[var*(xij)] = E[var(xij)]var*(gi)/var(gi). Appendix S2 shows that using Normal theory Inline graphic. Further Inline graphic, where x is the truncation point of a Normal distribution for upper-tail probability q, ī = wiq−(1−w)i (1−q).

Approximating Inline graphic for a binomial trait with probability ½, appropriate for equal numbers of cases and controls, gives Inline graphic, and substituting λ results in

graphic file with name pone.0003395.e047.jpg (7)

Changing the heritability from the liability scale for a population sample to the observed scale for a population sample using Equation (5) produces

graphic file with name pone.0003395.e048.jpg (8)

Finally, substituting Inline graphic, gives

graphic file with name pone.0003395.e050.jpg (9)

Thus the form of r for a case control study shows equivalence to the r of continuous and dichotomous phenotypes provided heritability is on the observed scale and the appropriate changes are made in c to account for the selection of cases and controls. The value of c is 1 in population studies (Equation (6)), where w = q (and, hence, ī = 0). When q<w<1/2, c<1 and there is an increase in r compared to a population study with the same λ.

Simulations

Stochastic computer simulations were used to test the deterministic predictions of r for a number of parameters affecting the continuous and dichotomous phenotypes. We describe the full simulation method for the continuous trait and then state additional steps that were needed for the dichotomous phenotypes (random population sample and case control). In all scenarios (i) individuals were unrelated; (ii) loci were independent; (iii) all genetic action was additive; (iv) for simplicity, loci were assumed to be in Hardy-Weinberg equilibrium; and (v) each scenario was replicated 100 times, except for case control scenarios with λ = 0.02 where 500 replicates were run. Furthermore for initial simulations (vi) allele frequencies were sampled from a uniform distribution corresponding to a common-disease-common-variant hypothesis (CDCV) [15]; and (vii) allele effects were drawn from a reflected exponential distribution which was made symmetric about x = 0. Items (vi) and (vii) were modified as described below.

For the continuous phenotypes, the phenotypic variance was 1. True additive genetic values for nP individuals were calculated as (1−pj)βj and −pjβj for the minor and major alleles, respectively, for each of nG simulated loci, and summing over loci. The value of nG used in most scenarios was 1000 and nP varied accordingly, depending on λ. Two exceptions were λ = 0.02, where nG = 20,000, and the scenarios in which λ was kept constant with nG = 100. The scale factor of the exponential distribution was chosen to obtain the required additive heritability Inline graphic. Phenotypic records were simulated by adding independent environmental terms to the true genetic effects drawn from a Normal distribution with mean zero and variance Inline graphic. Allele substitution effects Inline graphic were estimated by regression of nP phenotypic records on genotypes one locus at a time. A second sample of individuals was then simulated with genotypes based on the same allele frequencies and effects as the original population. The estimated additive genetic values were then computed according to the following model: Inline graphic, as described above. Finally, r was calculated as the correlation between true and estimated additive genetic values. Bias was also assessed by the slope of the regression of gi on ĝ i.

The continuous phenotype case was tested for robustness to different distributions of allele frequency and effects, and their correlation. The allele frequencies were also drawn from a beta (U-shape) distribution, consistent with a neutral allele model [16], with parameters alpha = 0.3, and theta = 0.3. Allele effects were also sampled from a normal distribution with mean zero. The effect of having a percentage of loci with zero effects was investigated by setting a proportion of the effects to zero while keeping the overall genetic variance constant. In all cases, the scale factor for the distribution of allele effects was modified to maintain the desired Inline graphic.

Further testing of the predictions was done by introducing a correlation between the heterozygosity at a locus and the squared magnitude of the allele substitution effect at a locus. This was done for a uniform distribution of allele frequencies and the reflected exponential distribution of allele effects. This was achieved empirically: if the randomly drawn frequency had heterozygosity greater than the median (i.e. 2p(1−p)>0.375) then the magnitude of the allele effect was drawn to be less than the median of the distribution of the magnitudes.

The simulation of a random population sample for the dichotomous disease phenotype followed the same structure as above but contained the additional step of treating the underlying continuous phenotype distribution as a liability for the disease with heritability Inline graphic on the liability scale [14]. Therefore, with prevalence q, the fraction q of the population with the greatest liability were considered to be affected. Therefore allele effects were estimated from the dichotomous phenotype and the accuracy, r, was calculated as the correlation between the true and estimated genetic liability for the disease estimated in an independent population sample.

Case control studies were simulated with an equal number of cases and controls (i.e. w = 1/2). A dichotomous disease phenotype with sample size nP was simulated by including an additional selection step which expanded the population size to nP[2qd]−1. The liabilities were constructed as for the population study of a dichotomous disease, the nP/2 individuals with the greatest phenotypic liability were considered to be affected cases, and a further nP/2 were randomly chosen from those remaining as control phenotypes. Allele effects were estimated as for the population studies, and the accuracy was estimated from a randomly-drawn independent population sample of size nP.

Results

Population-wide studies of continuous phenotypes

When allele effects were drawn from an exponential distribution and frequencies were from the uniform, the deterministic formula for r was found to predict the simulated data reliably across the wide range of parameters used (Table 1). The prediction errors across all parameters studied were in the range of −1.3 to 4.0% (Table 1).

Table 1. Predicted accuracy and percentage prediction error assessed by simulation with disease prevalence = 0.1 (SE range 0.0004–0.0065).

h 2 b λ a = 0.02 λ = 0.50 λ = 1.00 λ = 5.00
Pc %errord P %error P %error P %error
Ce 0.1 0.045 4.0 0.218 3.6 0.301 2.2 0.577 0.4
0.5 0.100 2.1 0.447 −0.5 0.577 −0.2 0.845 −0.1
0.9 0.133 −1.3 0.557 0.2 0.688 −0.2 0.905 −0.1
DP f 0.1 0.026 −14.1 0.130 −6.6 0.182 −2.2 0.382 −1.6
0.5 0.058 −1.1 0.281 0.6 0.382 −1.1 0.679 0.2
0.9 0.078 −9.8 0.365 1.6 0.485 0.8 0.779 0.2
DC g 0.1 0.043 −0.6 0.209 2.4 0.290 3.5 0.560 −1.9
0.5 0.089 −4.3 0.407 3.0 0.533 0.8 0.816 −2.9
0.9 0.112 −20.0 0.490 −0.4 0.622 −0.4 0.872 −3.3
a

λ = number of phenotypes per number of loci.

b

h 2 = heritability (observed scale for C and DP, liability scale for DC).

c

P = predicted accuracy of estimated additive genetic value.

d

% error = percentage prediction error = 100(P−accuracy from simulation)/P.

e

C = continuous phenotype.

f

DP = dichotomous phenotype, population study.

g

DC = dichotomous phenotype, case control study.

The close agreement between the predicted and achieved accuracies is also seen in Table 2 and was maintained when: (i) allele frequencies were drawn from a beta-distribution (% error −0.9 to 0.7); (ii) allele effects were drawn from a normal distribution (% error −0.8 to 5.0); (iii) exponential allele effects were mixed with varying proportions of alleles with no effects, ranging from 0 to 95% (% error 0.1 to 26.6, Table 3); (iv) λ's ranging from 0.02 to 5 were investigated (% error −20.0 to 4.0, Table 1); and (v) the genetic architecture was varied by keeping λ constant and changing nG (nG = 100, % error 0.1 to 7.6; and nG = 1000, % error −0.5 to 0.0). It should be noted that the large percentage errors seen when λ = 0.02 are due to low r, where the absolute difference between the expected and simulated r was still less than 0.02. The introduced correlation between heterozygosity and squared substitution effect was tested with λ = 1 and nG = 1000 using the empirical procedure described in the Materials and Methods. With an achieved correlation of −0.36 and an observed Inline graphic, the predicted accuracy from Equation (1) was 0.53, with an error of 1.1% when compared to simulation. In conclusion, it is clear that the deterministic r is robust to wide distributional assumptions on the joint distribution of frequency and effect of allele substitution, as predicted from the derivation.

Table 2. The effects of different distributions of allele frequency and effects on accuracy in a continuous phenotype with observed heritability = 0.5 (SE range 0.0004–0.0057).

λ a Predicted Simulated
Betab/Nrmc Beta/Expd Unif/Nrm Uni/Exp
0.02 0.100 0.095 0.093 0.100 0.097
0.50 0.447 0.442 0.436 0.451 0.450
1.00 0.577 0.577 0.579 0.576 0.578
2.00 0.707 0.709 0.714 0.704 0.709
5.00 0.845 0.849 0.848 0.846 0.846
10.00 0.913 0.914 0.914 0.913 0.912
a

λ = number of phenotypes per number of loci.

b

Beta = beta distribution (alpha = 0.3, theta = 0.3) of allele frequencies.

c

Nrm = normal distribution of allele effects.

d

Exp = exponential distribution of allele effects.

f

Uni = uniform distribution of allele frequencies.

Table 3. Accuracy for continuous phenotype when setting 0.95 of nG a loci to zero (λ = 0.02 = 400nP b/20,000nG, SE range 0.0042–0.0057).

Inline graphic c 0.95 of nG zero 0.0 of nG zero Predicted
0.1 0.057 0.043 0.045
0.5 0.101 0.097 0.100
0.9 0.129 0.135 0.133
a

nG = number of loci.

b

nP = number of phenotypes.

c

Inline graphic = observed heritability.

Therefore the predictions of genome-wide accuracy shown in Figure 1 based on Equation (1) for different values of observed h 2 and λ have wide applicability. For all λ, the accuracy was most sensitive to h 2 when h 2 was low and this sensitivity was potentiated by higher numbers of phenotypes per genotype tested. The accuracies are functions of λh 2, so the required λ to achieve a given accuracy is proportional to 1/h 2. Thus, the numbers of phenotypes per genotype need to be twice as high for half the heritability. To obtain accuracies of 0.71, corresponding to predicting half the genetic variance, λ = 1/h 2, and therefore λ must be ≥1 because h 2≤1.

Figure 1. Predicted accuracy of estimated genetic values of a continuous phenotype.

Figure 1

Predicted accuracy of estimated additive genetic values of a continuous phenotype as a function of observed heritability and number of phenotypes per genotype tested, λ = 0.02, 0.1, 0.5, 1, 2, 5, 10 and 20 from minimum to maximum accuracy respectively.

Population-wide studies on dichotomous disease phenotypes

The form of the predicted accuracy (r) is very similar to that for a quantitative trait. Again the prediction of r was very good (% error −14.1 to 1.6; see Table 1). The validity of the prediction resulting from Equation (6) was robust to varying disease prevalence over the range of 0.01 to 0.5 (% error −1.9 to 1.4, Table 4). The form of the prediction in Equation (6) is a function of λ and the observed additive heritability on a (0,1) scale, but this can be achieved with varied combinations of disease prevalence and underlying heritability of liability. This is shown in Table 5, which also demonstrates that, as predicted from Equation (6), r is a function of only Inline graphic as accuracy remains constant with varied disease prevalence and Inline graphic.

Table 4. Accuracy for a dichotomous disease trait as prevalence varies (a Inline graphic, b λ = 1, SE range 0.0026–0.0048).

Prevalence Study Type DP c Study Type DC d
Pe % Errorf P % Error
0.01 0.186 −0.8 0.593 −11.1
0.03 0.271 −1.9 0.568 −6.8
0.05 0.317 0.3 0.554 −3.5
0.10 0.382 −0.6 0.533 0.6
0.20 0.444 1.4 0.511 −2.5
0.30 0.473 1.2 0.499 −0.2
0.40 0.487 −0.6 0.493 1.2
0.50 0.491 0.0 0.491 1.4
a

Inline graphic = heritability on liability scale.

b

λ = number of phenotypes per number of loci.

c

DP = population study of dichotomous phenotypes.

d

DC = case control study of dichotomous phenotypes.

e

P = predicted accuracy of additive genetic values.

f

% error = percentage prediction error = 100(P−accuracy from simulation)/P.

Table 5. Simulated accuracy of a population study for a dichotomous phenotype as prevalence and Inline graphic a varies and Inline graphic b stays constant (λ c = 10, Inline graphic, predicted accuracy = 0.816, Equation (4), SE range 0.0025–0.0038).

Prevalence Inline graphic Accuracy
0.05 0.893 0.810
0.10 0.584 0.814
0.20 0.408 0.814
0.30 0.347 0.813
0.40 0.322 0.813
0.50 0.314 0.813
a

Inline graphic = heritability on liability scale.

b

Inline graphic = heritability on observed scale.

c

λ = number of phenotypes per number of loci.

The predicted r of population studies of continuous phenotypes and dichotomous disease phenotypes with an underlying continuous liability follow the same functional form as seen in Equation (6). Therefore, Figure 1 can be used to derive predicted r for dichotomous phenotypes as well as continuous phenotypes. However, note that in the liability model, even if liability was fully determined genetically, the additive heritability on the observed scale will never exceed 0.64 (i.e. 4θ(0)2, where θ(x) is the standardized normal density function) with the remaining genetic variation appearing non-additive. The corresponding maximum r achievable will be reduced and this will be most serious for low λ. Even with the most favorable circumstances of q = 1/2 and liability Inline graphic, the accuracy will never exceed 0.71 if λ<1.56, and it should be expected that λ needs to be much greater than this to explain half the genetic variance. This circumstance should not be expected to change when using other disease models than the liability, since the loss of r arises from the loss of quantitative information when moving from a continuous genetic value (however defined) to the categorical observation of affected or not.

Case control studies of dichotomous disease phenotypes

The prediction formula for accuracy of case control studies (r) is not a simple function of λ and the observed Inline graphic, but also depends on both the heritability on the liability scale and the disease prevalence, as seen from Equation (8). Therefore, comparisons require consideration of how c in Equation (9) varies. The simulations assumed w = 1/2, with equal numbers of cases and controls. Although, as seen in Table 1, the predictions are generally good (% error −20.0 to 3.5), where the large error deviations are again due to low λ, there is a trend towards the underestimation of r as prevalence becomes low (Table 4).

The value of r for case control studies is best illustrated by comparison with population studies of dichotomous disease traits. Figure 2 integrates this information and shows the relationship of prevalence and observed heritability in population and case control studies. Values of r below the narrowly dashed line derived from Equation (5) are not possible under the liability model, for example, an observed additive heritability of 0.5 and a prevalence of 0.1 could not exist in the same dataset. Each contour represents an level of constant r, where the dashed lines represent a population study and the solid lines denote a case control design with w = 1/2. As described above the contours are vertical for population studies as, given Inline graphic, the accuracy is independent of q, but for case control studies move towards lower Inline graphic as prevalence decreases. Several clear conclusions on case control studies can be drawn: (i) the overall trend of r increasing with more phenotypes per number of genotype holds true for case control studies (Table 1); (ii) population studies and case control studies are equivalent when the prevalence is 0.5 (Figure 2); (iii) a case control study is always more accurate than a population study with the same number of individuals genotyped (Figure 2); (iv) for a constant Inline graphic, r increases as the disease prevalence increases in population studies, since this increases Inline graphic, but in case control studies r increases as the disease prevalence decreases because of the more intense selection induced by the less prevalent disease (Table 4).

Figure 2. Predicted accuracy of estimated genetic risk from population and case control designs of a dichotomous phenotype.

Figure 2

Contour plot of predicted accuracy for varied prevalence and additive heritability on the observed scale, in population studies (dashed vertical line) and case control studies (solid line) of dichotomous phenotypes. Each contour represents a line of constant accuracy, starting from the right 0.9, 0.8, 0.7, and 0.6. The narrowly dashed line is derived from Equation (5) with Inline graphic, so values below this line are not possible under the liability model.

Discussion

We have derived simple deterministic predictions of r in continuous and dichotomous phenotypes using either a population or a case control study and we have shown them to be appropriately responsive to changes in disease prevalence, heritability, and the number of phenotypic records per number of risk loci to be estimated. In addition, the equations have proven robust to changes in allele effect distributions, including different fractions of loci with zero effect and differing allele frequency distributions. Population studies are also robust to covariances between the magnitude of allele effects and heterozygosity, although, in principle, this robustness does not hold for case control studies. This advance in understanding has been used to summarize the influence of critical parameters such as heritability and numbers of phenotypes and risk loci on accuracy of prediction, and also to show the degree to which case control designs can add power to studies.

The approach taken here has been to assume the potential loci affecting the trait are known, and this has an impact that is double edged. First, it allows for a clear quantification of the limitations imposed on r by the number of phenotypes obtained, irrespective of marker densities. The information gained by doing so is of equal importance to knowing the number of markers needed for a certain r but seems to have received less attention recently. Second, it implies that the predicted r are upper bounds for the data obtained, since some loss of r will occur through the use of markers which are potentially in imperfect linkage disequilibrium (LD) with loci with effect [17], and the inclusion of candidate loci that may have no effect within the population.

The impact of including these loci with no true effect may be explained by two applications of our formulae. The first application assumed the loci affecting a disease trait are known and thus r demonstrates an upper bound on the accuracy; for example, consider nG = 1000 loci with effects greater than 0, nP = 10,000 phenotypes and Inline graphic, then the predicted accuracy is obtained with λ = 10, and will be 0.71. Now consider if those 1000 loci are contained with a set of nG = 100,000 marker loci, with 99% having zero effect so that now the accuracy is obtained with λ = 0.1; our predictive equations remain valid and predict an accuracy of 0.10. From these applications of our formulae it is clear that the approach of estimating loci effects one at a time will inevitably result in low accuracies, and further, adding more marker loci with zero effects while using the same approach will reduce the expected accuracy. The low accuracies predicted accord with the empirical findings from large scale studies of human data that have recently been reported [18]. It is clear that alternative approaches to prediction will be needed to bridge the gap and raise accuracies towards the potential placed by the phenotype collection.

Nevertheless, potential alternative approaches are available and evidence already exists that these approaches may significantly increase predictive accuracy. One approach is to implement model selection approaches. Similarly, improvements in r can be achieved by implementing model selection least squares procedures to identify a subset of SNP from which to predict effects [10], [19], or by using more complex procedures to identify a subset to set to zero [20]. Some of these studies [10], [19], [20] also incorporate the use of prior information within Bayesian procedures and demonstrate significant increases in accuracy over least squares. Increasing the number of markers when using priors can increase accuracy because the size of the marker subset chosen stays the same due to the prior but the portion of the genetic variance captured by the markers subset increases [21]. However the use of Bayesian approaches will demand reliable distributions for incorporation into models. Literature estimates informing priors on nG and the distributions of the effects will become more widely available as GWA studies become more powerful [1], [22]. Full genome-wide methods [10], [11], where genetic risk or additive genetic values are estimated in one step, using all loci simultaneously particularly if they are correlated, might be expected to approach the upper bound of r faster than methods which impose significance thresholds and, thus, do not capture all the genetic variation. From the results presented here it may be argued that priors on the numbers of loci positively contributing to the genetic variance will be more critical than those describing the distribution of gene effects.

In this paper we have used a liability model for disease instead of the commonly used log genetic risk model and the impact of doing so is likely to be small for large datasets. For a set of Inline graphic and q, an underlying log-risk can be approximated well by a liability [9], [23] and the distribution of effects on the log-risk scale will be transformed to a distribution on the liability scale, and the predictions developed here are not dependent on the distribution of effects. However there is evidence that distinctions may be larger when q is very close to zero or one [24].

A critical assumption of the genetic models studied was that the loci acted independently. In humans, most LD stretches for 10 to 30 kb, while some linkage disequilibrium blocks may be >100 kb [25]. The human genome contains 3.1 billion bases [26] and, assuming 2000 known loci contribute to the additive genetic variance, each genomic segment between them would be 1550 kb. This confirms that this model is viable in human. One could apply our formulae by interpreting nG as the number of independent chromosome segments (i.e. haplotype blocks). The length and, thus, the number of these segments would depend on the amount of LD present in the genome. The number of such segments have been estimated directly from pair-wise LD between markers [27] and closely related measures, such as the number of independent tests on the genome, have been estimated using principle component analysis [28] and have been derived analytically for specific experimental designs [29]. When LD exists, either between markers and risk loci or between risk loci, the predictive efficiency of our equations will be reduced. Modeling the pattern of LD by extension of our formulae would thus be important when many loci are used, as with dense SNP marker maps, or when predicting additive genetic values in other species, such as some livestock populations where the extent of LD is large compared to human [30], [31].

An attraction of molecular predictors of genetic risk compared to pedigree predictors is the potential to apply the predictions more widely within populations and across populations. Obtaining sufficient accuracy within populations can be achieved by the quality and size of sampling, but there are additional factors in play when transfer across populations is being considered. For example, one benefit of genome-wide prediction is that individual allele effects are estimated with a precision that is related to the molecular variation observed at the locus, var(xij), which determines the contribution of genetic variance when combined with the squared magnitude of effect. This benefit may break down when predictions are transferred across populations. As an illustration, consider a rare allele of large effect which will be relatively imprecisely estimated in the estimation sample, but because the contribution of the locus to total variance is small there is only a small impact upon the accuracy of further predictions within the same population. In a different population, such an allele may have a greater frequency and contribute a greater part of the genetic variance, and, consequently, the predictive accuracy will suffer. Specifically, the ability to transfer predictions will depend on var(xij) in each of the two populations used for estimation and application, and this in turn depends on both the allele frequency (pj) and the degree of admixture present in the population. Furthermore, an additional risk of transferability across populations is the presence of epistasis which may differentially influence βj.

Any directional selection present in the population is likely to introduce a covariance between the magnitude of allelic effect and heterozygosity, since selection promotes the movement of alleles of large effect quickly through intermediate frequencies, where they create large genetic variance, towards extreme frequencies. The predictions of r developed make no assumption of the covariance, and hence are robust to such selection in the population prior to estimation in population studies. In contrast, the derivation for the case control study does assume independence of heterozygosity and magnitude (as described in Appendix S2). However, in the limited simulations carried out with such covariances in case control studies, the impact of the breaking this assumption appeared small (results not shown).

Our derivations show that r can be reduced to very similar forms for population and case-control studies of continuous and dichotomous phenotypes (c.f. Equations (1), (6) and (9)). The common element affecting r for all three equations is the term Inline graphic, describing the joint effect of λ, the number of phenotypic records per locus associated with the trait, and the observed heritability. Increasing either of these improves r, but the study shows that the major determinant of the trade-off between these two factors is their product. For a population study Inline graphic is completely sufficient to determine accuracy, independent of prevalence (q) and heritability Inline graphic of liability for a dichotomous trait, but for a case control study both q and Inline graphic retain some influence on r over and above their impact upon Inline graphic. This is because, in a case control study, the term c in Equation (9) is adjusting for the selection of the cases and controls, and the strength of selection will depend upon q, and its impact on genetic variance will depend on Inline graphic.

The predictive equations are a good fit to the simulated values and we have demonstrated, by theory and simulation, that they are independent of allele frequency and effect distributions. The formulae have increased the understanding of the relative differences between predicting r in a random sample of a population and in case control studies. The expressions for r derived will help researchers design experiments of appropriate size to estimate genetic risk to disease.

Supporting Information

Appendix S1

(0.56 MB DOC)

Appendix S2

(0.17 MB DOC)

Acknowledgments

We are grateful to Bill Hill, Piter Bijma and two anonymous reviewers for their helpful and constructive comments.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: HDD is supported by the SABRETRAIN Project, which is funded by the Marie Curie Host Fellowships for Early Stage Research Training, as part of the 6th Framework Program of the European Commission. BV receives support from the Scottish Executive Environment and Rural Affairs Department (SEERAD), and JAW receives funding from the Biotechnology and Biological Sciences Research Council (BBSRC).

References

  • 1.Hayes BJ, Goddard ME. The distribution of the effects of genes affecting quantitative traits in livestock. Genetics Selection Evolution. 2001;33:209–229. doi: 10.1186/1297-9686-33-3-209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Valdar W, Solberg LC, Gauguier D, Burnett S, Klenerman P, et al. Genome-wide genetic association of complex traits in heterogeneous stock mice. Nature Genetics. 2006;38:879–887. doi: 10.1038/ng1840. [DOI] [PubMed] [Google Scholar]
  • 3.Falconer DS, Mackay TFC. Introduction to Quantitative Genetics. Harlow, UK: Longman; 1996. [Google Scholar]
  • 4.Bijma P, Woolliams JA. Prediction of genetic contributions and generation intervals in populations with overlapping generations under selection. Genetics. 1999;151:1197–1210. doi: 10.1093/genetics/151.3.1197. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Hirschhorn JN, Daly MJ. Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics. 2005;6:95–108. doi: 10.1038/nrg1521. [DOI] [PubMed] [Google Scholar]
  • 6.Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Janssens ACJW, Aulchenko YS, Elefante S, Borsboom GJJM, Steyerberg EW, et al. Predictive testing for complex diseases using multiple genes: Fact or fiction? Genetics in Medicine. 2006;8:395–400. doi: 10.1097/01.gim.0000229689.18263.f4. [DOI] [PubMed] [Google Scholar]
  • 8.Pharoah PDP, Antoniou A, Bobrow M, Zimmern RL, Easton DF, et al. Polygenic susceptibility to breast cancer and implications for prevention. Nature Genetics. 2002;31:33–36. doi: 10.1038/ng853. [DOI] [PubMed] [Google Scholar]
  • 9.Wray NR, Goddard ME, Visscher PM. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 2007;17:1520–1528. doi: 10.1101/gr.6665407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Meuwissen TH, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819–1829. doi: 10.1093/genetics/157.4.1819. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Xu SZ. Estimating polygenic effects using markers of the entire genome. Genetics. 2003;163:789–801. doi: 10.1093/genetics/163.2.789. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Goring HHH, Terwilliger JD, Blangero J. Large upward bias in estimation of locus-specific effects from genomewide scans. American Journal of Human Genetics. 2001;69:1357–1369. doi: 10.1086/324471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Robertson A. Inbreeding in Artificial Selection Programmes. Genetical Research. 1961;2:189–&. doi: 10.1017/S0016672308009452. [DOI] [PubMed] [Google Scholar]
  • 14.Robertson A, Lerner IM. The Heritability of All-Or-None Traits - Viability of Poultry. Genetics. 1949;34:395–411. doi: 10.1093/genetics/34.4.395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Reich DE, Lander ES. On the allelic spectrum of human disease. Trends in Genetics. 2001;17:502–510. doi: 10.1016/s0168-9525(01)02410-6. [DOI] [PubMed] [Google Scholar]
  • 16.Pritchard JK. Are rare variants responsible for susceptibility to complex diseases? American Journal of Human Genetics. 2001;69:124–137. doi: 10.1086/321272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Dekkers JC. Commercial application of marker- and gene-assisted selection in livestock: strategies and lessons. J Anim Sci. 2004;82(E-Suppl):E313–E328. doi: 10.2527/2004.8213_supplE313x. [DOI] [PubMed] [Google Scholar]
  • 18.Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, et al. Genome-wide association analysis identifies 20 loci that influence adult height. Nature Genetics. 2008;40:575–583. doi: 10.1038/ng.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Habier D, Fernando RL, Dekkers JCM. The impact of genetic relationship information on genome-assisted breeding values. Genetics. 2007;177:2389–2397. doi: 10.1534/genetics.107.081190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Yi NJ, Xu SH. Bayesian LASSO for quantitative trait loci mapping. Genetics. 2008;179:1045–1055. doi: 10.1534/genetics.107.085589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Solberg TR, Sonesson AK, Woolliams JA, Meuwissen THE. Genomic selection using different marker types and densities. Journal of Animal Science. 2008 doi: 10.2527/jas.2007-0010. (in press) [DOI] [PubMed] [Google Scholar]
  • 22.Chamberlain AJ, McPartlan HC, Goddard ME. The number of loci that affect milk production traits in dairy cattle. Genetics. 2007;177:1117–1123. doi: 10.1534/genetics.107.077784. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Lynch M, Walsh B. Genetics and the analysis of quantitative traits. Sunderland, MA: Sinauer Associates Inc; 1998. [Google Scholar]
  • 24.Cox DR. Analysis of Binary Data. London: Methuen & Co Ltd; 1970. [Google Scholar]
  • 25.Ardlie KG, Kruglyak L, Seielstad M. Patterns of linkage disequilibrium in the human genome. Nature Reviews Genetics. 2002;3:299–309. doi: 10.1038/nrg777. [DOI] [PubMed] [Google Scholar]
  • 26.Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. The sequence of the human genome. Science. 2001;291:1304–+. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]
  • 27.Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005;21:263–265. doi: 10.1093/bioinformatics/bth457. [DOI] [PubMed] [Google Scholar]
  • 28.Shriner D, Baye TM, Padilla MA, Zhang S, Vaughan LK, et al. Commonality of functional annotation: a method for prioritization of candidate genes from genome-wide linkage studies. Nucleic Acids Research. 2008;36 doi: 10.1093/nar/gkn007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Risch N. A Note on Multiple Testing Procedures in Linkage Analysis. American Journal of Human Genetics. 1991;48:1058–1064. [PMC free article] [PubMed] [Google Scholar]
  • 30.Mcrae AF, Mcewan JC, Dodds KG, Wilson T, Crawford AM, et al. Linkage disequilibrium in domestic sheep. Genetics. 2002;160:1113–1122. doi: 10.1093/genetics/160.3.1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Sargolzaei M, Schenkel FS, Jansen GB, Schaeffer LR. Extent of linkage disequilibrium in Holstein cattle in North America. J Dairy Sci. 2008;5:2106–2117. doi: 10.3168/jds.2007-0553. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix S1

(0.56 MB DOC)

Appendix S2

(0.17 MB DOC)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES