Skip to main content
Genetics, Selection, Evolution : GSE logoLink to Genetics, Selection, Evolution : GSE
. 2017 Dec 27;49:95. doi: 10.1186/s12711-017-0366-6

An analytical framework to derive the expected precision of genomic selection

Jean-Michel Elsen 1,
PMCID: PMC5745666  PMID: 29281960

Abstract

Background

Formulae to predict the precision or accuracy of genomic estimated breeding values (GEBV) are important when modelling selection schemes. Simple versions of such formulae have been proposed in the past, based on a number of simplifying hypotheses, including absence of linkage disequilibrium and linkage between loci, a population made up of unrelated individuals, and that all genetic variability of the trait is explained by the genotyped loci. These formulae were based on approximations that were not always clear. The objective of this paper is to offer a unique framework to derive equations that predict the precision of GEBV from the size of the reference population and the heritability of and number of QTL controlling the quantitative trait.

Results

The exact formulation of the precision of GEBV involves the expectation of the inverse of a linear function of the genomic matrix, which cannot be calculated from simple algebra but can be approximated using a Taylor polynomial expansion. First order approximations performed better than the initial prediction equations published in the literature. Second order approximations produced almost perfect estimates of precision when compared to results obtained when simulating situations that agreed with the assumptions that were required to derive the precision equations. Using this proposed framework, we present several generalizations, including multi-trait genomic evaluation.

Conclusions

Although further improvements are needed to account for the complexity of practical situations, the equations proposed here can be used to derive the precision of GEBV when comparing breeding schemes a priori.

Electronic supplementary material

The online version of this article (10.1186/s12711-017-0366-6) contains supplementary material, which is available to authorized users.

Background

After the seminal work of Meuwissen et al. [1], who provided statistical methods to exploit linkage disequilibrium (LD) between genotyped marker loci and quantitative trait loci (QTL) in animal and plant breeding, as previously proposed by Lande and Thompson [2], genomic selection was launched, which has since revolutionised both research in quantitative and applied genetics and practical breeding plans. The benefits of this technology are considerable in dairy cattle (e.g. [38]) and dairy cattle breeders very rapidly changed their schemes in order to adopt genomic selection methods. Thus, it became possible to improve the reliability of estimated breeding values (EBV) at a young age, avoid costly and lengthy progeny tests, and limit the detrimental evolution of inbreeding. However, the application of genomic selection was not so clear in other breeding sectors, for various reasons: the high relative costs of genotyping (compared to the value of reproducers), the limited size of the (reference) populations required to calibrate the effects of single nucleotide polymorphisms (SNPs), and the fact that basic schemes were already organised with short generation intervals (e.g. [913]).

Mathematical models to describe and evaluate breeding plans can be useful to decide whether a breeding scheme based on genomic evaluations should be implemented or not e.g. [14]. These models are often based on stochastic simulations, in which the characteristics of single individuals (their genotypes at a number of SNPs, including QTL located across the genome, and their phenotypes for traits influenced by QTL) are generated, in order to produce data files that can be used as in “real life” (e.g. [15, 16]). Alternatively, models that describe populations at a higher level (generations, cohorts, classes of reproducers defined by their role in the scheme) offer a more rapid and flexible alternative to evaluate alternate breeding programs. In such approaches, deterministic equations link population characteristics such as heritability, mean LD, replacement rates, and the number of genotyped individuals to expected genetic progress by unit of time. Some of the most important equations in these models are the formulae that predict the precision of genomic EBV (GEBV). Analyses of simulations and real data have clearly demonstrated that the precision of GEBV depends on the structure of the reference population and the characteristics of the marker set used. The size of this reference population, its diversity, the genetic distance between the reference population and the group of selection candidates genotyped, the number of markers, and the degree or strength of LD are the main factors that influence this precision [1729].

A very simple formula to obtain the precision of GEBV was given by Daetwyler et al. [17], based on a number of simplifying hypotheses that included: absence of LD and linkage between loci, a population made up of unrelated individuals, and all genetic variation of the trait is explained by the genotyped loci. Under this approach, the regression of phenotypes on SNP genotypes was performed one locus at a time. This equation has been widely used and cited more than 100 times in the literature. Adjustments have since been proposed to deal with the distribution of marker allele frequencies [20], include dependence between marker loci through the definition of an effective number of independent loci [30], include the proportion of genetic variance explained by markers [22], and account for a smaller error variance when multiple marker loci are considered simultaneously [8]. Brard and Ricard [31] reviewed and challenged these formulae, using the results reported in 13 publications based on real data. They showed that the size of the reference population and the number of independent segments had a considerable impact on precision, and that the different formulae produced very different results. Other situations were explored by Hayes et al. [21] by considering dependence between the reference and candidate populations, by Wientjes et al. [32], who studied multi-population scenarios, and by Elsen [33], who suggested opportunities for the more systematic exploration of dependence between SNPs and between individuals.

In the present paper, using the simple situation that was initially studied by Daetwyler et al. [17], we propose a framework to derive equations that predict the precision of GEBV based on the size of the reference population, and the heritability of and number of QTL controlling the quantitative trait. We are interested in the expectation of the precision of GEBV, before implementing possible genotyping and selection schemes, as a tool for optimizing resources. With this prior approach, the variability summarized when computing the expectation of GEBV precision comes from marker locus polymorphisms as well as from QTL and environmental random effects. After demonstrating the performance of the solutions obtained, we explore extensions to more complex situations. Ten equations are successively proposed: (1) a general formulation of the expectation of the precision of GEBV; (2) the Daetwyler et al. [17] equation that assumes that the error variance is not modified after correction for SNP effects; (3) the Daetwyler et al. [17] equation that accounts for the corresponding reduction error variance; (4) an approximation of Eq. (1) based on a Taylor series expansion; (5) and (9) applications of Eq. (4) to the first order, assuming that all SNPs contribute equally to the genetic variance (Eq. 5) or that their effects share the same prior variance (Eq. 9); (6) an extension of Eq. (5) to the second order; (7) and (8) applications of Eq. (6) when assuming that distribution of allele frequencies is uniform (Eq. 7) or U-shaped (Eq. 8); (10) an extension to the multivariate situation.

Methods

Proposed framework

Notations and hypotheses

A list of abbreviations is in Table 1. Genomic predictions are based on a set of M biallelic SNPs, with alleles Ak and B k at locus k, the frequency of allele B k being f k. All SNPs are assumed to be in linkage equilibrium. The reference population, which is considered to be a random subset of a larger population, is made up of N unrelated individuals, which are genotyped and phenotyped. We are interested in the precision of the GEBV of a selection candidate that is not related to individuals in the reference population, but belongs to a selection population that is another subset of the larger population. The GEBV is derived as a SNP best linear unbiased prediction (BLUP) based on SNP genotypes.

Table 1.

List of abbreviations used in alphabetical order

Abbreviation Full meaning
a Vector of economic weights in γ
α Lower bound of the distribution of minor allele frequencies
Ak and Bk Alleles at SNP k
βk Effect of SNP k
β^k Prediction of the effect of SNP k
βj Vector of SNP effects for trait j
β Vector of SNP effects
B Covariance matrix vβ
β~ Estimates of fixed effects
γ Selection objective
γ^ BLUP of γ
D Expectation of XX+Λ
δk kth diagonal term of D
E Deviation of XX+Λ from D
e Vector of residuals
fmin Minimum minor allele frequency
fk Frequency of alleleBk
F Matrix of SNP genotypes
g Genetic value of the candidate
g^ GEBV of the candidate
g Vector of genetic values
g^ BLUP of g
h02 Heritability
Λ Diagonal matrix with elements λk
λ Ratio σe2/σgk2
λk Ratio σe2/σβk2
Me Effective number of loci
M Number of SNPs
Ne Effective population size
N Size of the reference population
P Working matrix (D-1ED-1E)
R Covariance matrix ve
r2~ Estimate of the precision of GEBV based on [17]
r^2 Approximation of r2 proposed here
r2X,w Expected precision of GEBV, given X and w
r2 Marginal expected precision of GEBV
σβk2 Variance of the effects βk
σk2 Variance of the number of Bk alleles
σg2 Genetic variance
σe2 Environmental variance
σY2 Phenotypic variance
w Vector of SNP genotypes
X Genotype matrix
y Vector of the phenotypes of the reference population

The random elements of the prediction model y=Xβ+e are as follows:

y is a vector of phenotypes recorded in the reference population, assumed to be centred at zero. β is a vector of SNP effects and is randomly distributed with a mean of 0 and covariance matrix vβ=σβ1200σβM2=B. Note that based on this matrix, the β k effects are supposed to be uncorrelated.

X is the genotype matrix defined by Xik=nik-2fk, where n ik ∊ {0, 1, 2} is the number of B k alleles carried by individual i at locus k. We assume that allele frequencies f k are known. The expectation of X ik is null, and its variance is σk2=2fk1-fk. Under linkage equilibrium between SNPs, the expectation of matrix XX is Nσ1200σM2=NF.

All genetic variability is assumed to be explained by the SNPs.

e is a vector of residuals with a mean of 0 and covariance matrix ve=σe200σe2=σe2IN=R.

g=wβ is the true genomic breeding value of the candidate to be predicted, with w the vector of SNP genotypes, defined as the rows in X. The variance of w is vw=F. Assuming all genetic variability is explained by the SNPs, we have vg=EwBw=EXBXiii.

g^=wβ^ is the GEBV of the candidate, where β^=(XR-1X+B-1)-1XR-1y is the BLUP of the SNP effects.

For a given set of genotypes X, variance v(β^|X)=B-XR-1X+B-1-1. Defining matrix Λ=λ100λM, with λk=σe2/σβk2, this variance is also v(β^X)=B-σe2XX+Λ-1.

Expected precision of GEBV

Four sources of variation underlie the correlation between genomic breeding values (g) and their prediction (g^): the SNP genotypes (X and w), their effects (β), and the environmental effects (e). Quite often, we are interested in the precision of GEBV, given the population genotypes (X and w), and the randomness arising from the variability of β and e, i.e. r2X,w=vg^|X,wvg|X,w, which is a function of matrices X and w. A priori, before genotyping, for instance when different SNP chip densities or reference population sizes are compared, the criterion of interest is r2=vg^vg. This is the situation explored in this paper.

The denominator in the previous equation for the precision of the GEBV is the genetic variance in the selection population: vg=EwwBw=trvwB=trFB=kσk2σβk2. The variances of SNP effects are not known but must be estimated (e.g. [1]). In our a priori estimation of the precision of the GEBV, simplifying assumptions are needed. Following VanRaden [19], all variances of SNP effects are assumed to be equal to σβ2 and, thus, σg2=σβ2kσk2. Alternatively, following Wientjes et al. [32], all SNPs contribute equally to σe2=σg2, i.e. σk2σβk2=σg2/M. This is the situation considered in the present paper. The ratio σ 2e/σ 2g will be denoted by λ.

The numerator of the equation for precision is vg^=EX,wwvβ^|X,ww since: (1) vg^=EX,wvg^|X,w+vX,wEg^|X,w and (2) Eg^|X,w=wEβ^|X=0.

Since β^ and X are independent from wandEw=0:vg^=EwwEXvβ^|Xw=trvwEXvβ^|X.

Finally r2=trFB-σe2trFEXXX+Λ-1σg2,

i.e.r2=1-λtrFEXXX+Λ-1. 1

Approximation of the precision of GEBV proposed by Daetwyler et al. [17]

In their derivation of the precision of GEBV, Daetwyler et al. [17] considered marker effects as both random and fixed effects. With our notations, they used β~=XR-1X-1XR-1y=XX-1Xy as a fixed effect estimator of β. In this context, varβ~-β=varβ~=XX-1σe2. However, when considering β as a random effect, they used covβ,β~-β=0, giving covg~,g=vg, and found r2=vgvg~, i.e. the inverse of the classical r2=vg^vg. Assuming that SNPs are in linkage equilibrium, have uncorrelated effects, and independence between SNP effects and genotypes (covwj,βj=0), the variance vg~=vwβ+β~-β is jvarwjvβj+jvarwjvβ~j-βj. When the reference population size is sufficiently large, then XX-1EXX-1, giving vg~=σg2+jvarwjσe2Nvarxij.

Initially, Daetwyler et al. [17] assumed inconsistently that both σp2=1 (“assuming the phenotypic variance is 1“) giving σg2=h01 and σe2=1 (“for the present, we shall conservatively take σe2=1”). Since the candidate and reference individuals belong to the same population, varwj=varxij and vg~=h02+M/N, which gives:

r2~1=Nh02Nh02+M. 2

A correction was proposed to relax the approximation σe2=1, which resulted in an upward correction of r2~. The idea was to replace σe2=1 by σe2=1-h02+h02(1-r2), giving a quadratic equation in r2 and

r2~2=M+Nh02±M+Nh022-4NMh042Mh02. 3

An alternative derivation of Eq. (2) was proposed by Wientjes et al. [32]. The main idea was that, assuming all SNPs are independent, their effects can be estimated in single random effect models, with y=Xkβk+ek for locus k, giving β^k=XkXk+σek2σβk2-1Xky. They assumed that (1) the reference population was large (XkXkNσk2), (2) the SNPs contributed equally to the genetic variance (σk2σβk2=σg2/M), and (3) the individual contribution of each SNP was very small (σek2=σY2-σg2/MσY2). Applying these assumptions, the BLUP of β k is β^k=Xkyσk2N+MσY2/σg2, with variance vβ^k=Nσβk2N+M/h02, and the precision of GEBV is r2=vg^vg=wvβ^wwvβw=NN+M/h02kwk2σβk2kwk2σβk2=Nh02Nh02+M.

Another approach to calculate the precision of GEBV

In the following, we do not assume that σ 2Y = σ 2e or σY2=σek2, and a unique multi QTL random model (y=Xβ+e) is used to describe relationships between phenotype and genotype. As in Wientjes et al. [32], we assume that σk2σβk2=σg2/M. In Eq. (1), the expectation of the inverse of matrix XX+Λ appears. This matrix can be broken down into diagonal D=EXX+Λ and non-diagonal elements E=XX-EXX. As in Goddard et al. [22] and Elsen [33], a Taylor series expansion for matrix E is used to find approximations:

XX+Λ-1=E+D-1=D-1I+ED-1-1=D-1I-ED-1+ED-1ED-1.

Because D-1 is not random and EXE=0, the second order approximation is EXXX+Λ-1=D-1+EXD-1ED-1ED-1 and the precision of GEBV can be approximated by:

r^221-λtrFD-1+trFEXD-1ED-1ED-1. 4

First order approximation

Matrix D-1 is diagonal with terms 1δk=1EXXkk+λk. In the first order, r^121-σe2σg2kσk2δk. Assuming as above that δk=Nσk2+σe2σβk2=σk2N+Mλ, we find:

r^121-λMN+Mλ,r^12=NN+Mλ=Nh02Nh02+M1-h02. 5

This equation differs from formula (2) of Daetwyler et al. [17] by a factor (1 − h 20).

Second order approximation

When the size of the reference population is limited, elements of matrix XX differ from their expectations: non-zero non-diagonal terms are present even if the SNPs are in linkage equilibrium and diagonal elements diverge from the genetic variances. The second order approximation of Eq. (4) partly captures these deviations. In Eq. (4), matrices F and D-1 are both diagonal. Thus, we only need the diagonal of matrix P=D-1ED-1E when computing the traces. Additional file 1 shows that terms E X[P kk] simplify to Nσk2δk1-2σk2δk+tσt2δt. A second order approximation of the precision of GEBV is thus r^221-λkσk2δk1+Nσk2δk1-2σk2δk+tσt2δt, which, using δ k = σ 2k(N + ),  simplifies to:

r^22NN+Mλ-λNMN+Mλ3M-2+1Mk1σk2. 6

This expression includes genetic variances σk2, which in practice can be estimated from the genomic data available. A priori, when these variances are not available, we can approximate the last term by using E1σk2=E12fk1-fk. A general situation is a uniform distribution of the frequencies between f min, the minimum minor allele frequency of genotyped SNPs (MAF), and 1 − f min (i.e. a probability density function ffk=11-2fmin), which results in E1σk2=fmin1-fmin11-2fmin121fk+11-fkdfk=121-2fminlog1-fminfmin-logfmin1-fmin=log1-fminfmin1-2fmin. A practical approximation of the expected precision of GEBV is thus:

r^22NN+Mλ-λNMN+Mλ3M-2+log1-fminfmin1-2fmin. 7

Numerical comparison of estimates of the precision of GEBV

Equations (1), (2), (3), (5) and (7) were evaluated by simulating data corresponding to the hypotheses that underlie their development: unrelated reference and candidate individuals, SNPs in linkage equilibrium, GEBV from a SNP-based BLUP model, and all causal SNPs are included in the SNP panel. Heritability ranged from 0.1 to 0.7, genotypes were available for 500 or 1000 SNPs and the size of the reference population ranged from 1000 to 10,000. The minimum MAF (f min) was 0.025, 0.05, 0.075 or 0.10. One hundred replicates were simulated for each scenario, with allele frequencies generated for each. Computations were made in FORTRAN with the help of the Nag library [34].

The results obtained when M = 500, f min = 0.05 and h 20 = 0.3 are in Fig. 1. Table 2 shows the effects of heritability and f min on the results. They showed consistently that (i) the prediction proposed by Daetwyler et al. [17] is closer to the “true” precision given by Eq. (1) when the size of the reference population is limited, (ii) the first order approximation proposed in the present paper (Eq. (5)) improves when the size of this population increases, i.e. when the assumptions that underlie these equations are more realistic, and (iii) the second order approximations of Eqs. (3) and (7) were nearly perfect in all cases, with a slight advantage for those of Daetwyler et al. [17] in most cases.

Fig. 1.

Fig. 1

Expected precision of GEBV as predicted by Eqs. (1), (2), (3), (5) and (7), when 500 SNPs are genotyped, minimum minor allele frequency is 0.05, and heritability is 0.3

Table 2.

Expected precision of GEBV as predicted by Eq. 1 (true precision), Daetwyler’s formula 2 and 3, Eqs. 5 and 7 when 500 SNPs are genotyped in a reference population of size 5000, depending on minimum minor allele frequency (MAF) and heritability

MAF h02 True Eq. (1) Daetwyler formula (2) Daetwyler formula (3) Elsen Eq. (5) Elsen Eq. (7)
0.05 0.1 0.514 0.500 0.513 0.526 0.513
0.05 0.3 0.798 0.750 0.798 0.811 0.798
0.05 0.5 0.901 0.833 0.901 0.909 0.902
0.05 0.7 0.955 0.875 0.955 0.959 0.955
0.025 0.1 0.513 0.500 0.513 0.526 0.513
0.025 0.3 0.797 0.750 0.798 0.811 0.798
0.025 0.5 0.901 0.833 0.901 0.909 0.902
0.025 0.7 0.955 0.875 0.955 0.959 0.955

Extensions

The framework that we propose is flexible to accommodate alternative situations without major problems, as illustrated in the following.

Distribution of allelic frequencies

In our a priori approach leading to Eq. (7), a uniform distribution of frequencies ffk=11-2fmin was assumed, corresponding to E1σk2=log1-fminfmin1-2fmin. Following Hayes et al. [21], a U-shaped distribution could be assumed, with ffk=C/2fk1-fk. The C constant must be estimated from the constraint α1-αffkdfk=1, where α and 1 − α are the bounds of the f k domain. Hayes et al. [21] argued that α = 1/2N e, with N e being the effective size of the reference population, leading to C = 1/ log (2N e). Alternatively, we could set α = f min and E1σk2=1+1-2fmin2fmin1-fminlog1-fminfmin. Figure 2 shows the values of these expectations for different minimum MAF. When the minimum MAF is higher than 5%, this correction factor E1σk2 is almost 2 and, in most cases, the expected precision of GEBV is close to:

r^22NN+Mλ-λMN2NN+Mλ3. 8
Fig. 2.

Fig. 2

Expectation of the inverse of variance of allele frequencies as a function of minimum allele frequencies (MAF), assuming a uniform or U-shaped distribution of allele frequencies

Goddard [20] also derived the precision of GEBV in this case of a U-shaped distribution of allele frequencies. In his formulation (formula (8) of his paper), the expectation of the precision of GEBV depends on the ratio λ=1-h2h2Melog2Ne, where Me is the effective number of SNPs genotyped. Replacing log (2N e) by -logfmin, as above, and assuming all SNPs are unlinked, resulting in M e = M, we compared the two approaches by using simulated data, as explained in the previous section. Results in Table 3 suggest that, in most cases, Goddard’s formula underestimates the precision, while the two other formulations are close to the expected value obtained from simulation.

Table 3.

Expected precision of GEBV when the distribution of allele frequencies is U-shaped, as predicted by Eqs. 1 and 8, by Daetwyler formula (3) and according to Goddard [20], when 500 SNPs are genotyped in a reference population of size 5000, depending on minimum minor allele frequency (MAF) and heritability

MAF h02 True Eq. (1) Elsen Eq. (8) Daetwyler formula (3) Goddard [20]
0.05 0.1 0.513 0.513 0.513 0.491
0.05 0.3 0.798 0.798 0.798 0.759
0.05 0.5 0.901 0.902 0.901 0.867
0.05 0.7 0.955 0.955 0.955 0.930
0.025 0.1 0.513 0.513 0.513 0.537
0.025 0.3 0.798 0.798 0.798 0.791
0.025 0.5 0.901 0.901 0.901 0.886
0.025 0.7 0.955 0.955 0.955 0.941

Distribution of SNP effects

In previous developments, it was assumed that each SNP had its own distribution of effects with a variance of σβk2. This was the condition assumed by Meuwissen et al. [1] when defining BayesA and BayesB Markov chain Monte Carlo approaches to genomic evaluation. This was justified in practice because the authors did not work at a single locus level but considered haplotypes of markers around each tested position, while theoretical justifications were given in the Bayesian LASSO context by Park and Cassela [35]. Alternatively, Meuwissen et al. [1] considered a unique variance σβ2 under the GBLUP approach, which is also the case for the fitted SNPs in model BayesC π [36]. The assumption of an equal contribution of each SNP to the genetic variance is no longer valid and variance σ 2β is linked to genetic variance by σg2=kσk2σβ2.

In this case, approximations of Eq. (1) for the expected precision of GEBV can be obtained using the same approach as before, using a matrix Taylor series expansion. As shown in Additional file 2, the first order approximation is given by:

r^92=1-λMN1+11-2fminλ^βN1+2λ^βNlog2fmin-1+1+2λ^βN1-2fmin+1+2λ^βN, 9

where λ^β=M1-2fmin1-6fmin2+2fmin33λ. An illustration of the quality of this approximation is in Fig. 3. The quality of the first order approximation of Eq. (1), which always overestimates precision, increased with population size and heritability and appeared to be satisfactory when N ≥ 5000.

Fig. 3.

Fig. 3

Expected precision of GEBV as predicted by Eqs. (1), (2) and (9) when 500 SNPs are genotyped and minimum minor allele frequency is 0.05. Case of a single prior variance of SNP effects and three levels of heritability (h 2 = 0.3, 0.5 and 0.7)

Multivariate prediction

A simple generalisation of the expected precision of GEBV can be obtained when retaining the previous assumption. A total of n c traits are recorded in the reference population and this information is used to predict the global genetic value of the candidate: γ=jajgj=ag, where a is a vector of n c economic weights and g the column vector of n c genetic values, i.e. gj=wβj. Vector βj is a vector of the M SNP effects on trait j (βj=βj1,,βjM). We assume that the vector of genotypes (w) is the same for all traits. All previous assumptions are retained: all SNPs have an effect on all traits, all SNPs have an equal contribution to genetic variance for each trait, and individuals are unrelated. SNP effects are distributed with specific prior variances of σβjk2, with zero correlations between SNPs. It is also assumed that the effects of SNP k on traits j and j are correlated, with a covariance of σβjjk2.

The objective is to predict precision r2=covγ,γ^2vγvγ^, where γ^=ag^, with g^ being the vector of GEBV. Thus, we need the variance vg^ of these GEBV, a n c × n c matrix. As detailed in Additional file 3, this variance is estimated at the first order using:

v(g^)^=Nv(g)(Nv(g)+Mv(e))-1v(g), 10

which is an obvious generalisation of the equivalent equation in the single-trait situation, which led to Eq. (5).

Discussion

Using classical statistical theory, the expected precision of GEBV based on marker-based BLUP was derived simply. Numerical approximations, based on a matrix Taylor series expansion, were produced for simple situations. From simulations that were consistent with the assumptions corresponding to these situations, these approximations performed similarly to and often better than the formulae for precision of GEBV that were previously published. However, the framework developed here is simpler and enables direct generalisations.

The first order approximation proposed here (Eq. 5) differs from formula (2) of Daetwyler et al. [17] by a (1 − h 20) term. Those approximations differ by the way the error term variance is defined when a single SNP effect is estimated. In Eq. (2), it was assumed that this error term variance is the total phenotypic variance, because when estimating a unique SNP effect, all other SNP effects participate to the error term. Too much error is assumed with this approximation and the precision is under evaluated. Equation (5) behaves as if all other SNP effects were perfectly estimated, limiting the error term to the only non-genetic part. This gives an overestimation of the GEBV precision. The second order approximations try to correct for these under- or overestimations: Eq. (3) replaces the 1 − h 20 term of Eq. (5) by 1 − r 2 h 20, which corrects for the non-perfect estimation of other SNPs effects. Equation (7) accounts for the lack of orthogonality between the SNPs.

Asymptotic behaviours of first order approximations are not the same: when all the observed variability has an (additive) genetic origin, i.e. when h 20 = 1, formula (2) simplifies to NN+M, while our Eq. (5) predicts a perfect precision of GEBV. This discrepancy disappears when correcting Eq. (2) for non-perfect estimation of other SNPS (Eq. 3). With this correction Eq. 3 predicts a perfect precision of GEBV when h 20 = 1. In spite of being algebraically very different, the second order approximations underlying the Eqs. (3) and (7) worked very similarly and produced results that were very close to those observed from simulations.

The hypotheses that underlie the equations derived here are strong and efforts should be made to overcome these constraints. First, it was assumed that all genetic variability is explained by the SNPs included in the evaluation. Although this is increasingly true as the size of SNP chips grows towards a full knowledge of genomes by resequencing and imputation, other polymorphisms, including copy number variations (CNV), may play a role and the genotype information obtained is still far from sufficient to fully explain genetic variability. It has been suggested [8, 22] that the proportion b of the genetic variance explained by the markers should be taken into account through a reduction in the heritability (from h 2 to bh 2) in the equations used and, using path coefficient theory, through a regression of precision of GEBV by b (from r 2 to br 2). This is easily implemented in the equations provided in this paper.

A second central hypothesis was independence between SNPs. With the current sizes of SNP chips, which will be even larger in the future, close SNPs are in LD and cannot be considered to be independent. This dependence means that non diagonal terms of EXX are non-null, with EXXkl=2NΔkl, where Δkl is the LD between SNPs and l. Equations can be derived for this situation, based on principles similar to the theory given here, but they are cumbersome, e.g. [33]. The concept of effective independent chromosomal segments has been discussed [17] and formalised [20] as an alternative to the true number of markers. The idea is that the precision of the genomic prediction model “depends on the variation in the realised relationship between pairs of animals” [20]. Then, the effective number of loci is defined as the “number of independent loci that gives the same variance of realised relationships as obtained in the more realistic situation” [20]. Solutions have also been proposed to estimate the effective number of loci, from population genetics considerations [20, 21, 37] or from real data (e.g. [32]).

A third assumption was the absence of relationships between individuals in the reference population and between candidates in the reference population. A formalisation of situations where individuals are related was proposed [33] but only for the first order approximations of precision. Although relationships between reference individuals and candidates were accounted for by using this first order approximation, this was not the case for the structure of the reference population itself. Therefore, further efforts are needed, which is particularly important since it is clear that (1) genomic predictions of breeding values arise only partly from historical LD and increase in precision when individuals in the reference population and candidates are more closely related [26, 3840], and (2) the structure of the reference population is a key factor in the precision on GEBV, e.g. [41].

The predicted variances of SNP effects calculated by Eq. (10) in the multivariate situation were obtained under strong assumptions. First, it was assumed that GEBV are computed using a multivariate approach that considers correlations between the effects of SNPs on different traits. However, in practice, GEBV are often computed using single-trait algorithms. In our formulation, this is equivalent to omitting the off-diagonal terms in matrices Bkk and E when estimating the SNP effects β_^. In this case, the variance of those effects, and vg^, do not simplify to the equations derived in the case studied. A second important assumption was that all SNPs contributed equally to genetic covariances, as a direct extension of the single trait situation studied. The alternative assumption of unique (regardless of the SNPs) variances and covariances (e.g. σβjk2=σβj2k) is also possible, as described in the previous section. Both these assumptions are, however, questionable, in particular because genetic correlations lower than 1 suggest that only a limited proportion of the SNPs (and underlying QTL) affect all traits. Extra prior information about the underlying genetic architecture of these correlations would be useful in this regard.

A few other assumptions are used in the current paper, including the additivity and i.i.d. of QTL effects, and the use of GBLUP. As long as the objective is to model and optimise breeding plans, then only relative values will be of interest and we assume that these assumptions are not critical for those comparisons.

Conclusions

The objective of this paper was to provide a clear framework to derive predictive equations to estimate the precision of GEBV. Such equations can generate results in a second and thus enable the optimisation of a breeding program design through intensive numerical exploration. Not the entire complexity of practical breeding programs was included in the simple formulae derived here and in previously published papers. The purpose was to support the a priori comparison of breeding schemes, rather than to evaluate actual breeding schemes. The exact formulation of precision involves the expectation of the inverse of a linear function of the genomic relationship matrix, which cannot be calculated from simple algebra but can be approximated by a Taylor series development, as was already suggested by Goddard [20]. Second order approximations produced nearly perfect estimates of this precision, when compared to the results obtained by simulating data that are in agreement with the assumptions required to obtain the equations to estimate the precision of GEBV. We proposed several generalisations for the estimates of precision for this initial case, including multi-trait evaluation. Other situations can also be derived within the framework presented here.

Additional files

12711_2017_366_MOESM1_ESM.docx (21.7KB, docx)

Additional file 1. Diagonal elements of EXD-1ED-1E. This file provides details about the algebraic derivation of the expectation given in the title.

12711_2017_366_MOESM2_ESM.docx (21.4KB, docx)

Additional file 2. Algebra when the SNP effect variance σβ2 is unique. This file shows the derivation in the alternative hypothesis for the SNP effect variance.

12711_2017_366_MOESM3_ESM.docx (28.8KB, docx)

Additional file 3. The multivariate case. This file provides details about the algebraic derivation in the multivariate case.

Acknowledgements

Many thanks to Yvonne CJ Wientjes for the discussions we had on how precision of GEBV should be calculated. I am very grateful to the referee who gave excellent suggestions for the discussion part.

Competing interests

The author declares that he has no competing interests.

Ethics approval and consent to participate

Not applicable.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Footnotes

Electronic supplementary material

The online version of this article (10.1186/s12711-017-0366-6) contains supplementary material, which is available to authorized users.

References

  • 1.Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819–1829. doi: 10.1093/genetics/157.4.1819. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Lande R, Thompson R. Efficiency of marker-assisted selection in the improvement of quantitative traits. Genetics. 1990;124:743–756. doi: 10.1093/genetics/124.3.743. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Schaeffer LR. Strategy for applying genome-wide selection in dairy cattle. J Anim Breed Genet. 2006;123:218–223. doi: 10.1111/j.1439-0388.2006.00595.x. [DOI] [PubMed] [Google Scholar]
  • 4.König S, Simianer H, Willam A. Economic evaluation of genomic breeding programs. J Dairy Sci. 2009;92:382–391. doi: 10.3168/jds.2008-1310. [DOI] [PubMed] [Google Scholar]
  • 5.McHugh N, Meuwissen THE, Cromie AR, Sonesson AK. Use of female information in dairy cattle genomic breeding programs. J Dairy Sci. 2011;94:4109–4118. doi: 10.3168/jds.2010-4016. [DOI] [PubMed] [Google Scholar]
  • 6.de Roos AP, Schrooten C, Veerkamp RF, van Arendonk JAM. Effects of genomic selection on genetic improvement, inbreeding, and merit of young versus proven bulls. J Dairy Sci. 2011;94:1559–1567. doi: 10.3168/jds.2010-3354. [DOI] [PubMed] [Google Scholar]
  • 7.Pryce JE, Goddard ME, Raadsma HW, Hayes BJ. Deterministic models of breeding scheme designs that incorporate genomic selection. J Dairy Sci. 2010;93:5455–5466. doi: 10.3168/jds.2010-3256. [DOI] [PubMed] [Google Scholar]
  • 8.Meuwissen THE, Hayes BJ, Goddard ME. Accelerating improvement of livestock with genomic selection. Annu Rev Anim Biosci. 2013;1:221–237. doi: 10.1146/annurev-animal-031412-103705. [DOI] [PubMed] [Google Scholar]
  • 9.Sonesson AK, Meuwissen THE. Testing strategies for genomic selection in aquaculture breeding programs. Genet Sel Evol. 2009;41:37. doi: 10.1186/1297-9686-41-37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Ibánẽz-Escriche N, Fernando RL, Toosi A, Dekkers JCM. Genomic selection of purebreds for crossbred performance. Genet Sel Evol. 2009;41:12. doi: 10.1186/1297-9686-41-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wolc A, Arango J, Settar P, Fulton JE, O’Sullivan NP, Preisinger R, et al. Persistence of accuracy of genomic estimated breeding values over generations in layer chickens. Genet Sel Evol. 2011;43:23. doi: 10.1186/1297-9686-43-23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Tribout T, Larzul C, Phocas F. Efficiency of genomic selection in a purebred pig male line. J Anim Sci. 2012;45:4164–4176. doi: 10.2527/jas.2012-5107. [DOI] [PubMed] [Google Scholar]
  • 13.Shumbusho F, Raoul J, Astruc JM, Palhiere I, Elsen JM. Potential benefits of genomic selection on genetic gain of small ruminant breeding programs. J Anim Sci. 2013;91:3644–3657. doi: 10.2527/jas.2012-6205. [DOI] [PubMed] [Google Scholar]
  • 14.Weller JI. Economic aspects of animal breeding. London: Chapman & Hall; 1994. [Google Scholar]
  • 15.Tribout T, Larzul C, Phocas F. Efficiency of genomic selection in a purebred pig male line. J Anim Sci. 2012;90:4164–4176. doi: 10.2527/jas.2012-5107. [DOI] [PubMed] [Google Scholar]
  • 16.Bastiaansen JW, Coster A, Calus MP, van Arendonk JA, Bovenhuis H. Long-term response to genomic selection: effects of estimation method and reference population structure for different genetic architectures. Genet Sel Evol. 2012;44:3. doi: 10.1186/1297-9686-44-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Daetwyler HD, Villanueva B, Woolliams JA. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS One. 2008;3:e3395. doi: 10.1371/journal.pone.0003395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Legarra A, Robert-Granié C, Manfredi E, Elsen JM. Performance of genomic selection in mice. Genetics. 2008;180:611–618. doi: 10.1534/genetics.108.088575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91:4414–4423. doi: 10.3168/jds.2007-0980. [DOI] [PubMed] [Google Scholar]
  • 20.Goddard ME. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 2009;136:245–257. doi: 10.1007/s10709-008-9308-0. [DOI] [PubMed] [Google Scholar]
  • 21.Hayes BJ, Visscher PM, Goddard ME. Increased accuracy of artificial selection by using the realized relationship matrix. Genet Res (Camb) 2009;91:47–60. doi: 10.1017/S0016672308009981. [DOI] [PubMed] [Google Scholar]
  • 22.Goddard ME, Hayes BJ, Meuwissen THE. Using the genomic relationship matrix to predict the accuracy of genomic selection. J Anim Breed Genet. 2011;128:409–421. doi: 10.1111/j.1439-0388.2011.00964.x. [DOI] [PubMed] [Google Scholar]
  • 23.Buch LH, Kargo M, Berg P, Lassen J, Sørensen AC. The value of cows in reference populations for genomic selection of new functional traits. Animal. 2011;6:880–886. doi: 10.1017/S1751731111002205. [DOI] [PubMed] [Google Scholar]
  • 24.Clark SA, Hickey JM, Daetwyler HD, van der Werf JHJ. The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet Sel Evol. 2012;44:4. doi: 10.1186/1297-9686-44-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Wientjes YCJ, Veerkamp RF, Calus MPL. The effect of linkage disequilibrium and family relationships on the reliability of genomic prediction. Genetics. 2013;193:621–631. doi: 10.1534/genetics.112.146290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Habier D, Fernando RL, Garrick DJ. Genomic BLUP decoded: a look into the black box of genomic prediction. Genetics. 2013;194:597–607. doi: 10.1534/genetics.113.152207. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Erbe M, Gredler B, Seefried FR, Bapst B, Simianer H. A function accounting for training set size and marker density to model the average accuracy of genomic prediction. PLoS One. 2013;8:e81046. doi: 10.1371/journal.pone.0081046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Habier D, Fernando RL, Dekkers JCM. The impact of genetic relationship information on genome-assisted breeding values. Genetics. 2007;177:2389–2397. doi: 10.1534/genetics.107.081190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Pszczola M, Strabel T, Mulder HA, Calus MPL. Reliability of direct genomic values for animals with different relationships within and to the reference population. J Dairy Sci. 2012;95:389–400. doi: 10.3168/jds.2011-4338. [DOI] [PubMed] [Google Scholar]
  • 30.Daetwyler HD, Pong-Wong R, Villanueva B, Woolliams A. The impact of genetic architecture on genome-wide evaluation methods. Genetics. 2010;185:1021–1031. doi: 10.1534/genetics.110.116855. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Brard S, Ricard A. Is the use of formulae a reliable way to predict the accuracy of genomic selection? J Anim Breed Genet. 2015;132:207–217. doi: 10.1111/jbg.12123. [DOI] [PubMed] [Google Scholar]
  • 32.Wientjes YCJ, Bijma P, Veerkamp RF, Calus MPL. An equation to predict the accuracy of genomic values by combining data from multiple traits, breeds, lines or environments. Genetics. 2016;202:799–823. doi: 10.1534/genetics.115.183269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Elsen JM. Approximated prediction of genomic selection accuracy when reference and candidate populations are related. Genet Sel Evol. 2016;48:18. doi: 10.1186/s12711-016-0183-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Numerical Algorithms Group. https://www.nag.co.uk/.
  • 35.Park T, Cassela G. The Bayesian LASSO. J Am Stat Assoc. 2008;103:681–686. doi: 10.1198/016214508000000337. [DOI] [Google Scholar]
  • 36.Habier D, Fernando RL, Kizilkaya K, Garrick DJ. Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics. 2011;12:186. doi: 10.1186/1471-2105-12-186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Visscher PM, Medland SE, Ferreira MAR, Morley KI, Zhu G, Cornes BK, et al. Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2006;2:e41. doi: 10.1371/journal.pgen.0020041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Schopp P, Müller D, Technow F, Melchinger AE. Accuracy of genomic prediction in synthetic populations depending on the number of parents, relatedness, and ancestral linkage disequilibrium. Genetics. 2017;205:441–454. doi: 10.1534/genetics.116.193243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Sun X, Fernando R, Dekkers J. Contributions of linkage disequilibrium and co-segregation information to the accuracy of genomic prediction. Genet Sel Evol. 2016;48:77. doi: 10.1186/s12711-016-0255-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Scutari M, Mackay I, Balding D. Using genetic distance to infer the accuracy of genomic prediction. PLoS Genet. 2016;12:e1006288. doi: 10.1371/journal.pgen.1006288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Rincent R, Laloë D, Nicolas S, Altmann T, Brunel D, Revilla P, et al. Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.) Genetics. 2012;192:715–728. doi: 10.1534/genetics.112.141473. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12711_2017_366_MOESM1_ESM.docx (21.7KB, docx)

Additional file 1. Diagonal elements of EXD-1ED-1E. This file provides details about the algebraic derivation of the expectation given in the title.

12711_2017_366_MOESM2_ESM.docx (21.4KB, docx)

Additional file 2. Algebra when the SNP effect variance σβ2 is unique. This file shows the derivation in the alternative hypothesis for the SNP effect variance.

12711_2017_366_MOESM3_ESM.docx (28.8KB, docx)

Additional file 3. The multivariate case. This file provides details about the algebraic derivation in the multivariate case.


Articles from Genetics, Selection, Evolution : GSE are provided here courtesy of BMC

RESOURCES