Genomic Prediction Within and Across Biparental Families: Means and Variances of Prediction Accuracy and Usefulness of Deterministic Equations

Pascal Schopp; Dominik Müller; Yvonne C J Wientjes; Albrecht E Melchinger

doi:10.1534/g3.117.300076

. 2017 Sep 15;7(11):3571–3586. doi: 10.1534/g3.117.300076

Genomic Prediction Within and Across Biparental Families: Means and Variances of Prediction Accuracy and Usefulness of Deterministic Equations

Pascal Schopp ^*, Dominik Müller ^*, Yvonne C J Wientjes ^†, Albrecht E Melchinger ^*,¹

PMCID: PMC5677162 PMID: 28916649

Abstract

A major application of genomic prediction (GP) in plant breeding is the identification of superior inbred lines within families derived from biparental crosses. When models for various traits were trained within related or unrelated biparental families (BPFs), experimental studies found substantial variation in prediction accuracy (PA), but little is known about the underlying factors. We used SNP marker genotypes of inbred lines from either elite germplasm or landraces of maize (Zea mays L.) as parents to generate in silico 300 BPFs of doubled-haploid lines. We analyzed PA within each BPF for 50 simulated polygenic traits, using genomic best linear unbiased prediction (GBLUP) models trained with individuals from either full-sib (FSF), half-sib (HSF), or unrelated families (URF) for various sizes ( $N_{t r a i n}$ ) of the training set and different heritabilities ( $h^{2}) .$ In addition, we modified two deterministic equations for forecasting PA to account for inbreeding and genetic variance unexplained by the training set. Averaged across traits, PA was high within FSF (0.41–0.97) with large variation only for $N_{t r a i n} < 50$ and $h^{2}$ $< 0.6.$ For HSF and URF, PA was on average ∼40–60% lower and varied substantially among different combinations of BPFs used for model training and prediction as well as different traits. As exemplified by HSF results, PA of across-family GP can be very low if causal variants not segregating in the training set account for a sizeable proportion of the genetic variance among predicted individuals. Deterministic equations accurately forecast the PA expected over many traits, yet cannot capture trait-specific deviations. We conclude that model training within BPFs generally yields stable PA, whereas a high level of uncertainty is encountered in across-family GP. Our study shows the extent of variation in PA that must be at least reckoned with in practice and offers a starting point for the design of training sets composed of multiple BPFs.

Keywords: genomic prediction, biparental families, plant breeding, GBLUP, deterministic accuracy, linkage disequilibrium, GenPred, Shared Data Resources, Genomic Selection

With the advent of low-cost genome-wide SNP markers, genomic prediction (GP, see Supplemental Material, Table S1 in File S1 for full list of abbreviations) proposed by Meuwissen et al. (2001) has become a powerful tool in animal and plant breeding. The basic idea of GP is to combine the phenotypic and genotypic data of training individuals in a model for predicting the genetic merit of selection candidates that have only been genotyped. Complementing, or even replacing phenotyping can result in considerable cost savings and shortening of breeding cycles, thereby giving GP a big advantage over traditional selection methods (Bernardo and Yu 2007; Goddard and Hayes 2007; Lin et al. 2014). Particular challenges of GP in plant breeding arise from (i) the specific population structures mostly characterized by multiple related or unrelated segregating biparental families (BPFs) derived from crosses between inbred parents, and (ii) small samples sizes available for model training (Jannink et al. 2010).

In commercial breeding of line and hybrid cultivars, up to several hundred BPFs are newly generated every year. Depending on the species and size of the breeding program, each family can comprise a variable number (usually <250) of lines, developed either by recurrent selfing or the doubled-haploid (DH) technology (Albrecht et al. 2011). Since expected differences among BPFs can be reliably predicted based on the mean performance of their parents (Melchinger 1987), GP applied to populations comprising multiple BPFs aims primarily at the identification of superior lines within these families (Riedelsheimer et al. 2013). Prediction models such as genomic best linear unbiased prediction (GBLUP) allow capturing Mendelian sampling—responsible for variation in the breeding values of siblings within BPFs—through cosegregation of SNP markers with quantitative trait loci (QTL) (Habier et al. 2013). While several studies have investigated the accuracy of GP within and across BPFs, more attention is needed to assess the mean and variation of PA for training sets taken from full-sib (FSF), half-sib (HSF) or unrelated families (URF). Experimental results available so far are confined by the number and size of BPFs (Riedelsheimer et al. 2013; Lehermeier et al. 2014) and low marker density (Jacobson et al. 2014; Lian et al. 2014).

Model training with individual BPFs has been studied intensively, and PA has been generally more promising for “within-family GP” than “across-family GP” (Riedelsheimer et al. 2013). Various authors argued that for a given size of the training set, within-family GP would provide the highest possible PA owing to strong linkage disequilibrium (LD) between SNPs and QTL due to cosegregation and the same set of loci being polymorphic in the prediction and training set (Crossa et al. 2014; Lehermeier et al. 2014). Nevertheless, Lian et al. (2014) reported for within-family GP substantial variation in PA among 969 BPFs and various traits, in line with the results of other studies on BPFs (Riedelsheimer et al. 2013; Jacobson et al. 2014; Lehermeier et al. 2014). However, a systematic investigation on the extent and factors determining the mean and variation in PA among BPFs and traits is, to the best of our knowledge, not available to date.

Since PA increases with closer pedigree relationships between training and predicted individuals (Habier et al. 2010; Clark et al. 2012), one obvious strategy is to use HSFs with one common parent between the training family (BPF_train) and the predicted family (BPF_pred) in across-family GP. Compared to within-family GP, PA for this strategy was generally much lower with the same sample size, but can reach similar levels if the sample size is strongly extended (Lehermeier et al. 2014). By comparison, model training with only unrelated BPFs produced from the same ancestral population yields often poor or even negative PA (Riedelsheimer et al. 2013; Jacobson et al. 2014; Schopp et al. 2017). Optimizing training set designs in GP with BPFs therefore requires better insights into how the pedigree relationship between BPFs, the sample size, and the heritability affect the mean and the variation in PA. Herein, we address these factors for the simple case of GP across individual pairs of BPFs, thereby providing a starting point for further investigations on the design of multi-family training sets in plant breeding.

Forecasting PA based on existing molecular and phenotypic data could assist breeders in (i) choosing the most suitable BPFs for model training for prediction of existing or planned BPFs, and (ii) allocating resources to the training and prediction sets. Daetwyler et al. (2008, 2010) derived a deterministic equation for forecasting PA, which requires only population parameters (sample size $N_{t r a i n},$ heritability $h^{2},$ and the effective number of chromosome segments $M_{e}) .$ When averaged over several traits, empirical and deterministic accuracy agreed well within BPFs (Lorenz 2013; Riedelsheimer et al. 2013; Lian et al. 2014). There is little consensus, however, regarding the calculation of $M_{e}$ in general (Goddard 2009; Meuwissen and Goddard 2010; Goddard et al. 2011; Wientjes et al. 2013), and, specifically, for BPFs (Lorenz 2013; Riedelsheimer and Melchinger 2013; Lian et al. 2014). Recently, Daetwyler’s equation was applied to both GP within and across cattle breeds (Wientjes et al. 2013, 2015). The authors extended Goddard et al.’s (2011) approach for calculating $M_{e}$ from the variance of genomic relationship coefficients to multiple populations. Overestimation of PA was attributed to a violation of Daetwyler’s assumption that the genetic variance in the prediction set is fully explained by marker effects estimated in the training set. An aggravation of this problem is expected for across-family GP with BPFs due to a high fraction of QTL and markers that are not consistently polymorphic across BPFs. Herein, we propose to extend Daetwyler’s equation to cope with this problem and make the equation applicable to across-family GP in plant breeding.

Alternatively, PA can be forecasted based on the estimated reliability of genomic-estimated breeding values (GEBVs) derived from selection index theory (VanRaden 2008). However, this approach has rarely been applied in plant breeding (Akdemir et al. 2015; He et al. 2016), and, to the best of our knowledge, not to GP of individual BPFs, despite promising results for GP within and across breeds of cattle (Hayes et al. 2009; Wientjes et al. 2013, 2015). One problem is that the approach was developed for outbred populations, and needs modifications when applied to inbred genotypes. Moreover, several strict assumptions regarding the properties of the genomic relationship matrix must be satisfied to obtain meaningful results, which will be elaborated in this paper for the case of BPFs in plant breeding.

The objectives of our study were to (i) investigate the mean and variation of empirical PA within and across BPFs of inbred lines, (ii) examine how the variation in PA is affected by differences in polymorphism at causal loci of polygenic traits between the training and prediction set, as well as by other factors (e.g., level of ancestral LD, pedigree relationship between BPFs, sample size, heritability), and (iii) adapt equations for deterministic forecasting of PA in BPFs of inbred genotypes and demonstrate their usefulness in simulated data sets. To simulate realistic scenarios, we used SNP data of inbred lines taken either from a public maize breeding program or a DH library of a European maize landrace and generated in silico numerous BPFs of DH lines. Besides flexibility in the choice of sample sizes, and exclusion of nuisance factors uncontrollable in experimental studies, this allowed us to simulate traits with known genetic architecture for a profound analysis of the causal factors affecting PA of GP within and across BPFs.

Materials and Methods

Ancestral populations

We considered two ancestral populations as source germplasm of parental genotypes for generating BPFs. Ancestral population Elite consisted of 72 elite inbred lines with medium long-range LD (Figure S1A in File S1) representative for the Flint heterotic group of the maize breeding program of the University of Hohenheim. Ancestral population Landrace consisted of 40 DH lines derived without any intentional selection from the German maize landrace “Gelber Badischer” with a rapid decay of LD to a low level (Melchinger et al. 2017). All lines were genotyped with the Illumina chip MaizeSNP50, containing 57,841 SNPs, and were expected to be fully homozygous. Markers monomorphic in the ancestral population or heterozygous in at least one individual were removed for further analysis. Physical map positions were converted into genetic map positions required for simulating meioses as described by Schopp et al. (2017). In total, we retained 19,204 and 16,171 SNPs for Elite and Landrace, respectively, distributed over the 10 maize chromosomes ranging in length from 137 to 276 cM (1913 cM in total). Individuals in the ancestral population were regarded as unrelated for defining pedigree relationships between subsequently generated BPFs.

Simulation of BFPs

For generating BPFs, we first sampled at random $N_{P}$ = 25 parent lines from each ancestral population, and intermated them according to a half-diallel design to generate all $(\begin{matrix} N_{P} \\ 2 \end{matrix}) = 300$ possible crosses. Subsequently, 1500 DH lines were derived from each F₁ cross to obtain the BPFs used for further analyses. According to the half-diallel, each predicted family (BPF_pred $= A$ ) was associated with several possible training families (BPF_train $= B$ ) with different pedigree relationships to $A .$ These were: one FSF, corresponding to $A = B$ ; $2 (N_{P} - 2) = 46$ HSF $B,$ sharing one common parent with $A$ ; and (iii) $(\begin{matrix} N_{P} \\ 2 \end{matrix}) - 2 (N_{P} - 2) - 1 = 253$ URF $B,$ sharing no common parent with $A .$ Meioses for in silico production of DH lines were simulated with the R package Meiosis (Müller and Broman 2017).

Description of factors analyzed

For systematic assessment of the factors influencing the distribution of the empirical PA, we defined various fixed and random factors (Table 1). As fixed factors, we considered (i) the ancestral population (Elite or Landrace), (ii) the pedigree relationship (FSF, HSF, or URF) between individuals in BPF_pred and BPF_train, (iii) the type of data (SNP marker genotypes or QTL genotypes) used to calculate the genomic relationship matrix $G$ for GBLUP, (iv) the sample size $N_{t r a i n} = 25, 100, 250$ , and (v) the heritability of the trait $h^{2} = 0.3, 0.6, 1.0.$ The idealistic scenario $h^{2} = 1$ was included to demonstrate how the variation in PA behaves when phenotypic accuracy is not a limiting factor. Random factors were the trait $T,$ the BPF_pred $A,$ the BPF_train $B,$ as well as the actual sample of training individuals $R$ taken from $B .$

Table 1. Overview of factors with their corresponding levels analyzed in this study.

Type	Factor	Model Parameter	Number of Factor Levels	Factor Levels
Fixed factors	Ancestral population	—	2	Elite, Landrace
	Pedigree relationship between training and predicted family	—	3	FSF, HSF, URF
	Data used to calculate the relationship matrix	—	2	QTL, SNPs
	Sample size ( $N_{t r a i n}$ )	—	3	25, 100, 250
	Heritability ( $h^{2}$ )	—	3	0.3, 0.6, 1
Random factors	Trait	$T$	50	—
	Predicted family (BPF_pred)	$A$	50	—
	Training family (BPF_train)	$B$	$1$ (FSF), $25$ (HSF/URF)	—
	Training set sample	$R$	3	—

Open in a new tab

Default values for the standard scenario are indicated in boldface.

We simulated 50 truly polygenic traits $T$ = $1, \dots, 50,$ each governed by 1000 QTL. First, we sampled at random a subset of 5000 SNP markers from all SNPs available in the ancestral population, corresponding to a marker density of 2.61 SNPs cM⁻¹. This fixed set of marker was used for GP of all traits, because resampling of SNP marker positions had a negligible influence on the results. Second, for each of the 50 traits we sampled at random the map positions of 1000 QTL from the remaining 14,204 and 12,171 SNPs in Elite and Landrace, respectively. Following Meuwissen et al. (2001), effects of each QTL were drawn from a Gamma distribution $Γ (0.4, 1.66)$ with equal probability of effect signs. Importantly, all traits were affected by the same number of loci, but differed in the position and effects of QTL. Thus, the realized number of polymorphic QTL loci could vary depending on the trait and the BPF_pred and BPF_train.

Phenotypes $y$ of training individuals were simulated according to the model $y = g + e$ (cf. Goddard et al. 2011), where $g$ is the vector of true breeding values (TBVs) calculated as $g = Wa,$ $W$ is the matrix of genotypic scores at QTL coded as 2 or 0, depending on whether a DH line was homozygous for the 1 or 0 allele, respectively, and $a$ is the vector of QTL effects. Vector $e$ contains independent normally distributed environmental noise variables, where variance $σ_{e}^{2}$ was assumed to be constant across BPFs derived from one ancestral population, implying independent environmental influence on the phenotypes. We calculated $σ_{e}^{2} = (h^{2} \bar{σ_{g}^{2}} - \bar{σ_{g}^{2}} / h^{2}),$ where $h^{2}$ is the a priori specified heritability (cf. Table 1) and $\bar{σ_{g}^{2}}$ is the genetic variance within a BPF, averaged across all 300 BPFs and 50 traits simulated.

Finally, we sampled at random 50 out of the 300 BPFs, and considered them individually as the predicted family BPF_pred $A .$ From the 1500 DH lines in each BPF_pred, we estimated GEBVs for the first 500 lines. For within-family GP, training individuals were sampled from the remaining 1000 lines to predict individuals within the same family ( $A = B,$ FSF). For across-family GP ( $A \neq B,$ HSF or URF), 25 BPF_train serving individually for model training were sampled from the 46 available HSFs and the 253 available URFs, respectively. For given BPF_pred and BPF_train, we sampled from BPF_train three disjunct samples $R$ of individuals of size $N_{t r a i n}$ (according to the fixed factor “sample size,” Table 1) with which the prediction model was trained. To minimize variation in PA attributable to sampling individuals from the BPF_pred, we chose $N_{p r e d} = 500.$ By contrast, the numbers $N_{t r a i n}$ were of realistic magnitude, and analyzing repeated samples allowed us to quantify the variation in PA due to finite sampling in BPF_train.

Genomic prediction model

The GBLUP model can be written as $y = 1 μ + Zu + ε,$ where $μ$ is the general mean, $Z$ is an incidence matrix linking phenotypes with breeding values, $u$ is the vector of random breeding values with mean zero and variance-covariance matrix $var (u) = G \circ Σ,$ where $G$ is the genomic relationship matrix and $Σ = [\begin{matrix} J_{AA} σ_{u, A}^{2} & J_{BA} σ_{u, A} σ_{u, B} r_{A B} \\ J_{BA} σ_{u, A} σ_{u, B} r_{A B} & J_{BB} σ_{u, B}^{2} \end{matrix}],$ $σ_{u, A}^{2}$ and $σ_{u, B}^{2}$ are the additive variances in the noninbred reference population of BPF_pred and BPF_train, respectively, which correspond to their (outbred) F₂ generation. $J_{A A},$ $J_{B B}$ and $J_{A B} = J_{B A}$ are matrices of 1’s, $r_{A B}$ is the genetic correlation between populations $A$ and $B,$ which was assumed to be equal to 1 for reasons detailed in the discussion, and ∘ symbolizes the Hadamard product. Vector $ε$ contains random residuals with mean zero and $var (ε) = I σ_{ε}^{2},$ where $I$ is an identity matrix and $σ_{ε, B}^{2}$ is the residual error variance. We used $G = [\begin{matrix} G_{A A} & G_{A B} \\ G_{B A} & G_{B B} \end{matrix}],$ representing a modified version of the block-structured genomic relationship matrix devised by Chen et al. (2013), where the across-population blocks $G_{A B} = G_{B A}^{T}$ had elements

G_{A_{i}, B_{j}} = \frac{\sum_{k} (x_{A_{i}, k} - 2 p_{A, k}) (x_{B_{j}, k} - 2 p_{B, k})}{\sqrt{2 \sum_{k} p_{A, k} (1 - p_{A, k})} \sqrt{2 \sum_{k} p_{B, k} (1 - p_{B, k})}},

(1)

and $x_{A_{i}, k}$ and $x_{B_{j}, k}$ are the genotypic scores of DH lines $i$ and $j$ in population $A$ and $B$ at locus $k,$ respectively, coded as 2 and 0, and $p_{A, k}$ and $p_{B, k}$ are the allele frequencies at locus $k$ in $A$ and $B,$ respectively, where $k = 1, \dots, 1, 000$ or $k = 1, \dots, 5, 000$ depending on whether QTL or SNPs were used to calculate $G$ (according to the fixed factor “data,” Table 1). Submatrices $G_{A A}$ and $G_{B B}$ are calculated accordingly, but here the denominator simplifies to $2 \sum_{k} p_{A, k} (1 - p_{A, k})$ and $2 \sum_{k} p_{B, k} (1 - p_{B, k}),$ respectively, corresponding to the standard $G$ matrix without subpopulation structure (Habier et al. 2007; VanRaden 2008). Importantly, the denominator for matrix $G_{A B}$ in Equation 1 is different from that in Chen et al. (2013), who used $2 \sum_{k} \sqrt{p_{A, k} (1 - p_{A, k}) p_{B, k} (1 - p_{B, k})} .$ Their approach effectively removes all loci that are monomorphic in $A$ and/or $B,$ whereas our denominator retains these loci in the scaling of $G,$ yielding a better approximation of the true relationship matrix, as discussed below.

In any BPF derived from fully homozygous parents, the expected allele frequency of a locus is known to be either 0, 0.5, or 1, depending on the genotypes of the parents. These expected frequencies were used in the computation of genomic relationships. Since, in our study, only population $B$ had phenotypes, we used a single-group GBLUP model. Although we allowed for heterogeneous genetic variances among BPFs in the general model (Equation 1) and the derivation of reliability described below (see Appendix B), $σ_{u, A}^{2}$ enters the computation of GEBVs in $A$ as a constant factor (see Equation B4) and, hence, does not affect the empirical PA. Estimates $\hat{σ_{u, B}^{2}}$ and $\hat{σ_{ε, B}^{2}}$ for BPF_train were obtained by restricted maximum likelihood from the $N_{t r a i n}$ individuals in the training set using the mixed.solve function from R-package rrBLUP (Endelman 2011). The empirical PA was calculated as the correlation between GEBVs $\hat{u}$ and the TBVs $u$ for the 500 predicted individuals in BPF_pred.

Analysis of variance of empirical prediction accuracies

For each possible combination of fixed factors (cf. Table 1), we partitioned $σ_{t o t a l}^{2},$ the total variance of the empirically observed PA $ρ_{A B R T},$ into variance components caused by each random factor, where we assumed a hierarchical structure for BPF_pred $A,$ BPF_train $B$ and the training set sample $R,$ as well as cross-classification with factor trait $T .$ Estimates of the variance components were obtained from the following random-effects model using function lmer of R package lme4 (Bates et al. 2015):

ρ_{A B R T} = μ + A + A : B + A : B : R + T + A \times T + (A : B) \times T + (A : B : R) \times T,

(2)

where $μ$ is the overall mean of PA for each of the three pedigree relationships (FSF, HSF, and URF) between individuals in $A$ and $B$ analyzed; $A$ is the effect of the BPF_pred; $A : B$ is the effect of the BPF_train $B$ nested within $A$ ; $A : B : R$ is the effect of the $r$ th sample of training individuals from $B$ nested within $A$ ; T is the effect of the trait, $A \times T$ is the interaction effect of BPF_pred $A$ with trait $T$ ; $(A : B) \times T$ is the interaction effect of BPF_train $B$ nested within $A$ with trait $T$ ; and $(A : B : R) \times T$ is the interaction effect of the training set sample nested within $B : A$ with trait $T,$ which corresponds to the residual error of the model. In the case of FSF ( $A = B$ ), all random factors involving $B$ were dropped. The degrees of freedom for each factor are shown in Table S2 in File S1.

Deterministic equations for forecasting prediction accuracy (PA)

We followed the theoretical framework of Wientjes et al. (2015) for forecasting PA within and across populations using two deterministic equations. Both equations assume that actual relationships regarding QTL are known, and were originally developed for outbred individuals. Hence, modifications are required to apply the equations to inbred individuals. As mentioned above, the outbred reference population corresponding to a BPF of fully inbred (DH) lines with an inbreeding coefficient of $F = 1$ is the F₂ generation. The level of inbreeding in BPFs of DH lines is reflected in the diagonal elements of $G$ calculated according to Equation 1, yielding $G_{i i} = 1 + F_{i} = 2$ in the special case of BPFs derived from homozygous parents.

The first approach is based on the reliability of GEBVs of each individual in $A$ (VanRaden 2008; Wientjes et al. 2013, 2015). Using the formula for the reliability of a selection index given by Mrode (2005, p. 15) and replacing the genetic covariance matrices by the genomic relationship matrices [multiplied by the corresponding genetic (co)variance components] yields the following formula that accounts for inbreeding in the predicted individual (see Appendix B):

r_{A_{i}}^{2} = r_{A B}^{2} \frac{G_{A_{i}, B}^{T} {[G_{B B} + R \frac{\hat{σ_{ε, B}^{2}}}{\hat{σ_{u, B}^{2}}}]}^{- 1} G_{A_{i}, B}}{G_{A_{i}, A_{i}}},

(3)

where $r_{A B}^{2}$ is the squared genetic correlation between $A$ and $B$ (here $r_{A B}^{2} = 1$ ), $G_{A_{i}, B}$ is the vector of genomic relationships of individual $i$ in $A$ with all training individuals of $B,$ $R = I$ is an identity matrix when assuming independent residual error variances $,$ and $G_{A_{i}, A_{i}}$ is the relationship of individual $i \in A$ with itself, providing an estimate of $1 + F_{i} .$ Dividing by $G_{A_{i}, A_{i}}$ assures that reliabilities are correctly scaled, given that variance components and inbreeding refer to an outbred reference population, as is the case when calculating $G$ according to Equation 1 (see Appendix B). The deterministic PA in population $A$ was subsequently obtained by averaging over all individuals in $A$ as $ρ^{W} = \sqrt{1 / N_{A} \sum_{i = 1}^{N_{A}} r_{A_{i}}^{2}},$ where in our case $N_{A} = 500.$

The second equation was proposed by Daetwyler et al. (2008, 2010) and is based solely on population parameters, which was modified to account for unexplained variance in $A$ by accounting for different markers segregating in $A$ and $B$ (in cases where $A \neq B$ ):

ρ^{D} = \sqrt{θ_{A B} \frac{N_{t r a i n} h^{2}}{N_{t r a i n} h^{2} + M_{e}},}

(4)

with $θ_{A B} = | L_{A \cap B} | / | L_{A} |,$ where $| L_{A \cap B} |$ is the number of markers that segregate in both $A$ and in $B$ and $| L_{A} |$ is the number of markers that segregate in $A,$ $N_{t r a i n}$ is the sample size, $h^{2} = (1 + F_{B}) \hat{σ_{u, B}^{2}} / [(1 + F_{B}) \hat{σ_{u, B}^{2}} + \hat{σ_{ε, B}^{2}}],$ where $F_{B} = \bar{d i a g (G_{B B})} - 1$ is the average inbreeding coefficient of the individuals in $B,$ $\hat{σ_{u, B}^{2}}$ refers to the estimated additive variance in the (outbred) F₂ generation of $B,$ and $M_{e}$ is the effective number of chromosome segments. Wientjes et al. (2015) proposed an estimator for $M_{e}$ across outbred populations, which is calculated as

M_{e} = \frac{1}{var [G_{A_{i}, B_{j}} - E (G_{A_{i}, B_{j}})]},

(5)

where $G_{A_{i}, B_{j}}$ contains all genomic relationships between individuals $i$ from $A$ and training individuals $j$ from $B .$ Given a uniform pedigree relationship between individuals in $A$ and $B$ (e.g., FSF, HSF, and URF), the denominator simplifies to $var (G_{A_{i}, B_{j}}),$ because $E (G_{A_{i}, B_{j}}) = constant .$ If the individuals $i$ from $A$ and $j$ from $B$ have inbreeding coefficients $F_{A}$ and $F_{B},$ respectively, we propose to use (see Appendix C):

M_{e} = \frac{(1 + F_{A}) (1 + F_{B})}{var (G_{A_{i}, B_{j}})} .

(6)

For DH lines from BPFs, $1 + F_{A} = \bar{d i a g (G_{A A})} = 2$ and $1 + F_{B} = \bar{d i a g (G_{B B})} = 2,$ so that $M_{e} = 4 / var (G_{A_{i}, B_{j}}),$ which was herein used as estimator for $M_{e} .$

Comparison of empirical and deterministic prediction accuracies

For all analyses except the ANOVA of $ρ_{A B R T},$ we considered only one sample $R$ of training individuals and dropped index $R$ altogether. This simplifies the presentation of our results and corresponds to the realistic case of having only one specific sample of training individuals available. For comparison of PA between fixed factors (e.g., between samples sizes, heritabilities or ancestral populations), as well as for evaluating the overall agreement of empirical and deterministic PAs, we calculated the general mean of PA across all random factors $A,$ $B,$ and $T,$ subsequently denoted as $\bar{ρ},$ $\bar{ρ^{W}},$ and $\bar{ρ^{D}}$ for the empirical PA and the two deterministic PAs, respectively.

Causal analysis of the variation in PA among traits in GP across BPFs

Preliminary analyses showed that PA varied substantially among traits in across-family GP for HSFs and URFs, although we assumed the same polygenic architecture for all 50 simulated traits. Therefore, we devised additional simulations to investigate the underlying cause(s), using assumptions warranting almost ideal conditions for GP to largely eliminate the influence of nuisance factors on PA. We restricted these simulations to HSFs to demonstrate the key points in a simple fashion. First, we chose at random (i) a pair of HSFs BPF_pred $A$ and BPF_train $B$ produced from ancestral population Elite, and (ii) repeatedly sampled 1000 QTL positions from the entire set of 19,204 SNPs until we found a sample with $θ_{A B} \approx 0.4,$ corresponding to the average value of $θ_{A B}$ for HSF in our study (Table 2). Second, given $A$ and $B$ and the 1000 QTL positions, we sampled 1000 sets of different QTL effects $a_{k}$ as described above. This resulted in 1000 traits with $θ_{A B} \approx 0.4$ and identical QTL positions, but different QTL effects. Finally, assuming $N_{t r a i n} = 250, h^{2} = 1$ and known QTL genotypes, we used RR-BLUP—yielding equivalent GEBVs as GBLUP (Habier et al. 2007)—to identify among the 1000 traits the two with lowest and highest PA and retrieved the corresponding QTL effect estimates.

Table 2. Mean ( $\pm$ SD) of the estimated number of effective chromosome segments ( $M_{e}$ ) and the proportion of polymorphic loci in the predicted family $A$ that also segregate in the training family $B$ ( $θ_{A B}$ ) with different pedigree relationships (FSF, HSF, and URF) between $A$ and $B,$ derived either from ancestral populations Elite or Landrace.

Ancestral Population	Pedigree Relationship	$M_{e}$ $\pm$ SD	$θ_{A B} \pm$ SD
Elite	FSF	21.00 ± 2.27	1.00 ± 0.00
	HSF	66.26 ± 27.03	0.50 ± 0.10
	URF	148.16 ± 77.87	0.40 ± 0.08
Landrace	FSF	22.24 ± 2.05	1.00 ± 0.00
	HSF	72.48 ± 24.83	0.50 ± 0.08
	URF	172.33 ± 77.03	0.40 ± 0.06

Open in a new tab

We surmised that variation in PA among traits arises from structural differences in the large chromosome segments containing cosegregating QTL alleles that DH lines inherit from their respective parents. To investigate this hypothesis, we analyzed the contribution of each chromosome segment along the entire genome to PA. The length of the chromosome segments within $A$ and $B$ was taken as the expected genetic map distance at which the LD between two QTL in BPFs falls below $r^{2} = 0.2$ (cf. Giraud et al. 2014), which amounted to $41$ cM (cf. File S3 in Schopp et al. 2017). Using a sliding window approach, chromosome segments of this length moved in steps of 5 cM along each chromosome separately for each trait. Similar to Kemper et al. (2015), we subsequently calculated for each window $W$ the “local” TBV for all DH lines $i \in A$ in the BPF_pred as

T B V_{i, W} = \sum_{k ϵ W} x_{i, k} a_{k},

(7)

where $x_{i, k}$ is the genotypic score coded (2,0) for DH line $i$ at QTL $k \in W,$ and $a_{k}$ is the corresponding QTL effect. Analogously, we calculated the local GEBV in the BPF_pred as

G E B V_{i, W} = \sum_{k ϵ W} x_{i, k} \hat{a_{k}},

(8)

where ${\hat{a}}_{k}$ is the estimate of $a_{k}$ obtained from RR-BLUP in BPF_train $B,$ provided $k$ segregated in $B,$ and otherwise ${\hat{a}}_{k} = 0.$ Subsequently, we calculated for each window $W$ the correlation between local TBVs and local GEBVs among all 500 DH lines in $A .$

Further, we defined chromosome segment substitution effects ( $C S S E$ ) for the parental chromosome segments of $A$ as the sum of allele substitution effects across all QTL $k \in W$

C S S E_{A, W} = \sum_{k ϵ W} δ_{A, k} a_{k} = \frac{1}{2} (T B V_{P_{A 2}, W} - T B V_{P_{A 1}, W}),

(9)

where $δ_{A, k} = (x_{i, k, P_{A 1}} - x_{i, k, P_{A 2}}) / 2,$ $P_{A 1}$ and $P_{A 2}$ are the parents of $A$ with $P_{A 2}$ being the common parent of $A$ and $B .$ Thus, $δ_{A, k} = \pm 1,$ if $P_{A 1}$ and $P_{A 2}$ carry different alleles at QTL $k,$ and $δ_{A, k} = 0,$ otherwise. Values $C S S E_{B, W}$ were calculated analogously with respect to parents $P_{B 1}$ and $P_{B 2} = P_{A 2}$ of $B .$ Note that $δ_{A, k} = δ_{B, k},$ if QTL $k$ segregates in both $A$ and $B,$ i.e., $P_{A 1}$ and $P_{B 1}$ carry the same allele that is different from the allele in $P_{B 2} = P_{A 2} .$ In contrast, $δ_{A, k} \neq δ_{B, k}$ implies that QTL $k$ segregates in exactly one of the two HSFs $A$ or $B .$ Thus, $C S S E_{A, W} - C S S E_{B, W} \neq 0$ only if $δ_{A, k} \neq δ_{B, k}$ at one or more QTL $k \in W,$ and the magnitude of this difference depends on (i) the subset of QTL $k \in W$ with $δ_{A, k} \neq δ_{B, k},$ (ii) the relative size of $a_{k}$ for each QTL in $W$ compared with the effects of other QTL in the genome, and (iii) whether these effects have identical sign or not, which is important, especially for QTL that are closely linked. Altogether, the magnitude of $C S S E_{A, W}$ and its difference to $C S S E_{B, W}$ for each trait along the genome were expected to strongly influence the PA of GEBVs in BPF_pred $A,$ estimated on the basis of BPF_train $B .$

All computations were carried out in the R statistical environment (R Core Team 2017).

Data availability

Genotypic data of the ancestral populations is available in File S2. All R packages used for simulating the data are publicly available. All simulation steps and equations are fully described within the manuscript.

Results

Means and variation of empirical PA

Figure 1A shows the distributions of empirical PA $ρ_{A B T} .$ For the standard scenario (ancestral population Elite, $N_{t r a i n} = 100, h^{2} = 0.6,$ and $G$ calculated from SNP markers, Table 1), the mean PA ( $\bar{ρ}$ ) across all pairs of BPF_pred and BPF_train and traits was highest for FSF (0.79, Table S3 in File S1), and decreased by 43% for HSF (0.45) and by 60% for URF (0.32). A reverse trend was observed for the SD of $ρ_{A B T},$ which amounted to 0.09 for FSF and more than doubled for HSF (0.20) and URF (0.22). The 5 and 95% quantiles of $ρ_{A B T}$ ranged from 0.61 to 0.89 for FSF, but from $0.07$ to $0.73$ for HSF and from $- 0.09$ to $0.64$ for URF.

(A) Boxplots of empirical prediction accuracies $ρ_{A B T}$ in BPFs of DH lines, and (B) variance components of different factors influencing the variation of $ρ_{A B R T} .$ Parents of BPFs were sampled from ancestral population *Elite*, and SNP markers were used to calculate the genomic relationship matrix $G .$ Results are shown for different pedigree relationships (FSF, HSF, and URF) between the predicted family (BPF_pred) $A$ and training family (BPF_train) $B,$ as well as for different sample sizes $N_{t r a i n}$ and heritabilities $h^{2} .$

For $h^{2} = 0.6,$ reducing $N_{t r a i n}$ from $100$ to 25 resulted in $28$ – $31 %$ lower $\bar{ρ}$ and increasing $N_{t r a i n}$ to 250 resulted in 12–18% higher $\bar{ρ}$ for all pedigree relationships (Figure 1A). The SD increased for $N_{t r a i n} = 25$ by 84% for FSF, but only by $11$ and $4 %$ for HSF and URF, respectively, because it was already large under $N_{t r a i n} = 100.$ For $N_{T S} = 250,$ the SD reduced by $42 %$ for FSF, yet only by 6% for HSF and $4 %$ for URF. Altering $h^{2}$ for $N_{t r a i n} = 100$ affected the PA similarly as altering $N_{t r a i n}$ under fixed $h^{2} .$ In comparison with $h^{2} = 0.6,$ $\bar{ρ}$ was reduced by $18$ – $20 %$ for $h^{2} = 0.3$ and increased by $20$ – $32 %$ for $h^{2} = 1.0,$ depending on the pedigree relationship. The corresponding SDs changed considerably for FSF (+57 and −68%), but only marginally for HSF (8 and −11%) and URF (4 and −7%).

Deriving BPFs from ancestral population Landrace instead of Elite generally reduced $\bar{ρ}$ by <0.05, whereas the SD remained nearly identical (Figure 2A and Table S3 in File S1). By comparison, calculating the $G$ matrix from QTL instead of SNP data increased $\bar{ρ}$ by only 0.02, 0.03, and 0.05 for FSF, HSF, and URF, respectively, but hardly affected the SD, regardless of the pedigree relationship and the ancestral population.

Analysis of variance of random factors affecting the empirical PA

Estimates of $σ_{t o t a l}^{2}$ for $ρ_{A B R T}$ were of similar magnitude for HSF and URF, but generally much smaller for FSF (Figure 1B). For the standard scenario, $σ_{t o t a l}^{2}$ was small for FSF (0.01) and primarily attributable to $σ_{A \times T}^{2} .$ By comparison, $σ_{t o t a l}^{2}$ was 5.3 and 6.6 times larger for HSF and URF, respectively, with >50% contributed by $σ_{(A : B) \times T}^{2},$ followed by the residual variance $σ_{(A : B : R) \times T}^{2}$ (26 and 19%, respectively). All variance components not involving factor $T$ were substantially smaller, with $σ_{A : B}^{2}$ contributing most for HSF (9%) and URF (6%).

Decreasing $N_{t r a i n}$ to 25 or $h^{2}$ to 0.3 affected the relative importance and overall magnitude of the variance components similarly for the three pedigree relationships (Figure 1B). The residual variances $σ_{(A : R) \times T}^{2}$ (FSF) and $σ_{(A : B : R) \times T}^{2}$ (HSF, URF) increased substantially, accompanied by a moderate increase in $σ_{A \times T}^{2}$ for FSF and decrease in $σ_{(A : B) \times T}^{2}$ for HSF and URF. Conversely, increasing $N_{t r a i n}$ to 250 or $h^{2}$ to 1.0 strongly reduced the residual variances and nearly eliminated $σ_{t o t a l}^{2}$ for FSF, whereas, for HSF and URF, $σ_{t o t a l}^{2}$ remained large owing to a high $σ_{(A : B) \times T}^{2},$ even under these favorable conditions.

Deriving BPFs from ancestral population Landrace instead of Elite had almost no effect on $σ_{t o t a l}^{2}$ and its components (Figure 2B). Calculating the $G$ matrix from QTL instead of from SNP genotypes moderately reduced $σ_{t o t a l}^{2}$ by 5% for HSF and 10% for URF, mainly due to decreasing $σ_{(A : B) \times T}^{2} .$ In contrast to HSF and URF, $σ_{t o t a l}^{2}$ for FSF was already minor when using SNP genotypes, leaving less room for improvement when using QTL instead of SNP genotypes than for HSF and URF, which both showed bigger changes in the absolute magnitude of the variance components than FSF.

Comparison of empirical and deterministic prediction accuracies

Figure 3 shows scatter plots for empirical versus deterministic prediction accuracies for the standard scenario. In general, empirical and deterministic accuracies for single traits agreed relatively well for FSF ( $r_{ρ, ρ^{w}} = 0.65$ and $r_{ρ, ρ^{D}} = 0.61$ ), but rather weakly for HSF ( $0.43$ and $0.42$ , respectively) and URF ( $0.33$ and $0.32$ , respectively). By comparison, the correlations between the means of empirical and deterministic accuracies across the 50 traits increased for FSF ( $r_{ρ, ρ^{w}} = 0.81$ and $r_{ρ, ρ^{D}} = 0.65$ ), but even more so for HSF (0.94 and 0.92, respectively) and URF (0.89 and 0.88, respectively), indicating that trait-specific deviations from the mean empirical accuracy hampers the agreement with deterministic accuracies, particularly for HSF and URF.

Empirical prediction accuracy $ρ$ in BPFs of DH lines plotted against deterministic prediction accuracies $ρ^{W}$ and $ρ^{D} .$ The top two graphs refer to observations for single traits ( $ρ_{A T}$ for FSF and $ρ_{A B T}$ otherwise), and the bottom row to means over traits ( $\bar{ρ_{A}}$ for FSF and $\bar{ρ_{A B}}$ otherwise). Parents of BPFs were sampled from ancestral population *Elite* and genotypes at SNP markers were used to calculate the genomic relationship matrix $G .$ Results are shown for a random sample of 10,000 data points, $N_{t r a i n} = 100$ and $h^{2} = 0.6.$

For the general mean of empirical and deterministic PA across $A, B,$ and $T,$ $\bar{ρ^{W}}$ matched very well with $\bar{ρ}$ for all pedigree relationships and values of $N_{t r a i n}$ and $h^{2}$ (Figure S2 in File S1). By comparison, $\bar{ρ^{D}}$ generally underestimated $\bar{ρ}$ with increasing bias for HSF and URF as compared with $\bar{ρ^{W}}$ (Figure S3 in File S1), and particularly for smaller values of $N_{t r a i n}$ and $h^{2}$ (Figure S2 in File S1). Calculating the $G$ matrix from QTL instead of from SNP genotypes hardly influenced the bias of deterministic accuracies (Figure S4 in File S1) and the correlations with empirical accuracies.

Causal analysis of the variation in PA among traits

Figure 4 compares two traits T1 and T2 with divergent PA for one representative pair of HSFs. For both traits with identical QTL positions and QTL genotypes in the BPF_pred $A$ and BPF_train B, but different QTL effects, 376 QTL segregated in $A,$ 286 in $B$ and 151 of them jointly in $A$ and $B .$ For trait T1 with high $ρ_{A B T} = 0.92,$ the differences between chromosome segment substitution effects (CSSE) in $A$ and $B$ were generally small across the entire genome, in particular on chromosomes 2, 3, and 9, with sizeable CSSEs (Figure 4A). Conversely, for trait T2 with low $ρ_{A B T} = - 0.04,$ the CSSEs in $A$ and $B$ differed substantially over large parts of the genome, and showed even opposite signs on several chromosomes.

(A) Chromosome segment substitution effects (*CSSE*_A,_W in red and *CSSE*_B,_W in blue) and correlation between local TBVs and local GEBVs in the predicted family $A$ (green) averaged in sliding windows $W$ (see *Materials and Methods* for definition). GEBVs were calculated from QTL effects estimated by RR-BLUP in training set (HSF) $B .$ Results are shown for $N_{t r a i n} = 250$ and two traits T1 and T2 with $h^{2} = 1$ and large differences in prediction accuracy $ρ .$ Both traits were generated from the same set of 1000 QTL with $θ_{A B} \approx 0.40,$ but different QTL effects. (B) Correlation between local TBVs and local GEBVs (green lines) shown together with true QTL effects (diamonds) and estimated QTL effects (circles) for T1 and T2 in $B$ on chromosome 5. Colors indicate QTL segregating in both $A$ and $B$ (orange) or only in $A$ (purple); grey bars in the background reflect the windows $W .$

The correlation between local TBVs and local GEBVs of the DH lines $k \in$ $A$ were closely associated with the differences between the CSSEs for $A$ and $B$ in the corresponding windows $W$ (Figure 4A). If the difference in the CSSE for a segment was small, the correlation was generally high, particularly if both CSSEs in $A$ and $B$ had large magnitude and identical sign (see chromosomes 2, 3 and 9 for trait T1). Conversely, if the CSSEs for a window $W$ differed and had opposite sign in $A$ and $B,$ the correlation between local TBV and local GEBV dropped substantially, and frequently became negative (see chromosomes 2, 5, and 8 for trait T2). Overall, the proportion of the genome showing low or even negative correlations was much smaller for trait T1 with high PA than for trait T2 with low PA.

Zooming into chromosome 5—which had a large impact on the differences between the two traits—revealed that for trait T1, all large-effect QTL that segregated in $A$ also segregated in $B$ (Figure 4B). However, for trait T2, there was a large-effect QTL that segregated only in $A$ in windows $W$ with low correlation between local TBVs and local GEBVs. Neighboring windows not harboring this QTL showed higher correlations. The trends for this exemplary chromosome were consistent with other chromosomes and other HSF pairs $A$ and $B,$ as well as other traits with high and low PA (results not shown).

Discussion

Experimental studies showed that PA can be highly variable for GP within, but even more so across BPFs. Moreover, PA was found to vary substantially among different target traits for distinct pairs of training and predicted families. Investigating the causes for this variability is hardly possible based on experimental data due to the limited number and sample size of available BPFs, and the generally unknown genetic architecture of agronomically important traits. Here, we used computer simulations to analyze in detail why PA varies among different combinations of training sets, prediction sets, and polygenic traits. Moreover, we demonstrate that modification of available deterministic equations enables accurate estimates of PA averaged across many polygenic traits for both within-family GP and across-family GP.

Variation in PA within and across biparental families

The average PA decreased under small $N_{t r a i n}$ and low $h^{2}$ (Figure 1A) for all pedigree relationships, as expected from theory (Daetwyler et al. 2008). This was always accompanied by a large increase in the variation of PA (Figure 1A), which was mainly caused by inflated residual errors [ $σ_{(A : R) \times T}^{2}$ for FSF, $σ_{(A : B : R) \times T}^{2}$ for HSF and URF, Figure 1B]. These errors capture the variation in PA that arises due to the random sampling of (i) individuals (genotypes) from the BPF_train, and (ii) their corresponding phenotypes for a specific trait. The larger residual errors in across-family GP are presumably due to incongruent sets of QTL segregating in pairs of HSFs and URFs, which can vary substantially across traits, as reflected by the SD of $θ_{A B}$ (Table 2). The fact that predictions became much more robust under $N_{t r a i n} \geq$ 100 and $h^{2} \geq 0.6$ illustrate that large sample sizes and heritabilities are mandatory to alleviate the trait-specific sampling variance in PA. Together with the generally optimal conditions in within-family GP (Crossa et al. 2014), this nearly eliminated all variation in PA for FSF (Figure 1).

The predicted family BPF_pred accounted only for a marginal proportion of variation in PA, irrespective of the pedigree relationship with BPF_train (Figure 1B, $σ_{A}^{2}$ ). For within-family GP (where BPF_train = BPF_pred), this implies that the genetic distance between the parents of a BPF has at best marginal influence on the average PA across traits, in agreement with previous studies (Lehermeier et al. 2014; Marulanda et al. 2015). This conclusion is further supported by the similar variation in PA among predicted families derived from the two ancestral populations ( $σ_{A}^{2},$ Figure 2B, FSF), despite the much weaker latent pedigree structure in Landrace compared with Elite (Figure S1B in File S1). By comparison, the generally substantial influence of $σ_{A \times T}^{2}$ in FSF (Figure 1B and Figure 2B) suggests that PA strongly depends on $h^{2}$ in the training set (Figure S5 in File S1), which can be highly variable among BPF × trait combinations (Figure S6 in File S1). This is in harmony with previous studies that attributed variation in PA partially to differences in the phenotypic variance of the training set (Lehermeier et al. 2014; Marulanda et al. 2015).

For across-family GP, the expected PA depends largely on the pedigree relationship (Habier et al. 2007; Riedelsheimer et al. 2013) and on the variation in across-family genomic relationships. Since genomic relationships across families have a zero mean (if calculated according to Equation 1), their variation is equal to the mean squared genomic relationship between training and predicted individuals (Wientjes et al. 2013). Generally, PA is expected to increase proportionally with these squared relationships. In the case of BPFs, genomic relationships between families are heavily influenced by the proportion of polymorphic markers in the BPF_pred ( $θ_{A B}$ ) segregating also in the BPF_train (Figure S7 in File S1). Therefore, PA for across-family GP depends primarily on the magnitude of $θ_{A B},$ because larger $θ_{A B}$ implies that a greater proportion of the genetic variance in the BPF_pred can be explained by the QTL in BPF_train. Accordingly, the variation in $θ_{A B}$ among combinations of different HSFs or URFs (Figure S1D in File S1) was largely responsible for the notable contribution of $σ_{A : B}^{2}$ to the total variation in PA (Figure 1B). Altogether, the much larger $σ_{A : B}^{2}$ for across-family GP, compared to within-family GP, was mainly due to the overriding influence of $σ_{(A : B) \times T}^{2}$ besides the considerable contribution of $σ_{A : B}^{2}$ to $σ_{t o t a l}^{2}$ (Figure 1B, FSF vs. HSF or URF). Unraveling the genetic causes for this complex interaction required additional analyses, which are discussed in depth in the next section.

Sampling of training individuals from a given BPF_train barely contributed to the variation in PA, for both within- and across-family GP (Figure 1B, $σ_{A : R}^{2}$ and $σ_{A : B : R}^{2}$ ). Thus, compared with structured populations or diversity panels, there is little room for improvement by applying optimization algorithms accounting for genomic relationships in the sampling of training individuals within BPFs (Rincent et al. 2012; Akdemir et al. 2015; Bustos-Korts et al. 2016), confirming previous findings (Lorenz and Smith 2015; Marulanda et al. 2015). This is because already modest sample sizes (e.g., $N_{t r a i n} = 50$ ) enable the Mendelian sampling term in the BPF_train to be sufficiently captured. Nevertheless, we recommend $N_{t r a i n} \geq 50$ to achieve a high mean and small variance of PA (Table S3 in File S1) arising from sampling of genotypes from a given BPF_train (Figure 1B).

Previous experimental studies found generally higher levels of variation in PA, particularly for within-family GP (Riedelsheimer et al. 2013; Lehermeier et al. 2014; Lian et al. 2014). This is most likely attributable to miscellaneous additional factors present in these studies, which were not accounted for in our simulations. These factors include (i) small prediction set size, (ii) analysis of different types of progeny (F₂ or backcross generations and DH lines derived from them), (iii) variation in QTL-SNP LD within BPFs due to low marker density, (iv) nonadditive gene action due to epistasis, and (v) estimation error in $h^{2},$ which affects calculation of PA from predictive ability. Further, the various agronomic traits investigated in the experimental studies differed likely in their genetic architecture, which further increases the total variation in PA compared with the polygenic traits simulated in our study ( $σ_{T}^{2},$ Figure 1B). Consequently, our results should be regarded as a lower bound for the variation in PA that must be expected in practice for a given $N_{t r a i n}$ and $h^{2} .$

Unraveling the variation among traits in across-family GP

We adopted the concept of local breeding values (cf. Kemper et al. 2015) to investigate the relationship between the strong variation in PA among traits and the large chromosome segments that DH lines of BPF inherit from their parents. The latter entails strong LD between QTL alleles and consequently small $M_{e}$ (Table 2), which is very different from the situation found in diverse populations such as cattle breeds ( $M_{e} \approx 1, 000$ ) (Daetwyler et al. 2010; Wientjes et al. 2013). Thus, only a small number of local TBVs contribute to the “global” TBV of predicted individuals. Similarly, the PA can be thought of as the average accuracy of local GEBVs estimated from the training data, weighted by their relative contribution to the global TBV in the BPF_pred. As a consequence of the small $M_{e}$ in BPFs, the accuracy of local GEBVs is prone to much larger sample variance than would be the case in more diverse populations. To illustrate this point, we examined for a given pair of HSFs exemplarily two traits with contrasting PA (Figure 4).

Of all QTL, only those that segregated in the BPF_pred (376/1, 000, Figure 4) contributed to the variance in local TBVs, which were estimated by local GEBVs from the training set. In our example, trait $T 1$ with $ρ_{A B T} = 0.92$ showed, on average, much higher correlations between local TBVs and local GEBVs in the BPF_pred along the entire genome than trait $T 2$ with $ρ_{A B T} = - 0.04$ (Figure 4A). For the trait with low PA, we found a larger proportion of local GEBVs that provided a false prediction signal, in the sense that negative effects were estimated for favorable parental chromosome segments and vice versa. These discrepancies between local TBVs and local GEBVs trace back to different chromosome segment substitution effects (CSSE, Equation 9) between the BPF_pred and BPF_train (Figure 4A), which, in the case of HSFs, occur if their noncommon parent carries different alleles at one or more QTL on the segment. If this is the case, one of the two BPFs will be monomorphic for the respective QTL. The effect of such a QTL compared with other QTL on a chromosome segment that may be polymorphic in both the BPF_pred and BPF_train determines the difference in CSSE between two families. For instance, if the variance in local TBVs among predicted individuals is dominated by a large-effect QTL, which is monomorphic in the training set, the ranking of local GEBVs based on the other polymorphic QTL located on this segment might deviate substantially from the ranking of local TBVs, resulting in low local PA (Figure 4B, $T 2$ ). The frequency of inaccurate local GEBVs along the whole genome together with the variance explained by the corresponding local TBVs will finally determine the PA of across-family GP. Hence, two traits with the same number and positions of QTL might have very different PA, depending on the effects of QTL that are poly- or monomorphic across the training and prediction set. This explains also why $θ_{A B},$ and thereby across-family genomic relationships, were closely associated with the average PA across many traits for different pairs of HSF and URF (Figure S7 in File S1), but poorly associated with PA $ρ_{A B T}$ for individual traits (Figure 3). Additional simulations showed further that reducing (i) the number of chromosomes on which QTL were located, or (ii) the total number of QTL, results in increased variation in PA (Figure S8 in File S1). Both these alterations reduce the number of local TBVs discernible for a trait, which underlines the relevance of small $M_{e}$ (i.e., a low number of segments carrying QTL) for the variation in PA.

In conclusion, the large variation in PA among traits observed for across-family GP is caused by the strong LD among linked QTL within BPFs, and the resulting small effective number of chromosome segments contributing to polygenic traits, in combination with different QTL segregating across BPFs. Our analyses exemplify that BPFs represent a special case regarding the possibly strong fluctuations in PA, which is—to this extent—not expected for genetically more diverse populations.

Influence of LD in the ancestral population on the expected accuracy of GP across BPFs

Differences in the extent of LD in ancestral populations Elite and Landrace (Figure S1A in File S1) translated into sizable differences in QTL-SNP linkage phase similarity among URFs derived from these populations (Figure S1C in File S1). Surprisingly, this barely affected $\bar{ρ}$ across URFs (Figure 2A and Table S3 in File S1). The low relevance of linkage phase similarity across URFs was confirmed by the similar PAs when substituting the SNP- with a QTL-derived $G$ matrix (Figure 2A), which eliminates the influence of this factor. This reflects most likely the overriding influence of $θ_{A B}$ on PA across URFs, because the mean $θ_{A B}$ was similar for URFs derived from the two ancestral populations (Figure S1D in File S1). Thus, the higher mean in PA for HSFs compared with URFs seems to be attributable to higher $θ_{A B}$ values (Table 2) rather than to the fact that QTL-SNP linkage phases are always consistent across HSF (Lehermeier et al. 2014), but not necessarily across URF. This corrects a conjecture of Riedelsheimer et al. (2013), who suspected that low PA obtained from certain URFs was due to low linkage phase similarity with the respective BPF_pred.

Deterministic equations for forecasting PA within and across BPFs

Forecasting PA based on estimated reliabilities of GEBVs requires that unrelated individuals have an expected genomic relationship of zero (Goddard et al. 2011; Wientjes et al. 2015). This can be achieved by a block-structured $G$ matrix based on population-specific allele frequencies (e.g., Chen et al. 2013). Preliminary analyses showed that in the calculation of $G$ (Equation A5), correct treatment of SNPs polymorphic only in either BPF_train or in BPF_pred is very important. Different from empirical PAs, which remain unaffected by $θ_{A B} < 1$ (see Appendix A), deterministic PAs across BPFs can be grossly inflated by ignoring $θ_{A B} < 1$ in the calculation of $G$ (results not shown). While $θ_{A B}$ is generally high across diverse populations such as breeds of cattle (Matukumalli et al. 2009), it can fall to <0.4 across different BPFs produced from inbred parents in plant breeding (Figure S1D in File S1 and Table 2). Calculating $G$ according to our improved method (Equation 1) largely eliminated the bias in deterministic accuracies $ρ^{W}$ attributable to $θ_{A B} < 1$ and is therefore a prerequisite for applying Equation 3 to GP across BPFs.

Accounting for inbreeding (see Appendix B for derivation) in the original reliability equation, resulted together with the modifications on the $G$ matrix in excellent agreement between empirical and deterministic accuracies $ρ^{W}$ averaged across traits, which is consistent with the findings of Wientjes et al. (2015) for cattle populations. However, the trait-dependent variation in empirical PA observed for GP across BPFs cannot be accounted for by $ρ^{W} .$ This is because for a given set of training and predicted individuals and two traits with the same $h^{2}$ but different QTL effects, the deterministic accuracy would be identical yet the empirical accuracy can differ substantially as illustrated in Figure 3 and Figure 4.

Forecasting PA within FSF by Daetwyler et al.’s (2008, 2010) equation based on population parameters has been widely used in plant breeding (Lorenz 2013; Riedelsheimer et al. 2013; Lian et al. 2014). However, estimates of $M_{e}$ can differ substantially (Riedelsheimer and Melchinger 2013; Wientjes et al. 2013) between the various proposed formulas to estimate $M_{e}$ from the effective population size $N_{e}$ and genome length (Goddard 2009; Meuwissen and Goddard 2010; Goddard et al. 2011). Moreover, estimation of $N_{e}$ itself is problematic, because it assumes a base population of unrelated founders, which is often impossible to define in practice (cf. Figure S1B in File S1, Elite). Following Goddard et al. (2011), we calculated $M_{e}$ directly from the variance of genomic relationships, with extensions devised by Wientjes et al. (2015, 2016) for GP across populations (Equation 5). This has the advantage that $M_{e}$ is computed from the actual genotypes for which the PA is to be forecasted. The calculation of $M_{e}$ required in Equation 4 must account for inbreeding (Equation 6), because the variance in genomic relationships increases with the inbreeding coefficient $F$ (see Appendix C). Ignoring inbreeding would result in underestimation of $M_{e},$ and strong overestimation of the deterministic accuracy $ρ^{D} .$

An important assumption of the equation of Daetwyler et al. is that the entire genetic variance in the prediction set is explained by QTL segregating in the training set (cf. $r_{Effect}$ in Wientjes et al. 2016). This holds true for FSF ( $θ_{A B} = 1$ ), but is violated for GP across BPFs ( $θ_{A B} ≪ 1,$ Table 2). As a solution for this problem, we propose multiplication with $θ_{A B}$ in calculating $ρ^{D}$ (Equation 4), which efficiently reduced the strong upward-bias observed otherwise (results not shown). With these modifications, empirical and deterministic accuracies $ρ^{D}$ agreed reasonably well when averaged across traits, but forecasting was problematic for individual traits for the same reasons as discussed above for $ρ^{W}$ (Figure 3). Compared with previous experimental studies (Riedelsheimer et al. 2013; Lian et al. 2014), we found overall better agreement of $ρ$ and $ρ^{D}$ for single traits in within-family GP (Figure 3). We suppose that, in addition to the lower variation in empirical PA (Figure 1), this is likely attributable to smaller deviations between estimated and true $M_{e}$ (Lian et al. 2014) when dealing with real traits of diverse genetic architecture.

An upward bias in deterministic PA must generally be expected if SNPs are not a good approximation of QTL due to incomplete QTL-SNP LD, (cf. $r_{Effect}$ vs. $r_{LD}$ in Wientjes et al. 2016), leading to “missing heritability” in genomic studies (Yang et al. 2010). This is because empirical PA decreases as less variance at QTL is explained by SNPs under incomplete LD, whereas deterministic PA is hardly affected (Figure S9 in File S1). However, our results show that this is barely relevant in BPFs (Figure 3 vs. Figure S4 in File S1), if large chromosome segments are covered sufficiently by markers. Thus, a sizable reduction in empirical PA and overestimation of deterministic PA must only be expected under very low marker density (<100 SNPs) as in the study of Lian et al. (2014). Although these authors argued that 100 SNPs were likely sufficient for within-family GP in maize, our results indicate that at least 1000 and 2500 SNPs should be used for within- and across-family GP, respectively, to obtain acceptable empirical PA and minimize the bias in deterministic PA (Figure S9 in File S1). If such numbers are not available, deterministic equations must additionally account for incomplete LD (Wientjes et al. 2016), using, for example, multiplication with the average LD ( $r^{2}$ ) between adjacent markers as proxy for the QTL-SNP LD (Lian et al. 2014).

Besides low marker density, incomplete QTL-SNP LD can result from differences in the allele frequency distribution at QTL and SNPs (Goddard et al. 2011), inter alia due to ascertainment bias of SNP chips. These differences are in reality unknown, and, as treated herein, commonly not accounted for in simulation studies (Daetwyler et al. 2013). For GP across BPFs, differences in allele frequencies at QTL and SNPs in the ancestral population (cf. Figure S1E in File S1) would translate into different $θ_{A B}$ values at SNPs and QTL across BPFs, because the smaller the minor allele frequency, the larger the chances of a locus being monomorphic in a BPF. Thus, calculation of $ρ^{D}$ might be inflated by an upward-bias in $θ_{A B}$ (Equation 5), in addition to the possible overestimation of across-family genomic relationships affecting both $ρ^{D}$ and $ρ^{W}$ (Equations 3 and 4). Further research is needed to show how strongly overestimation of $θ_{A B}$ can affect application of deterministic equations in practice, for example, by comparing the equations under chip-based and sequencing-based genotyping (Pérez-Enciso et al. 2015).

We assumed in our derivations that the genetic correlation among BPFs = 1 (see Appendix B), which is expected to hold under a purely additive-genetic model, as applies in the absence of epistasis to (i) testcross performance for a given tester, and (ii) to per se performance of completely homozygous lines (Melchinger 1987). By comparison, in cattle breeds or diverse germplasm in plant breeding, genetic correlations between populations are typically < 1 (Karoui et al. 2012; Lehermeier et al. 2015). Accounting for genetic correlations is possible with multi-group models, but these require sufficient phenotypic data for the predicted population as well as estimating these correlations, which seems impractical in the case of GP of a single BPF.

Despite generally promising results for both deterministic equations, we recommend using $ρ^{W}$ (Equation 3), because it depended less on the relatedness between BPFs, $N_{t r a i n},$ and $h^{2}$ (Figures S2 and S3 in File S1), rendering it more robust across a wide range of scenarios. Since $ρ^{W}$ and $ρ^{D}$ (as implemented here) require genotypic data of both the training and predicted individuals, they can be applied only after obtaining genotypic data of the individuals to be predicted. Alternatively, for newly planned crosses we propose to use computer simulations to generate in silico virtual genotypic data of the corresponding BPFs using known genotypes of the parents and genetic map information of the markers, as conducted in this study (cf. Mohammadi et al. 2015). This would make both equations accessible prior to generating new crosses for use in optimizing training set designs and allocation of resources.

Conclusions and extensions to multi-family training sets

We demonstrated that the empirical PA in BPFs of inbred lines is prone to various sources of variation, which differ strongly in their relevance for GP within and across BPFs. It should be stressed that the conclusions drawn from our study do not only apply to DH lines, but also to inbreds developed by recurrent selfing and most likely also to partly inbred generations. Overall, our results corroborate within-family GP as a valuable and robust tool for the implementation of GP in plant breeding, provided the training set meets minimum standards for $N_{t r a i n}$ ( $\geq 50$ ) and $h^{2}$ ( $>$ 0.3). However, the need for phenotypes from the predicted family represents the main drawback of within-family GP, because this increases both the costs and the time needed until selection can be applied.

Our simulations on across-family GP were restricted to the simple strategy of using only a single HSF or URF for model training. This provided a manageable framework for analyzing the underlying causes affecting variation in PA. For a given BPF_pred, we showed: (i) the PA in across-family GP expected across many traits differs systematically between different BPF_train, even if they have the same pedigree relationship with the BPF_pred, (ii) deterministic equations enable accurate forecasts of the PA across traits for given pairs of BPF_pred and BPF_train, and (iii) large variation in the PA among traits hampers the forecasting. Therefore, it is very unlikely to find a single BPF_train that performs uniformly best across all target traits. This means that caution must be exercised when applying rules of thumb or deterministic equations for choosing the BPF_train in GP of a specific trait given BPF_pred. This issue can be even more severe if (i) traits deviate from the polygenic architecture assumed in our simulations, or (ii) $M_{e}$ in the BPFs is smaller than $M_{e}$ in maize due to fewer chromosomes and/or smaller genome size (Figure S8 in File S1). Thus, identification of useful, trait-specific BPF_train might only be possible by directly evaluating the empirical PA for a small sample $(N \sim 30)$ of individuals from the BPF_pred. However, this would largely eliminate the time- and cost-related advantages of genomic selection based on previously available data from BPFs.

In practice, breeders generally do not rely on single-family training sets in GP across BPFs, but rather use multi-family training set designs for the sake of increasing sample size (Heffner et al. 2011; Riedelsheimer et al. 2013; Hickey et al. 2014; Jacobson et al. 2014; Lehermeier et al. 2014). Another important advantage of multi-family over single-family training sets in across-family GP most likely stems from the increased proportion of causal loci segregating in both the BPF_pred and the training set, which we identified as the core problem leading to the large variation of PA in GP across single BPFs. One critical question in this context is whether or not a single BPF_train that is poorly predictive of a given BPF_pred (e.g., a HSFs that yields PA close to zero, Figure 4) is detrimental or harmless for PA if combined together with other predictive BPFs for extending the training set. The problem might exacerbate if URF are included in multi-family training sets (cf. Albrecht et al. 2011), which might come at the expense of reduced linkage phase similarity (cf. Figure S1C in File S1) between a multifamily training set and the BPF_pred (Lorenz and Smith 2015). Further research is warranted to investigate whether the current design of training sets can be improved by identifying and excluding adverse families to avoid disappointing outcomes of GP in BPFs.

Supplementary Material

Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.117.300076/-/DC1.

Click here for additional data file.^{(1.6MB, pdf)}

Click here for additional data file.^{(1,021.7KB, zip)}

Acknowledgments

We thank Chris-Carolin Schön, Matthias Westhues, Tobias Schrag, and Willem Molenaar for valuable suggestions to improve the content of this manuscript. P.S. acknowledges Syngenta for partially funding this research by a Ph.D. fellowship, and A.E.M. acknowledges the financial contribution of the International Maize and Wheat Improvement Center/Gesellschaft für Internationale Zusammenarbeit (CIMMYT/GIZ) through the Climate Resilient Maize for Asia (CRMA) Project 15.78600.8-001-00.

Appendix A

Genomic Relationships Between DH Lines from BFPs $A$ and $B$ Calculated with Different Methods

Suppose $i \in A$ and $j \in B$ are two DH lines from BPFs $A$ and $B,$ respectively. Let $L = L_{A \cup B}, L_{A},$ $L_{B}$ and $L_{A \cap B}$ denote the set of loci (SNPs or QTL, depending on the context) that are polymorphic in $A$ or in $B,$ polymorphic in $A,$ polymorphic in $B,$ and polymorphic in $A$ and $B,$ respectively. Since $A$ and $B$ are BPFs, we have, under Mendelian inheritance,

p_{A, k} = {\begin{matrix} \frac{1}{2} for k \in L_{A} \\ 0 or 1 elsewhere \end{matrix}} and p_{B, k} = {\begin{matrix} \frac{1}{2} for k \in L_{B} \\ 0 or 1 elsewhere \end{matrix}} .

(A1)

Thus,

\sum_{k \in L} 2 p_{A, k} (1 - p_{A, k}) = \frac{1}{2} | L_{A} | and \sum_{k \in L} 2 p_{B, k} (1 - p_{B, k}) = \frac{1}{2} | L_{B} |,

(A2)

where $| L_{A} |$ and $| L_{B} |$ denote the number of elements in set $L_{A}$ and $L_{B},$ respectively. Defining $w_{A_{i}, k} = (x_{A_{i}, k} - 2 p_{A, k}),$ $w_{B_{j}, k} = (x_{B_{j}, k} - 2 p_{B, k})$ and $z_{A_{i} B_{j}, k} = w_{A_{i}, k} w_{B_{j}, k},$ we get, with Equation 1, for completely homozygous lines

z_{A_{i} B_{j}, k} = {\begin{matrix} + 1 if k \in L_{A \cap B} and x_{A_{i}, k} = x_{B_{j}, k} (i . e ., identical alleles in i and j at locus k) \\ - 1 if k \in L_{A \cap B} and x_{A_{i}, k} \neq x_{B_{j}, k} (i . e ., different alleles in i and j at locus k) \\ 0 elsewhere \end{matrix}} .

(A3)

For calculating the elements of the genomic relationship matrix $G_{A B} = (G_{A_{i} B_{j}})$ according to the modification proposed in Equation 1, we obtain

G_{A_{i} B_{j}} = \frac{\sum_{k \in L} z_{A_{i} B_{j}, k}}{\sqrt{\frac{1}{2} | L_{A} |} \sqrt{\frac{1}{2} | L_{B} |}} = \frac{2 | L_{A \cap B} | [2 S M_{i, j} (L_{A \cap B}) - 1]}{\sqrt{| L_{A} | | L_{B} |}},

(A4)

where $S M_{i, j} (L_{A \cap B})$ refers to the simple matching coefficient (Sneath and Sokal 1973), also known as the IBS (identity by state) coefficient (Astle and Balding 2009), between $i$ and $j$ with respect to the loci set $L_{A \cap B} .$ Using the original formula of Chen et al. (2013), which extends Method 1 of VanRaden (2008) to the case of two populations, we obtain the genomic relationship matrix $G_{A B}^{V R 1} = (G_{A_{i} B_{j}}^{V R 1}),$ with elements

G_{A_{i} B_{j}}^{V R 1} = \frac{\sum_{k \in L} z_{A_{i} B_{j}, k}}{2 \sum_{k \in L} \sqrt{p_{A, k} (1 - p_{A, k}) p_{B, k} (1 - p_{B, k})}} = \frac{| L_{A \cap B} | [2 S M_{i, j} (L_{A \cap B}) - 1]}{2 | L_{A \cap B} | \sqrt{\frac{1}{16}}} . = 2 [2 S M_{i, j} (L_{A \cap B}) - 1]

(A5)

Extending Method 2 of VanRaden (2008) to the case of two populations, we obtain the genomic relationship matrix $G_{A B}^{V R 2} = (G_{A_{i} B_{j}}^{V R 2}),$ as follows

G_{A_{i} B_{j}}^{V R 2} = \frac{1}{| L_{A \cap B} |} \sum_{k \in L_{A \cap B}} \frac{z_{A_{i} B_{j}, k}}{\sqrt{2 p_{A, k} (1 - p_{A, k}) p_{B, k} (1 - p_{B, k})}} = \frac{2}{| L_{A \cap B} |} \sum_{k \in L_{A \cap B}} z_{A_{i} B_{j}, k} = 2 [2 S M_{i, j} (L_{A \cap B}) - 1],

(A6)

where summation is only possible for $k \in L_{A \cap B},$ because for $k \in L_{A^{c} \cup B^{c}}$ the denominator is zero, where $L_{A^{c} \cup B^{c}}$ denotes subset of $L$ polymorphic in $B$ but not in $A,$ or polymorphic in $A$ but not in $B .$ Thus, we obtain

G_{A_{i} B_{j}}^{V R 1} = G_{A_{i} B_{j}}^{V R 2} = 2 [2 S M_{i, j} (L_{A \cap B}) - 1] = γ G_{A_{i} B_{j}},

(A7)

in BPFs with allele frequencies equal to 0.5 at segregating loci, with $γ = \sqrt{| L_{A} | | L_{B} |} / | L_{A \cap B} | .$ Consequently, $γ = 1$ if and only if $L_{A} = L_{B}$ (e.g., if $A = B$ ), but otherwise $γ > 1.$ Note that the empirical PA of GBLUP is invariant to $γ$ (cf. Strandén and Christensen 2011), but $γ$ affects uniformly the scaling of GEBVs and reliabilities thereof. Note also that $S M_{i, j}$ calculated with regard to all loci (set $L$ ) can deviate from $S M_{i, j} (L_{A \cap B}),$ because

S M_{i, j} = \frac{| L_{A \cap B} |}{| L |} S M_{i, j} (L_{A \cap B}) + \frac{| L_{A \cap B^{c}} |}{| L |} S M_{i, j} (L_{A \cap B^{c}}) + \frac{| L_{A^{c} \cap B} |}{| L |} S M_{i, j} (L_{A^{c} \cap B}),

(A8)

and $S M_{i, j} (L_{A \cap B^{c}})$ as well as $S M_{i, j} (L_{B \cap A^{c}})$ can vary between pairs of $i$ and $j,$ where $L_{A \cap B^{c}}$ and $L_{A^{c} \cap B}$ denote subsets of $L$ polymorphic in $A$ but not in $B,$ and polymorphic in $B$ but not in $A,$ respectively.

Appendix B

Calculation of the Reliability of GP across Populations for Inbred Individuals

Assume two populations A (= prediction set) and B (= training set), which are not necessarily BPFs, and we consider across-population GP. Using well-known results about selection indices (Mrode 2005), the breeding value $g_{A_{i}}$ for individual $i \in A,$ which may be inbred, is predicted with information from its genotype and the phenotypic and genotypic information from the training set as:

{\hat{g}}_{A_{i}} = cov (g_{A_{i}}, y_{B}) {[var (y_{B})]}^{- 1} y_{B},

(B1)

in which ${\hat{g}}_{A_{i}}$ is the predicted breeding value, $g_{A_{i}}$ is the true breeding value of individual $i$ in population $A,$ and $y_{B}$ is a vector with phenotypes of individuals from population $B,$ corrected for fixed effects.

The covariance between the true breeding value of an individual from population $A$ and the phenotypes of individuals from population $B$ is:

cov (g_{A_{i}}, y_{B}) = cov (g_{A_{i}}, g_{B} + e_{B}) = cov (g_{A_{i}}, g_{B}) + cov (g_{A_{i}}, e_{B}) = G_{A_{i}, B}^{T} r_{A B} σ_{u_{A}} σ_{u_{B}},

(B2)

where $r_{A B}$ is the genetic correlation between $A$ and $B$ (which represents the correlation between the breeding value in population $A$ and the breeding value in population $B$ for the individuals in $A$ ), $G_{A_{i}, B}$ is the vector of genomic relationships between individual $i$ and the training individuals from $B$ that can be estimated by Equation 1 in the main text, and $σ_{u_{A}}$ and $σ_{u_{B}}$ are the square root of the additive variances in $A$ and $B,$ respectively. Finally,

var (y_{B}) = cov (y_{B}, y_{B}) = cov (g_{B} + e_{B}, g_{B} + e_{B}) = cov (g_{B}, g_{B}) + cov (e_{B}, e_{B}) = G_{B B} σ_{u_{B}}^{2} + R_{B} σ_{e_{B}}^{2},

(B3)

where $G_{B B}$ is the genomic relationship matrix among training individuals in $B,$ $R_{B} σ_{e_{B}}^{2}$ is the covariance matrix of the errors $e_{B}$ in the observation vector $y_{B},$ and the breeding value $u_{A_{i}}$ is predicted as:

{\hat{g}}_{A_{i}} = G_{A_{i}, B}^{T} r_{A B} σ_{u_{A}} σ_{u_{B}} {[G_{B B} σ_{u_{B}}^{2} + R_{B} σ_{e_{B}}^{2}]}^{- 1} y_{B} = r_{A B} G_{A_{i}, B}^{T} \frac{σ_{u_{A}}}{σ_{u_{B}}} {[G_{B B} + R_{B} \frac{σ_{e_{B}}^{2}}{σ_{u_{B}}^{2}}]}^{- 1} y_{B} .

(B4)

We are interested in the reliability

r_{A_{i}}^{2} = \frac{{Cov}^{2} ({\hat{g}}_{A_{i}}, g_{A_{i}})}{σ_{{\hat{g}}_{A}}^{2} σ_{g_{A}}^{2}} = \frac{Cov ({\hat{g}}_{A_{i}}, g_{A_{i}})}{σ_{g_{A}}^{2}}

(B5)

for the estimated breeding value ${\hat{g}}_{A_{i}}$ of individual $i \in A,$ $g_{A_{i}}$ being defined with respect to population $A .$ Since $var (g_{A_{i}}) = σ_{g_{A}}^{2} = G_{A_{i}, A_{i}} σ_{u_{A}}^{2},$ together with Equation B2, we obtain

r_{A_{i}}^{2} = \frac{Cov ({\hat{g}}_{A_{i}}, g_{A_{i}})}{σ_{g_{A}}^{2}} = \frac{Cov ({\hat{g}}_{A_{i}}, g_{A_{i}})}{G_{A_{i}, A_{i}} σ_{u_{A}}^{2}} = cov (\frac{r_{A B} G_{A_{i}, B}^{T} \frac{σ_{u_{A}}}{σ_{u_{B}}} {[G_{B B} + R_{B} \frac{σ_{e_{B}}^{2}}{σ_{u_{B}}^{2}}]}^{- 1} y_{B}, u_{A_{i}}}{G_{A_{i}, A_{i}} σ_{u_{A}}^{2}}) = \frac{r_{A B} G_{A_{i}, B}^{T} \frac{σ_{u_{A}}}{σ_{u_{B}}} {[G_{B B} + R_{B} \frac{σ_{e_{B}}^{2}}{σ_{u_{B}}^{2}}]}^{- 1}}{G_{A_{i}, A_{i}} σ_{u_{A}}^{2}} cov (y_{B}, u_{A_{i}}) = \frac{r_{A B} G_{A_{i}, B}^{T} \frac{σ_{u_{A}}}{σ_{u_{B}}} {[G_{B B} + R_{B} \frac{σ_{e_{B}}^{2}}{σ_{u_{B}}^{2}}]}^{- 1} r_{A B} G_{A_{i}, B} σ_{u_{A}} σ_{u_{B}}}{G_{A_{i}, A_{i}} σ_{u_{A}}^{2}}

(B6)

so, the reliability is:

r_{A_{i}}^{2} = r_{A B}^{2} \frac{G_{A_{i}, B}^{T} {[G_{B B} + R_{B} \frac{σ_{e_{B}}^{2}}{σ_{u_{B}}^{2}}]}^{- 1} G_{A_{i}, B}}{G_{A_{i}, A_{i}}} .

Appendix C

Calculation of $M_{e}$ and the Variance of Genomic Relationships of Inbred Populations

Consider two populations $A$ (=prediction set) and $B$ (= training set) that are not necessarily BPFs. Based on the theory of Goddard et al. (2011), Wientjes et al. (2015) suggested to calculate the effective number of chromosome segments $M_{e_{A, B}}$ shared between the two populations as (see their Equation 20)

M_{e_{A, B}} = \frac{1}{var (G_{A_{i} B_{i}} - E (G_{A_{i} B_{i}}))}

(C1)

If all individuals in $A$ have the same pedigree relationships with the individuals in $B,$ which holds true for pairs of BPFs, we have $E (G_{A_{i} B_{i}}) = constant \forall i \in A and \forall i \in B$ so that $var (G_{A_{i} B_{i}} - E (G_{A_{i} B_{i}})) = var (G_{A_{i} B_{i}}) .$ If the genotypes of loci pairs are stochastically independent, as follows from the assumption of independent segregation in the definition of $M_{e}$ (Daetwyler et al. 2010), and applies in reality to all loci pairs located on different chromosomes (i.e., a fraction of at least $(C - 1) / C$ of all loci pairs, assuming $C$ chromosomes with equal length and number of loci), we have

var (G_{A_{i} B_{i}}) = \frac{var (\sum_{k \in L} w_{A_{i}, k} w_{B_{j}, k})}{\sum_{k \in L} 2 p_{A, k} (1 - p_{A, k}) \sum_{k \in L} 2 p_{B, k} (1 - p_{B, k})} = \frac{\sum_{k \in L} var (w_{A_{i}, k} w_{B_{j}, k})}{\sum_{k \in L} 2 p_{A, k} (1 - p_{A, k}) \sum_{k \in L} 2 p_{B, k} (1 - p_{B, k})},

(C2)

where $w_{A_{i}, k}$ and $w_{B_{j}, k}$ are defined as in Appendix A, with $E (w_{A_{i}, k}) = 0$ and $E (w_{B_{j}, k}) = 0$ and $w_{A_{i}, k}$ and $w_{B_{j}, k}$ are stochastically independent because $i$ and $j$ are random samples from $A$ and $B,$ respectively. Thus, using results about the product of two stochastically independent random variables (Mood 1974), we obtain

var (w_{A_{i}, k} w_{B_{j}, k}) = var (w_{A_{i}, k}) var (w_{B_{j}, k}) .

(C3)

Under inbreeding with inbreeding coefficient $F_{A}$ in population $A,$ applying well-known results on the effects of inbreeding on the additive genetic variance (Falconer and Mackay 1996), we obtain

var (w_{A_{i}, k} | i with F_{A}) = (1 + F_{A}) var (w_{A_{s}, k} | s with F_{s} = 0)

(C4)

and a similar result for population $B .$ Combining Equations C2, C3, and C4 yields

var (G_{A_{i}, B_{j}} | i with F_{A} \land j with F_{B}) = (1 + F_{A}) (1 + F_{B}) var (G_{A_{s}, B_{t}} | s with F_{s} = 0 \land t with F_{t} = 0)

(C5)

While, for inbred generations derived by recurrent selfing, this equation may hold only approximately true due to the assumption of stochastic independence among all loci pairs, a proof can be given that Equation C5 holds strictly true without this requirement for DH lines and F₂ individuals, i.e.,

var (G_{A_{i}, B_{j}} | i = DH line \land j = DH line) = 4 var (G_{A_{s}, B_{t}} | s = F_{2} ind . \land s = F_{2} ind .)

(C6)

In the original publications (Goddard et al. 2011; Wientjes et al. 2015) connecting $M_{e}$ with the variance in genomic relationships among individuals, it was assumed that the individuals were noninbred. However, if they are DH lines or from another inbred generation, this is expected to affect $M_{e},$ so that for the case of fixed pedigree relationships between $A$ and $B,$ the estimator of $M_{e}$ becomes

M_{e_{A, B}} = \frac{(1 + F_{A}) (1 + F_{B})}{var (G_{A_{i} B_{i}} | i with F_{A} \land j with F_{B})} .

(C7)

Footnotes

Communicating editor: J.-L. Jannink

Literature Cited

Akdemir D., Sanchez J. I., Jannink J.-L., 2015. Optimization of genomic selection training populations with a genetic algorithm. Genet. Sel. Evol. 47: 38. [DOI] [PMC free article] [PubMed] [Google Scholar]
Albrecht T., Wimmer V., Auinger H., Erbe M., Knaak C., et al. , 2011. Genome-based prediction of testcross values in maize. Theor. Appl. Genet. 123: 339–350. [DOI] [PubMed] [Google Scholar]
Astle W., Balding D., 2009. Population structure and cryptic relatedness in genetic association studies. Stat. Sci. 24: 451–471. [Google Scholar]
Bates D., Mächler M., Bolker B., Walker S., 2015. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67: 1–48. [Google Scholar]
Bernardo R., Yu J., 2007. Prospects for genomewide selection for quantitative traits in maize. Crop Sci. 47: 1082–1090. [Google Scholar]
Bustos-Korts D., Malosetti M., Chapman S., Biddulph B., van Eeuwijk F., 2016. Improvement of predictive ability by uniform coverage of the target genetic space. G3 6: 3733–3747. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen L., Schenkel F., Vinsky M., Jr D. H. C., Li C., 2013. Accuracy of predicting genomic breeding values for residual feed intake in angus and charolais beef cattle. Anim. Genet. 91: 4669–4678. [DOI] [PubMed] [Google Scholar]
Clark S. A., Hickey J. M., Daetwyler H. D., van der Werf J. H. J., 2012. The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet. Sel. Evol. 44: 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Crossa J., Pérez P., Hickey J., Burgueño J., Ornella L., et al. , 2014. Genomic prediction in CIMMYT maize and wheat breeding programs. Heredity (Edinb) 112: 48–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
Daetwyler H. D., Villanueva B., Woolliams J. A., 2008. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS One 3: e3395. [DOI] [PMC free article] [PubMed] [Google Scholar]
Daetwyler H. D., Pong-Wong R., Villanueva B., Woolliams J. A., 2010. The impact of genetic architecture on genome-wide evaluation methods. Genetics 185: 1021–1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
Daetwyler H. D., Calus M. P. L., Pong-Wong R., de Los Campos G., Hickey J. M., 2013. Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics 193: 347–365. [DOI] [PMC free article] [PubMed] [Google Scholar]
Endelman J. B., 2011. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4: 250–255. [Google Scholar]
Falconer D. F., Mackay T. S. C., 1996. Introduction to Quantitative Genetics. Longman, Pearson, Essex. [Google Scholar]
Giraud H., Lehermeier C., Bauer E., Falque M., Segura V., et al. , 2014. Linkage disequilibrium with linkage analysis of multiline crosses reveals different multiallelic QTL for hybrid performance in the flint and dent heterotic groups of maize. Genetics 198: 1717–1734. [DOI] [PMC free article] [PubMed] [Google Scholar]
Goddard M., 2009. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica 136: 245–257. [DOI] [PubMed] [Google Scholar]
Goddard M. E., Hayes B. J., 2007. Genomic selection. J. Anim. Breed. Genet. 124: 323–330. [DOI] [PubMed] [Google Scholar]
Goddard M. E., Hayes B. J., Meuwissen T. H. E., 2011. Using the genomic relationship matrix to predict the accuracy of genomic selection. J. Anim. Breed. Genet. 128: 409–421. [DOI] [PubMed] [Google Scholar]
Habier D., Fernando R. L., Dekkers J. C. M., 2007. The impact of genetic relationship information on genome-assisted breeding values. Genetics 177: 2389–2397. [DOI] [PMC free article] [PubMed] [Google Scholar]
Habier D., Tetens J., Seefried F., Lichtner P., Thaller G., 2010. The impact of genetic relationship information on genomic breeding values in German Holstein cattle. Genet. Sel. Evol. 42: 5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Habier D., Fernando R. L., Garrick D. J., 2013. Genomic BLUP decoded: a look into the black box of genomic prediction. Genetics 194: 597–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hayes B. J., Bowman P. J., Chamberlain A. C., Verbyla K., Goddard M. E., 2009. Accuracy of genomic breeding values in multi-breed dairy cattle populations. Genet. Sel. Evol. 41: 51. [DOI] [PMC free article] [PubMed] [Google Scholar]
He S., Schulthess A. W., Mirdita V., Zhao Y., Korzun V., et al. , 2016. Genomic selection in a commercial winter wheat population. Theor. Appl. Genet. 129: 641–651. [DOI] [PubMed] [Google Scholar]
Heffner E. L., Jannink J., Sorrells M. E., 2011. Genomic selection accuracy using multifamily prediction models in a wheat breeding program. Plant Genome 4: 65–75. [Google Scholar]
Hickey J. M., Dreisigacker S., Crossa J., Hearne S., Babu R., et al. , 2014. Evaluation of genomic selection training population designs and genotyping strategies in plant breeding programs using simulation. Crop Sci. 54: 1476–1488. [Google Scholar]
Jacobson A., Lian L., Zhong S., Bernardo R., 2014. General combining ability model for genomewide selection in a biparental cross. Crop Sci. 54: 895–905. [Google Scholar]
Jannink J.-L., Lorenz A. J., Iwata H., 2010. Genomic selection in plant breeding: from theory to practice. Brief. Funct. Genomics 9: 166–177. [DOI] [PubMed] [Google Scholar]
Karoui S., Carabaño M. J., Díaz C., Legarra A., 2012. Joint genomic evaluation of French dairy cattle breeds using multiple-trait models. Genet. Sel. Evol. 44: 39. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kemper K. E., Reich C. M., Bowman P. J., Vander Jagt C. J., Chamberlain A. J., et al. , 2015. Improved precision of QTL mapping using a nonlinear Bayesian method in a multi-breed population leads to greater accuracy of across-breed genomic predictions. Genet. Sel. Evol. 47: 29. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lehermeier C., Krämer N., Bauer E., Bauland C., Camisan C., et al. , 2014. Usefulness of multi-parental populations of maize (Zea mays L.) for genome-based prediction. Genetics 198: 3–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lehermeier C., Schön C.-C., de los Campos G., 2015. Assessment of genetic heterogeneity in structured plant populations using multivariate whole-genome regression models. Genetics 201: 323–337. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lian L., Jacobson A., Zhong S., Bernardo R., 2014. Genomewide prediction accuracy within 969 maize biparental populations. Crop Sci. 54: 1514–1522. [Google Scholar]
Lin Z., Hayes B. J., Daetwyler H. D., 2014. Genomic selection in crops, trees and forages: a review. Crop Pasture Sci. 65: 1177–1191. [Google Scholar]
Lorenz A. J., 2013. Resource allocation for maximizing prediction accuracy and genetic gain of genomic selection in plant breeding: a simulation experiment. G3 3: 481–491. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lorenz A. J., Smith K. P., 2015. Adding genetically distant individuals to training populations reduces genomic prediction accuracy in barley. Crop Sci. 55: 2657–2667. [Google Scholar]
Marulanda J. J., Melchinger A. E., Würschum T., 2015. Genomic selection in biparental populations: assessment of parameters for optimum estimation set design. Plant Breed. 134: 623–630. [Google Scholar]
Matukumalli L. K., Lawley C. T., Schnabel R. D., Taylor J. F., Allan M. F., et al. , 2009. Development and characterization of a high density SNP genotyping assay for cattle. PLoS One 4: e5350. [DOI] [PMC free article] [PubMed] [Google Scholar]
Melchinger A. E., 1987. Expectation of means and variances of testcrosses produced from F2 and backcross individuals and their selfed progenies. Heredity (Edinb) 59: 105–115. [Google Scholar]
Melchinger A. E., Schopp P., Müller D., Schrag T. A., Bauer E., et al. , 2017. Safeguarding our genetic resources with libraries of doubled-haploid lines. Genetics 206: 1611–1619. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meuwissen T., Goddard M., 2010. Accurate prediction of genetic values for complex traits by whole-genome resequencing. Genetics 185: 623–631. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meuwissen T. H. E., Hayes B. J., Goddard M. E., 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819–1829. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mohammadi M., Tiede T., Smith K. P., 2015. Popvar: a genome-wide procedure for predicting genetic variance and correlated response in biparental breeding populations. Crop Sci. 55: 2068–2077. [Google Scholar]
Mood A., 1974. Introduction to the Theory of Statistics. McGraw-Hill Education, Europe. [Google Scholar]
Mrode R. A., 2005. Linear Models for the Prediction of Animal Breeding Values. CABI, Wallingford, Oxfordshire, UK. [Google Scholar]
Müller, D., and K. W. Broman, 2017 Meiosis: simulation of meiosis in plant breeding research. R Package. version 1.0.0. Available at: https://github.com/DominikMueller64/Meiosis.
Pérez-Enciso M., Rincón J. C., Legarra A., 2015. Sequence- vs. chip-assisted genomic selection: accurate biological information is advised. Genet. Sel. Evol. 47: 43. [DOI] [PMC free article] [PubMed] [Google Scholar]
R Core Team, 2017 R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria. Available at: https://www.r-project.org/.
Riedelsheimer C., Melchinger A. E., 2013. Optimizing the allocation of resources for genomic selection in one breeding cycle. Theor. Appl. Genet. 126: 2835–2848. [DOI] [PubMed] [Google Scholar]
Riedelsheimer C., Endelman J. B., Stange M., Sorrells M. E., Jannink J. L., et al. , 2013. Genomic predictability of interconnected biparental maize populations. Genetics 194: 493–503. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rincent R., Laloë D., Nicolas S., Altmann T., Brunel D., et al. , 2012. Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.). Genetics 192: 715–728. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schopp P., Müller D., Technow F., Melchinger A. E., 2017. Accuracy of genomic prediction in synthetic populations depending on the number of parents, relatedness and ancestral linkage disequilibrium. Genetics 205: 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sneath P., Sokal R., 1973. Numerical Taxonomy: The Principles and Practice of Numerical Classification. Freeman, San Francisco, CA. [Google Scholar]
Strandén I., Christensen O. F., 2011. Allele coding in genomic evaluation. Genet. Sel. Evol. 43: 25. [DOI] [PMC free article] [PubMed] [Google Scholar]
VanRaden P. M., 2008. Efficient methods to compute genomic predictions. J. Dairy Sci. 91: 4414–4423. [DOI] [PubMed] [Google Scholar]
Wientjes Y. C. J., Veerkamp R. F., Calus M. P. L., 2013. The effect of linkage disequilibrium and family relationships on the reliability of genomic prediction. Genetics 193: 621–631. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wientjes Y. C. J., Veerkamp R. F., Bijma P., Bovenhuis H., Schrooten C., et al. , 2015. Empirical and deterministic accuracies of across-population genomic prediction. Genet. Sel. Evol. 47: 5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wientjes Y. C. J., Bijma P., Veerkamp R. F., Calus M. P. L., 2016. An equation to predict the accuracy of genomic values by combining data from multiple traits, populations, or environments. Genetics 202: 799–823. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang J., Benyamin B., McEvoy B. P., Gordon S., Henders A. K., et al. , 2010. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42: 565–569. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Click here for additional data file.^{(1.6MB, pdf)}

Click here for additional data file.^{(1,021.7KB, zip)}

Data Availability Statement

[bib1] Akdemir D., Sanchez J. I., Jannink J.-L., 2015. Optimization of genomic selection training populations with a genetic algorithm. Genet. Sel. Evol. 47: 38. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] Albrecht T., Wimmer V., Auinger H., Erbe M., Knaak C., et al. , 2011. Genome-based prediction of testcross values in maize. Theor. Appl. Genet. 123: 339–350. [DOI] [PubMed] [Google Scholar]

[bib3] Astle W., Balding D., 2009. Population structure and cryptic relatedness in genetic association studies. Stat. Sci. 24: 451–471. [Google Scholar]

[bib4] Bates D., Mächler M., Bolker B., Walker S., 2015. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67: 1–48. [Google Scholar]

[bib5] Bernardo R., Yu J., 2007. Prospects for genomewide selection for quantitative traits in maize. Crop Sci. 47: 1082–1090. [Google Scholar]

[bib6] Bustos-Korts D., Malosetti M., Chapman S., Biddulph B., van Eeuwijk F., 2016. Improvement of predictive ability by uniform coverage of the target genetic space. G3 6: 3733–3747. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] Chen L., Schenkel F., Vinsky M., Jr D. H. C., Li C., 2013. Accuracy of predicting genomic breeding values for residual feed intake in angus and charolais beef cattle. Anim. Genet. 91: 4669–4678. [DOI] [PubMed] [Google Scholar]

[bib8] Clark S. A., Hickey J. M., Daetwyler H. D., van der Werf J. H. J., 2012. The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet. Sel. Evol. 44: 4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] Crossa J., Pérez P., Hickey J., Burgueño J., Ornella L., et al. , 2014. Genomic prediction in CIMMYT maize and wheat breeding programs. Heredity (Edinb) 112: 48–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] Daetwyler H. D., Villanueva B., Woolliams J. A., 2008. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS One 3: e3395. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] Daetwyler H. D., Pong-Wong R., Villanueva B., Woolliams J. A., 2010. The impact of genetic architecture on genome-wide evaluation methods. Genetics 185: 1021–1031. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] Daetwyler H. D., Calus M. P. L., Pong-Wong R., de Los Campos G., Hickey J. M., 2013. Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics 193: 347–365. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] Endelman J. B., 2011. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4: 250–255. [Google Scholar]

[bib15] Falconer D. F., Mackay T. S. C., 1996. Introduction to Quantitative Genetics. Longman, Pearson, Essex. [Google Scholar]

[bib16] Giraud H., Lehermeier C., Bauer E., Falque M., Segura V., et al. , 2014. Linkage disequilibrium with linkage analysis of multiline crosses reveals different multiallelic QTL for hybrid performance in the flint and dent heterotic groups of maize. Genetics 198: 1717–1734. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] Goddard M., 2009. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica 136: 245–257. [DOI] [PubMed] [Google Scholar]

[bib18] Goddard M. E., Hayes B. J., 2007. Genomic selection. J. Anim. Breed. Genet. 124: 323–330. [DOI] [PubMed] [Google Scholar]

[bib19] Goddard M. E., Hayes B. J., Meuwissen T. H. E., 2011. Using the genomic relationship matrix to predict the accuracy of genomic selection. J. Anim. Breed. Genet. 128: 409–421. [DOI] [PubMed] [Google Scholar]

[bib20] Habier D., Fernando R. L., Dekkers J. C. M., 2007. The impact of genetic relationship information on genome-assisted breeding values. Genetics 177: 2389–2397. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] Habier D., Tetens J., Seefried F., Lichtner P., Thaller G., 2010. The impact of genetic relationship information on genomic breeding values in German Holstein cattle. Genet. Sel. Evol. 42: 5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] Habier D., Fernando R. L., Garrick D. J., 2013. Genomic BLUP decoded: a look into the black box of genomic prediction. Genetics 194: 597–607. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] Hayes B. J., Bowman P. J., Chamberlain A. C., Verbyla K., Goddard M. E., 2009. Accuracy of genomic breeding values in multi-breed dairy cattle populations. Genet. Sel. Evol. 41: 51. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] He S., Schulthess A. W., Mirdita V., Zhao Y., Korzun V., et al. , 2016. Genomic selection in a commercial winter wheat population. Theor. Appl. Genet. 129: 641–651. [DOI] [PubMed] [Google Scholar]

[bib25] Heffner E. L., Jannink J., Sorrells M. E., 2011. Genomic selection accuracy using multifamily prediction models in a wheat breeding program. Plant Genome 4: 65–75. [Google Scholar]

[bib26] Hickey J. M., Dreisigacker S., Crossa J., Hearne S., Babu R., et al. , 2014. Evaluation of genomic selection training population designs and genotyping strategies in plant breeding programs using simulation. Crop Sci. 54: 1476–1488. [Google Scholar]

[bib27] Jacobson A., Lian L., Zhong S., Bernardo R., 2014. General combining ability model for genomewide selection in a biparental cross. Crop Sci. 54: 895–905. [Google Scholar]

[bib28] Jannink J.-L., Lorenz A. J., Iwata H., 2010. Genomic selection in plant breeding: from theory to practice. Brief. Funct. Genomics 9: 166–177. [DOI] [PubMed] [Google Scholar]

[bib29] Karoui S., Carabaño M. J., Díaz C., Legarra A., 2012. Joint genomic evaluation of French dairy cattle breeds using multiple-trait models. Genet. Sel. Evol. 44: 39. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] Kemper K. E., Reich C. M., Bowman P. J., Vander Jagt C. J., Chamberlain A. J., et al. , 2015. Improved precision of QTL mapping using a nonlinear Bayesian method in a multi-breed population leads to greater accuracy of across-breed genomic predictions. Genet. Sel. Evol. 47: 29. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] Lehermeier C., Krämer N., Bauer E., Bauland C., Camisan C., et al. , 2014. Usefulness of multi-parental populations of maize (Zea mays L.) for genome-based prediction. Genetics 198: 3–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] Lehermeier C., Schön C.-C., de los Campos G., 2015. Assessment of genetic heterogeneity in structured plant populations using multivariate whole-genome regression models. Genetics 201: 323–337. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] Lian L., Jacobson A., Zhong S., Bernardo R., 2014. Genomewide prediction accuracy within 969 maize biparental populations. Crop Sci. 54: 1514–1522. [Google Scholar]

[bib34] Lin Z., Hayes B. J., Daetwyler H. D., 2014. Genomic selection in crops, trees and forages: a review. Crop Pasture Sci. 65: 1177–1191. [Google Scholar]

[bib35] Lorenz A. J., 2013. Resource allocation for maximizing prediction accuracy and genetic gain of genomic selection in plant breeding: a simulation experiment. G3 3: 481–491. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] Lorenz A. J., Smith K. P., 2015. Adding genetically distant individuals to training populations reduces genomic prediction accuracy in barley. Crop Sci. 55: 2657–2667. [Google Scholar]

[bib37] Marulanda J. J., Melchinger A. E., Würschum T., 2015. Genomic selection in biparental populations: assessment of parameters for optimum estimation set design. Plant Breed. 134: 623–630. [Google Scholar]

[bib38] Matukumalli L. K., Lawley C. T., Schnabel R. D., Taylor J. F., Allan M. F., et al. , 2009. Development and characterization of a high density SNP genotyping assay for cattle. PLoS One 4: e5350. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] Melchinger A. E., 1987. Expectation of means and variances of testcrosses produced from F2 and backcross individuals and their selfed progenies. Heredity (Edinb) 59: 105–115. [Google Scholar]

[bib40] Melchinger A. E., Schopp P., Müller D., Schrag T. A., Bauer E., et al. , 2017. Safeguarding our genetic resources with libraries of doubled-haploid lines. Genetics 206: 1611–1619. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] Meuwissen T., Goddard M., 2010. Accurate prediction of genetic values for complex traits by whole-genome resequencing. Genetics 185: 623–631. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] Meuwissen T. H. E., Hayes B. J., Goddard M. E., 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819–1829. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib43] Mohammadi M., Tiede T., Smith K. P., 2015. Popvar: a genome-wide procedure for predicting genetic variance and correlated response in biparental breeding populations. Crop Sci. 55: 2068–2077. [Google Scholar]

[bib44] Mood A., 1974. Introduction to the Theory of Statistics. McGraw-Hill Education, Europe. [Google Scholar]

[bib45] Mrode R. A., 2005. Linear Models for the Prediction of Animal Breeding Values. CABI, Wallingford, Oxfordshire, UK. [Google Scholar]

[bib46] Müller, D., and K. W. Broman, 2017 Meiosis: simulation of meiosis in plant breeding research. R Package. version 1.0.0. Available at: https://github.com/DominikMueller64/Meiosis.

[bib47] Pérez-Enciso M., Rincón J. C., Legarra A., 2015. Sequence- vs. chip-assisted genomic selection: accurate biological information is advised. Genet. Sel. Evol. 47: 43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib48] R Core Team, 2017 R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria. Available at: https://www.r-project.org/.

[bib49] Riedelsheimer C., Melchinger A. E., 2013. Optimizing the allocation of resources for genomic selection in one breeding cycle. Theor. Appl. Genet. 126: 2835–2848. [DOI] [PubMed] [Google Scholar]

[bib50] Riedelsheimer C., Endelman J. B., Stange M., Sorrells M. E., Jannink J. L., et al. , 2013. Genomic predictability of interconnected biparental maize populations. Genetics 194: 493–503. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib51] Rincent R., Laloë D., Nicolas S., Altmann T., Brunel D., et al. , 2012. Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.). Genetics 192: 715–728. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib52] Schopp P., Müller D., Technow F., Melchinger A. E., 2017. Accuracy of genomic prediction in synthetic populations depending on the number of parents, relatedness and ancestral linkage disequilibrium. Genetics 205: 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib53] Sneath P., Sokal R., 1973. Numerical Taxonomy: The Principles and Practice of Numerical Classification. Freeman, San Francisco, CA. [Google Scholar]

[bib54] Strandén I., Christensen O. F., 2011. Allele coding in genomic evaluation. Genet. Sel. Evol. 43: 25. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib55] VanRaden P. M., 2008. Efficient methods to compute genomic predictions. J. Dairy Sci. 91: 4414–4423. [DOI] [PubMed] [Google Scholar]

[bib56] Wientjes Y. C. J., Veerkamp R. F., Calus M. P. L., 2013. The effect of linkage disequilibrium and family relationships on the reliability of genomic prediction. Genetics 193: 621–631. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib57] Wientjes Y. C. J., Veerkamp R. F., Bijma P., Bovenhuis H., Schrooten C., et al. , 2015. Empirical and deterministic accuracies of across-population genomic prediction. Genet. Sel. Evol. 47: 5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib58] Wientjes Y. C. J., Bijma P., Veerkamp R. F., Calus M. P. L., 2016. An equation to predict the accuracy of genomic values by combining data from multiple traits, populations, or environments. Genetics 202: 799–823. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib59] Yang J., Benyamin B., McEvoy B. P., Gordon S., Henders A. K., et al. , 2010. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42: 565–569. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Genomic Prediction Within and Across Biparental Families: Means and Variances of Prediction Accuracy and Usefulness of Deterministic Equations

Pascal Schopp

Dominik Müller

Yvonne C J Wientjes

Albrecht E Melchinger

Abstract

Materials and Methods

Ancestral populations

Simulation of BFPs

Description of factors analyzed

Table 1. Overview of factors with their corresponding levels analyzed in this study.

Genomic prediction model

Analysis of variance of empirical prediction accuracies

Deterministic equations for forecasting prediction accuracy (PA)

Comparison of empirical and deterministic prediction accuracies

Causal analysis of the variation in PA among traits in GP across BPFs

Data availability

Results

Means and variation of empirical PA

Figure 1.

Figure 2.

Analysis of variance of random factors affecting the empirical PA

Comparison of empirical and deterministic prediction accuracies

Figure 3.

Causal analysis of the variation in PA among traits

Figure 4.

Discussion

Variation in PA within and across biparental families

Unraveling the variation among traits in across-family GP

Influence of LD in the ancestral population on the expected accuracy of GP across BPFs

Deterministic equations for forecasting PA within and across BPFs

Conclusions and extensions to multi-family training sets

Supplementary Material

Acknowledgments

Appendix A

Genomic Relationships Between DH Lines from BFPs A and B Calculated with Different Methods

Appendix B

Calculation of the Reliability of GP across Populations for Inbred Individuals

Appendix C

Calculation of Me and the Variance of Genomic Relationships of Inbred Populations

Footnotes

Literature Cited

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Genomic Relationships Between DH Lines from BFPs $A$ and $B$ Calculated with Different Methods

Calculation of $M_{e}$ and the Variance of Genomic Relationships of Inbred Populations