Abstract
A major application of genomic prediction (GP) in plant breeding is the identification of superior inbred lines within families derived from biparental crosses. When models for various traits were trained within related or unrelated biparental families (BPFs), experimental studies found substantial variation in prediction accuracy (PA), but little is known about the underlying factors. We used SNP marker genotypes of inbred lines from either elite germplasm or landraces of maize (Zea mays L.) as parents to generate in silico 300 BPFs of doubled-haploid lines. We analyzed PA within each BPF for 50 simulated polygenic traits, using genomic best linear unbiased prediction (GBLUP) models trained with individuals from either full-sib (FSF), half-sib (HSF), or unrelated families (URF) for various sizes () of the training set and different heritabilities ( In addition, we modified two deterministic equations for forecasting PA to account for inbreeding and genetic variance unexplained by the training set. Averaged across traits, PA was high within FSF (0.41–0.97) with large variation only for and For HSF and URF, PA was on average ∼40–60% lower and varied substantially among different combinations of BPFs used for model training and prediction as well as different traits. As exemplified by HSF results, PA of across-family GP can be very low if causal variants not segregating in the training set account for a sizeable proportion of the genetic variance among predicted individuals. Deterministic equations accurately forecast the PA expected over many traits, yet cannot capture trait-specific deviations. We conclude that model training within BPFs generally yields stable PA, whereas a high level of uncertainty is encountered in across-family GP. Our study shows the extent of variation in PA that must be at least reckoned with in practice and offers a starting point for the design of training sets composed of multiple BPFs.
Keywords: genomic prediction, biparental families, plant breeding, GBLUP, deterministic accuracy, linkage disequilibrium, GenPred, Shared Data Resources, Genomic Selection
With the advent of low-cost genome-wide SNP markers, genomic prediction (GP, see Supplemental Material, Table S1 in File S1 for full list of abbreviations) proposed by Meuwissen et al. (2001) has become a powerful tool in animal and plant breeding. The basic idea of GP is to combine the phenotypic and genotypic data of training individuals in a model for predicting the genetic merit of selection candidates that have only been genotyped. Complementing, or even replacing phenotyping can result in considerable cost savings and shortening of breeding cycles, thereby giving GP a big advantage over traditional selection methods (Bernardo and Yu 2007; Goddard and Hayes 2007; Lin et al. 2014). Particular challenges of GP in plant breeding arise from (i) the specific population structures mostly characterized by multiple related or unrelated segregating biparental families (BPFs) derived from crosses between inbred parents, and (ii) small samples sizes available for model training (Jannink et al. 2010).
In commercial breeding of line and hybrid cultivars, up to several hundred BPFs are newly generated every year. Depending on the species and size of the breeding program, each family can comprise a variable number (usually <250) of lines, developed either by recurrent selfing or the doubled-haploid (DH) technology (Albrecht et al. 2011). Since expected differences among BPFs can be reliably predicted based on the mean performance of their parents (Melchinger 1987), GP applied to populations comprising multiple BPFs aims primarily at the identification of superior lines within these families (Riedelsheimer et al. 2013). Prediction models such as genomic best linear unbiased prediction (GBLUP) allow capturing Mendelian sampling—responsible for variation in the breeding values of siblings within BPFs—through cosegregation of SNP markers with quantitative trait loci (QTL) (Habier et al. 2013). While several studies have investigated the accuracy of GP within and across BPFs, more attention is needed to assess the mean and variation of PA for training sets taken from full-sib (FSF), half-sib (HSF) or unrelated families (URF). Experimental results available so far are confined by the number and size of BPFs (Riedelsheimer et al. 2013; Lehermeier et al. 2014) and low marker density (Jacobson et al. 2014; Lian et al. 2014).
Model training with individual BPFs has been studied intensively, and PA has been generally more promising for “within-family GP” than “across-family GP” (Riedelsheimer et al. 2013). Various authors argued that for a given size of the training set, within-family GP would provide the highest possible PA owing to strong linkage disequilibrium (LD) between SNPs and QTL due to cosegregation and the same set of loci being polymorphic in the prediction and training set (Crossa et al. 2014; Lehermeier et al. 2014). Nevertheless, Lian et al. (2014) reported for within-family GP substantial variation in PA among 969 BPFs and various traits, in line with the results of other studies on BPFs (Riedelsheimer et al. 2013; Jacobson et al. 2014; Lehermeier et al. 2014). However, a systematic investigation on the extent and factors determining the mean and variation in PA among BPFs and traits is, to the best of our knowledge, not available to date.
Since PA increases with closer pedigree relationships between training and predicted individuals (Habier et al. 2010; Clark et al. 2012), one obvious strategy is to use HSFs with one common parent between the training family (BPFtrain) and the predicted family (BPFpred) in across-family GP. Compared to within-family GP, PA for this strategy was generally much lower with the same sample size, but can reach similar levels if the sample size is strongly extended (Lehermeier et al. 2014). By comparison, model training with only unrelated BPFs produced from the same ancestral population yields often poor or even negative PA (Riedelsheimer et al. 2013; Jacobson et al. 2014; Schopp et al. 2017). Optimizing training set designs in GP with BPFs therefore requires better insights into how the pedigree relationship between BPFs, the sample size, and the heritability affect the mean and the variation in PA. Herein, we address these factors for the simple case of GP across individual pairs of BPFs, thereby providing a starting point for further investigations on the design of multi-family training sets in plant breeding.
Forecasting PA based on existing molecular and phenotypic data could assist breeders in (i) choosing the most suitable BPFs for model training for prediction of existing or planned BPFs, and (ii) allocating resources to the training and prediction sets. Daetwyler et al. (2008, 2010) derived a deterministic equation for forecasting PA, which requires only population parameters (sample size heritability and the effective number of chromosome segments When averaged over several traits, empirical and deterministic accuracy agreed well within BPFs (Lorenz 2013; Riedelsheimer et al. 2013; Lian et al. 2014). There is little consensus, however, regarding the calculation of in general (Goddard 2009; Meuwissen and Goddard 2010; Goddard et al. 2011; Wientjes et al. 2013), and, specifically, for BPFs (Lorenz 2013; Riedelsheimer and Melchinger 2013; Lian et al. 2014). Recently, Daetwyler’s equation was applied to both GP within and across cattle breeds (Wientjes et al. 2013, 2015). The authors extended Goddard et al.’s (2011) approach for calculating from the variance of genomic relationship coefficients to multiple populations. Overestimation of PA was attributed to a violation of Daetwyler’s assumption that the genetic variance in the prediction set is fully explained by marker effects estimated in the training set. An aggravation of this problem is expected for across-family GP with BPFs due to a high fraction of QTL and markers that are not consistently polymorphic across BPFs. Herein, we propose to extend Daetwyler’s equation to cope with this problem and make the equation applicable to across-family GP in plant breeding.
Alternatively, PA can be forecasted based on the estimated reliability of genomic-estimated breeding values (GEBVs) derived from selection index theory (VanRaden 2008). However, this approach has rarely been applied in plant breeding (Akdemir et al. 2015; He et al. 2016), and, to the best of our knowledge, not to GP of individual BPFs, despite promising results for GP within and across breeds of cattle (Hayes et al. 2009; Wientjes et al. 2013, 2015). One problem is that the approach was developed for outbred populations, and needs modifications when applied to inbred genotypes. Moreover, several strict assumptions regarding the properties of the genomic relationship matrix must be satisfied to obtain meaningful results, which will be elaborated in this paper for the case of BPFs in plant breeding.
The objectives of our study were to (i) investigate the mean and variation of empirical PA within and across BPFs of inbred lines, (ii) examine how the variation in PA is affected by differences in polymorphism at causal loci of polygenic traits between the training and prediction set, as well as by other factors (e.g., level of ancestral LD, pedigree relationship between BPFs, sample size, heritability), and (iii) adapt equations for deterministic forecasting of PA in BPFs of inbred genotypes and demonstrate their usefulness in simulated data sets. To simulate realistic scenarios, we used SNP data of inbred lines taken either from a public maize breeding program or a DH library of a European maize landrace and generated in silico numerous BPFs of DH lines. Besides flexibility in the choice of sample sizes, and exclusion of nuisance factors uncontrollable in experimental studies, this allowed us to simulate traits with known genetic architecture for a profound analysis of the causal factors affecting PA of GP within and across BPFs.
Materials and Methods
Ancestral populations
We considered two ancestral populations as source germplasm of parental genotypes for generating BPFs. Ancestral population Elite consisted of 72 elite inbred lines with medium long-range LD (Figure S1A in File S1) representative for the Flint heterotic group of the maize breeding program of the University of Hohenheim. Ancestral population Landrace consisted of 40 DH lines derived without any intentional selection from the German maize landrace “Gelber Badischer” with a rapid decay of LD to a low level (Melchinger et al. 2017). All lines were genotyped with the Illumina chip MaizeSNP50, containing 57,841 SNPs, and were expected to be fully homozygous. Markers monomorphic in the ancestral population or heterozygous in at least one individual were removed for further analysis. Physical map positions were converted into genetic map positions required for simulating meioses as described by Schopp et al. (2017). In total, we retained 19,204 and 16,171 SNPs for Elite and Landrace, respectively, distributed over the 10 maize chromosomes ranging in length from 137 to 276 cM (1913 cM in total). Individuals in the ancestral population were regarded as unrelated for defining pedigree relationships between subsequently generated BPFs.
Simulation of BFPs
For generating BPFs, we first sampled at random = 25 parent lines from each ancestral population, and intermated them according to a half-diallel design to generate all possible crosses. Subsequently, 1500 DH lines were derived from each F1 cross to obtain the BPFs used for further analyses. According to the half-diallel, each predicted family (BPFpred ) was associated with several possible training families (BPFtrain ) with different pedigree relationships to These were: one FSF, corresponding to ; HSF sharing one common parent with ; and (iii) URF sharing no common parent with Meioses for in silico production of DH lines were simulated with the R package Meiosis (Müller and Broman 2017).
Description of factors analyzed
For systematic assessment of the factors influencing the distribution of the empirical PA, we defined various fixed and random factors (Table 1). As fixed factors, we considered (i) the ancestral population (Elite or Landrace), (ii) the pedigree relationship (FSF, HSF, or URF) between individuals in BPFpred and BPFtrain, (iii) the type of data (SNP marker genotypes or QTL genotypes) used to calculate the genomic relationship matrix for GBLUP, (iv) the sample size , and (v) the heritability of the trait The idealistic scenario was included to demonstrate how the variation in PA behaves when phenotypic accuracy is not a limiting factor. Random factors were the trait the BPFpred the BPFtrain as well as the actual sample of training individuals taken from
Table 1. Overview of factors with their corresponding levels analyzed in this study.
| Type | Factor | Model Parameter | Number of Factor Levels | Factor Levels |
|---|---|---|---|---|
| Fixed factors | Ancestral population | — | 2 | Elite, Landrace |
| Pedigree relationship between training and predicted family | — | 3 | FSF, HSF, URF | |
| Data used to calculate the relationship matrix | — | 2 | QTL, SNPs | |
| Sample size () | — | 3 | 25, 100, 250 | |
| Heritability () | — | 3 | 0.3, 0.6, 1 | |
| Random factors | Trait | 50 | — | |
| Predicted family (BPFpred) | 50 | — | ||
| Training family (BPFtrain) | (FSF), (HSF/URF) | — | ||
| Training set sample | 3 | — |
Default values for the standard scenario are indicated in boldface.
We simulated 50 truly polygenic traits = each governed by 1000 QTL. First, we sampled at random a subset of 5000 SNP markers from all SNPs available in the ancestral population, corresponding to a marker density of 2.61 SNPs cM−1. This fixed set of marker was used for GP of all traits, because resampling of SNP marker positions had a negligible influence on the results. Second, for each of the 50 traits we sampled at random the map positions of 1000 QTL from the remaining 14,204 and 12,171 SNPs in Elite and Landrace, respectively. Following Meuwissen et al. (2001), effects of each QTL were drawn from a Gamma distribution with equal probability of effect signs. Importantly, all traits were affected by the same number of loci, but differed in the position and effects of QTL. Thus, the realized number of polymorphic QTL loci could vary depending on the trait and the BPFpred and BPFtrain.
Phenotypes of training individuals were simulated according to the model (cf. Goddard et al. 2011), where is the vector of true breeding values (TBVs) calculated as is the matrix of genotypic scores at QTL coded as 2 or 0, depending on whether a DH line was homozygous for the 1 or 0 allele, respectively, and is the vector of QTL effects. Vector contains independent normally distributed environmental noise variables, where variance was assumed to be constant across BPFs derived from one ancestral population, implying independent environmental influence on the phenotypes. We calculated where is the a priori specified heritability (cf. Table 1) and is the genetic variance within a BPF, averaged across all 300 BPFs and 50 traits simulated.
Finally, we sampled at random 50 out of the 300 BPFs, and considered them individually as the predicted family BPFpred From the 1500 DH lines in each BPFpred, we estimated GEBVs for the first 500 lines. For within-family GP, training individuals were sampled from the remaining 1000 lines to predict individuals within the same family ( FSF). For across-family GP ( HSF or URF), 25 BPFtrain serving individually for model training were sampled from the 46 available HSFs and the 253 available URFs, respectively. For given BPFpred and BPFtrain, we sampled from BPFtrain three disjunct samples of individuals of size (according to the fixed factor “sample size,” Table 1) with which the prediction model was trained. To minimize variation in PA attributable to sampling individuals from the BPFpred, we chose By contrast, the numbers were of realistic magnitude, and analyzing repeated samples allowed us to quantify the variation in PA due to finite sampling in BPFtrain.
Genomic prediction model
The GBLUP model can be written as where is the general mean, is an incidence matrix linking phenotypes with breeding values, is the vector of random breeding values with mean zero and variance-covariance matrix where is the genomic relationship matrix and and are the additive variances in the noninbred reference population of BPFpred and BPFtrain, respectively, which correspond to their (outbred) F2 generation. and are matrices of 1’s, is the genetic correlation between populations and which was assumed to be equal to 1 for reasons detailed in the discussion, and ∘ symbolizes the Hadamard product. Vector contains random residuals with mean zero and where is an identity matrix and is the residual error variance. We used representing a modified version of the block-structured genomic relationship matrix devised by Chen et al. (2013), where the across-population blocks had elements
| (1) |
and and are the genotypic scores of DH lines and in population and at locus respectively, coded as 2 and 0, and and are the allele frequencies at locus in and respectively, where or depending on whether QTL or SNPs were used to calculate (according to the fixed factor “data,” Table 1). Submatrices and are calculated accordingly, but here the denominator simplifies to and respectively, corresponding to the standard matrix without subpopulation structure (Habier et al. 2007; VanRaden 2008). Importantly, the denominator for matrix in Equation 1 is different from that in Chen et al. (2013), who used Their approach effectively removes all loci that are monomorphic in and/or whereas our denominator retains these loci in the scaling of yielding a better approximation of the true relationship matrix, as discussed below.
In any BPF derived from fully homozygous parents, the expected allele frequency of a locus is known to be either 0, 0.5, or 1, depending on the genotypes of the parents. These expected frequencies were used in the computation of genomic relationships. Since, in our study, only population had phenotypes, we used a single-group GBLUP model. Although we allowed for heterogeneous genetic variances among BPFs in the general model (Equation 1) and the derivation of reliability described below (see Appendix B), enters the computation of GEBVs in as a constant factor (see Equation B4) and, hence, does not affect the empirical PA. Estimates and for BPFtrain were obtained by restricted maximum likelihood from the individuals in the training set using the mixed.solve function from R-package rrBLUP (Endelman 2011). The empirical PA was calculated as the correlation between GEBVs and the TBVs for the 500 predicted individuals in BPFpred.
Analysis of variance of empirical prediction accuracies
For each possible combination of fixed factors (cf. Table 1), we partitioned the total variance of the empirically observed PA into variance components caused by each random factor, where we assumed a hierarchical structure for BPFpred BPFtrain and the training set sample as well as cross-classification with factor trait Estimates of the variance components were obtained from the following random-effects model using function lmer of R package lme4 (Bates et al. 2015):
| (2) |
where is the overall mean of PA for each of the three pedigree relationships (FSF, HSF, and URF) between individuals in and analyzed; is the effect of the BPFpred; is the effect of the BPFtrain nested within ; is the effect of the th sample of training individuals from nested within ; T is the effect of the trait, is the interaction effect of BPFpred with trait ; is the interaction effect of BPFtrain nested within with trait ; and is the interaction effect of the training set sample nested within with trait which corresponds to the residual error of the model. In the case of FSF (), all random factors involving were dropped. The degrees of freedom for each factor are shown in Table S2 in File S1.
Deterministic equations for forecasting prediction accuracy (PA)
We followed the theoretical framework of Wientjes et al. (2015) for forecasting PA within and across populations using two deterministic equations. Both equations assume that actual relationships regarding QTL are known, and were originally developed for outbred individuals. Hence, modifications are required to apply the equations to inbred individuals. As mentioned above, the outbred reference population corresponding to a BPF of fully inbred (DH) lines with an inbreeding coefficient of is the F2 generation. The level of inbreeding in BPFs of DH lines is reflected in the diagonal elements of calculated according to Equation 1, yielding in the special case of BPFs derived from homozygous parents.
The first approach is based on the reliability of GEBVs of each individual in (VanRaden 2008; Wientjes et al. 2013, 2015). Using the formula for the reliability of a selection index given by Mrode (2005, p. 15) and replacing the genetic covariance matrices by the genomic relationship matrices [multiplied by the corresponding genetic (co)variance components] yields the following formula that accounts for inbreeding in the predicted individual (see Appendix B):
| (3) |
where is the squared genetic correlation between and (here ), is the vector of genomic relationships of individual in with all training individuals of is an identity matrix when assuming independent residual error variancesand is the relationship of individual with itself, providing an estimate of Dividing by assures that reliabilities are correctly scaled, given that variance components and inbreeding refer to an outbred reference population, as is the case when calculating according to Equation 1 (see Appendix B). The deterministic PA in population was subsequently obtained by averaging over all individuals in as where in our case
The second equation was proposed by Daetwyler et al. (2008, 2010) and is based solely on population parameters, which was modified to account for unexplained variance in by accounting for different markers segregating in and (in cases where ):
| (4) |
with where is the number of markers that segregate in both and in and is the number of markers that segregate in is the sample size, where is the average inbreeding coefficient of the individuals in refers to the estimated additive variance in the (outbred) F2 generation of and is the effective number of chromosome segments. Wientjes et al. (2015) proposed an estimator for across outbred populations, which is calculated as
| (5) |
where contains all genomic relationships between individuals from and training individuals from Given a uniform pedigree relationship between individuals in and (e.g., FSF, HSF, and URF), the denominator simplifies to because If the individuals from and from have inbreeding coefficients and respectively, we propose to use (see Appendix C):
| (6) |
For DH lines from BPFs, and so that which was herein used as estimator for
Comparison of empirical and deterministic prediction accuracies
For all analyses except the ANOVA of we considered only one sample of training individuals and dropped index altogether. This simplifies the presentation of our results and corresponds to the realistic case of having only one specific sample of training individuals available. For comparison of PA between fixed factors (e.g., between samples sizes, heritabilities or ancestral populations), as well as for evaluating the overall agreement of empirical and deterministic PAs, we calculated the general mean of PA across all random factors and subsequently denoted as and for the empirical PA and the two deterministic PAs, respectively.
Causal analysis of the variation in PA among traits in GP across BPFs
Preliminary analyses showed that PA varied substantially among traits in across-family GP for HSFs and URFs, although we assumed the same polygenic architecture for all 50 simulated traits. Therefore, we devised additional simulations to investigate the underlying cause(s), using assumptions warranting almost ideal conditions for GP to largely eliminate the influence of nuisance factors on PA. We restricted these simulations to HSFs to demonstrate the key points in a simple fashion. First, we chose at random (i) a pair of HSFs BPFpred and BPFtrain produced from ancestral population Elite, and (ii) repeatedly sampled 1000 QTL positions from the entire set of 19,204 SNPs until we found a sample with corresponding to the average value of for HSF in our study (Table 2). Second, given and and the 1000 QTL positions, we sampled 1000 sets of different QTL effects as described above. This resulted in 1000 traits with and identical QTL positions, but different QTL effects. Finally, assuming and known QTL genotypes, we used RR-BLUP—yielding equivalent GEBVs as GBLUP (Habier et al. 2007)—to identify among the 1000 traits the two with lowest and highest PA and retrieved the corresponding QTL effect estimates.
Table 2. Mean (SD) of the estimated number of effective chromosome segments () and the proportion of polymorphic loci in the predicted family that also segregate in the training family () with different pedigree relationships (FSF, HSF, and URF) between and derived either from ancestral populations Elite or Landrace.
| Ancestral Population | Pedigree Relationship | SD | SD |
|---|---|---|---|
| Elite | FSF | 21.00 ± 2.27 | 1.00 ± 0.00 |
| HSF | 66.26 ± 27.03 | 0.50 ± 0.10 | |
| URF | 148.16 ± 77.87 | 0.40 ± 0.08 | |
| Landrace | FSF | 22.24 ± 2.05 | 1.00 ± 0.00 |
| HSF | 72.48 ± 24.83 | 0.50 ± 0.08 | |
| URF | 172.33 ± 77.03 | 0.40 ± 0.06 |
We surmised that variation in PA among traits arises from structural differences in the large chromosome segments containing cosegregating QTL alleles that DH lines inherit from their respective parents. To investigate this hypothesis, we analyzed the contribution of each chromosome segment along the entire genome to PA. The length of the chromosome segments within and was taken as the expected genetic map distance at which the LD between two QTL in BPFs falls below (cf. Giraud et al. 2014), which amounted to cM (cf. File S3 in Schopp et al. 2017). Using a sliding window approach, chromosome segments of this length moved in steps of 5 cM along each chromosome separately for each trait. Similar to Kemper et al. (2015), we subsequently calculated for each window the “local” TBV for all DH lines in the BPFpred as
| (7) |
where is the genotypic score coded (2,0) for DH line at QTL and is the corresponding QTL effect. Analogously, we calculated the local GEBV in the BPFpred as
| (8) |
where is the estimate of obtained from RR-BLUP in BPFtrain provided segregated in and otherwise Subsequently, we calculated for each window the correlation between local TBVs and local GEBVs among all 500 DH lines in
Further, we defined chromosome segment substitution effects () for the parental chromosome segments of as the sum of allele substitution effects across all QTL
| (9) |
where and are the parents of with being the common parent of and Thus, if and carry different alleles at QTL and otherwise. Values were calculated analogously with respect to parents and of Note that if QTL segregates in both and i.e., and carry the same allele that is different from the allele in In contrast, implies that QTL segregates in exactly one of the two HSFs or Thus, only if at one or more QTL and the magnitude of this difference depends on (i) the subset of QTL with (ii) the relative size of for each QTL in compared with the effects of other QTL in the genome, and (iii) whether these effects have identical sign or not, which is important, especially for QTL that are closely linked. Altogether, the magnitude of and its difference to for each trait along the genome were expected to strongly influence the PA of GEBVs in BPFpred estimated on the basis of BPFtrain
All computations were carried out in the R statistical environment (R Core Team 2017).
Data availability
Genotypic data of the ancestral populations is available in File S2. All R packages used for simulating the data are publicly available. All simulation steps and equations are fully described within the manuscript.
Results
Means and variation of empirical PA
Figure 1A shows the distributions of empirical PA For the standard scenario (ancestral population Elite, and calculated from SNP markers, Table 1), the mean PA () across all pairs of BPFpred and BPFtrain and traits was highest for FSF (0.79, Table S3 in File S1), and decreased by 43% for HSF (0.45) and by 60% for URF (0.32). A reverse trend was observed for the SD of which amounted to 0.09 for FSF and more than doubled for HSF (0.20) and URF (0.22). The 5 and 95% quantiles of ranged from 0.61 to 0.89 for FSF, but from to for HSF and from to for URF.
Figure 1.
(A) Boxplots of empirical prediction accuracies in BPFs of DH lines, and (B) variance components of different factors influencing the variation of Parents of BPFs were sampled from ancestral population Elite, and SNP markers were used to calculate the genomic relationship matrix Results are shown for different pedigree relationships (FSF, HSF, and URF) between the predicted family (BPFpred) and training family (BPFtrain) as well as for different sample sizes and heritabilities
For reducing from to 25 resulted in – lower and increasing to 250 resulted in 12–18% higher for all pedigree relationships (Figure 1A). The SD increased for by 84% for FSF, but only by and for HSF and URF, respectively, because it was already large under For the SD reduced by for FSF, yet only by 6% for HSF and for URF. Altering for affected the PA similarly as altering under fixed In comparison with was reduced by– for and increased by – for depending on the pedigree relationship. The corresponding SDs changed considerably for FSF (+57 and −68%), but only marginally for HSF (8 and −11%) and URF (4 and −7%).
Deriving BPFs from ancestral population Landrace instead of Elite generally reduced by <0.05, whereas the SD remained nearly identical (Figure 2A and Table S3 in File S1). By comparison, calculating the matrix from QTL instead of SNP data increased by only 0.02, 0.03, and 0.05 for FSF, HSF, and URF, respectively, but hardly affected the SD, regardless of the pedigree relationship and the ancestral population.
Figure 2.
(A) Boxplots of empirical prediction accuracies in BPFs of DH lines and (B) variance components of different factors influencing the variation of Parents of BPFs were sampled from ancestral population Elite (left) or Landrace (right), and either genotypes at SNP markers or at QTL were used to calculate the genomic relationship matrix Results are shown for different pedigree relationships (FSF, HSF, and URF) between the predicted family (BPFpred) and training family (BPFtrain) and refer to and
Analysis of variance of random factors affecting the empirical PA
Estimates of for were of similar magnitude for HSF and URF, but generally much smaller for FSF (Figure 1B). For the standard scenario, was small for FSF (0.01) and primarily attributable to By comparison, was 5.3 and 6.6 times larger for HSF and URF, respectively, with >50% contributed by followed by the residual variance (26 and 19%, respectively). All variance components not involving factor were substantially smaller, with contributing most for HSF (9%) and URF (6%).
Decreasing to 25 or to 0.3 affected the relative importance and overall magnitude of the variance components similarly for the three pedigree relationships (Figure 1B). The residual variances (FSF) and (HSF, URF) increased substantially, accompanied by a moderate increase in for FSF and decrease in for HSF and URF. Conversely, increasing to 250 or to 1.0 strongly reduced the residual variances and nearly eliminated for FSF, whereas, for HSF and URF, remained large owing to a high even under these favorable conditions.
Deriving BPFs from ancestral population Landrace instead of Elite had almost no effect on and its components (Figure 2B). Calculating the matrix from QTL instead of from SNP genotypes moderately reduced by 5% for HSF and 10% for URF, mainly due to decreasing In contrast to HSF and URF, for FSF was already minor when using SNP genotypes, leaving less room for improvement when using QTL instead of SNP genotypes than for HSF and URF, which both showed bigger changes in the absolute magnitude of the variance components than FSF.
Comparison of empirical and deterministic prediction accuracies
Figure 3 shows scatter plots for empirical versus deterministic prediction accuracies for the standard scenario. In general, empirical and deterministic accuracies for single traits agreed relatively well for FSF ( and ), but rather weakly for HSF ( and , respectively) and URF ( and , respectively). By comparison, the correlations between the means of empirical and deterministic accuracies across the 50 traits increased for FSF ( and ), but even more so for HSF (0.94 and 0.92, respectively) and URF (0.89 and 0.88, respectively), indicating that trait-specific deviations from the mean empirical accuracy hampers the agreement with deterministic accuracies, particularly for HSF and URF.
Figure 3.
Empirical prediction accuracy in BPFs of DH lines plotted against deterministic prediction accuracies and The top two graphs refer to observations for single traits ( for FSF and otherwise), and the bottom row to means over traits ( for FSF and otherwise). Parents of BPFs were sampled from ancestral population Elite and genotypes at SNP markers were used to calculate the genomic relationship matrix Results are shown for a random sample of 10,000 data points, and
For the general mean of empirical and deterministic PA across and matched very well with for all pedigree relationships and values of and (Figure S2 in File S1). By comparison, generally underestimated with increasing bias for HSF and URF as compared with (Figure S3 in File S1), and particularly for smaller values of and (Figure S2 in File S1). Calculating the matrix from QTL instead of from SNP genotypes hardly influenced the bias of deterministic accuracies (Figure S4 in File S1) and the correlations with empirical accuracies.
Causal analysis of the variation in PA among traits
Figure 4 compares two traits T1 and T2 with divergent PA for one representative pair of HSFs. For both traits with identical QTL positions and QTL genotypes in the BPFpred and BPFtrain B, but different QTL effects, 376 QTL segregated in 286 in and 151 of them jointly in and For trait T1 with high the differences between chromosome segment substitution effects (CSSE) in and were generally small across the entire genome, in particular on chromosomes 2, 3, and 9, with sizeable CSSEs (Figure 4A). Conversely, for trait T2 with low the CSSEs in and differed substantially over large parts of the genome, and showed even opposite signs on several chromosomes.
Figure 4.
(A) Chromosome segment substitution effects (CSSEA,W in red and CSSEB,W in blue) and correlation between local TBVs and local GEBVs in the predicted family (green) averaged in sliding windows (see Materials and Methods for definition). GEBVs were calculated from QTL effects estimated by RR-BLUP in training set (HSF) Results are shown for and two traits T1 and T2 with and large differences in prediction accuracy Both traits were generated from the same set of 1000 QTL with but different QTL effects. (B) Correlation between local TBVs and local GEBVs (green lines) shown together with true QTL effects (diamonds) and estimated QTL effects (circles) for T1 and T2 in on chromosome 5. Colors indicate QTL segregating in both and (orange) or only in (purple); grey bars in the background reflect the windows
The correlation between local TBVs and local GEBVs of the DH lines were closely associated with the differences between the CSSEs for and in the corresponding windows (Figure 4A). If the difference in the CSSE for a segment was small, the correlation was generally high, particularly if both CSSEs in and had large magnitude and identical sign (see chromosomes 2, 3 and 9 for trait T1). Conversely, if the CSSEs for a window differed and had opposite sign in and the correlation between local TBV and local GEBV dropped substantially, and frequently became negative (see chromosomes 2, 5, and 8 for trait T2). Overall, the proportion of the genome showing low or even negative correlations was much smaller for trait T1 with high PA than for trait T2 with low PA.
Zooming into chromosome 5—which had a large impact on the differences between the two traits—revealed that for trait T1, all large-effect QTL that segregated in also segregated in (Figure 4B). However, for trait T2, there was a large-effect QTL that segregated only in in windows with low correlation between local TBVs and local GEBVs. Neighboring windows not harboring this QTL showed higher correlations. The trends for this exemplary chromosome were consistent with other chromosomes and other HSF pairs and as well as other traits with high and low PA (results not shown).
Discussion
Experimental studies showed that PA can be highly variable for GP within, but even more so across BPFs. Moreover, PA was found to vary substantially among different target traits for distinct pairs of training and predicted families. Investigating the causes for this variability is hardly possible based on experimental data due to the limited number and sample size of available BPFs, and the generally unknown genetic architecture of agronomically important traits. Here, we used computer simulations to analyze in detail why PA varies among different combinations of training sets, prediction sets, and polygenic traits. Moreover, we demonstrate that modification of available deterministic equations enables accurate estimates of PA averaged across many polygenic traits for both within-family GP and across-family GP.
Variation in PA within and across biparental families
The average PA decreased under small and low (Figure 1A) for all pedigree relationships, as expected from theory (Daetwyler et al. 2008). This was always accompanied by a large increase in the variation of PA (Figure 1A), which was mainly caused by inflated residual errors [ for FSF, for HSF and URF, Figure 1B]. These errors capture the variation in PA that arises due to the random sampling of (i) individuals (genotypes) from the BPFtrain, and (ii) their corresponding phenotypes for a specific trait. The larger residual errors in across-family GP are presumably due to incongruent sets of QTL segregating in pairs of HSFs and URFs, which can vary substantially across traits, as reflected by the SD of (Table 2). The fact that predictions became much more robust under 100 and illustrate that large sample sizes and heritabilities are mandatory to alleviate the trait-specific sampling variance in PA. Together with the generally optimal conditions in within-family GP (Crossa et al. 2014), this nearly eliminated all variation in PA for FSF (Figure 1).
The predicted family BPFpred accounted only for a marginal proportion of variation in PA, irrespective of the pedigree relationship with BPFtrain (Figure 1B, ). For within-family GP (where BPFtrain = BPFpred), this implies that the genetic distance between the parents of a BPF has at best marginal influence on the average PA across traits, in agreement with previous studies (Lehermeier et al. 2014; Marulanda et al. 2015). This conclusion is further supported by the similar variation in PA among predicted families derived from the two ancestral populations ( Figure 2B, FSF), despite the much weaker latent pedigree structure in Landrace compared with Elite (Figure S1B in File S1). By comparison, the generally substantial influence of in FSF (Figure 1B and Figure 2B) suggests that PA strongly depends on in the training set (Figure S5 in File S1), which can be highly variable among BPF × trait combinations (Figure S6 in File S1). This is in harmony with previous studies that attributed variation in PA partially to differences in the phenotypic variance of the training set (Lehermeier et al. 2014; Marulanda et al. 2015).
For across-family GP, the expected PA depends largely on the pedigree relationship (Habier et al. 2007; Riedelsheimer et al. 2013) and on the variation in across-family genomic relationships. Since genomic relationships across families have a zero mean (if calculated according to Equation 1), their variation is equal to the mean squared genomic relationship between training and predicted individuals (Wientjes et al. 2013). Generally, PA is expected to increase proportionally with these squared relationships. In the case of BPFs, genomic relationships between families are heavily influenced by the proportion of polymorphic markers in the BPFpred () segregating also in the BPFtrain (Figure S7 in File S1). Therefore, PA for across-family GP depends primarily on the magnitude of because larger implies that a greater proportion of the genetic variance in the BPFpred can be explained by the QTL in BPFtrain. Accordingly, the variation in among combinations of different HSFs or URFs (Figure S1D in File S1) was largely responsible for the notable contribution of to the total variation in PA (Figure 1B). Altogether, the much larger for across-family GP, compared to within-family GP, was mainly due to the overriding influence of besides the considerable contribution of to (Figure 1B, FSF vs. HSF or URF). Unraveling the genetic causes for this complex interaction required additional analyses, which are discussed in depth in the next section.
Sampling of training individuals from a given BPFtrain barely contributed to the variation in PA, for both within- and across-family GP (Figure 1B, and ). Thus, compared with structured populations or diversity panels, there is little room for improvement by applying optimization algorithms accounting for genomic relationships in the sampling of training individuals within BPFs (Rincent et al. 2012; Akdemir et al. 2015; Bustos-Korts et al. 2016), confirming previous findings (Lorenz and Smith 2015; Marulanda et al. 2015). This is because already modest sample sizes (e.g., ) enable the Mendelian sampling term in the BPFtrain to be sufficiently captured. Nevertheless, we recommend to achieve a high mean and small variance of PA (Table S3 in File S1) arising from sampling of genotypes from a given BPFtrain (Figure 1B).
Previous experimental studies found generally higher levels of variation in PA, particularly for within-family GP (Riedelsheimer et al. 2013; Lehermeier et al. 2014; Lian et al. 2014). This is most likely attributable to miscellaneous additional factors present in these studies, which were not accounted for in our simulations. These factors include (i) small prediction set size, (ii) analysis of different types of progeny (F2 or backcross generations and DH lines derived from them), (iii) variation in QTL-SNP LD within BPFs due to low marker density, (iv) nonadditive gene action due to epistasis, and (v) estimation error in which affects calculation of PA from predictive ability. Further, the various agronomic traits investigated in the experimental studies differed likely in their genetic architecture, which further increases the total variation in PA compared with the polygenic traits simulated in our study ( Figure 1B). Consequently, our results should be regarded as a lower bound for the variation in PA that must be expected in practice for a given and
Unraveling the variation among traits in across-family GP
We adopted the concept of local breeding values (cf. Kemper et al. 2015) to investigate the relationship between the strong variation in PA among traits and the large chromosome segments that DH lines of BPF inherit from their parents. The latter entails strong LD between QTL alleles and consequently small (Table 2), which is very different from the situation found in diverse populations such as cattle breeds () (Daetwyler et al. 2010; Wientjes et al. 2013). Thus, only a small number of local TBVs contribute to the “global” TBV of predicted individuals. Similarly, the PA can be thought of as the average accuracy of local GEBVs estimated from the training data, weighted by their relative contribution to the global TBV in the BPFpred. As a consequence of the small in BPFs, the accuracy of local GEBVs is prone to much larger sample variance than would be the case in more diverse populations. To illustrate this point, we examined for a given pair of HSFs exemplarily two traits with contrasting PA (Figure 4).
Of all QTL, only those that segregated in the BPFpred (376/1, 000, Figure 4) contributed to the variance in local TBVs, which were estimated by local GEBVs from the training set. In our example, trait with showed, on average, much higher correlations between local TBVs and local GEBVs in the BPFpred along the entire genome than trait with (Figure 4A). For the trait with low PA, we found a larger proportion of local GEBVs that provided a false prediction signal, in the sense that negative effects were estimated for favorable parental chromosome segments and vice versa. These discrepancies between local TBVs and local GEBVs trace back to different chromosome segment substitution effects (CSSE, Equation 9) between the BPFpred and BPFtrain (Figure 4A), which, in the case of HSFs, occur if their noncommon parent carries different alleles at one or more QTL on the segment. If this is the case, one of the two BPFs will be monomorphic for the respective QTL. The effect of such a QTL compared with other QTL on a chromosome segment that may be polymorphic in both the BPFpred and BPFtrain determines the difference in CSSE between two families. For instance, if the variance in local TBVs among predicted individuals is dominated by a large-effect QTL, which is monomorphic in the training set, the ranking of local GEBVs based on the other polymorphic QTL located on this segment might deviate substantially from the ranking of local TBVs, resulting in low local PA (Figure 4B, ). The frequency of inaccurate local GEBVs along the whole genome together with the variance explained by the corresponding local TBVs will finally determine the PA of across-family GP. Hence, two traits with the same number and positions of QTL might have very different PA, depending on the effects of QTL that are poly- or monomorphic across the training and prediction set. This explains also why and thereby across-family genomic relationships, were closely associated with the average PA across many traits for different pairs of HSF and URF (Figure S7 in File S1), but poorly associated with PA for individual traits (Figure 3). Additional simulations showed further that reducing (i) the number of chromosomes on which QTL were located, or (ii) the total number of QTL, results in increased variation in PA (Figure S8 in File S1). Both these alterations reduce the number of local TBVs discernible for a trait, which underlines the relevance of small (i.e., a low number of segments carrying QTL) for the variation in PA.
In conclusion, the large variation in PA among traits observed for across-family GP is caused by the strong LD among linked QTL within BPFs, and the resulting small effective number of chromosome segments contributing to polygenic traits, in combination with different QTL segregating across BPFs. Our analyses exemplify that BPFs represent a special case regarding the possibly strong fluctuations in PA, which is—to this extent—not expected for genetically more diverse populations.
Influence of LD in the ancestral population on the expected accuracy of GP across BPFs
Differences in the extent of LD in ancestral populations Elite and Landrace (Figure S1A in File S1) translated into sizable differences in QTL-SNP linkage phase similarity among URFs derived from these populations (Figure S1C in File S1). Surprisingly, this barely affected across URFs (Figure 2A and Table S3 in File S1). The low relevance of linkage phase similarity across URFs was confirmed by the similar PAs when substituting the SNP- with a QTL-derived matrix (Figure 2A), which eliminates the influence of this factor. This reflects most likely the overriding influence of on PA across URFs, because the mean was similar for URFs derived from the two ancestral populations (Figure S1D in File S1). Thus, the higher mean in PA for HSFs compared with URFs seems to be attributable to higher values (Table 2) rather than to the fact that QTL-SNP linkage phases are always consistent across HSF (Lehermeier et al. 2014), but not necessarily across URF. This corrects a conjecture of Riedelsheimer et al. (2013), who suspected that low PA obtained from certain URFs was due to low linkage phase similarity with the respective BPFpred.
Deterministic equations for forecasting PA within and across BPFs
Forecasting PA based on estimated reliabilities of GEBVs requires that unrelated individuals have an expected genomic relationship of zero (Goddard et al. 2011; Wientjes et al. 2015). This can be achieved by a block-structured matrix based on population-specific allele frequencies (e.g., Chen et al. 2013). Preliminary analyses showed that in the calculation of (Equation A5), correct treatment of SNPs polymorphic only in either BPFtrain or in BPFpred is very important. Different from empirical PAs, which remain unaffected by (see Appendix A), deterministic PAs across BPFs can be grossly inflated by ignoring in the calculation of (results not shown). While is generally high across diverse populations such as breeds of cattle (Matukumalli et al. 2009), it can fall to <0.4 across different BPFs produced from inbred parents in plant breeding (Figure S1D in File S1 and Table 2). Calculating according to our improved method (Equation 1) largely eliminated the bias in deterministic accuracies attributable to and is therefore a prerequisite for applying Equation 3 to GP across BPFs.
Accounting for inbreeding (see Appendix B for derivation) in the original reliability equation, resulted together with the modifications on the matrix in excellent agreement between empirical and deterministic accuracies averaged across traits, which is consistent with the findings of Wientjes et al. (2015) for cattle populations. However, the trait-dependent variation in empirical PA observed for GP across BPFs cannot be accounted for by This is because for a given set of training and predicted individuals and two traits with the same but different QTL effects, the deterministic accuracy would be identical yet the empirical accuracy can differ substantially as illustrated in Figure 3 and Figure 4.
Forecasting PA within FSF by Daetwyler et al.’s (2008, 2010) equation based on population parameters has been widely used in plant breeding (Lorenz 2013; Riedelsheimer et al. 2013; Lian et al. 2014). However, estimates of can differ substantially (Riedelsheimer and Melchinger 2013; Wientjes et al. 2013) between the various proposed formulas to estimate from the effective population size and genome length (Goddard 2009; Meuwissen and Goddard 2010; Goddard et al. 2011). Moreover, estimation of itself is problematic, because it assumes a base population of unrelated founders, which is often impossible to define in practice (cf. Figure S1B in File S1, Elite). Following Goddard et al. (2011), we calculated directly from the variance of genomic relationships, with extensions devised by Wientjes et al. (2015, 2016) for GP across populations (Equation 5). This has the advantage that is computed from the actual genotypes for which the PA is to be forecasted. The calculation of required in Equation 4 must account for inbreeding (Equation 6), because the variance in genomic relationships increases with the inbreeding coefficient (see Appendix C). Ignoring inbreeding would result in underestimation of and strong overestimation of the deterministic accuracy
An important assumption of the equation of Daetwyler et al. is that the entire genetic variance in the prediction set is explained by QTL segregating in the training set (cf. in Wientjes et al. 2016). This holds true for FSF (), but is violated for GP across BPFs (Table 2). As a solution for this problem, we propose multiplication with in calculating (Equation 4), which efficiently reduced the strong upward-bias observed otherwise (results not shown). With these modifications, empirical and deterministic accuracies agreed reasonably well when averaged across traits, but forecasting was problematic for individual traits for the same reasons as discussed above for (Figure 3). Compared with previous experimental studies (Riedelsheimer et al. 2013; Lian et al. 2014), we found overall better agreement of and for single traits in within-family GP (Figure 3). We suppose that, in addition to the lower variation in empirical PA (Figure 1), this is likely attributable to smaller deviations between estimated and true (Lian et al. 2014) when dealing with real traits of diverse genetic architecture.
An upward bias in deterministic PA must generally be expected if SNPs are not a good approximation of QTL due to incomplete QTL-SNP LD, (cf. vs. in Wientjes et al. 2016), leading to “missing heritability” in genomic studies (Yang et al. 2010). This is because empirical PA decreases as less variance at QTL is explained by SNPs under incomplete LD, whereas deterministic PA is hardly affected (Figure S9 in File S1). However, our results show that this is barely relevant in BPFs (Figure 3 vs. Figure S4 in File S1), if large chromosome segments are covered sufficiently by markers. Thus, a sizable reduction in empirical PA and overestimation of deterministic PA must only be expected under very low marker density (<100 SNPs) as in the study of Lian et al. (2014). Although these authors argued that 100 SNPs were likely sufficient for within-family GP in maize, our results indicate that at least 1000 and 2500 SNPs should be used for within- and across-family GP, respectively, to obtain acceptable empirical PA and minimize the bias in deterministic PA (Figure S9 in File S1). If such numbers are not available, deterministic equations must additionally account for incomplete LD (Wientjes et al. 2016), using, for example, multiplication with the average LD () between adjacent markers as proxy for the QTL-SNP LD (Lian et al. 2014).
Besides low marker density, incomplete QTL-SNP LD can result from differences in the allele frequency distribution at QTL and SNPs (Goddard et al. 2011), inter alia due to ascertainment bias of SNP chips. These differences are in reality unknown, and, as treated herein, commonly not accounted for in simulation studies (Daetwyler et al. 2013). For GP across BPFs, differences in allele frequencies at QTL and SNPs in the ancestral population (cf. Figure S1E in File S1) would translate into different values at SNPs and QTL across BPFs, because the smaller the minor allele frequency, the larger the chances of a locus being monomorphic in a BPF. Thus, calculation of might be inflated by an upward-bias in (Equation 5), in addition to the possible overestimation of across-family genomic relationships affecting both and (Equations 3 and 4). Further research is needed to show how strongly overestimation of can affect application of deterministic equations in practice, for example, by comparing the equations under chip-based and sequencing-based genotyping (Pérez-Enciso et al. 2015).
We assumed in our derivations that the genetic correlation among BPFs = 1 (see Appendix B), which is expected to hold under a purely additive-genetic model, as applies in the absence of epistasis to (i) testcross performance for a given tester, and (ii) to per se performance of completely homozygous lines (Melchinger 1987). By comparison, in cattle breeds or diverse germplasm in plant breeding, genetic correlations between populations are typically < 1 (Karoui et al. 2012; Lehermeier et al. 2015). Accounting for genetic correlations is possible with multi-group models, but these require sufficient phenotypic data for the predicted population as well as estimating these correlations, which seems impractical in the case of GP of a single BPF.
Despite generally promising results for both deterministic equations, we recommend using (Equation 3), because it depended less on the relatedness between BPFs, and (Figures S2 and S3 in File S1), rendering it more robust across a wide range of scenarios. Since and (as implemented here) require genotypic data of both the training and predicted individuals, they can be applied only after obtaining genotypic data of the individuals to be predicted. Alternatively, for newly planned crosses we propose to use computer simulations to generate in silico virtual genotypic data of the corresponding BPFs using known genotypes of the parents and genetic map information of the markers, as conducted in this study (cf. Mohammadi et al. 2015). This would make both equations accessible prior to generating new crosses for use in optimizing training set designs and allocation of resources.
Conclusions and extensions to multi-family training sets
We demonstrated that the empirical PA in BPFs of inbred lines is prone to various sources of variation, which differ strongly in their relevance for GP within and across BPFs. It should be stressed that the conclusions drawn from our study do not only apply to DH lines, but also to inbreds developed by recurrent selfing and most likely also to partly inbred generations. Overall, our results corroborate within-family GP as a valuable and robust tool for the implementation of GP in plant breeding, provided the training set meets minimum standards for () and (0.3). However, the need for phenotypes from the predicted family represents the main drawback of within-family GP, because this increases both the costs and the time needed until selection can be applied.
Our simulations on across-family GP were restricted to the simple strategy of using only a single HSF or URF for model training. This provided a manageable framework for analyzing the underlying causes affecting variation in PA. For a given BPFpred, we showed: (i) the PA in across-family GP expected across many traits differs systematically between different BPFtrain, even if they have the same pedigree relationship with the BPFpred, (ii) deterministic equations enable accurate forecasts of the PA across traits for given pairs of BPFpred and BPFtrain, and (iii) large variation in the PA among traits hampers the forecasting. Therefore, it is very unlikely to find a single BPFtrain that performs uniformly best across all target traits. This means that caution must be exercised when applying rules of thumb or deterministic equations for choosing the BPFtrain in GP of a specific trait given BPFpred. This issue can be even more severe if (i) traits deviate from the polygenic architecture assumed in our simulations, or (ii) in the BPFs is smaller than in maize due to fewer chromosomes and/or smaller genome size (Figure S8 in File S1). Thus, identification of useful, trait-specific BPFtrain might only be possible by directly evaluating the empirical PA for a small sample of individuals from the BPFpred. However, this would largely eliminate the time- and cost-related advantages of genomic selection based on previously available data from BPFs.
In practice, breeders generally do not rely on single-family training sets in GP across BPFs, but rather use multi-family training set designs for the sake of increasing sample size (Heffner et al. 2011; Riedelsheimer et al. 2013; Hickey et al. 2014; Jacobson et al. 2014; Lehermeier et al. 2014). Another important advantage of multi-family over single-family training sets in across-family GP most likely stems from the increased proportion of causal loci segregating in both the BPFpred and the training set, which we identified as the core problem leading to the large variation of PA in GP across single BPFs. One critical question in this context is whether or not a single BPFtrain that is poorly predictive of a given BPFpred (e.g., a HSFs that yields PA close to zero, Figure 4) is detrimental or harmless for PA if combined together with other predictive BPFs for extending the training set. The problem might exacerbate if URF are included in multi-family training sets (cf. Albrecht et al. 2011), which might come at the expense of reduced linkage phase similarity (cf. Figure S1C in File S1) between a multifamily training set and the BPFpred (Lorenz and Smith 2015). Further research is warranted to investigate whether the current design of training sets can be improved by identifying and excluding adverse families to avoid disappointing outcomes of GP in BPFs.
Supplementary Material
Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.117.300076/-/DC1.
Acknowledgments
We thank Chris-Carolin Schön, Matthias Westhues, Tobias Schrag, and Willem Molenaar for valuable suggestions to improve the content of this manuscript. P.S. acknowledges Syngenta for partially funding this research by a Ph.D. fellowship, and A.E.M. acknowledges the financial contribution of the International Maize and Wheat Improvement Center/Gesellschaft für Internationale Zusammenarbeit (CIMMYT/GIZ) through the Climate Resilient Maize for Asia (CRMA) Project 15.78600.8-001-00.
Appendix A
Genomic Relationships Between DH Lines from BFPs and Calculated with Different Methods
Suppose and are two DH lines from BPFs and respectively. Let and denote the set of loci (SNPs or QTL, depending on the context) that are polymorphic in or in polymorphic in polymorphic in and polymorphic in and respectively. Since and are BPFs, we have, under Mendelian inheritance,
| (A1) |
Thus,
| (A2) |
where and denote the number of elements in set and respectively. Defining and we get, with Equation 1, for completely homozygous lines
| (A3) |
For calculating the elements of the genomic relationship matrix according to the modification proposed in Equation 1, we obtain
| (A4) |
where refers to the simple matching coefficient (Sneath and Sokal 1973), also known as the IBS (identity by state) coefficient (Astle and Balding 2009), between and with respect to the loci set Using the original formula of Chen et al. (2013), which extends Method 1 of VanRaden (2008) to the case of two populations, we obtain the genomic relationship matrix with elements
| (A5) |
Extending Method 2 of VanRaden (2008) to the case of two populations, we obtain the genomic relationship matrix as follows
| (A6) |
where summation is only possible for because for the denominator is zero, where denotes subset of polymorphic in but not in or polymorphic in but not in Thus, we obtain
| (A7) |
in BPFs with allele frequencies equal to 0.5 at segregating loci, with Consequently, if and only if (e.g., if ), but otherwise Note that the empirical PA of GBLUP is invariant to (cf. Strandén and Christensen 2011), but affects uniformly the scaling of GEBVs and reliabilities thereof. Note also that calculated with regard to all loci (set ) can deviate from because
| (A8) |
and as well as can vary between pairs of and where and denote subsets of polymorphic in but not in and polymorphic in but not in respectively.
Appendix B
Calculation of the Reliability of GP across Populations for Inbred Individuals
Assume two populations A (= prediction set) and B (= training set), which are not necessarily BPFs, and we consider across-population GP. Using well-known results about selection indices (Mrode 2005), the breeding value for individual which may be inbred, is predicted with information from its genotype and the phenotypic and genotypic information from the training set as:
| (B1) |
in which is the predicted breeding value, is the true breeding value of individual in population and is a vector with phenotypes of individuals from population corrected for fixed effects.
The covariance between the true breeding value of an individual from population and the phenotypes of individuals from population is:
| (B2) |
where is the genetic correlation between and (which represents the correlation between the breeding value in population and the breeding value in population for the individuals in ), is the vector of genomic relationships between individual and the training individuals from that can be estimated by Equation 1 in the main text, and and are the square root of the additive variances in and respectively. Finally,
| (B3) |
where is the genomic relationship matrix among training individuals in is the covariance matrix of the errors in the observation vector and the breeding value is predicted as:
| (B4) |
We are interested in the reliability
| (B5) |
for the estimated breeding value of individual being defined with respect to population Since together with Equation B2, we obtain
| (B6) |
so, the reliability is:
Appendix C
Calculation of and the Variance of Genomic Relationships of Inbred Populations
Consider two populations (=prediction set) and (= training set) that are not necessarily BPFs. Based on the theory of Goddard et al. (2011), Wientjes et al. (2015) suggested to calculate the effective number of chromosome segments shared between the two populations as (see their Equation 20)
| (C1) |
If all individuals in have the same pedigree relationships with the individuals in which holds true for pairs of BPFs, we have so that If the genotypes of loci pairs are stochastically independent, as follows from the assumption of independent segregation in the definition of (Daetwyler et al. 2010), and applies in reality to all loci pairs located on different chromosomes (i.e., a fraction of at least of all loci pairs, assuming chromosomes with equal length and number of loci), we have
| (C2) |
where and are defined as in Appendix A, with and and and are stochastically independent because and are random samples from and respectively. Thus, using results about the product of two stochastically independent random variables (Mood 1974), we obtain
| (C3) |
Under inbreeding with inbreeding coefficient in population applying well-known results on the effects of inbreeding on the additive genetic variance (Falconer and Mackay 1996), we obtain
| (C4) |
and a similar result for population Combining Equations C2, C3, and C4 yields
| (C5) |
While, for inbred generations derived by recurrent selfing, this equation may hold only approximately true due to the assumption of stochastic independence among all loci pairs, a proof can be given that Equation C5 holds strictly true without this requirement for DH lines and F2 individuals, i.e.,
| (C6) |
In the original publications (Goddard et al. 2011; Wientjes et al. 2015) connecting with the variance in genomic relationships among individuals, it was assumed that the individuals were noninbred. However, if they are DH lines or from another inbred generation, this is expected to affect so that for the case of fixed pedigree relationships between and the estimator of becomes
| (C7) |
Footnotes
Communicating editor: J.-L. Jannink
Literature Cited
- Akdemir D., Sanchez J. I., Jannink J.-L., 2015. Optimization of genomic selection training populations with a genetic algorithm. Genet. Sel. Evol. 47: 38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Albrecht T., Wimmer V., Auinger H., Erbe M., Knaak C., et al. , 2011. Genome-based prediction of testcross values in maize. Theor. Appl. Genet. 123: 339–350. [DOI] [PubMed] [Google Scholar]
- Astle W., Balding D., 2009. Population structure and cryptic relatedness in genetic association studies. Stat. Sci. 24: 451–471. [Google Scholar]
- Bates D., Mächler M., Bolker B., Walker S., 2015. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67: 1–48. [Google Scholar]
- Bernardo R., Yu J., 2007. Prospects for genomewide selection for quantitative traits in maize. Crop Sci. 47: 1082–1090. [Google Scholar]
- Bustos-Korts D., Malosetti M., Chapman S., Biddulph B., van Eeuwijk F., 2016. Improvement of predictive ability by uniform coverage of the target genetic space. G3 6: 3733–3747. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen L., Schenkel F., Vinsky M., Jr D. H. C., Li C., 2013. Accuracy of predicting genomic breeding values for residual feed intake in angus and charolais beef cattle. Anim. Genet. 91: 4669–4678. [DOI] [PubMed] [Google Scholar]
- Clark S. A., Hickey J. M., Daetwyler H. D., van der Werf J. H. J., 2012. The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet. Sel. Evol. 44: 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crossa J., Pérez P., Hickey J., Burgueño J., Ornella L., et al. , 2014. Genomic prediction in CIMMYT maize and wheat breeding programs. Heredity (Edinb) 112: 48–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daetwyler H. D., Villanueva B., Woolliams J. A., 2008. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS One 3: e3395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daetwyler H. D., Pong-Wong R., Villanueva B., Woolliams J. A., 2010. The impact of genetic architecture on genome-wide evaluation methods. Genetics 185: 1021–1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daetwyler H. D., Calus M. P. L., Pong-Wong R., de Los Campos G., Hickey J. M., 2013. Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics 193: 347–365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Endelman J. B., 2011. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4: 250–255. [Google Scholar]
- Falconer D. F., Mackay T. S. C., 1996. Introduction to Quantitative Genetics. Longman, Pearson, Essex. [Google Scholar]
- Giraud H., Lehermeier C., Bauer E., Falque M., Segura V., et al. , 2014. Linkage disequilibrium with linkage analysis of multiline crosses reveals different multiallelic QTL for hybrid performance in the flint and dent heterotic groups of maize. Genetics 198: 1717–1734. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goddard M., 2009. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica 136: 245–257. [DOI] [PubMed] [Google Scholar]
- Goddard M. E., Hayes B. J., 2007. Genomic selection. J. Anim. Breed. Genet. 124: 323–330. [DOI] [PubMed] [Google Scholar]
- Goddard M. E., Hayes B. J., Meuwissen T. H. E., 2011. Using the genomic relationship matrix to predict the accuracy of genomic selection. J. Anim. Breed. Genet. 128: 409–421. [DOI] [PubMed] [Google Scholar]
- Habier D., Fernando R. L., Dekkers J. C. M., 2007. The impact of genetic relationship information on genome-assisted breeding values. Genetics 177: 2389–2397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Habier D., Tetens J., Seefried F., Lichtner P., Thaller G., 2010. The impact of genetic relationship information on genomic breeding values in German Holstein cattle. Genet. Sel. Evol. 42: 5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Habier D., Fernando R. L., Garrick D. J., 2013. Genomic BLUP decoded: a look into the black box of genomic prediction. Genetics 194: 597–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hayes B. J., Bowman P. J., Chamberlain A. C., Verbyla K., Goddard M. E., 2009. Accuracy of genomic breeding values in multi-breed dairy cattle populations. Genet. Sel. Evol. 41: 51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He S., Schulthess A. W., Mirdita V., Zhao Y., Korzun V., et al. , 2016. Genomic selection in a commercial winter wheat population. Theor. Appl. Genet. 129: 641–651. [DOI] [PubMed] [Google Scholar]
- Heffner E. L., Jannink J., Sorrells M. E., 2011. Genomic selection accuracy using multifamily prediction models in a wheat breeding program. Plant Genome 4: 65–75. [Google Scholar]
- Hickey J. M., Dreisigacker S., Crossa J., Hearne S., Babu R., et al. , 2014. Evaluation of genomic selection training population designs and genotyping strategies in plant breeding programs using simulation. Crop Sci. 54: 1476–1488. [Google Scholar]
- Jacobson A., Lian L., Zhong S., Bernardo R., 2014. General combining ability model for genomewide selection in a biparental cross. Crop Sci. 54: 895–905. [Google Scholar]
- Jannink J.-L., Lorenz A. J., Iwata H., 2010. Genomic selection in plant breeding: from theory to practice. Brief. Funct. Genomics 9: 166–177. [DOI] [PubMed] [Google Scholar]
- Karoui S., Carabaño M. J., Díaz C., Legarra A., 2012. Joint genomic evaluation of French dairy cattle breeds using multiple-trait models. Genet. Sel. Evol. 44: 39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kemper K. E., Reich C. M., Bowman P. J., Vander Jagt C. J., Chamberlain A. J., et al. , 2015. Improved precision of QTL mapping using a nonlinear Bayesian method in a multi-breed population leads to greater accuracy of across-breed genomic predictions. Genet. Sel. Evol. 47: 29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lehermeier C., Krämer N., Bauer E., Bauland C., Camisan C., et al. , 2014. Usefulness of multi-parental populations of maize (Zea mays L.) for genome-based prediction. Genetics 198: 3–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lehermeier C., Schön C.-C., de los Campos G., 2015. Assessment of genetic heterogeneity in structured plant populations using multivariate whole-genome regression models. Genetics 201: 323–337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lian L., Jacobson A., Zhong S., Bernardo R., 2014. Genomewide prediction accuracy within 969 maize biparental populations. Crop Sci. 54: 1514–1522. [Google Scholar]
- Lin Z., Hayes B. J., Daetwyler H. D., 2014. Genomic selection in crops, trees and forages: a review. Crop Pasture Sci. 65: 1177–1191. [Google Scholar]
- Lorenz A. J., 2013. Resource allocation for maximizing prediction accuracy and genetic gain of genomic selection in plant breeding: a simulation experiment. G3 3: 481–491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lorenz A. J., Smith K. P., 2015. Adding genetically distant individuals to training populations reduces genomic prediction accuracy in barley. Crop Sci. 55: 2657–2667. [Google Scholar]
- Marulanda J. J., Melchinger A. E., Würschum T., 2015. Genomic selection in biparental populations: assessment of parameters for optimum estimation set design. Plant Breed. 134: 623–630. [Google Scholar]
- Matukumalli L. K., Lawley C. T., Schnabel R. D., Taylor J. F., Allan M. F., et al. , 2009. Development and characterization of a high density SNP genotyping assay for cattle. PLoS One 4: e5350. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Melchinger A. E., 1987. Expectation of means and variances of testcrosses produced from F2 and backcross individuals and their selfed progenies. Heredity (Edinb) 59: 105–115. [Google Scholar]
- Melchinger A. E., Schopp P., Müller D., Schrag T. A., Bauer E., et al. , 2017. Safeguarding our genetic resources with libraries of doubled-haploid lines. Genetics 206: 1611–1619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meuwissen T., Goddard M., 2010. Accurate prediction of genetic values for complex traits by whole-genome resequencing. Genetics 185: 623–631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meuwissen T. H. E., Hayes B. J., Goddard M. E., 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819–1829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mohammadi M., Tiede T., Smith K. P., 2015. Popvar: a genome-wide procedure for predicting genetic variance and correlated response in biparental breeding populations. Crop Sci. 55: 2068–2077. [Google Scholar]
- Mood A., 1974. Introduction to the Theory of Statistics. McGraw-Hill Education, Europe. [Google Scholar]
- Mrode R. A., 2005. Linear Models for the Prediction of Animal Breeding Values. CABI, Wallingford, Oxfordshire, UK. [Google Scholar]
- Müller, D., and K. W. Broman, 2017 Meiosis: simulation of meiosis in plant breeding research. R Package. version 1.0.0. Available at: https://github.com/DominikMueller64/Meiosis.
- Pérez-Enciso M., Rincón J. C., Legarra A., 2015. Sequence- vs. chip-assisted genomic selection: accurate biological information is advised. Genet. Sel. Evol. 47: 43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team, 2017 R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria. Available at: https://www.r-project.org/.
- Riedelsheimer C., Melchinger A. E., 2013. Optimizing the allocation of resources for genomic selection in one breeding cycle. Theor. Appl. Genet. 126: 2835–2848. [DOI] [PubMed] [Google Scholar]
- Riedelsheimer C., Endelman J. B., Stange M., Sorrells M. E., Jannink J. L., et al. , 2013. Genomic predictability of interconnected biparental maize populations. Genetics 194: 493–503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rincent R., Laloë D., Nicolas S., Altmann T., Brunel D., et al. , 2012. Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.). Genetics 192: 715–728. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schopp P., Müller D., Technow F., Melchinger A. E., 2017. Accuracy of genomic prediction in synthetic populations depending on the number of parents, relatedness and ancestral linkage disequilibrium. Genetics 205: 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sneath P., Sokal R., 1973. Numerical Taxonomy: The Principles and Practice of Numerical Classification. Freeman, San Francisco, CA. [Google Scholar]
- Strandén I., Christensen O. F., 2011. Allele coding in genomic evaluation. Genet. Sel. Evol. 43: 25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- VanRaden P. M., 2008. Efficient methods to compute genomic predictions. J. Dairy Sci. 91: 4414–4423. [DOI] [PubMed] [Google Scholar]
- Wientjes Y. C. J., Veerkamp R. F., Calus M. P. L., 2013. The effect of linkage disequilibrium and family relationships on the reliability of genomic prediction. Genetics 193: 621–631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wientjes Y. C. J., Veerkamp R. F., Bijma P., Bovenhuis H., Schrooten C., et al. , 2015. Empirical and deterministic accuracies of across-population genomic prediction. Genet. Sel. Evol. 47: 5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wientjes Y. C. J., Bijma P., Veerkamp R. F., Calus M. P. L., 2016. An equation to predict the accuracy of genomic values by combining data from multiple traits, populations, or environments. Genetics 202: 799–823. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang J., Benyamin B., McEvoy B. P., Gordon S., Henders A. K., et al. , 2010. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42: 565–569. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Genotypic data of the ancestral populations is available in File S2. All R packages used for simulating the data are publicly available. All simulation steps and equations are fully described within the manuscript.




