Abstract
Synthetics play an important role in quantitative genetic research and plant breeding, but few studies have investigated the application of genomic prediction (GP) to these populations. Synthetics are generated by intermating a small number of parents ( and thereby possess unique genetic properties, which make them especially suited for systematic investigations of factors contributing to the accuracy of GP. We generated synthetics in silico from 2 to 32 maize (Zea mays L.) lines taken from an ancestral population with either short- or long-range linkage disequilibrium (LD). In eight scenarios differing in relatedness of the training and prediction sets and in the types of data used to calculate the relationship matrix (QTL, SNPs, tag markers, and pedigree), we investigated the prediction accuracy (PA) of Genomic best linear unbiased prediction (GBLUP) and analyzed contributions from pedigree relationships captured by SNP markers, as well as from cosegregation and ancestral LD between QTL and SNPs. The effects of training set size and marker density were also studied. Sampling few parents () generates substantial sample LD that carries over into synthetics through cosegregation of alleles at linked loci. For fixed , influences PA most strongly. If the training and prediction set are related, using parents yields high PA regardless of ancestral LD because SNPs capture pedigree relationships and Mendelian sampling through cosegregation. As increases, ancestral LD contributes more information, while other factors contribute less due to lower frequencies of closely related individuals. For unrelated prediction sets, only ancestral LD contributes information and accuracies were poor and highly variable for due to large sample LD. For large , achieving moderate accuracy requires large , long-range ancestral LD, and high marker density. Our approach for analyzing PA in synthetics provides new insights into the prospects of GP for many types of source populations encountered in plant breeding.
Keywords: genomic prediction, synthetic populations, GBLUP, genetic relationships, linkage disequilibrium, GenPred, Shared data resource, genomic selection
SYNTHETIC populations, known as synthetics, have played an important role in quantitative genetic research on gene action in complex heterotic traits and comparison of selection methods (cf. Hallauer et al. 2010). In many crops, synthetics also serve as cultivars in agricultural production or as source populations for recurrent selection programs (cf. Bradshaw 2016). Synthetics are usually created by crossing a small number of parents () and subsequently cross-pollinating the F1 individuals for one or several generations (Falconer and Mackay 1996). A prominent example is the “Iowa Stiff Stalk Synthetic” (BSSS) generated from 16 parents of maize, from which numerous successful elite inbred lines such as B73 have been derived (Hagdorn et al. 2003). Further examples of synthetics include composite crosses (Suneson 1956) and multi-parental advanced intercross (MAGIC, see Supplemental Material, Table S1 for list of abbreviations) populations (Cavanagh et al. 2008) advocated for breeding purposes in crops (Bandillo et al. 2013). Importantly, two-way and four-way crosses, widely employed as source material in recycling breeding (Mikel and Dudley 2006), can be viewed as special cases of synthetics when and , respectively.
Genomic prediction (GP) proposed by Meuwissen et al. (2001) led to a paradigm-shift in animal breeding during the past decade (Hayes et al. 2009a; de Koning 2016), and has also been widely adopted in plant breeding (Lin et al. 2014). In cattle breeding, GP is predominantly applied within closed breeds and training sets (TS) commonly encompass thousands of individuals. By comparison, in plant breeding the TS sizes are much smaller (e.g., hundreds or fewer of individuals) and populations are usually structured into multiple segregating families or subpopulations. Numerous studies addressed the implementation of GP in structured plant breeding populations (cf. Lorenzana and Bernardo 2009; Albrecht et al. 2011; Lehermeier et al. 2014; Technow and Totir 2015), but systematic investigations on the prospects of GP in synthetics are lacking so far, although they were proposed as particularly suitable source material for recurrent genomic selection (Windhausen et al. 2012; Gorjanc et al. 2016).
Genomic best linear unbiased prediction (GBLUP), a modification of the traditional pedigree BLUP devised by Henderson (1984), is a widely used method to implement GP in animal and plant breeding (Mackay et al. 2015). Here, the pedigree relationship matrix is replaced by a marker-derived genomic relationship matrix to estimate actual relationships at QTL (Hayes et al. 2009c). The success of this approach depends on three sources of information, namely (i) pedigree relationships captured by markers, (ii) cosegregation of QTL and markers, and (iii) population-wide linkage disequilibrium between QTL and markers (Habier et al. 2007, 2013; Wientjes et al. 2013).
In classical quantitative genetics, pedigree relationships between individuals are calculated as twice the probability of identity-by-descent (IBD) of alleles at a locus, conditional on their pedigree (Wright 1922; Falconer and Mackay 1996). However, actual IBD relationships at QTL deviate from pedigree relationships—which correspond to expected IBD relationships—due to Mendelian sampling (Hill and Weir 2011). In GP, pedigree relationships are captured best with a large number of stochastically independent markers (Habier et al. 2007), whereas capturing the Mendelian sampling term requires cosegregation of QTL and markers (Hayes et al. 2009c; Habier et al. 2013).
In pedigree analysis, the founders of the pedigree are, by definition, assumed to be unrelated (i.e., IBD = 0), but in reality, there usually exist latent similarities at QTL contributing to variation in identity-by-state (IBS) relationships among these individuals. Markers enable the capture of these IBS relationships if they are in population-wide linkage disequilibrium (LD) with the QTL in an ancestral population of founders. Thus, ancestral LD between QTL and markers also provides information between individuals that are unrelated by pedigree to the TS (Habier et al. 2013; Wientjes et al. 2013). Ancestral LD generally results from various population-historic processes like mutation, drift, and selection (Flint-Garcia et al. 2003) and varies within species primarily due to different bottlenecks imposed by artificial selection or population admixture (Hill 1981; Hartl and Clark 2007). The influence of different levels of ancestral LD on prediction accuracy (PA) in synthetics and related types of populations has so far received little attention.
The contributions of the three sources of information to PA were demonstrated in theory and simulations by Habier et al. (2013) using half-sib families in cattle breeding and multiple biparental (full-sib) families (BF) in maize breeding, where both of these examples consisted of numerous families derived from a large number of parents. However, it is unclear whether these results generalize to other breeding situations, particularly those involving only few parents. In such situations, diverse relationship patterns are generated, new statistical associations between loci arise due to sampling, and ancestral LD might be only partially present in the progeny. These factors are expected to profoundly affect the contributions of the three sources of information to PA and, thus, affect the application of GP on related and unrelated genotypes. Synthetics represent an ideal framework for examining the influence of these factors on PA, because the number of parents used for producing them can be varied over a wide range. Here, we simulated two ancestral populations differing substantially in their LD and analyzed synthetics generated from different numbers of parents under eight scenarios that enabled dissection of the factors contributing to PA.
The objectives of this study were to: (i) examine how PA in synthetics depends on the number of parents and LD in the ancestral population, (ii) assess the importance of the three sources of information for PA and how they are influenced by training set size and marker density, and (iii) analyze the relationship of LD between QTL and markers among the ancestral population, parents, and the synthetics generated from them. Finally, we discuss how our approach provides a general framework for analyzing the factors influencing PA and we draw inferences on the prospects of GP in other scenarios encountered in breeding.
Methods
Genome properties and genetic map
We used maize (Zea mays L.) as a model species in our study. Physical map positions of the 56K Illumina maize SNP BeadChip were used to account for the markedly reduced recombination rate and lower marker density in the centromere regions (McMullen et al. 2009). These positions were converted into genetic map positions required for simulating meiosis events (File S1). In total, we obtained 37,286 SNPs distributed over the 10 chromosomes of length 276, 200, 193, 188, 221, 171, 203, 173, 151, and 137 cM (1913 cM in total), corresponding to an average marker density of 24.4 SNPs cM−1. All subsequent meiosis events were simulated using the count-location model without crossover interference, where the number of chiasmata was drawn from a Poisson distribution with parameter equal to the chromosome length in Morgans, and where crossover positions were sampled from a uniform distribution across the chromosome.
Simulation of ancestral populations
Two ancestral populations that differed substantially in their level and decay of LD (LDA, Figure 1), were simulated with the software QMSim (Sargolzaei and Schenkel 2009). Ancestral population LR displayed extensive long-range LDA, whereas SR displayed only short-range LDA. The simulation of LR was carried out by closely following Habier et al. (2013) and involved the following steps (Figure S1). First, we generated an initial population of 1500 diploid individuals by sampling alleles at each (biallelic) locus independently from a Bernoulli distribution with probability 0.5. Second, 5000 loci were randomly sampled from all SNPs and henceforth interpreted as QTL; all remaining loci were considered as SNP markers. Third, these individuals were randomly mated for 3000 generations using a constant population size of 1500 and a mutation rate of 2.5 10−5. Fourth, a severe bottleneck was introduced by reducing population size to 30 randomly chosen individuals, followed by 15 more generations of random mating to generate extensive long-range LDA. Fifth, we conducted three more generations of random mating with a population size of 10,000 individuals to eliminate close pedigree relationships in the ancestral population LR. To produce SR, we randomly mated LR for 100 more generations at a population size of 10,000 individuals to remove long-range LDA. Thus, LR and SR strongly differed in their LDA structure, but only marginally in their allele frequencies (Table S2). Always in the last generation, a single gamete was randomly sampled per individual from both SR and LR and treated as a completely homozygous doubled haploid line. These 10,000 lines represented the final ancestral population used for production of the synthetics. All lines were considered unrelated when calculating pedigree relationships among their progeny.
Figure 1.
Linkage disequilibrium (LDA) between pairs of loci plotted against their genetic map distance Δ in cM, for the two ancestral populations SR (short-range LD) and LR (long-range LD). The two vertical lines represent the average distance between QTL and its closest nearby SNP for the two marker densities investigated in our study.
Simulation of synthetics
We generated synthetics differing in by sampling parent lines from the same ancestral population. From these parents, we produced all possible combinations of single crosses (Syn-1 generation, Figure S1), where the number of Syn-1 progenies per cross was chosen to obtain at least 1000 individuals in total. For production of the Syn-2 generation, the Syn-1 individuals were intermated at random, allowing also for selfing. Finally, a single doubled haploid line was derived from each of the 1000 individuals of the Syn-2 generation to obtain the genotypes of the final synthetic. This approach was chosen to avoid additional full-sib relationships among doubled haploid lines that arise when deriving them from the same Syn-2 individual.
Genetic model
For simulating the polygenic target trait, we sampled a subset of 1000 of the 5000 QTL in each simulation replicate. Following Meuwissen et al. (2001), the corresponding QTL effects were drawn from a gamma distribution with scale and shape parameter 0.4 and 1.66, respectively. Signs of QTL effects were sampled from a Bernoulli distribution with probability parameter 0.5.
The vector of true breeding values for all individuals in the synthetic was calculated as , where is the matrix of genotypic scores at QTL coded {2,0} depending on whether an individual was homozygous for the 1 or 0 allele, respectively, that was adjusted for twice the frequency of the 1 allele in the ancestral population (cf. Figure S1), and is the vector of QTL effects. The corresponding vector of phenotypes was obtained as (Goddard et al. 2011; de los Campos et al. 2013; Habier et al. 2013), i.e., assuming a null mean and adding a vector of independent normally distributed environmental noise variables , where variance was chosen to be identical for the two ancestral populations and all choices of , assuming that environmental effects affect phenotypes independently of additive genetic variance in the synthetic. The value of was therefore set equal to the additive genetic variance in ancestral population LR averaged across 1000 simulation replicates. The heritability of the target trait was then on average equal to 0.5 for LR and SR due to nearly identical allele frequencies at QTL, but lower in the synthetics, because (Table S2). We restricted our simulations to a single level of heritability, because preliminary analyses showed that changing resulted in a relatively constant shift of PA.
Analysis of the sources of information exploited in genomic prediction
We conceived eight scenarios to evaluate to what extent the three sources of information contribute to PA in synthetics (Figure 2), when actual relationships at QTL are estimated by marker-derived genomic relationships. The scenarios can be differentiated by three factors.
Figure 2.
Flowchart of the eight scenarios analyzed in this study. Training set and prediction set were either related (“Re”-scenarios) or unrelated (“Un”-scenarios). The arrows represent the changes made between scenarios, e.g., removal of ancestral linkage disequilibrium (LD) between QTL and SNPs (LDA → LEA) or replacing the relationship matrix (G → Q). The background texture indicates whether identity-by-state (IBS) or identity-by-descent (IBD) information was used. The green circles show the sources of information that contributed to prediction accuracy for the SNP-based scenarios (cf. Habier et al. 2013) where, in addition to LDA, RS refers to pedigree relationships at QTL captured by SNPs and CS refers to cosegregation of QTL and SNPs.
First, individuals in the TS and prediction set (PS) were either related (“Re”-scenarios) or unrelated (“Un”-scenarios), depending on whether the parents of the TS () and of the PS () were identical (i.e., ) or disjointed (i.e., ). For the “Re”-scenarios, we sampled individuals for the TS and PS from the same synthetic, whereas for the “Un”-scenarios, individuals were sampled from two different synthetics produced from disjointed sets and , each of size . Both sets of parents originated always from the same ancestral population.
Second, pairs of QTL and SNPs were either in LD (“LDA”-scenarios) as found in the ancestral population, or in linkage equilibrium (“LEA”-scenarios). To achieve the latter, we permuted complete QTL haplotypes among the parents (for “Un”-scenarios separately in each set and ), while keeping their SNP haplotypes unchanged (i.e., conserving their LDA). This procedure eliminates any systematic association between QTL and SNP alleles originating from the ancestral population, but maintains allele frequencies and polymorphic states at QTL, as well as LDA between them. In contrast to previous approaches (cf. Habier et al. 2013), this approach avoids influencing PA by altering actual relationships at QTL. Importantly, after removal of LDA, there is still LD between QTL and SNPs in the parents, but this LD is purely due to the limited sample size and, thus, subsequently referred to as sample LD.
Third, four different types of data were used to calculate the relationship matrix used in BLUP: (i) For the “SNP”-scenarios, we used SNP genotypes to calculate the marker-derived genomic relationship matrix as (Habier et al. 2007; VanRaden 2008), where is the genotype of the -th individual at the -th locus coded {2,0} depending on whether this individual was homozygous for the 1 or 0 allele, respectively, and is the frequency of the 1 allele at the -th SNP marker in the ancestral population. (ii) For the “QTL”-scenarios, the QTL genotypes were used to calculate the actual relationship matrix using the same formula. (iii) For the “Ped”-scenario, pedigree records were used to calculate the pedigree relationship matrix with elements being equal to expected IBD relationships (i.e., twice the coefficient of coancestry). (iv) For the “Tag”-scenario, tag markers labeling the origin of QTL alleles at each locus from the parents were used to calculate the actual IBD relationship matrix with elements being equal to twice the proportion of identical tag marker alleles between each pair of individuals. Tag markers label each QTL allele, regardless of its state, uniquely with a number in the parents and thus, they allow tracking of the segregation process during intermating and identification of the parental origin of each QTL allele in the synthetic.
Scenario Re-LDA-SNP reflects the situation mostly encountered in practical applications of GP and used information from pedigree relationships among individuals in the TS and PS captured by SNPs; deviations from pedigree relationships due to (i) Mendelian sampling at QTL captured by cosegregation of QTL and SNPs and (ii) ancestral LD between QTL and SNPs. Scenario Re-LDA-Ped used only pedigree relationships, but ignored deviations due to Mendelian sampling, whereas Re-LDA-Tag accounted for both pedigree relationships and Mendelian sampling. Both scenarios ignored actual relationships among parents by assuming unrelated founders and, thus, did not account for alleles that are IBS but not IBD in the synthetic. Scenario Re-LEA-SNP was artificial, with the goal of determining the influence of ancestral LD on PA in scenario Re-LDA-SNP. Scenario Re-LDA-QTL was employed to determine for the “Re”-scenarios the maximum PA achievable with GBLUP (cf. de los Campos et al. 2013), when assuming that each QTL explains an equal proportion of the additive genetic variance. The purpose was, thus, to quantify the reduction in PA for all other “Re”-scenarios when using a different data type to estimate actual relationships.
Scenarios Un-LDA-SNP and Un-LDA-QTL (“Un”-scenarios) represent the conceptual counterparts to Re-LDA-SNP and Re-LDA-QTL (Figure 2). Un-LDA-SNP reflects the practical situation of predicting the genetic merit of individuals unrelated to the TS, whereas Un-LDA-QTL provides the corresponding upper bound of PA. For both scenarios, alleles in the TS and PS had IBD probability equal to zero and, thus, the only remaining source of information contributing to PA in Un-LDA-SNP was ancestral LD between QTL and SNPs to track actual relationships among parents. Scenario Un-LEA-SNP was employed as a negative control scenario to validate the simulation designs. As expected, PA for Un-LEA-SNP fluctuated around zero for all investigated settings (results not shown), confirming that there are only three sources of information contributing to PA when using .
Analysis of linkage disequilibrium and linkage phase similarity
We calculated LD as the squared correlation coefficient (, Hill and Robertson 1968) between all pairs of QTL and SNPs in (i) each ancestral population (LDA), (ii) the set of parents sampled from the ancestral population, and (iii) the synthetic generated from the parents. Furthermore, we computed the linkage phase similarity of QTL-SNP pairs in the TS and PS. Here, we adopted a similar approach to that of de Roos et al. (2008), but replaced the correlation by the cosine similarity
| , | (1) |
where refers to the index of the QTL-SNP pair and is the number of pairs for which linkage phase similarity is calculated. The reason was to account not only for the ranking but also for the absolute size of the statistics in the two data sets (see File S2 for details). Linkage phase similarity was calculated for all QTL-SNP pairs falling into consecutive bins of 0.5 cM width. LD was first averaged within each bin and, subsequently, both LD and linkage phase similarity statistics were averaged across chromosomes and simulation replicates.
Genomic prediction
The statistical model used for predicting breeding values can be written as
| , | (2) |
where is the incidence matrix linking phenotypes with breeding values, is the vector of random breeding values with mean zero and variance-covariance matrix var, where is a relationship matrix, calculated from different data types as described above, and is the additive genetic variance in the synthetic. Residuals are random with mean zero and var, where is an identity matrix and is the residual variance. Estimates of variance components and were obtained by restricted maximum likelihood and estimated breeding values were predicted using the mixed.solve function from R-package rrBLUP (Endelman 2011). PA was always calculated as the correlation between and for the PS in each simulation replicate.
Following previous studies (Goddard et al. 2011; de los Campos et al. 2013), we also investigated how well estimated relationships (i.e., , ,) between individuals and in the TS and PS reflect the corresponding actual relationships at QTL. Therefore, we calculated the coefficient of determination of the regression of on in each simulation replicate and all scenarios (except for Re-LDA-QTL and Un-LDA-QTL, where ).
To assess the effect of TS size on PA, we sampled 125, 250, 500, or 750 individuals from the 1000 lines of the synthetic, where 250 was used as default when another factor (e.g., marker density) was varied. For the PS, we always sampled = 100 individuals from (i) the remaining individuals that were not part of the TS in the “Re”-scenarios or (ii) the second synthetic in the “Un”-scenarios. For all “SNP”-scenarios, the effect of marker density on PA was evaluated for two values of 5 and 0.25 SNPs cM−1, the former being used as default. The number of randomly sampled SNPs per chromosome in each simulation replicate was proportional to the respective chromosome length. The two marker densities of 5 and 0.25 SNPs cM−1 resulted in an average genetic map distance between each QTL and its closest nearby SNP of 0.18 and 2.02 cM, respectively (Figure 1).
All reported results are arithmetic means over 1000 simulation replicates, which were stochastically independent conditional on the ancestral populations. A simulation replicate comprises (i) random sampling of 1000 QTL from the 5000 initial QTL and sampling of QTL effects, (ii) sampling of the parents from the ancestral population and, in the case of the “LEA”-scenarios, additionally permuting QTL haplotypes, (iii) creation of synthetics from each set of parents, (iv) sampling of the individuals for the TS and PS, (v) sampling of the noise variable and calculation of the breeding and phenotypic values, and (vi) training of the prediction equation and calculation of estimated breeding values as well as PA in the PS (Figure S1). All computations were carried out in the R statistical environment (R Core Team 2012).
Data availability
Figure S1 provides a detailed overview over the entire simulation scheme and assumptions underlying all results presented herein.
Results
Linkage disequilibrium in the ancestral populations
For ancestral population LR, LDA showed a steep decline extending to a genetic map distance cM and approached an asymptote of about 0.08 for cM (Figure 1), reflecting the presence of long-range LDA. By comparison, LDA in ancestral population SR started at slightly smaller values for closely linked loci and showed a similar decline for cM. It leveled off at about cM, where it almost reached its asymptotic value of zero due to absence of long-range LDA resulting from the 100 additional generations of random mating.
Linkage disequilibrium in the parents and the synthetics
Figure 3A shows the distribution of LD between QTL-SNP pairs in the parents, measured as , as a function of . LD in the parents takes on only a limited number of values in the interval [0,1], because only a finite number of genotype configurations is possible for two biallelic loci, which depends exclusively on . For , all LD values are equal to 1. For and , possible LD values are and , respectively, whereas for , more than 100 values are possible, resulting in a nearly continuous distribution of LD values in the parents. Under LEA (i.e., ancestral linkage equilibrium due to permutation of QTL haplotypes), the frequency of LD values in the parents was thus almost independent of , except for some small residual deviations due to similarity of ancestral allele frequencies at closely linked loci (see File S4 for details). Under LEA, the high frequencies of pairs of loci in high LD for and demonstrate the magnitude of sample LD (Figure 3A, left column). If additionally, ancestral LD was present, large parental LD values occurred more frequently for tightly linked loci ( cM) for both ancestral populations. Under short-range LDA in SR, the frequencies of high parental LD values were almost identical to those found under LEA for 1 cM, regardless of . Conversely, under long-range LDA in LR, the frequency of high parental LD values was also considerably elevated for 1 cM. Altogether, the distribution of LD values in the parents was much more strongly influenced by than by ancestral LD. The proportion of QTL-SNP pairs in high LD diminished as increased, but grew when shifting from short- to long-range LDA (Figure 3A, SR vs. LR).
Figure 3.
(A) Frequency of QTL-SNP pairs falling into eight disjoint intervals of linkage disequilibrium (LD) in the parents of synthetics, plotted against their genetic map distance Δ, for three different numbers of parents NP. (B) Average LD between QTL-SNP pairs, plotted against their genetic map distance Δ, for synthetics generated from different NP. The mean LD in the respective ancestral population (LDA) is shown for comparison (red graphs). The left column in (A and B) refers to scenarios Re-LEA-SNP and Un-LEA-SNP (independent of the ancestral population), where ancestral LD between QTL and SNPs was eliminated, whereas the other two columns correspond to all other scenarios, for the ancestral populations SR (short-range LD) and LR (long-range LD), respectively.
Figure 3B shows the average LD between QTL-SNP pairs in synthetics as a function of and . The level of the LD curve dropped rapidly as increased from two to eight and approached the curve of ancestral LD. Under LEA, LD in synthetics was still substantial for due to sample LD, yet successively approached zero as was increased further. For the presence of ancestral LD resulted in elevated LD in the synthetics, where the increment was large between tightly linked QTL-SNP pairs (1 cM) for both ancestral populations and moderate between loosely linked loci ( 1 cM) for LR.
Linkage phase similarity between training and prediction set
For scenario Re-LDA-SNP (), linkage phase similarity between TS and PS exceeded 0.8 up to cM, regardless of the ancestral population (Figure 4). By comparison, values were much lower for Un-LDA-SNP (). Increasing reduced linkage phase similarity only marginally for Re-LDA-SNP even for 20 cM, but resulted in a substantial increase for Un-LDA-SNP. The higher ancestral LD in LR resulted only in a minor increase in linkage phase similarity in Re-LDA-SNP, but in a large increase for Un-LDA-SNP, irrespective of . Since permuting QTL haplotypes eliminated ancestral LD in scenario Re-LEA-SNP, linkage phase similarity was identical for SR and LR and showed similar results as Re-LDA-SNP for SR (results not shown).
Figure 4.
Linkage phase similarity of QTL-SNP pairs in the training set (TS) and prediction set (PS) for scenarios Re-LDA-SNP and Un-LDA-SNP, plotted against the number of parents NP used to generate synthetics, for the two ancestral populations SR (short-range LD) and LR (long-range LD), and for different genetic map distances Δ (0.5, 5 and 20 ± 0.5 cM) between QTL and SNPs.
Influence of ancestral linkage disequilibrium and number of parents on PA
PA declined for all “Re”-scenarios (except Re-LDA-Ped), but increased for all “Un”-scenarios with an increasing number of parents (Figure 5), where the strongest changes occurred between and for all scenarios. The highest PA was always achieved by scenario Re-LDA-QTL, closely followed by Re-LDA-SNP for small , with an increasing difference for larger . PA increased when shifting from low (SR) to high (LR) ancestral LD for scenario Re-LDA-SNP, but decreased for Re-LEA-SNP. For scenario Re-LDA-Tag, PA was always intermediate between Re-LDA-SNP and Re-LEA-SNP. For Re-LDA-Ped, PA concavely increased from up to its maximum value of 0.4 for , followed by a minor decrease. Re-LDA-Ped and Re-LEA-SNP approached identical PA for large under long-range LDA in LR, whereas Re-LEA-SNP retained superior PA under short-range LDA in SR. For Un-LDA-QTL, PA strongly increased for both ancestral populations, especially from to , followed by a moderate increase. For Un-LDA-SNP, the overall level of PA was much lower, but showed an increasing curvature that was similar to Un-LDA-QTL for long-range LDA, whereas under short-range LDA, PA was almost consistently < 0.2 without sizeable increase for all values of .
Figure 5.
Prediction accuracy for seven scenarios (scenario Un-LEA-SNP not shown), plotted against the number of parents NP used to generate synthetics, for the two ancestral populations SR (short-range LD) and LR (long-range LD). Results refer to a training set size of NTS = 250 doubled haploid lines and a marker density of 5 SNPs cM−1.
Influence of training set size and marker density on PA
Increasing TS size () from 125 to 750 individuals was overall most beneficial for all “Re”-scenarios, except for Re-LDA-Ped (Figure S3). Conversely, Re-LDA-Ped, as well as Un-LDA-SNP under short-range LDA, showed only a minor increase in PA for larger . However, for Un-LDA-SNP under long-range LDA and for Un-LDA-QTL under both short- and long-range LDA, the increase in PA along with was notable, especially for .
Reducing the marker density from 5 SNPs cM−1 to 0.25 SNPs cM−1 resulted in a substantial reduction of PA for all “SNP”-scenarios (Figure S4). This reduction was reinforced for scenarios utilizing ancestral LD (Re-LDA-SNP and Un-LDA-SNP), especially in the presence of long-range LDA and for large values of .
Discussion
In plant breeding, GP has been applied to various types of populations such as single or multiple biparental families (BF) or diversity panels of inbred lines. These materials differ fundamentally in their pedigree structure, the number of founder individuals involved in their development, as well as the LD in the ancestral population from which they were taken. Synthetics are especially suited for systematically assessing the influence of these factors on PA, because the variable number of parents used for generating synthetics leads to (i) different pedigree relationships among individuals and (ii) a trade-off between ancestral LD and sample LD arising in the parents. Thus, our approach provides new insights into how these factors influence the ability of molecular markers to capture actual relationships at causal loci, which determines the accuracy in various applications of GP.
Influence of the number of parents and ancestral LD on actual relationships at causal loci and PA
The accuracy of GP relies on (i) the distribution of actual relationships at causal loci (QTL) between individuals in the TS and PS and (ii) the quality of the approximation of by marker-derived genomic relationships (Goddard et al. 2011; Habier et al. 2013). We first investigated PA using the actual relationship matrix , which provides an upper bound of PA given fixed values for , , and (de los Campos et al. 2013). Subsequently, we estimated by the marker-derived genomic relationship matrix and inferred how the three sources of information contributed to PA.
Actual relationships between two individuals and can be factorized into
| (3) |
where is their expected IBD relationship at QTL, is the deviation of the actual from the expected IBD relationship due to Mendelian sampling, and is the deviation of the actual (IBS) relationship from the actual IBD relationship. Whereas and provide information solely with respect to the parents (i.e., the founders of the pedigree), accounts also for actual relationships among the parents (Powell et al. 2010; Vela-Avitúa et al. 2015).
If TS and PS are related (“Re”-scenarios), the distribution of depends on (Figure 6A) and on the mating scheme employed for production of the synthetic (Figure S1). For small , this distribution is dominated by full-sib and half-sib relationships, whereas distantly related and unrelated individuals dominate for larger . The closer the pedigree relationships between individuals, the longer the chromosome segments they inherit from common ancestors are and the larger the conditional variance in actual IBD relationships, i.e., var() (Figure 6B, Goddard et al. 2011; Hill and Weir 2011). In other words, var() is inversely proportional to the number of independently segregating chromosome segments and, hence, the length and number of chromosomes must be taken into account when transferring our results to other species. For example, in bread wheat (2n = 42), var()—and consequently PA attributable to the Mendelian sampling term—are expected to be smaller than in maize (2n = 20).
Figure 6.
(A) Frequency of the seven possible values fij of pedigree relationships for different numbers of unrelated inbred parents NP used to generate synthetics. (B) Conditional distributions of actual relationships qij conditional on their pedigree relationship fij between individuals i and j in the training set and prediction set, respectively, for the two ancestral populations SR (short-range LD) and LR (long-range LD).
The contribution of to depends on the level of ancestral LD. Elevated LDA increases var() in the ancestral population (Figure S2, LR vs. SR) and in turn increases the variation in similarity of haplotypes among parents sampled therefrom (Habier et al. 2013). Consequently, var() in synthetics increases with ancestral LD (Figure 6B and Figure S2), on top of the variance var() caused by Mendelian sampling. Assuming known actual relationships and fixed TS size, PA therefore decreases if (i) increases and (ii) ancestral LD decreases (Figure 5). This is because both factors reduce the absolute frequency of close actual relationships among TS and PS (Figure S5). If actual relationships among the parents were not accounted for, the decline in PA was reinforced as increased (Figure 5, scenario Re-LDA-Tag). The reason for this follows from the factorization (Equation 3): the larger , the more frequent are pairs of individuals with small or zero pedigree relationship (Figure 6A) and the more important it is to account for actual relationships among parents. Conversely, PA was consistently higher for small due to strong pedigree relationships and Mendelian sampling, despite the accompanying negative effects of reduced heritability in the TS and the reduced additive genetic variance in the PS on PA (Table S2).
Restricting predictive information to pedigree relationships (scenario Re-LDA-Ped), resulted in only moderate PA, unless for (Figure 5). In this case, all individuals in the TS and PS were full-sibs (Figure 6A), which resulted in identical estimated breeding values by pedigree BLUP, so that PA could not be calculated (indicated as PA = 0 in Figure 5). For , there was variation in pedigree relationships in synthetics (Figure 6A) and thus, PA 0. Further research is warranted on the importance of variation in pedigree relationships for GP in the presence of Mendelian sampling and ancestral LD, e.g., by considering mating schemes such as MAGIC, which reduce or even entirely avoid variation in pedigree relationships.
If the TS and PS are unrelated (“Un”-scenarios), only contributes to variation in , because and are equal to zero. Moreover, if is small, QTL in the TS and PS can (i) be fixed for different alleles (Table S2) or (ii) differ in their LD structure due to sample LD. This limits the occurrence of close actual relationships between TS and PS (Figure S5, Un-LDA-QTL), and reduces the upper bounds of PA compared with the corresponding “Re”-scenarios (Figure 5, Un-LDA-QTL vs. Re-LDA-QTL). As increases, allele frequencies and LD between loci converge toward those in the ancestral population in both “Re”- and “Un”-scenarios (Table S2). In turn, the closest actual relationships between TS and PS converge as well (Figure S5), ultimately resulting in similar PA for Re-LDA-QTL and Un-LDA-QTL (Figure 5). In conclusion, the difference in predicting related and unrelated genotypes vanishes as increases for a given TS size, because it is then primarily ancestral information that drives the accuracy of GP.
Sample LD and cosegregation: crucial factors for PA in synthetics
LD in the parents represents a combination of LD carrying over from the ancestral population and LD generated anew due to limited . The latter LD, herein referred to as sample LD, results from a bottleneck in population size similar to that used in our simulations for generating long-range LD in the ancestral population (cf. Figure S1), but can be much stronger if is small (e.g., 4). Cosegregation is defined as the coinheritance of alleles at linked loci on the same gamete and, thus, describes the process that prevents parental LD between them from being rapidly eroded by recombination (Figure S6). Together, sample LD and cosegregation result in high LD in synthetics, which for small exceeds by far the level of ancestral LD (Figure 3B and see File S3 for details). However, the crucial property of sample LD is that it is specific to a set of parents and thus provides predictive information only for their descendants. Hence, using cosegregation as “source of information” in GP relies on the presence of pedigree relationships (Habier et al. 2013). Conversely, the fraction of parental LD that stems from ancestral LD is a commonality among all descendants of the ancestral population, irrespective of pedigree relationships. The particularly small number of parents used in synthetics makes sample LD and cosegregation crucial factors contributing to PA, a situation that differs greatly from previously investigated scenarios (e.g., Habier et al. 2007, 2013; Wientjes et al. 2013). Hence, knowledge of how ancestral LD and sample LD contribute to parental LD, depending on , is essential for evaluating the applicability of training data to prediction of both related and unrelated genotypes.
The influence of sample LD on parental LD and PA in the “Re”-scenarios is illustrated best by considering different values of : for , sample LD in the parents is maximized, because all pairs of polymorphic loci are in complete LD (r2 = 1.0), irrespective of ancestral LD, linkage, or genetic map distance. Cosegregation of linked QTL and SNPs during intermating largely conserves LD, even for loosely linked loci (Figure S6), so that LD in synthetics remained at high levels (Figure 3B). Therefore, replacing with resulted in merely a marginal reduction of PA (Figure 5, Re-LDA-QTL vs. Re-LDA-SNP). Previous studies claimed that PA in biparental populations is the maximum obtainable for given TS size (Riedelsheimer et al. 2013; Lehermeier et al. 2014), despite absence of variation in pedigree relationships. Our results demonstrate that this is exclusively attributable to the efficient utilization of sample LD via cosegregation. For and , LD can take two and four discrete values, respectively (see Results). Thus, sample LD still takes up a large share of parental LD (Figure 3A). However, the occurrence of different LD values (in contrast to ) introduces a dependency on ancestral LD: the frequency of loci with high parental LD increases in the presence of ancestral LD compared with ancestral linkage equilibrium. This difference carries over during intermating and resulted in increased LD in the synthetics, especially under long-range ancestral LD (Figure 3B, LR). However, the increment in PA was only marginal (Figure 5, Re-LDA-SNP vs. Re-LEA-SNP) owing to the overriding contribution of sample LD to parental LD for and . Nevertheless, the reduction in sample LD for or , compared with , impaired cosegregation information and reinforced the decline in PA when relying on markers (Figure 5, Re-LDA-QTL vs. Re-LDA-SNP). For 16, sample LD becomes negligible (Figure 3A) so that parental LD hardly differed from ancestral LD. This led to (i) reinforced reduction in PA, when using markers rather than known QTL genotypes (Figure 5, Re-LDA-QTL vs. Re-LDA-SNP), especially for short-range ancestral LD, and (ii) convergence of PA of GBLUP and pedigree BLUP in the absence of ancestral LD (Figure 5, Re-LEA-SNP vs. Re-LDA-Ped). The reason for the latter is that, under marginal contribution of cosegregation, PA stems primarily from capturing pedigree relationships by SNPs.
For the “Un”-scenarios, sample LD is manifested independently in and , which results in different cosegregation “patterns” in TS and PS that cannot reliably be exploited in GP. Therefore, the ancestral LD that is common to both sets of parents—measured by linkage phase similarity in the synthetics (Figure 4)—provides the only source of information connecting the TS and PS. This constraint resulted in a much larger drop in PA when replacing with in the “Un”-scenarios (Figure 5, Un-LDA-QTL vs. Un-LDA-SNP) compared with the corresponding “Re”-scenarios (Figure 5, Re-LDA-QTL vs. Re-LDA-SNP), especially under short-range ancestral LD. This decline in PA when predicting the genetic merit of unrelated instead of related genotypes corroborates previous findings on GP across populations in both animal and plant breeding (de Roos et al. 2009; Hayes et al. 2009b; Riedelsheimer et al. 2013; Technow et al. 2013; Albrecht et al. 2014; Heslot and Jannink 2015).
Variation in linkage phase similarity between TS and PS caused by sample LD affects GP of unrelated genotypes in an unforeseeable manner: while identical and reversed QTL-SNP linkage phases manifested by sample LD cancel out on average, individual TS-PS combinations can show above or below average linkage phase similarity and thus, cosegregation “patterns.” This translates into large variation of PA among different TS-PS combinations. Additional simulations using unequal to derive the TS and PS showed that variation in PA was even higher when using small to generate the PS than for the TS (Figure S7). A possible explanation might be that regardless of the TS composition, small for the PS drastically reduces the frequency of polymorphic loci (Table S2) and thereby increases the variation in linkage phase similarity with the TS for the remaining loci, which in turn increases the variability of prediction. Considering the practical relevance of such prediction scenarios, further research is needed to investigate this finding in greater detail.
Influence of LD on capturing pedigree relationships
The ability to capture pedigree relationships by SNPs increases with the effective number of independently segregating SNPs in the model (Habier et al. 2007). Higher LD between SNPs reduces this number and, thus, reduces the contribution of pedigree relationships captured by SNPs to PA. Scenario Re-LEA-SNP demonstrates this fact for large values of , where LD between QTL and SNPs in synthetics was small (Figure 3B) and, hence, PA mainly relied on capturing pedigree relationships. In line with this reasoning, PA decreased from SR to LR (Figure 5, Re-LEA-SNP), as well as when marker density was reduced from 5 to only 0.25 SNPs cM−1, because similar to increasing LD, using low marker density reduced the number of independently segregating SNPs (Figure S4, Re-LEA-SNP).
In GBLUP, the consequences of an imprecise estimation of pedigree relationships by SNPs due to strong LD are limited, because the loss in PA compared with pedigree BLUP is mostly overcompensated for by capturing either cosegregation (Figure 5; small , Re-LEA-SNP vs. Re-LDA-Ped) or long-range ancestral LD (Figure 5; large , Re-LEA-SNP vs. Re-LDA-Ped). An exception is the combination of large and short-range ancestral LD, where the comparatively small contribution of ancestral LD to PA does not necessarily compensate for that loss, so that GBLUP might not provide the desired advantage over pedigree BLUP. Alternative models employing variable selection (e.g., BayesB), which capitalize more on LD rather than pedigree relationships (Habier et al. 2007; Zhong et al. 2009; Jannink et al. 2010), might help to improve the prospects of GP in such cases.
Influence of training set size on PA
In this study, we varied training set size for given values of , because resources devoted to the TS differ between breeding programs and do not necessarily depend on . Under fixed , the absolute frequency of individuals with close actual relationship among TS and PS increases with (Figure S5), which led to similar benefits in PA for all “Re”-scenarios (Figure S3, except Re-LDA-Ped). However, the general decline of PA in these scenarios with increasing NP was only slightly attenuated even when using 750 instead of 125 individuals in the TS. This is because the need for larger TS size increases rapidly as pedigree relationships with the PS decrease (Habier et al. 2010), which in turn shifts the distribution of actual relationships toward lower values (Figure 6B and Figure S2). Thus, must generally be increased along with to counteract the expected decline in PA as much as possible.
According to Habier et al. (2013), altering TS size affects the contributions of the three sources of information to PA, but this inference is based on the assumption that TS size was increased by adding new families to the TS (unrelated to the initially included families), which is comparable to increasing in our study. The estimation of actual relationships by SNPs was shown by de los Campos et al. (2013) to be sufficiently characterized by (Figure S8) and, thus, largely independent of , apart from estimation error. In synthetics, the distribution of actual relationships is defined by and LDA (Figure 6 and Figure S2). Thus, increasing increases the chances for each individual in the PS to have several individuals with close actual relationships in the TS, which was previously found to be crucial for achieving high PA (Jannink et al. 2010; Clark et al. 2012). Therefore, the contributions to PA from cosegregation and ancestral LD increase with , because they are required to capture deviations from pedigree relationships. Conversely, using small will tend to hamper the occurrence of high values and, hence, increase the reliance on pedigree relationships.
If TS and PS are unrelated, the absolute frequency of close actual relationships is low, even if is large (Figure S5). Additionally, actual relationships are rather poorly estimated by SNPs when relying solely on ancestral LD (Figure S8, Un-LDA-SNP). Consequently, huge () and high marker density would be required to substantially elevate PA, especially if there is only short-range ancestral LD (cf. de los Campos et al. 2013).
Influence of marker density on PA
High marker density is especially important if LD between QTL and SNPs extends only to short map distances (Solberg et al. 2008; Zhong et al. 2009; Hickey et al. 2014). This applies in our study if either sample LD was negligible (Figure S4; large , Re-LDA-SNP vs. Re-LEA-SNP) or if TS and PS were unrelated (Figure S4, Un-LDA-SNP), so that PA relied heavily on ancestral LD. Our results also show that, in the latter case, using high marker density strongly improved PA for both ancestral populations, implying that capturing LD between tightly linked loci ( cM) is beneficial even if long-range ancestral LD prevails. With low marker density, capturing only the “long-range part” of ancestral LD (Figure 1, LR) still provided moderate PA (Figure S4, LR), but PA dropped below 0.1 for short-range ancestral LD (Figure S4, SR). This was likely because most SNPs were no longer in LD with QTL and thus contributed mostly noise to the prediction equation. These results are in agreement with former studies (de los Campos et al. 2013; Habier et al. 2013; Hickey et al. 2014; Lorenz and Smith 2015) reporting that, under insufficient marker density, adding individuals unrelated to the PS to the TS can even decrease PA.
In summary, the required marker density for should be chosen in compliance with the extent of ancestral LD. While, in this case, high density is mandatory if TS and PS are unrelated, moderate PA can still be achieved under low marker density if TS and PS are related due to pedigree relationships contributing to PA. For small , extensive LD in synthetics (due to sample LD and cosegregation) lowers the requirements on marker density. Although cosegregation is captured optimally if SNPs and QTL are as tightly linked as possible medium marker density (1 SNPs cM−1, depending on ) is likely sufficient to reach PA near the optimum.
Expected impact of ancestral LD on GP in synthetics
In GP of genetic predisposition in humans or breeding values of bulls, the availability of several thousand training individuals, in conjunction with high marker densities, allows for efficient use of rather low levels of ancestral LD, as usually observed in these species (de Roos et al. 2008; Goddard and Hayes 2009; de los Campos et al. 2013). We showed that short-range ancestral LD is generally less valuable in plant breeding, where TS usually comprise only hundreds or fewer individuals. Ancestral LD can differ substantially among crops and different germplasm within crops (Flint-Garcia et al. 2003). Usually, low levels of ancestral LD are found in diversity panels that encompass lines from different breeding programs and/or geographic origin, as well as in materials largely unselected by breeders, such as landraces or gene bank accessions (Hyten et al. 2007; Delourme et al. 2013; Romay et al. 2013). Recently, Gorjanc et al. (2016) proposed GP for recurrent selection of synthetics generated from doubled haploid lines derived from landraces. In the light of our findings, such an approach generally requires large TS size and high marker density to outperform pedigree BLUP, unless one chooses small to ensure satisfactory PA due to cosegregation.
In contrast, extensive long-range ancestral LD is usually found in elite breeding germplasm of major crops such as maize (Windhausen et al. 2012; Unterseer et al. 2014), wheat (Maccaferri et al. 2005), barley (Zhong et al. 2009), soybean (Hyten et al. 2007), or sugar beet (Würschum et al. 2013). If synthetics were derived from such germplasm, ancestral LD is expected to contribute substantially to PA, as shown by our results. However, LD determined from biallelic SNPs might overestimate ancestral LD between QTL-SNP pairs, because their allele frequencies can deviate due to ascertainment bias in discarding SNPs with low minor allele frequencies for the construction of SNP arrays (Ganal et al. 2011; Goddard et al. 2011). Such an overestimation would impair the advantage of GP approaches over pedigree BLUP.
Implications for other scenarios relevant in plant breeding
Research on GP in plant breeding has so far focused primarily on the use of single (e.g., Lorenzana and Bernardo 2009; Riedelsheimer et al. 2013) and multiple segregating BF (e.g., Albrecht et al. 2011; Heffner et al. 2011; Schulz-Streeck et al. 2012; Habier et al. 2013; Lehermeier et al. 2014). For , our scenarios Re-LDA-SNP and Un-LDA-SNP correspond exactly to GP within and between BF derived from unrelated parents. In practice, breeders mostly derive lines directly from F1 crosses (Mikel and Dudley 2006), whereas we applied a further generation of intermating (Figure S1). This additional meiosis slightly reduces LD in synthetics (see File S3 for details) and, in turn, PA (results not shown). While GP within BF generally works well, predicting an unrelated BF can be risky and unreliable (Riedelsheimer et al. 2013), as underlined by our results for scenario Un-LDA-SNP (Figure S7, ). Similar uncertainties might be encountered if new lines from an untested BF are predicted based on preexisting data from multiple BF (Heffner et al. 2011), diversity panels (Würschum et al. 2013), or populations of experimental hybrids (Massman et al. 2013) to obtain predicted breeding values prior to partially phenotyping the new cross (Figure S7, in TS and in PS). The risk of such approaches is likely attenuated in advanced breeding cycles, where putatively “unrelated” BF usually share more recent common ancestors than a TS comprising truly unrelated material, as would be the case in an “ideal” diversity panel. However, Hickey et al. (2014) showed that if two BF share only a grandparent as their most recent common ancestor, PA was not substantially higher than for unrelated BF. This underpins the need for close relatives in the TS (e.g., full-sibs or half-sibs) to warrant high and robust PA across different prediction targets. Accordingly, previous studies on GP in diversity panels concluded that the observed medium to high PAs were partially attributable to latent groups of related germplasm (e.g., Rincent et al. 2012; Schopp et al. 2015).
If a BF is too small for training the prediction equation, multiple BF can be alternatively pooled together (Heffner et al. 2011; Technow and Totir 2015). Such a combined TS can be constructed by sampling lines from each BF to predict the remainder in each BF (“within”) or by using some BF to predict other BF (“across”) (cf. Albrecht et al. 2011). Our scenarios Re-LDA-SNP and Un-LDA-SNP are similar to these “within” and “across” situations for , but—besides the additional meiosis discussed above—show another important difference to F1-derived multiple BF: generating synthetics by random mating of the Syn-1 generation breaks up the clear pedigree structure in full-sib, half-sib, and unrelated families (Figure S9). This reduces both the mean and variance of pedigree relationships, which in turn reduces PA (results not shown). As discussed above, capturing pedigree relationships plays a major role in GP of both synthetics and multiple BF if TS and PS are related, especially if is large. This is because, in both situations, cosegregation is barely used to obtain “accuracy within families” (cf. Habier et al. 2013). In practical breeding programs using multiple BF, the situation might be slightly different, if some parents are overrepresented compared with others and introduce a predominant linkage phase pattern that can be exploited in GP. Moreover, one has the opportunity to improve information from cosegregation by (i) clustering related BF into the TS to reflect the cosegregation pattern of the PS or (ii) explicit modeling of cosegregation (cf. Habier et al. 2013) or family-specific effects using hierarchical models (Technow and Totir 2015). However, both of these strategies are not easily accessible in synthetics, unless one replaces random mating by controlled mating to keep track of pedigree relationships. Since ancestral LD persists well over generations (Habier et al. 2007), its contribution to PA is expected to be only marginally affected by additional intermating generations. Thus, ancestral LD can generally be considered of great importance for GP of material related or unrelated to the TS, particularly if NP is large.
In the present study, we considered the two most extreme situations of relatedness or unrelatedness of the TS and PS, because their parents were either identical or entirely different. Further research is warranted for situations of partial overlapping of parents among families, which occurs frequently in practice, e.g., when proven inbred lines contribute to multiple crosses in subsequent breeding cycles. Moreover, we focused here exclusively on PA, but the genetic gain from genomic selection, which is of ultimate interest to breeders, depends additionally on the genetic variance in the population. Since both parameters are influenced by the choice of , the potential of recurrent genomic selection in synthetics needs to be examined for different values of and different levels of ancestral LD, ideally across multiple selection cycles.
Acknowledgments
We thank Chris-Carolin Schön, Tobias Würschum, José Marulanda, Willem Molenaar, and three anonymous reviewers for valuable suggestions to improve the content of the manuscript. P.S. acknowledges Syngenta for partially funding this research by a Ph.D. fellowship and A.E.M. the financial contribution of the International Maize and Wheat Improvement Center/German Agency for International Development (CIMMYT/GIZ) through the Climate Resilient Maize for Asia project 15.78600.8-001-00.
Footnotes
Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.116.193243/-/DC1.
Communicating editor: A. Charcosset
1These authors contributed equally to this work.
Literature Cited
- Albrecht T., Wimmer V., Auinger H., Erbe M., Knaak C., et al. , 2011. Genome-based prediction of testcross values in maize. Theor. Appl. Genet. 123: 339–350. [DOI] [PubMed] [Google Scholar]
- Albrecht T., Auinger H.-J., Wimmer V., Ogutu J. O., Knaak C., et al. , 2014. Genome-based prediction of maize hybrid performance across genetic groups, testers, locations, and years. Theor. Appl. Genet. 127: 1375–1386. [DOI] [PubMed] [Google Scholar]
- Bandillo N., Raghavan C., Muyco P., 2013. Multi-parent advanced generation inter-cross (MAGIC) populations in rice: progress and potential for genetics research and breeding. Rice (N. Y.) 6: 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bradshaw J. E., 2016. Plant Breeding: Past, Present and Future. Springer International Publishing, Edinburgh, NY. [Google Scholar]
- Cavanagh C., Morell M., Mackay I., Powell W., 2008. From mutations to MAGIC: resources for gene discovery, validation and delivery in crop plants. Curr. Opin. Plant Biol. 11: 215–221. [DOI] [PubMed] [Google Scholar]
- Clark S. A., Hickey J. M., Daetwyler H. D., van der Werf J. H., 2012. The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet. Sel. Evol. 44: 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Koning D.-J., 2016. Meuwissen et al. on genomic selection. Genetics 203: 5–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de los Campos G., Vazquez A. I., Fernando R., Klimentidis Y. C., Sorensen D., 2013. Prediction of complex human traits using the genomic best linear unbiased predictor. PLoS Genet. 9: 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Delourme R., Falentin C., Fomeju B. F., Boillot M., Lassalle G., et al. , 2013. High-density SNP-based genetic map development and linkage disequilibrium assessment in Brassica napus L. BMC Genomics 14: 120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Roos A. P., Hayes B. J., Spelman R. J., Goddard M. E., 2008. Linkage disequilibrium and persistence of phase in Holstein-Friesian, Jersey and Angus cattle. Genetics 179: 1503–1512. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Roos A. P., Hayes B. J., Goddard M. E., 2009. Reliability of genomic predictions across multiple populations. Genetics 183: 1545–1553. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Endelman J. B., 2011. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4: 250–255. [Google Scholar]
- Falconer D. S. F., Mackay T. F. C., 1996 Introduction to Quantitative Genetics, Ed. 4. Pearson, Essex. [Google Scholar]
- Flint-Garcia S. A., Thornsberry J. M., Buckler E. S., 2003. Structure of linkage disequilibrium in plants. Annu. Rev. Plant Biol. 54: 357–374. [DOI] [PubMed] [Google Scholar]
- Ganal M. W., Durstewitz G., Polley A., Bérard A., Buckler E. S., et al. , 2011. A large maize (Zea mays L.) SNP genotyping array: development and germplasm genotyping, and genetic mapping to compare with the B73 reference genome. PLoS One 6: e28334. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goddard M. E., Hayes B. J., 2009. Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat. Rev. Genet. 10: 381–391. [DOI] [PubMed] [Google Scholar]
- Goddard M. E., Hayes B. J., Meuwissen T. H. E., 2011. Using the genomic relationship matrix to predict the accuracy of genomic selection. J. Anim. Breed. Genet. 128: 409–421. [DOI] [PubMed] [Google Scholar]
- Gorjanc G., Jenko J., Hearne S. J., Hickey J. M., 2016. Initiating maize pre-breeding programs using genomic selection to harness polygenic variation from landrace populations. BMC Genomics 17: 30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Habier D., Fernando R. L., Dekkers J. C. M., 2007. The impact of genetic relationship information on genome-assisted breeding values. Genetics 177: 2389–2397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Habier D., Tetens J., Seefried F., Lichtner P., Thaller G., 2010. The impact of genetic relationship information on genomic breeding values in German Holstein cattle. Genet. Sel. Evol. 42: 5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Habier D., Fernando R. L., Garrick D. J., 2013. Genomic BLUP decoded: a look into the black box of genomic prediction. Genetics 194: 597–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hagdorn S., Lamkey K. R., Frisch M., Guimaraes P. E. O., Melchinger A. E., 2003. Molecular genetic diversity among progenitors and derived elite lines of BSSS and BSCB1 maize populations. Crop Sci. 43: 474–482. [Google Scholar]
- Hallauer A. R., Carena M. J., Miranda Filho J. B., 2010. Quantitative Genetics in Maize Breeding. Springer, Ames, IA. [Google Scholar]
- Hartl D. L., Clark A. G., 2007. Principles of Population Genetics. Sinauer Associates, Sunderland, Mass. [Google Scholar]
- Hayes B. J., Bowman P. J., Chamberlain A. J., Goddard M. E., 2009a Genomic selection in dairy cattle: progress and challenges. J. Dairy Sci. 92: 433–443. [DOI] [PubMed] [Google Scholar]
- Hayes B. J., Bowman P. J., Chamberlain A. C., Verbyla K., Goddard M. E., 2009b Accuracy of genomic breeding values in multi-breed dairy cattle populations. Genet. Sel. Evol. 41: 51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hayes B. J., Visscher P. M., Goddard M. E., 2009c Increased accuracy of artificial selection by using the realized relationship matrix. Genet. Res. 91: 47–60. [DOI] [PubMed] [Google Scholar]
- Heffner E. L., Jannink J., Sorrells M. E., 2011. Genomic selection accuracy using multifamily prediction models in a wheat breeding program. Plant Genome 4: 65–75. [Google Scholar]
- Henderson C., 1984. Applications of Linear Models in Animal Breeding. University of Guelph, Guelph, ON. [Google Scholar]
- Heslot N., Jannink J.-L., 2015. An alternative covariance estimator to investigate genetic heterogeneity in populations. Genet. Sel. Evol. 47: 93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hickey J. M., Dreisigacker S., Crossa J., Hearne S., Babu R., et al. , 2014. Evaluation of genomic selection training population designs and genotyping strategies in plant breeding programs using simulation. Crop Sci. 54: 1476–1488. [Google Scholar]
- Hill W. G., 1981. Estimation of effective population size from data on linkage disequilibrium. Genet. Res. 38: 209–216. [Google Scholar]
- Hill W. G., Robertson A., 1968. Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38: 226–231. [DOI] [PubMed] [Google Scholar]
- Hill W. G., Weir B. S., 2011. Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genet. Res. 93: 47–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hyten D. L., Choi I. Y., Song Q., Shoemaker R. C., Nelson R. L., et al. , 2007. Highly variable patterns of linkage disequilibrium in multiple soybean populations. Genetics 175: 1937–1944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jannink J. L., Lorenz A. J., Iwata H., 2010. Genomic selection in plant breeding: from theory to practice. Brief. Funct. Genomics 9: 166–177. [DOI] [PubMed] [Google Scholar]
- Lehermeier C., Krämer N., Bauer E., Bauland C., Camisan C., et al. , 2014. Usefulness of multi-parental populations of maize (Zea mays L.) for genome-based prediction. Genetics 198: 3–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Z., Hayes B. J., Daetwyler H. D., 2014. Genomic selection in crops, trees and forages: a review. Crop Pasture Sci. 65: 1177–1191. [Google Scholar]
- Lorenz A. J., Smith K. P., 2015. Adding genetically distant individuals to training populations reduces genomic prediction accuracy in Barley. Crop Sci. 55: 2657–2667. [Google Scholar]
- Lorenzana R. E., Bernardo R., 2009. Accuracy of genotypic value predictions for marker-based selection in biparental plant populations. Theor. Appl. Genet. 120: 151–161. [DOI] [PubMed] [Google Scholar]
- Maccaferri M., Sanguineti M. C., Noli E., Tuberosa R., 2005. Population structure and long-range linkage disequilibrium in a durum wheat elite collection. Mol. Breed. 15: 271–289. [Google Scholar]
- Mackay I., Ober E., Hickey J., 2015. GplusE: beyond genomic selection. Food Energy Secur. 4: 25–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Massman J. M., Gordillo A., Lorenzana R. E., Bernardo R., 2013. Genomewide predictions from maize single-cross data. Theor. Appl. Genet. 126: 13–22. [DOI] [PubMed] [Google Scholar]
- McMullen M. D., Kresovich S., Villeda H. S., Bradbury P., Li H., et al. , 2009. Genetic properties of the maize nested association mapping population. Science 325: 737–740. [DOI] [PubMed] [Google Scholar]
- Meuwissen T. H. E., Hayes B. J., Goddard M. E., 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819–1829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mikel M. A., Dudley J. W., 2006. Evolution of North American dent corn from public to proprietary germplasm. Crop Sci. 46: 1193–1205. [Google Scholar]
- Powell J. E., Visscher P. M., Goddard M. E., 2010. Reconciling the analysis of IBD and IBS in complex trait studies. Nat. Rev. Genet. 11: 800–805. [DOI] [PubMed] [Google Scholar]
- R Core Team , 2012. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
- Riedelsheimer C., Endelman J. B., Stange M., Sorrells M. E., Jannink J. L., et al. , 2013. Genomic predictability of interconnected biparental maize populations. Genetics 194: 493–503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rincent R., Laloë D., Nicolas S., Altmann T., Brunel D., et al. , 2012. Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.). Genetics 192: 715–728. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Romay M. C., Millard M. J., Glaubitz J. C., Peiffer J. A., Swarts K. L., et al. , 2013. Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biol. 14: R55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sargolzaei M., Schenkel F. S., 2009. QMSim: a large-scale genome simulator for livestock. Bioinformatics 25: 680–681. [DOI] [PubMed] [Google Scholar]
- Schopp P., Riedelsheimer C., Utz H. F., Schön C.-C., Melchinger A. E., 2015. Forecasting the accuracy of genomic prediction with different selection targets in the training and prediction set as well as truncation selection. Theor. Appl. Genet. 128: 2189–2201. [DOI] [PubMed] [Google Scholar]
- Schulz-Streeck T., Ogutu J. O., Karaman Z., Knaak C., Piepho H. P., 2012. Genomic selection using multiple populations. Crop Sci. 52: 2453–2461. [Google Scholar]
- Solberg T. R., Sonesson A. K., Woolliams J. A., Meuwissen T. H., 2008. Genomic selection using different marker types and densities. J. Anim. Sci. 86: 2447–2454. [DOI] [PubMed] [Google Scholar]
- Suneson C. A., 1956. An evolutionary plant breeding method. Agron. J. 6: 1–4. [Google Scholar]
- Technow F., Totir L. R., 2015. Using Bayesian multilevel whole genome regression models for partial pooling of training sets in genomic prediction. G3 (Bethesda) 5: 1603–1612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Technow F., Bürger A., Melchinger A. E., 2013. Genomic prediction of northern corn leaf blight resistance in maize with combined or separated training sets for heterotic groups. G3 (Bethesda) 3: 197–203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Unterseer S., Bauer E., Haberer G., Seidel M., Knaak C., et al. , 2014. A powerful tool for genome analysis in maize: development and evaluation of the high density 600 k SNP genotyping array. BMC Genomics 15: 823. [DOI] [PMC free article] [PubMed] [Google Scholar]
- VanRaden P. M., 2008. Efficient methods to compute genomic predictions. J. Dairy Sci. 91: 4414–4423. [DOI] [PubMed] [Google Scholar]
- Vela-Avitúa S., Meuwissen T. H., Luan T., Ødegård J., 2015. Accuracy of genomic selection for a sib-evaluated trait using identity-by-state and identity-by-descent relationships. Genet. Sel. Evol. 47: 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wientjes Y. C. J., Veerkamp R. F., Calus M. P. L., 2013. The effect of linkage disequilibrium and family relationships on the reliability of genomic prediction. Genetics 193: 621–631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Windhausen V. S., Atlin G. N., Hickey J. M., Crossa J., Jannink J.-L., et al. , 2012. Effectiveness of genomic prediction of maize hybrid performance in different breeding populations and environments. G3 (Bethesda) 2: 1427–1436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright S., 1922. Coefficients of inbreeding and relationship. Am. Nat. 56: 330–338. [Google Scholar]
- Würschum T., Reif J. C., Kraft T., Janssen G., Zhao Y., 2013. Genomic selection in sugar beet breeding populations. BMC Genet. 14: 85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhong S., Dekkers J. C. M., Fernando R. L., Jannink J.-L., 2009. Factors affecting accuracy from genomic selection in populations derived from multiple inbred lines: a barley case study. Genetics 182: 355–364. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Figure S1 provides a detailed overview over the entire simulation scheme and assumptions underlying all results presented herein.






