Abstract
Mate selection plays an important role in breeding programs. The Usefulness Criterion was proposed to improve mate selection, combining information on both the mean and standard deviation of the potential offspring of a cross, particularly in clonally propagated species where large family sizes are possible. Predicting the mean value of a cross is generally easier than predicting the standard deviation, especially in outbred species when the linkage of alleles is unknown and phasing is required. In this study, we developed a method for estimating phasing accuracy from unphased genotype data on possible parental lines and evaluated whether the accuracy was sufficient to predict family standard deviations of possible crosses. We used simulations spanning a wide range of genetic architectures and used genotypes from a real strawberry breeding population to evaluate the conditions when usefulness could be accurately predicted. We found that with highly accurate computational phasing, predicting family standard deviations and usefulness criteria for potential crosses yields benefit over simply selecting crosses based on predicted family means only at high selection intensity and high heritability and with small numbers of QTL. However, even then the gain from using the family usefulness is small.
Keywords: phasing errors, family variance, usefulness criterion, cross selection, Genomic Prediction, GenPred, Shared Data Resource
Introduction
Plant breeding programs aim to create genotypes with improved performance in one or more target traits. New genotypes are created by making crosses between existing genotypes, using recombination and Mendelian assortment to assemble sets of alleles that did not exist. These new genotypes can then be evaluated and released as commercial products or recycled to create new crosses. The success of a breeding program can be improved by evaluating genotypes more efficiently and making crosses that are more likely to generate better genotypes (Meuwissen et al. 2001; Heffner et al. 2009; Bernardo 2014; Forster et al. 2015). In recent years, genomic data have been leveraged in many breeding programs to make both steps more effective. Genomic data can help make the evaluation of genotypes based on phenotypic data more precise by modeling the correlations between phenotypes and genotypes. Genomic data can also be used to select parents to be crossed by predicting their breeding value (BV), or the expected average performance of their offspring (Meuwissen et al. 2001; Heffner et al. 2009; Jannink et al. 2010). Both uses of genomic data are now commonly implemented in plant breeding programs.
Predicting the breeding value of potential parents is widely used to choose crossing partners. When the number of offspring per cross is large, the best crosses are the ones that will create the best possible offspring, not the crosses that will make the best average offspring. This situation is common in many plant systems because large numbers of seeds can be generated from the same cross. This result was observed nearly 50 years ago by Schnell and Utz (1975), who proposed the usefulness criterion (UC), or usefulness, to design optimal crosses (Zhong and Jannink 2007). Usefulness is defined as
where is the average genetic value of offspring of cross j, is the standard deviation (SD) of genetic values among these offspring, and i is within family selection intensity, the difference in mean phenotypic value of selected offspring from each family relative to the family’s mean. The consequence of the usefulness is that crosses between parents that individually have good breeding values are not necessarily the best to make. For example, the upper tail of the distribution of genetic values of offspring from a cross with high variance (high ) may have higher values than the upper tail of the distribution of offspring from a family with higher mean but lower variance. Selection decisions based on usefulness have received little attention in animal breeding, possibly because the number of offspring per cross is very small so the chance of creating offspring in the tails is low. However, several recent plant breeding studies have suggested that usefulness-based breeding might be successful, particularly in clonally propagated crops (Wolfe et al. 2021).
A simple method to predict the mean breeding value of a family is to use the average of the breeding value of the two parents. This will be accurate as long as most QTL operate additively. However, even under additive gene actions, predicting the variance within a family is more challenging as patterns of co-segregation of causal loci among genotypes also contribute to variance. For example, a cross between two genotypes with two coupling heterozygous loci would produce a greater variance in the next generation, compared with one with two repulsion heterozygous loci (Lynch and Walsh 1998; Zhong and Jannink 2007; Wolfe et al. 2021). Predicting co-segregation of causal loci from a cross of two outbred parents requires information on the phasing of alleles between all pairs of heterozygous loci in each parent and phased haplotypes are not directly measured by most genotyping platforms when individuals are heterozygous. Computational phasing using software like Beagle (Browning and Browning 2007; Browning et al. 2018) and FILLIN (Swarts et al. 2014) have been used in plant populations (Poland and Rife 2012; Scott et al. 2020). However, whether these methods are sufficiently accurate to predict family variance is unknown. Once phased haplotypes are known, QTL effect sizes need to be estimated in a training population, and then either family population variance (Zhong and Jannink 2007; Osthushenrich et al. 2017) or sample variance can be predicted (Zhong and Jannink 2007; Lehermeier et al. 2017; Osthushenrich et al. 2017; Wolfe et al. 2021; Bernardo 2022).
The benefit of choosing crosses based on their usefulness will likely vary among crops and breeding programs due to differences in genetic architecture, breeding system, and training population size. Zhong and Jannink (2007) evaluated the benefit of the usefulness in inbred populations by analytical studies and simulations. They found a generally limited utility of usefulness in inbred populations. A recent study by Wolfe et al. (2021) evaluated the usefulness in a cassava breeding program with pre-phased parental individuals and empirical observations of cross-variance in 4 traits. They argued that both family mean and variance were important metrics for mate selection. In their results, the family mean and usefulness criterion were correlated and similar selections were made using either criterion. This was one of the first applications of usefulness in an outbred, clonally propagated species using whole-genome marker data and genomic prediction models of marker effects. Our work focuses on strawberry, another outbred, clonally propagated species, and on the University of California Davis (UCD) strawberry breeding program. This breeding program is currently investing in infrastructure to implement genomics data in breeding decisions (Pincot et al. 2022; Hardigan et al. 2023; Jiménez et al. 2023; Feldmann et al. 2024b; Knapp et al. 2024) and has detailed records of the pedigrees of current varieties (Pincot et al. 2021; Knapp et al. 2023; Feldmann et al. 2024a) which help evaluate the success of the statistical models necessary for predicting usefulness as detailed below.
Our study builds on the work of Zhong and Jannink (2007) and Wolfe et al. (2021) in three key ways. First, we develop and apply a method to measure the accuracy of computationally phased parental haplotypes which is necessary for accurately predicting usefulness. Second, our study uses a different reference population with a different pedigree topology and a different marker density. Zhong and Jannink (2007) studied the benefit of usefulness in inbred species, while our study focuses on an outcross species. Therefore, our results test whether their conclusions are robust in other systems. Finally, we use numerical simulations of genetic architectures so that we can compare computationally predicted usefulness values to the actual (simulated) values without measurement error. We repeat these simulations across a range of additive genetic architectures from oligogenic to polygenic and from low to high heritability so that our conclusions can generalize to many possible target traits. Overall, our results show that despite highly accurate computational phasing, the conditions under which selection decisions based on usefulness could be useful are relatively rare. This is consistent with earlier results from Bernardo (2022) who studied predicting genetic variance within biparental populations created from inbred parents and concluded that there was little benefit to modeling within family genetic variance.
Methods and materials
Genetic data and genotyping error estimation
We used genetic and pedigree data from a population of 1,007 strawberry genotypes with 37,441 SNPs. We identified 593 family trios based on the pedigree to assess genotyping error rates and pedigree accuracy. Each trio is composed of two parents and one of their progeny. These 593 progenies were not parents of any other genotypes in the pedigree. The parents of the 593 progenies were included in the rest of the 414 genotypes. Within each family trio, we examined SNP markers for which both parents were homozygous. We compared the parental and progeny genotypes at those markers and counted the number of mismatches between parents and progeny. For example, parental genotype AA and aa with a progeny genotype AA were considered a mismatch. The genotyping error was calculated as the number of mismatched markers divided by the total number of homozygous markers compared per trio. All of the family trios had a genotyping error rate of .
Phasing
To predict the family breeding value variance, the linkage between heterozygous alleles within each parent must be known. Assigning heterozygous genotype calls to haplotypes is phasing. We used Beagle 5.0 (Browning and Browning 2007; Browning et al. 2018), version 05May22.33a and seed 0, to phase the SNP data for all 1,007 genotypes and to impute any missing genotype calls. We used a genetic map for the markers with 28 pairs of chromosomes averaging 223 centiMorgans (cM), measured as the average distance between the furthest markers on the chromosomes. We excluded 2,530 of the 37,411 markers due to their unknown map positions or low allele frequencies.
Simulation of offspring
We calculated family mean and standard deviation empirically by simulating crosses. We randomly selected 10,000 unique pairs of strawberry genotypes as parents for the crossing simulation. For each pair of parents, we used the Hypred R package (Technow 2011) to simulate crossover events and produce 400 gametes using the same genetic map as above, and the gametes were randomly paired to produce 200 offspring. Crossover events were simulated based on a count-location model. The number of crossover events (on a single chromosome) followed a Poisson distribution , where L was the length of the chromosome in Morgans. The locations of crossover events were sampled from a uniform distribution .
Estimation of imputation and phasing error
To measure the accuracy of phasing and imputation, we re-ran Beagle using the 414 genotypes that were not the progenies of the 593 family trios, and left the 593 progenies unphased. We estimated the imputation error using the same method of estimating the genotyping error as above. We compared genotypes of progeny to their computationally phased parents at pairs of markers for which at least one parent was heterozygous at both markers (e.g. AaBb and AABB parents). We measured the percentage of progeny genotypes that were inconsistent with the predicted parental haplotypes for markers, binned by the cM distance between markers.
We compared the calculated rates of phasing inconsistencies to the expected rates of inconsistencies given our genetic map, by introducing phasing errors in the haplotypes of each parent and then comparing the “incorrect” parent haplotypes to the simulated offspring genotypes using the “correct” parent haplotypes. We estimated the rate of offspring genotype inconsistency with the “incorrect” parental haplotypes using the method above. We defined 6 rates of phasing error per cM distance as the probability of a phasing error between two SNPs 1 cM apart. Phasing errors were introduced using the same method of crossover events as described above. The number of phasing errors per chromosome was sampled from a Poisson distribution , where L is the chromosome length in Morgans. The locations of phasing errors were sampled from a uniform distribution .
Simulation of breeding value and phenotypes
We simulated breeding values assuming a sparse additive genetic architecture. We created 15 scenarios spanning all combinations of three levels of heritability (0.2, 0.5, 0.8) and five levels of QTL number (4, 16, 64, 256, 1,024) randomly selected from the SNP markers with effect sizes sampled from a standard normal distribution . We defined the BV of genotype k as , where was the number of copies of alternative allele for genotype k at QTL j and was the effect size of alternative allele at QTL j. To calibrate these architectures, we calculated the number of QTL segregating per pair of haplotypes, which is the average number of heterozygous QTL per individual. For the traits with 4 QTL, on average there were segregating QTL per genotype, while for the traits with 1,024 QTL, there were segregating QTL (Supplementary Fig. 3). Compared to a biparental population, this is equivalent to having 1.5 or 400 segregating loci. Focusing only on large-effect QTL explaining at least 20% of the variance in breeding values, haplotype pairs differed by large-effect QTL when there were 4 QTL in the whole population, when there were 16 QTL, and ≈ 7.58, 5.57, and 0.06 with 64–1,024 QTL (Supplementary Fig. 4).
We also simulated phenotypic values (Y) by adding random noise sampled from a normal distribution , i.e. , where was scaled based on different levels of heritability . For each combination of heritability and QTL number, we ran 20 independent simulations, sampling different SNPs as QTL and with different effect sizes and environmental errors in each simulation, and we used 500 out of the 10,000 families for each simulation to ensure independence among the 20 simulations. We calculated breeding values for both the real 1,007 parental genotypes and simulated offspring, but the phenotypes were only simulated for the real 1,007 genotypes.
Prediction of family means and standard deviation of breeding value, and estimation of prediction accuracy
RR-BLUP and BayesC models were fitted to the simulated phenotypes of the 1,007 parental genotypes and their SNP marker data, excluding the markers that were selected as QTL (Endelman 2011; Pérez and de los Campos 2014; Team 2020), i.e. , where and were the phenotype vector and marker matrix of 1,007 genotypes and was the trained models. Predicted breeding values were calculated for each offspring by plugging offspring genotypes into the fitted models, i.e. . We calculated the within family breeding value mean and standard deviation based on both the true and predicted breeding value of offspring. For example, the true and predicted breeding value mean (μ and ) of a family was calculated as and , where k represented the kth offspring in the family. We calculated the prediction accuracy r as the Pearson correlation between the predicted values and the true values of each statistic (e.g. ).
Prediction of family mean and standard deviations of all possible crosses using simulation of gamates
We used the randomly simulated 10,000 families to assess the prediction accuracies for family mean, standard deviation, and usefulness. For mate selection, however, 10,000 crosses were only a small fraction of all 506,521 possible unique crosses that can be made among the 1,007 individuals, and the cross that yielded the highest mean or usefulness was likely not among them. We needed a method to exhaustively predict family means and standard deviations of all possible crosses without the simulation of offspring. For this method, we simulated 200 gametes from each of the 1,007 individuals using Hypred with the same count-location model.
We calculated the predicted breeding values of the gametes of each potential parent as , where was the trained BayesC model. We also calculated the predicted mean and variance of gametes from each parent, and . Since we only considered additive genetic architectures, we calculated predicted family means and variances of a cross as the summation of predicted means and variances of gametes from two individuals. For example, if a cross was made between individual j and individual l, the predicted family mean and variance would be calculated as and . Predicted family standard deviation was calculated as the square root of predicted family variance. We used this method to calculate the true and predicted family mean and standard deviations for all 506,521 crosses, using the simulated gametes. The estimates of family mean breeding values were highly correlated using both methods (>99%, Supplementary Fig. 7a). The correlations of standard deviations of breeding values within families were also high, but decreased as the number of QTL increased (from to , Supplementary Fig. 7b).
Calculation of selection intensity
The usefulness criterion is a function of the standardized selection differential or selection intensity i defined in Caballero (2020). To compare true and predicted usefulness values, we selected two values of i representing moderate and strong selection: 1.40 and 2.67, corresponding to selecting 20 or 1% of offspring within each family, using the equation:
Here, is the probability density function of the phenotypic distribution of the population, and we assume a normal distribution, and represents how many population standard deviations the selected group mean is away from the population mean. p is the proportion of genotypes selected.
Results and discussion
Computational phasing is highly accurate in this strawberry population.
We compared genotypes of pairs of markers at different genetic distances between unphased offspring and phased parents. To account for recombination, we binned marker pairs by genetic distances in cM. Rates of inconsistencies increased with recombination distance (Fig. 1), as expected. We then compared the observed rates of genotype inconsistency with expected rates if parents were correctly phased, or if phasing errors were introduced randomly within each parent with different frequencies. For marker pairs >10 cM apart, our observed rate of genotype inconsistencies was close to the expected rate if parental phasing was perfect (Fig. 1, black and red). However, for closer marker pairs, simulated inconsistency rates converged towards zero even when parental phasing was imperfect, but our observed inconsistency rate never dropped below , perhaps due to a low amount of genotyping error or incorrect parent–offspring trios. These results are consistent with the computationally inferred phasing of markers being highly accurate in this population.
Fig. 1.
Rates of inconsistent parent–offspring genotypes at pairs of markers as a function of genetic distance. Pairs of marker genotypes, which both were heterozygous in at least one parent, were compared between parents and offspring across the 593 family trios. The black line shows the fraction of marker pairs in each genetic distance bin that are inconsistent between offspring and parents. Colored lines show simulated inconsistency rates for the same marker pairs after introducing haplotype switches (phasing errors) at the specified rates per cM in each parent.
Prediction accuracy of family breeding value mean and standard deviation varied as heritability and number of QTL changed
Given the highly accurate computational phasing of parental haplotypes in this population (Fig. 1), we used the haplotypes to simulate families of 200 offspring for 10,000 crosses between randomly paired genotypes and projected genotypes of causal alleles onto each simulated offspring. We then trained genomic prediction models (BayesC and RR-BLUP) on simulated parental phenotypes with varying levels of heritability () and used the trained models to predict the breeding values of the offspring (Endelman 2011; Pérez and de los Campos 2014). We calculated the mean and standard deviation of these predicted breeding values per family and measured the accuracies (r) of these predictions using Pearson’s correlation with the true values (, and ).
Using BayesC as a genomic prediction model, the average prediction accuracy of family means ranged between and (Fig. 2a). We achieved moderate accuracy in predicting family standard deviations, averaging between and across scenarios (Fig. 2b, blue “prediction from model”). Results using RR-BLUP as a prediction model were similar (Supplementary Fig. 1). The prediction accuracies of family breeding values mean and standard deviations both increased as the heritability increased, as expected. When the number of QTL increased, the prediction accuracy of the mean remained the same (Fig. 2a), but that of standard deviation decreased (Fig. 2b, blue “prediction from model”). This might be due to less accurate estimates of individual QTL effects when there are more QTL.
Fig. 2.
Prediction accuracies of the means a) and standard deviations b) of breeding values across simulated families as a function of heritability () and the number of causal loci. Panels show the means and standard error bars of Pearson’s correlations (prediction accuracies, y-axis) between the true family means of breeding values a) or true family standard deviations of breeding values (b, blue “prediction from model”) based on a BayesC genomic prediction model trained on parental phenotypes simulated with different heritabilities () and different number of causal loci (x-axis). Twenty simulations were run at each combination of and number of causal loci. Each simulation utilized 500 out of 10,000 simulated families to ensure independence among simulations. As a comparison, we also calculated the correlation between the true standard deviation of breeding values in each family and the average number of heterozygous markers of the two parents of that family calculated across all non-causal SNP markers in each simulation (b, gold “prediction from parental heterozygosity”). Points and bars show the means and standard errors of accuracies across 20 simulations. Curves are created by connecting the points of means and show the trend of prediction accuracies across simulations.
Even with a low heritability and a high number of causal loci, prediction accuracies of within family standard deviations did not decline to zero. This is likely because parents heterozygous for more loci also tend to be heterozygous for more causal loci. Only heterozygous causal loci contribute to the variance of breeding values among offspring in our simulations because the genetic architecture is purely additive. We used the number of heterozygous markers as a measure of an individual’s heterozygosity, and calculated Pearson’s correlation between the average parental heterozygosity and the true family standard deviation. The average parental heterozygosity was more correlated with the standard deviation of offspring breeding values than either the BayesC-based or RR-BLUP-based predicted standard deviations in some of our simulations (Fig. 2b, gold “prediction from parental heterozygosity”; Supplementary Fig. 1b, gold “prediction from parental heterozygosity”), particularly when the number of causal loci was high.
Mean heterozygosity information may have been indirectly captured by the genomic prediction models to accurately predict standard deviation. With a small number of QTL, the predicted family standard deviations from our genomic prediction models were more accurate than predictions of standard deviation from parental heterozygosity. In contrast, with a large number of QTL, the prediction of standard deviation from parental heterozygosity was more accurate. Unfortunately, despite its high correlation with family standard deviations in some instances, it is not straightforward to use parental heterozygosity to directly predict family usefulness, because heterozygosity and family standard deviations have different units and scaling factors. In practice, real families with large enough numbers of genotypes would be needed to estimate the regressions between family standard deviations and parental heterozygosity before this statistic could be used to estimate usefulness.
Family usefulness can be accurately predicted with family mean
With the predicted family mean and standard deviation using the BayesC model, we were able to predict the usefulness of each family for a defined selection intensity i using the equation: . We calculated the accuracy of the estimated usefulness as Pearson’s correlation with the true usefulness (, Fig. 3, gold “predicted from family usefulness”) and compared this with the accuracy of predicting the usefulness using only the predicted family mean (, Fig. 3, blue “predicted from family mean”) of each family. As heritability increased, the accuracy of both predictions increased. At low heritability, the prediction accuracies were relatively insensitive to the number of QTL. However, at high heritability, the prediction accuracies increased as the number of QTL increased. Predicting usefulness with both predicted mean and predicted standard deviation (i.e. ) only achieved meaningfully higher accuracies than predicting usefulness with the predicted mean itself (i.e. ), when heritability was high and when there were small numbers of QTL (Fig. 3, panels and , at 4 and 16 QTL), the same conditions under which we could most accurately predict the family standard deviation (Fig. 2b, blue “predicted from family mean”). The results using an RR-BLUP model were similar (Supplementary Fig. 2).
Fig. 3.
Prediction accuracies of the usefulness across the simulated families using a BayesC genomic prediction model. Panels show the means and standard error bars of Pearson’s correlations (prediction accuracies, y-axis) between the true family usefulness and predicted family mean (blue “predicted from family mean”) or predicted family usefulness (gold “predicted from family usefulness”) based on a BayesC genomic prediction model trained on parental phenotypes simulated with different heritabilities () and different numbers of causal loci (x-axis), across different selection intensity (i). Twenty simulations were run at each combination of and number of causal loci. Points and bars show the means and standard errors of accuracies across 20 simulations across each i. Curves are created by connecting the points of means and show the trend of prediction accuracies across simulations.
Given the moderate accuracy at predicting both the standard deviation and mean of each family, we were surprised that direct predictions of usefulness were not much better than predictions based on predicted family means alone. One explanation for this observation is that the true values of family mean and family usefulness are highly correlated (), especially when the number of causal loci is large (Fig. 4). The reason for this high correlation is that the variance among families in the quantity (selection intensity multiplied by standard deviation) is very small compared with the variance among families in mean (μ), so the first term in the usefulness dominates. This mirrors observations from Zhong and Jannink (2007) in biparental recombinant inbred line populations that the variance of family means increases at a greater rate than the variance of family standard deviations as the number of QTL increases. They defined the parameter , the ratio between the variance of true family standard deviations and variance of means and showed that in biparental populations t accurately predicts the value of usefulness. We estimated t for each simulation (Supplementary Fig. 5, blue “simulated offspring”) and found that it also decreased dramatically as the number of causal loci increased. However, t was generally larger in our populations than expected based on Equation 8 of Zhong and Jannink (2007): , where L is the effective number of causal loci (Supplementary Fig. 5, gold “the number of casual loci”), which assumes that all loci are independent, have equal allele frequencies (), and equal effect sizes. The increase of t in our simulations is expected because there was LD among loci, unequal effect sizes, and unequal allele frequencies. This reduces the “effective” number of loci, as in Equation 9 of Zhong and Jannink (2007). However, that equation does not account for unequal allele frequencies, which increases t even more, as they predicted. Nevertheless, t values are still small in our simulations when there were four causal loci, which explains the negligible value of usefulness.
Fig. 4.
Correlations between true family breeding usefulness and mean at two selection intensity levels. Pearson’s correlation (y-axis) between the true family means of breeding values and true family usefulness across different selection intensities (i) and the number of causal loci (x-axis). Twenty simulations were run at each number of causal loci. Points and bars show the means and standard errors of accuracies across 20 simulations at , blue, and at , gold. Curves are created by connecting the points of means and show the trend of correlations across simulations.
Predicting usefulness improves cross-selection accuracy in a two-stage approach, but results in little additional genetic gain
Zhong and Jannink (2007) observed that by restricting possible crosses to only those among lines first selected for breeding value, (i.e. elites), the variance in family means will be reduced, and thus the relative importance of standard deviation for usefulness will increase. We tested whether our prediction accuracy of usefulness in this context could improve crossing decisions. Across the range of genetic architectures, the true correlation between the family mean breeding value and usefulness was lower among crosses from the elite parents than among random crosses across all numbers of causal loci and selection intensities for these new families, reaching values as low as 0 when selection intensities were strong and there were few QTL, but remained moderately high when the number of QTL was >60 (Fig. 5, Supplementary Fig. 6).
Fig. 5.
True family breeding value means and true family usefulness are less correlated in the elite families. Pearson’s correlation (y-axis) between the true family means of breeding values and true family usefulness across different selection intensities (i) and the number of QTL (x-axis), from the initial 10,000 families (blue “elite parents”) and from the families with the highest predicted breeding value individuals as parents (gold “random parents”) at 0.5 heritibility. Points and bars show the means and standard errors of accuracies across 20 simulations at and at . Curves are created by connecting the points of means and show the trend of correlations across simulations.
Using our previously trained BayesC models, prediction accuracies of usefulness were moderately higher using predicted usefulness for each candidate family than using predicted family means in the high heritability/few QTL/low selection intensity scenarios, so the benefit increased when the selection intensity was high and the number of QTL was small (Fig. 6).
Fig. 6.
Prediction accuracies of the usefulness across the elite families using a BayesC genomic prediction model. Panels show means and standard error bars of Pearson’s correlations (prediction accuracy, y-axis) between the true family usefulness and predicted family mean (blue “predicted from family mean”) or predicted family usefulness (gold “predicted from family usefulness”) based on a BayesC genomic prediction model trained on parental phenotypes simulated with different heritabilities () and different numbers of causal loci (x-axis), across different selection intensity (i), from the families with the highest predicted breeding value individuals as parents. Twenty simulations were run at each combination of and number of causal loci. Points and bars show the means and standard errors of accuracies across each i. Curves are created by connecting the points of means and show the trend of prediction accuracies across simulations.
These results suggest that a two-stage crossing design strategy, in which parents are first selected by predicted breeding value and then crosses among these selected parents are designed in a second state, might be an effective method for leveraging usefulness to increase genetic gain, at least when heritability and/or selection intensity is high and the genetic architecture has a small number of QTL. However, the prediction accuracies and selection intensities shown in Fig. 6 are not direct parameters of the traditional breeder’s equation, usually written as . The parameter i from the breeder’s equation represents the selection intensity relative to all possible offspring, while the i in the usefulness equation is the selection intensity within each family. The parameter r from the breeder’s equation represents the accuracy of predicted genetic values of each offspring, while the r from Fig. 6 represents the accuracy of predicted average genetic values of the selected offspring in each cross. Additionally, Fig. 6 does not show the population additive genetic variance , and the second stage of the two-stage crossing design will have less genetic variance because only crosses among the previously selected parents are considered. Therefore, Fig. 6 provides an incomplete picture of the potential benefit of predicting usefulness for increasing genetic gain.
To assess whether the increases in accuracy from predicted usefulness shown in Fig. 6 actually result in meaningful increases in genetic gain, we estimated the genetic gain expected for this two-stage approach and compared it with the expected genetic gain from the one-stage approach that selected crosses from all possible crosses using either predicted usefulness or predicted mean breeding values. The expected genetic gain is the mean genetic value of the top offspring from each of the selected crosses, using a selection intensity of i within each cross. For each approach, we created 10 crosses using 20 unique parents. For the two-stage approach, we selected the 40 parents by predicted breeding values, and predicted usefulness values for all 780 candidate crosses, and then chose 10 crosses using a greedy search algorithm, first choosing the cross with the highest predicted usefulness, then adding the next-best cross that did not use any previously selected parents, and so-on until 10 crosses were selected. For the one-stage approach, we predicted family means and family usefulness for all 506,521 possible crosses, and then applied the same algorithm to select 10 crosses using 20 unique parents. Note that a two-stage approach using predicted family means as the cross-section criteria in the second stage would choose identical crosses as the one-stage approach using predicted family means. Since simulating crosses for 506,521 crosses would be prohibitively computationally expensive, we instead simulated populations of gametes from each parental genotype, and predicted family means as the sum of the mean predicted gamete genetic values of each parent, and predicted family variances as the sum of the variances of predicted gamete genetic values.
Figure 7 shows the expected genetic gains in units of for each selection strategy. Consistent with the results of Fig. 3, predicting usefulness can increase genetic gain, and never results in lower rates of gain on average. However, except in cases of extreme within family selection intensity, high heritability, and few QTL, the gain is rarely large. In contrast to the impression given by Fig. 6 and to the hypothesis by Zhong and Jannink (2007), there was limited benefit of selecting crosses by predicted usefulness after selecting parents, even when the accuracy of predicted usefulness was high. The two-stage approach never had higher expected genetic gains than the one-stage approach using predicted usefulness.
Fig. 7.
Selection on predicted family usefulness from all possible crosses gave crosses with the highest breeding value usefulness. Panels show means and standard error bars of standardized family breeding value usefulness (y-axis) against the number of QTL (x-axis) from crosses based on four selection methods across different heritability () and selection intensity (i). The selection methods included selecting the top 10 crosses among all possible crosses with the highest predicted family usefulness (gold “selected based on usefulness”, circle) without replacement, selecting the top 10 crosses among all possible crosses with the highest predicted family mean (blue “selected based on mean”, circle) without replacement, selecting top 40 ranked parents based on BayesC model predicted breeding values and then selecting the top 10 crosses among them with the highest predicted family usefulness (gold “selected based on usefulness”, triangle) without replacement, and selecting top 40 ranked parents based on BayesC model predicted breeding values and then selecting the top 10 crosses among them with the highest predicted family mean (blue “selected based on mean”, triangle) without replacement.
Mate selection based on family usefulness provides little benefit in most cases and minimal benefit in some cases
Together, our results show that the true family mean is highly correlated with the true family usefulness under most scenarios in an outcross species like strawberries. Our results validated conclusions made by Zhong and Jannink (2007), in which they stated limited benefit of usefulness in inbred species. We show that the accuracy of computational phasing is unlikely to be the limiting factor in the utility of usefulness predictions for improving genetic gain. Even when phasing is successful, as it appears to be in our strawberry population, and when predicted usefulness values are relatively accurate, predicting usefulness values for potential crosses only has meaningful benefit compared with only predicting family means when the genetic architecture is dominated by few QTL, when the heritability is moderate or high, and when family sizes are so large that very high within family selection intensities can be achieved (Fig. 7, panels , , and at four QTL). One possible explanation for our high phasing accuracy is that our reference strawberry breeding program includes many parent–offspring pairs which are highly informative for inferring phasing among markers. In other programs without such a strong pedigree structure, lower phasing accuracy may further limit the benefit of usefulness.
One limitation of our study is that we only simulated additive genetic architectures which is justified if we are only interested in population improvement. Under purely additive architectures, the usefulness of a cross is simply a function of the average usefulness of each parent. In this case, usefulness can be used equally as a line or mate selection metric. However, for variety release in a clonally propagated crop like strawberry, the usefulness criterion can be extended to total genetic values where dominance and epistasis effects will additionally be important (Wolfe et al. 2021). In this case, if the within family variance in total genetic values becomes large enough so that the variance of is of similar magnitudes to , the importance of usefulness would increase. However, predicting dominance and epistatic effects from genome-wide marker data remains very challenging and generally requires much larger sample sizes than are currently possible in most breeding programs. Wolfe et al. (2021) found little benefit from predicting dominance variance. In future work, we could extend our simulations to include dominance and epistatic variance to confirm these intuitions.
Conclusion
In this study, we discuss whether selecting crosses based on predicted values of the usefulness of each cross could improve the rate of gain in breeding programs. Predicting the usefulness requires predicting the standard deviations of breeding values among the offspring produced by each cross. Our results show that predicting usefulness improves outcomes relative to simply predicting the mean breeding value of each family only in very specific scenarios with high heritability, high selection intensity, and a small number of QTL (Figs. 3 and 4a). However, even in these scenarios, the relative gain in accuracy was small. We have also shown that the true usefulness of the top crosses selected either with predicted family mean or with usefulness only meaningfully differ in those same scenarios (Fig. 7). Due to the rare instances where prediction of usefulness is beneficial, usefulness is rarely useful.
Supplementary Material
Acknowledgments
We thank Lily Bhattacharjee for the initial data formatting.
Contributor Information
Fangyi Wang, Department of Plant Sciences, University of California Davis, Davis, CA 95616, USA.
Mitchell J Feldmann, Department of Plant Sciences, University of California Davis, Davis, CA 95616, USA.
Daniel E Runcie, Department of Plant Sciences, University of California Davis, Davis, CA 95616, USA.
Data availability
The data used in this study are available at: https://gsajournals.figshare.com/articles/dataset/Supplemental_Material_for_Wang_Feldmann_and_Runcie_2024/26957599. The Supplementary figures are provided. The codes are available at https://github.com/Faye-Wong-stat/family_variance.git.
Supplemental material available at G3 online.
Funding
This study was supported by the United States Department of Agriculture (USDA) National Institute of Food and Agriculture (NIFA) Agriculture Genomes 2 Phenomes Initiative (AG2PI), grant number 026418H, the United States Department of Agriculture (USDA) National Institute of Food and Agriculture (NIFA) AI Institute for Next Generation Food Systems (AIFS), award number 2020-67021-32855, in addition to funding from the University of California, Davis.
Literature cited
- Bernardo R. 2014. Essentials of Plant Breeding. Stemma Press. [Google Scholar]
- Bernardo R. 2022. Outliers and their distribution in breeding populations. Crop Sci. 62(3):1107–1114. 10.1002/csc2.v62.3 [DOI] [Google Scholar]
- Browning SR, Browning BL. 2007. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet. 81(5):1084–1097. 10.1086/521987 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Browning BL, Zhou Y, Browning SR. 2018. A one-penny imputed genome from next-generation reference panels. Am J Hum Genet. 103(3):338–348. 10.1016/j.ajhg.2018.07.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Caballero A. 2020. Quantitative Genetics. Cambridge University Press. [Google Scholar]
- Endelman JB. 2011. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome. 4(3):250–255. 10.3835/plantgenome2011.08.0024 [DOI] [Google Scholar]
- Feldmann MJ, Pincot DDA, Cole GS, Knapp SJ. 2024a. Genetic gains underpinning a little-known strawberry green revolution. Nat Commun. 15(1):2468. 10.1038/s41467-024-46421-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feldmann MJ, Pincot DDA, Vachev MV, Famula RA, Cole GS, Knapp SJ. 2024b. Accelerating genetic gains for quantitative resistance to verticillium wilt through predictive breeding in strawberry. Plant Genome. 17(1):e20405. 10.1002/tpg2.20405 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Forster B, Till B, Ghanim A, Huynh H, Burstmayr H, Caligari PDS. 2015. Accelerated plant breeding. CAB Rev. 9(043):1–16. 10.1079/PAVSNNR20149043 [DOI] [Google Scholar]
- Hardigan MA, Feldmann MJ, Carling J, Zhu A, Kilian A, Famula RA, Cole GS, Knapp SJ. 2023. A medium-density genotyping platform for cultivated strawberry using DArTag technology. Plant Genome. 16(4):e20399. 10.1002/tpg2.20399 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heffner EL, Sorrells ME, Jannink J. 2009. Genomic selection for crop improvement. Crop Sci. 49(1):1–12. 10.2135/cropsci2008.08.0512 [DOI] [Google Scholar]
- Jannink J-L, Lorenz AJ, Iwata H. 2010. Genomic selection in plant breeding: from theory to practice. Brief Funct Genomics. 9(2):166–177. 10.1093/bfgp/elq001 [DOI] [PubMed] [Google Scholar]
- Jiménez NP, Feldmann MJ, Famula RA, Pincot DDA, Bjornson M Cole GS, Knapp SJ. 2023. Harnessing underutilized gene bank diversity and genomic prediction of cross usefulness to enhance resistance to Phytophthora cactorum in strawberry. Plant Genome. 16(1):e20275. 10.1002/tpg2.20275 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Knapp SJ, Cole GS, Pincot DDA, Dilla-Ermita CJ, Bjornson M, Famula RA, Gordon TR, Harshman JM, Henry PM, Feldmann MJ. 2024. Transgressive segregation, hopeful monsters, and phenotypic selection drove rapid genetic gains and breakthroughs in predictive breeding for quantitative resistance to Macrophomina in strawberry. Hortic Res. 11(2):uhad289. 10.1093/hr/uhad289 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Knapp SJ, Cole GS, Pincot DD, Lòpez CM, Gonzalez-Benitez OA, Famula RA. 2023. ‘UC Eclipse’, a summer plant-adapted photoperiod-insensitive strawberry cultivar. HortScience. 58(12):1568–1572. 10.21273/HORTSCI17363-23 [DOI] [Google Scholar]
- Lehermeier C, de Los Campos G, Wimmer V, Schön C-C. 2017. Genomic variance estimates: with or without disequilibrium covariances? J Anim Breed Genet. 134(3):232–241. 10.1111/jbg.2017.134.issue-3 [DOI] [PubMed] [Google Scholar]
- Lynch M, Walsh B. 1998. Genetics and Analysis of Quantitative Traits. Oxford, New York: Oxford University Press. [Google Scholar]
- Meuwissen TH, Hayes BJ, Goddard ME. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 157(4):1819–1829. 10.1093/genetics/157.4.1819 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Osthushenrich T, Frisch M, Herzog E. 2017. Genomic selection of crossing partners on basis of the expected mean and variance of their derived lines. PLoS One. 12(12):e0188839. 10.1371/journal.pone.0188839 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pérez P, de los Campos G. 2014. Genome-wide regression and prediction with the BGLR statistical package. Genetics. 198:483–495. 10.1534/genetics.114.164442 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pincot DDA, Feldmann MJ, Hardigan MA, Vachev MV, Henry PM, Gordon TR, Bjornson M, Rodriguez A, Cobo N, Famula RA, et al. 2022. Novel Fusarium wilt resistance genes uncovered in natural and cultivated strawberry populations are found on three non-homoeologous chromosomes. Theor Appl Genet. 135(6):2121–2145. 10.1007/s00122-022-04102-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pincot DDA, Ledda M, Feldmann MJ, Hardigan MA, Poorten TJ, Runcie DE, Heffelfinger C, Dellaporta SL, Cole GS, Knapp SJ, 2021. Social network analysis of the genealogy of strawberry: retracing the wild roots of heirloom and modern cultivars. G3 (Bethesda). 11:jkab015. 10.1093/g3journal/jkab015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Poland JA, Rife TW. 2012. Genotyping-by-sequencing for plant breeding and genetics. Plant Genome. 5(3):92–102. 10.3835/plantgenome2012.05.0005 [DOI] [Google Scholar]
- Schnell FW, Utz HF. 1975. F1-leistung und elternwahl euphy-der züchtung von selbstbefruchtern. Bericht über die Arbeitstagung der Vereinigung Österreichischer Pflanzenzüchter. 25(27):243–248. [Google Scholar]
- Scott MF, Ladejobi O, Amer S, Bentley AR, Biernaskie J, Boden SA, Clark M, Dell’Acqua M, Dixon LE, Filippi CV, et al. 2020. Multi-parent populations in crops: a toolbox integrating genomics and genetic mapping with breeding. Heredity (Edinb). 125:396–416. 10.1038/s41437-020-0336-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Swarts K, Li H, Romero Navarro JA, An D, Romay MC, Hearne S, Acharya C, Glaubitz JC, Mitchell S, Elshire RJ, et al. 2014. Novel methods to optimize genotypic imputation for low-coverage, next-generation sequence data in crop plants. Plant Genome. 7(3):1–12. 10.3835/plantgenome2014.05.0023 [DOI] [Google Scholar]
- Team RC . 2020. R: A Language and Environment for Statistical Computing. Technical Repor. Vienna, Austria: R Foundation for Statistical Computing.
- Technow F. 2011. R package hypred: simulation of genomic data in applied genetics.
- Wolfe MD, Chan AW, Kulakow P, Rabbi I, Jannink J-L. 2021. Genomic mating in outbred species: predicting cross usefulness with additive and total genetic covariance matrices. Genetics. 219(3):iyab122. 10.1093/genetics/iyab122 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhong S, Jannink J-L. 2007. Using quantitative trait loci results to discriminate among crosses on the basis of their progeny mean and variance. Genetics. 177(1):567–576. 10.1534/genetics.107.075358 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data used in this study are available at: https://gsajournals.figshare.com/articles/dataset/Supplemental_Material_for_Wang_Feldmann_and_Runcie_2024/26957599. The Supplementary figures are provided. The codes are available at https://github.com/Faye-Wong-stat/family_variance.git.
Supplemental material available at G3 online.







