Abstract
Crop breeding involves two selection steps: choosing progenitors and selecting individuals within progenies. Genomic prediction, based on genome-wide marker estimation of genetic values, could facilitate these steps. However, its potential usefulness in grapevine (Vitis vinifera L.) has only been evaluated in non-breeding contexts mainly through cross-validation within a single population. We tested across-population genomic prediction in a more realistic breeding configuration, from a diversity panel to ten bi-parental crosses connected within a half-diallel mating design. Prediction quality was evaluated over 15 traits of interest (related to yield, berry composition, phenology and vigour), for both the average genetic value of each cross (cross mean) and the genetic values of individuals within each cross (individual values). Genomic prediction in these conditions was found useful: for cross mean, average per-trait predictive ability was 0.6, while per-cross predictive ability was halved on average, but reached a maximum of 0.7. Mean predictive ability for individual values within crosses was 0.26, about half the within-half-diallel value taken as a reference. For some traits and/or crosses, these across-population predictive ability values are promising for implementing genomic selection in grapevine breeding. This study also provided key insights on variables affecting predictive ability. Per-cross predictive ability was well predicted by genetic distance between parents and when this predictive ability was below 0.6, it was improved by training set optimization. For individual values, predictive ability mostly depended on trait-related variables (magnitude of the cross effect and heritability). These results will greatly help designing grapevine breeding programs assisted by genomic prediction.
Introduction
Breeding for perennial species is mostly based on phenotypic selection and is hindered by cumbersome field trials and the long generation time. Genomic prediction (GP), based on genome-wide prediction of genetic values [1], has been widely adopted in modern plant and animal breeding programs, for its superiority in terms of cost and time saved compared to traditional phenotypic selection, but also because it allows handling traits with complex genetic determinism. GP requires a model training step within a reference population, prior to model application to a target population of selection candidates [2]. In perennial crops, a population encompassing most of the species’ genetic diversity could be particularly interesting as a training population to reduce phenotyping effort, since breeding cycle and juvenile phase are long.
Breeding schemes typically involve first the choice of parents (individuals to be crossed) and then the selection of offspring within crosses. GP is adapted both for predicting cross mean and for ranking genotypes within a cross (Mendelian sampling). These steps correspond to the components of the predictive ability (PA) of GP. It is indeed defined as the sum of cross mean and Mendelian sampling terms, as detailed in Werner et al. [3].
Cross mean is the sum of the breeding values of parents if allelic effects are only additive, but in practice some deviation may result from dominance or epistasis [4]. So far, a few studies only have investigated cross mean PA in heterozygous crops [5–8], although none of them clearly investigated its influencing parameters.
In contrast, the prediction of genetic values within a cross (Mendelian sampling), has been widely studied, both with simulated and real data, and various parameters affecting PA have been pointed out, including the statistical method used [9], the composition and size of training and validation populations [10, 11], the trait genetic architecture and heritability [12, 13] and marker density [14]. Genetic relationship between the training and validation sets is known to strongly affect PA [15], with low or even sometimes negative accuracies for across-breed GP in animals [16]. This can be explained by the loss of linkage phase between the marker and QTL or by differences in linkage disequilibrium among populations [17]. Another explanation is the presence of specific allelic effects and allele frequencies, due to the genetic background [18]. All these effects are linked to genetic relationship. Some studies specifically derived deterministic equations to predict PA for across-population GP, based on genetic relationship and heritability (e.g. [19–21]).
In grapevine (Vitis vinifera subsp. vinifera), very few authors have assessed the potential interest of GP. Viana et al. [22] investigated GP within a bi-parental population from a cross between an interspecific hybrid and a seedless table grape. Later, Migicovsky et al. [23] used a panel of 580 V. vinifera accessions to perform both GP and genome-wide association study (GWAS) for 33 phenotypes. More recently, Brault et al. [24] investigated GP within a bi-parental population from a cross between Syrah and Grenache. In a related study, Fodor et al. [25] had simulated a structured and highly diverse grapevine panel and bi-parental populations with parents originating from the panel. They applied GP and found little difference between PA values estimated within the panel or across populations. Finally, Flutre et al. [26] studied 127 traits with GWAS and GP within a diversity panel; they also applied across-population GP, but with 23 test offspring and for one trait only. Before genomic selection can be deployed in grapevine, evaluating PA across populations is thus crucially needed. In particular, PA could be evaluated with a diversity panel and a bi-parental progeny as training and validation sets, respectively. This is a challenging configuration considering the low genetic relatedness but this configuration is much more likely to occur in actual breeding schemes than GP within the same population. As in grape, studies investigating across-population GP are also lacking in most clonally propagated crops.
The aim of this study was to assess across-population genomic PA and to provide a more thorough understanding of parameters affecting PA in a situation close to the one typically encountered in a breeding context, i.e. across populations, for a clonally propagated crop such as grapevine. Our study was based on phenotypic data for 15 traits, related to yield, berry composition, phenology and vigour, measured both in a diversity panel [27], and in a half-diallel with 10 bi-parental crosses. We assessed PA under three scenarios, first for cross mean, and then for Mendelian sampling term; the results provided keys to understand PA determinants in both cases. Finally, we implemented training population optimization to investigate under which conditions PA can be improved.
Results
Extent of genetic diversity within the half-diallel population
We first evaluated the genetic variability of half-diallel crosses with respect to the diversity panel, through their projection on the first plane of a PCA based on genotypic data at 32894 SNPs within the diversity panel. The half-diallel crosses were genetically close to the wine west (WW) subpopulation from the diversity panel (Figure 1a), which was expected, given that all half-diallel parents except Grenache are wine varieties from western Europe (Figure 1a, Figure S1). The half-diallel diversity covered the whole range of WW diversity, and progenies, all located exactly between their respective parents, were well separated from each other along the first two PCA axes (Figure 1a).
We then investigated broad-sense heritability values (H2) for 15 traits related to yield, berry composition, phenology and vigour. Overall H2 values were medium to high, ranging from 0.49 for mcwi (full names of traits are available in Material and Methods section) in the half-diallel to 0.91 for mbw in the panel (Figure 1b; Table S1). Correlation between half-diallel and diversity panel heritability values was 0.25. Per-cross H2 values for each trait varied among half-diallel crosses (Figure S2), which might result from the fairly small number of offspring per cross (from 64 to 70). Nevertheless, we observed a 0.62 correlation between overall and per-cross H2. Mean cluster width displayed extreme variation in H2 per cross (from 0.02 to 0.67). This might be due to the difficulty to phenotype this specific trait because of the presence of lateral wings in some individuals.
Within the half-diallel and for all traits, the cross effect was retained in the mixed model for genetic value estimation, but its magnitude with respect to the total genetic variance varied depending on the trait, ranging from less than 10% to ca. 50% (Figure 1b; Table S1). Depending on the trait or cross, the distribution of genotypic BLUPs varied widely (Figure 1c-e; Figure S3), some traits such as vigour being quite comparable among crosses, while others such as mbw or mcwi varied greatly. We also observed transgressive segregation within the half-diallel progenies (Figure 1c-e; Figure S3) for most traits and subpopulations. The 15 traits studied represented a large phenotypic diversity, structured among crosses (Figure S4).
Prediction of cross mean and Mendelian sampling within- and across-populations
Prediction of cross mean
We first implemented cross mean prediction, as if aiming to select parents for future crosses, selecting the method best adapted to genetic architecture between RR and LASSO (see Material and Methods). Predictive ability (PA) was assessed as Pearson’s correlation between the observed mean genotypic value per half-diallel cross and the one predicted based on parental average genotypes (Table S2). Three scenarios were tested (Material and Methods, Figure 2): allelic effects estimated within the whole half-diallel (scenario 1a), in families with one parent in common (scenario 1b), or within the whole diversity panel (scenario 2).
In scenario 2, per-trait and per-cross predictive ability was lower and more variable than in scenarios 1a and 1b (Figure 3). Average per-cross PA was 0.54, 0.63 and 0.36 in scenarios 1a, 1b and 2, respectively (Figure 3a). Average per-trait PA was close to 1 for most traits in scenarios 1a and 1b (Figure 3b), and still high (around 0.75) in scenario 2, when excluding nbclu and vigour (Table S3). Overall PA (over the 150 cross x trait combinations) was 0.45. There was upward or downward bias for some traits, scenarios or methods, and in scenario 1a, LASSO resulted in larger bias (Figure S5).
Prediction of Mendelian sampling
We then measured PA for individual offspring within each half-diallel cross, thus considering separately the Mendelian sampling component. For each cross and trait, we compared the observed and predicted genotypic values in the three scenarios (Figure 2; Figure S6, Figure S7). In Figure S7, results are displayed for each method.
In scenario 1a (Figure 4a), average PA per trait ranged from 0.18 for mcwi to 0.57 for mbw, with a 0.47 overall average (Figure S8a). The extent of PA variation among crosses depended on the trait and could be very large, as for samplday (from 0.202 to 0.624). Unlike for traits, no cross had constantly high or low PA (Figure S8b). RR method yielded the highest PA for all traits except shik.ripe.
In scenario 1b (Figure 4b), there were two PA values per cross, one for each parental training set (TS). The difference between these two values varied widely, depending on the cross and trait (up to about 0.5 for mcwi in GxS), with an overall average of 0.36. Most often, PA was lower in scenario 1b than in scenario 1a, likely because no full-sibs were included in the training set. However, there were several cases with PA values similar or higher in scenario 1b for one parental TS compared to scenario 1a. RR method produced the best PA in 80% of the 30 combinations (2 parents x 15 traits).
In scenario 2 (Figure 4b), overall average PA (0.23) was nearly halved compared to scenario 1a, with trait dependent differences in PA between both scenarios. Some traits such as vigour, clucomp and maltar.ripe displayed a particularly marked decrease. On the opposite, mcwi and vermatu reached equivalent PA values in both scenarios. RR provided the best PA for 11 traits.
When only WW genotypes from the diversity panel were used as TS, PA was lower by 0.05 compared to scenario 2 (Figure S9).
Exploring factors affecting predictive ability, and training set optimization
We sought those variables affecting the PA values observed above, both for prediction of cross mean and Mendelian sampling. We then implemented training set (TS) optimization in an attempt to increase PA.
Variables affecting the prediction of cross mean
In scenario 2, per-cross PA was highly correlated (0.85) with the cross parents’ additive genomic relationship (Figure 5a, Figure S10a). No such strong correlation was found for per-cross PA in scenarios 1a or 1b (Figure S10a). The proportion of non-segregating markers showed low correlation with per-cross PA in all scenarios (Figure S10a).
Since variation in per-cross PA for scenario 2 was extremely large, from −0.15 for GxCS to 0.73 for SxPN (Figure 3a), we implemented TS optimization for each cross, to try and increase low PA values. Optimization actually improved PA for crosses with PA initially below 0.6, for TS sizes between 50 and 150 (Figure 6). The largest improvement, from 0 to 0.61, was observed for GxPN cross.
The variable that most affected per-trait PA was the ratio (relative variance of cross effect). It was strongly correlated with PA in scenarios 1a and 1b (0.8) and to a lesser extent in scenario 2 (0.53, Figure S10b).
No other explanatory variable displayed any significant impact despite a fairly high correlation with per-trait or per-cross PA, which could be due to low sample sizes (15 and 10 for per-trait and per-cross PA, respectively).
Factors affecting Mendelian sampling prediction
To model Mendelian sampling PA for each scenario and method selected for each trait (RR, LASSO or best), we applied multiple linear regression on six to nine variables depending on the scenario, as detailed in Material and Methods. The highest coefficient of determination (37.8%) was obtained in scenario 1a with the best method (Figure 5b). Coefficients of determination were lower, lower and higher for LASSO compared to RR in scenarios 1a, 1b and 2, respectively. Three variables were found to impact PA in all scenarios: half-diallel overall H2, per-cross H2 and the proportion of non-segregating markers. Surprisingly, half-diallel overall H2 was not selected in scenario 2 with either RR or best method, while it had a strong effect in other modalities.
The selected variables were quite similar between scenarios 1a and 1b, with a high effect of half-diallel overall and per-cross H2, but differed in scenario 2 in which more variables were selected. Overall, most of the relative importance came from variables related to the trait and not to the genetic composition of TS or validation set (VS).
We also calculated individual PA with optimized TSs derived from the diversity panel (Figure S11). However, we did not observe any improvement compared to using the whole diversity panel. This is consistent with the fact that genetic relationship seemed not to impact PA (Figure 5b).
Discussion
Our study allowed us to thoroughly explore GP potential in grapevine breeding, by scanning a large range of potentially useful configurations: (i) with 15 weakly related traits with variable levels of H2 and phenotypic structure (subpopulation or cross effects on phenotypic data) (Figure S4), (ii) in across-population scenarios with TS ranging from half-sibs (scenario 1b) to a diversity panel (scenario 2), (iii) with 10 balanced VS crosses. Moreover, we decomposed PA into cross mean and Mendelian sampling components, each being useful in breeding to select parental genotypes and offspring within crosses, respectively. All these results allowed us to get insight into main factors affecting PA. We will focus our discussion on prediction with the diversity panel as TS, since this is the most sought-after configuration in perennial species breeding.
Range of PA values
For the prediction of cross mean, overall PA was 0.45 in scenario 2, average per-cross PA of 0.36, while the average per-trait PA was twice as high (0.6) (Figure 3). In studies concerning other plant crops, the average per-cross PA was not reported [5–8], probably because, in most cases, there were not enough traits to estimate it. Bernardo et al. [5] and Osthushenrich et al. [6] also reported a high-average per-trait PA, above 0.9, while Yamamoto et al. [8] reported PA values from 0.21 to 0.57 depending on the trait.
For the prediction of Mendelian sampling, overall average PA was lower than overall PA for cross mean in scenario 2 (0.23 and 0.45, respectively). Yet, Mendelian sampling PA was still quite high, considering that TS was essentially unrelated to VS for most genotypes and with no first-degree relationship with predicted progenies. When WW subpopulation was used as TS (Figure S9), PA was lower compared to scenario 2, despite a higher relatedness, probably due to the small TS size (around 92). The same diversity panel was previously used in Flutre et al. [26] for predicting individual genotypic values of 23 additional Syrah x Grenache offspring. The reported PA for mbw was 0.56, whereas in the present study, we obtained 0.35 in the Grenache x Syrah progeny (n = 59). We further investigated such discrepancy, and found it related to a sampling bias due to the small VS size in Flutre et al. [26] (data not shown).
The range of average per-trait Mendelian sampling PA observed in scenario 2 (from 0.11 to 0.39) was consistent with those described on fruit perennial species where individual prediction was performed with a TS not directly related to the VS (neither half-sib nor full-sib). In Coffea, Ferrao et al. [28] reported differences in per-trait PA, from slightly negative values up to ca. 0.60. But, in this study, overall PA was calculated for all crosses of the VS, thus encompassing both cross mean and Mendelian sampling predictions, making comparison with our Mendelian sampling results alone impossible. In contrast, some studies in apple yielded within cross individual PA values. For instance, Muranty et al. [29] reported average per-trait PA ranging from −0.14 to 0.37, and Roth et al. [30] found PA values from −0.29 to 0.72 for fruit texture, highly dependent on the cross for all traits. Conversely, our PA values were mainly stable over crosses and variable over traits, in the three scenarios (Figure S8). This difference might partly be due to the larger trait diversity we explored as compared to Roth et al. [30], as suggested by comparing our Figure S4 with their Figure 1A. A complementary explanation could be that progeny size varied from 15 to 80 in Roth et al. [30], while here progeny sizes were very close and thus less likely prone to sampling variability and to upward or downward bias.
Several factors may influence Mendelian sampling PA in our study compared to others. Among potential inflating factors, we can mention a slight over-representation of phenotyped individuals from the WW panel subpopulation, to which four out of the five parents of the half-diallel belong, leading to a higher genetic relationship between effective TS and VS. Factors potentially decreasing PA could be differences between TS and VS experimental designs since the diversity panel and the half-diallel were not phenotyped on the same years, had different plant management systems (overgrafting or simple grafting, respectively) and were planted a few kilometers apart. Nevertheless, for most studied traits, two years of phenotyping were used to compute genotypic BLUPs, which could at least compensate for differences between years, usually referred to as the millesime effect.
Variables affecting PA in across-population genomic prediction
We focused on PA obtained with the best method between RR and LASSO, to take into account the part of variability among traits associated with genetic architecture. Indeed, LASSO is supposed to be better adapted to traits underlined by few QTLs, while RR would yield better PA for highly polygenic traits [24]. However, we showed that for a given trait x cross combination, i.e. for a given genetic architecture, the best method selected changed depending on the scenario: LASSO was more often selected for scenario 2 than for scenario 1a, both for cross mean and individual prediction. This means that the best method choice also depends on the relationship between TS and VS. This was also suggested in cattle breeding by MacLeod et al. [31], who found that BayesRC method (comparable to LASSO) yielded better results than GBLUP (comparable to RR) for across-population GP.
Studying both per-cross and per-trait PA allowed us to characterize different variables affecting PA. Regarding the other factors affecting PA, for cross mean prediction in scenario 2, we found no other variable significantly affecting per-trait PA (Figure S10b). Conversely, per-cross PA was strongly affected by the genetic distance between parents (Figure 5a, Figure S10a). To our knowledge, such correlation has never been reported before. We could hypothesize that when one parent is farthest from WW -the most represented panel subpopulation in TS- (e.g. Grenache, Figure 1a, Figure S1), associated crosses might have different QTLs or allelic frequencies, compared to WW ones, thereby explaining the decrease in PA for crosses related to Grenache. Such differences underlying marker effects were already described in maize [32]. Simultaneously, some QTLs in these associated crosses might be less genetically linked to causal polymorphisms due to higher genetic distance. However, this cannot be the only explanation for the large correlation of per-cross PA with pairwise parent distance, because the correlation between PA and genetic distance between TS and VS was much lower (Figure S10a).
For the prediction of Mendelian sampling, the variables explaining individual PA in scenario 2 were quite different from those explaining cross mean PA. Trait-related variables had a large impact on individual PA: half-diallel overall and per-cross heritability, but also the relative variance of cross effect (Figure 5b). Surprisingly, genetic relationship between TS and VS had little to no impact on PA, although this factor has often been reported to affect PA [15, 17]. Most studies reported separately the effects of different variables on individual PA. Riedelsheimer et al. [33] also performed multiple linear regression of individual PA on several factors to study their impact. They found that TS composition (number of crosses and their relationship with VS) explained most of the variance (41.7%), followed by trait (27.6%) and VS composition (4.8%). The variance in genetic relationship between TS and VS may be smaller in our study.
Practical consequences on breeding programs
Across-population GP with model training in a diversity panel appeared to be promising in grapevine breeding for some traits and crosses, particularly for parent choice (Figure 3; Figure 4; Table S3; Figure S8).
The usefulness of GP for better selecting parents for future crosses can be at first assessed by the low overall correlation between mean parental genotypic values (BLUPs) and mean offspring BLUPs (0.27; see also Figure S12). This correlation was much lower than overall PA for cross mean in scenario 1b (0.68) and slightly lower than overall PA for cross mean in scenario 2 (0.45). In strawberry, Yamamoto et al. [8] also evidenced the interest of GP for predicting cross mean, with no additional benefit from including dominance effects into GP models, even if cross means were not equal to parental means. Moreover, in some cases, GP could provide other advantages over mean parental genetic values, for instance when parents are not phenotyped for some reasons, because too young or without representative phenotypes (e.g. using microvine [34], in a new environment, etc). This was actually the case, in our half-diallel trial, for the Terret Noir parent, which suffered from mortality probably due to rootstock incompatibility and consequently had no phenotypic record for most studied traits.
Even though PA was quite high for some traits and crosses in scenario 2, on average it remained moderate both for cross mean and individual prediction. Both PAs were much higher in scenario 1a, due to increased relationship between training and validation sets. Nevertheless, such an extreme configuration is rarely used in plant breeding programs, especially in perennial species, because it requires to partly phenotype the cross to be predicted. An intermediate configuration, scenario 1b, could be implemented in breeding programs when PA from scenario 2 is not sufficient and half-sib families are available, because in this scenario, cross mean PA was similar as in scenario 1a and individual PA intermediate between scenarios 1a and 2.
We found TS optimization useful mostly for cross mean prediction for crosses with low PA. The advantage of TS optimization was less clear for individual prediction. This was consistent with the fact that genetic parameters more strongly affected cross mean PA than individual PA. In contrast, Roth et al. [30] observed in apple a systematic increase of individual PA with an optimized TS in the same context (i.e. with a diversity panel as TS and bi-parental families as VS, and common optimization methods). To our knowledge, only a single study tested TS optimization for cross mean prediction, by Heslot and Feoktistov [35], who implemented optimization of parent selection for hybrid crossing in sunflower while selecting individuals to phenotype, but did not calculate cross mean PA.
Since our results show that prediction of cross mean can be quite accurate and useful in scenario 2, we decided to go one step further and implemented cross mean prediction for all 38 781 possible crosses between the 279 genotypes of the diversity panel, based on parental average genotypes (Table S2) and on marker effects estimated with RR in this population. As predicted cross mean were biased for some traits in the ten half-diallel crosses (Figure S5), we estimated the bias for each trait from these data to correct the predicted mean in the possible diversity panel crosses. Figure S13 shows the large potential diversity to be explored through crossing in grape, for all the traits considered in the present study, illustrating the finding of Myles et al. [36] that genetic diversity in grapevine was largely unexploited. Such an example opens many prospects for the use of GP to design future crosses. Indeed, we limited here our prediction to the 279 panel genotypes representing the V. vinifera diversity, but potentially any other (unphenotyped) genotype of interest with dense genotypic data could be used for this purpose as exemplified with the half-diallel, since its five parents were not part of the diversity panel.
Prospects
Based on our results, the following improvements could be tested: i) increase SNP density [25, 37] and include structural variants ii) implement non-additive effects in GP models such as dominance or epistatic effects and iii) add crosses from other panel subpopulations as VSs. Indeed, since all our half-diallel crosses had at least one parent belonging to the WW subpopulation, it would be beneficial to include crosses with parents from the WE and TE subpopulations too. Specific GP models that include genetic structure in marker effect estimation [38, 39] could also be tested.
Predicting cross variance could also prove useful to design the offspring selection step, more specifically for choosing the number of offspring to test or produce for a given cross. Depending on the available funds and breeding program, a breeder may want to select crosses with high genetic variance, in order to maximize the probability to generate top-ranking genotypes. Conversely, choosing a cross with low variance could limit the risk of breeding poor genotypes.
Conclusion
We implemented GP in grapevine in a breeding context, i.e. across populations, on 15 traits, in ten related crosses, and obtained moderate to high PA values for some crosses and traits, thus showing GP usefulness in grapevine. Never before had genomic prediction been implemented for so many traits and crosses simultaneously in this species. We showed that per-cross PA was strongly correlated with the genetic distance between parents, whereas Mendelian sampling PA was largely determined by trait-related variables, such as heritability and the magnitude of the cross effect.
Material and methods
Plant material
The half-diallel consists of 10 pseudo- bi-parental families obtained by crossing five V. vinifera cultivars: Cabernet-Sauvignon (CS), Pinot Noir (PN), Terret Noir (TN), Grenache (G) and Syrah (S) [40]. Each family comprised between 64 and 70 offspring, with a total of 676 individuals including parents.
The diversity panel consists of 279 cultivars selected as maximizing genetic diversity and minimizing kinship among cultivated grapevine. Grapevine genetic diversity is highly heterozygous with a weak to moderate structuration into three subpopulations: WW (Wine West), WE (Wine East) and TE (Table East) [27].
Field experiments
Field design
The half-diallel was created in 1998 at INRAE Montpellier, grafted on Richter 110, and planted in 2005, at the Institut Agro experimental vineyard “Le Chapitre” in Villeneuve-lès-Maguelone (Southern France). The progenies were planted in two randomized complete blocks, with plots of two consecutive plants per offspring per block.
The field design for the diversity panel was previously described in Flutre et al. [26]. Briefly, cultivars were overgrafted on 6-year-old Marselan in 2009, itself originally grafted on Fercal rootstock, a few kilometers away from the diversity panel. They were planted in five randomized complete blocks, with one plant per cultivar per block.
Phenotyping
We studied 15 traits in both trials: berry composition with malic (mal.ripe), tartaric (tar.ripe) and shikimic acid (shik.ripe) concentrations in measured at ripe stage (20° Brix) (according to Rienth et al. [41]), from which two ratios were derived, shikimic / tartaric acid (shiktar.ripe) and malic / tartaric acid (maltar.ripe); morphological traits with mean berry weight (mbw, in g) measured on 100 random berries, mean cluster weight (mcw, in g), mean cluster length (mcl, in cm) and mean cluster width (mcwi, in cm), measured on 3 clusters, number of clusters (nbclu) and cluster compactness (clucomp) measured on the OIV semi-quantitative scale; phenology traits with veraison date (onset of ripening; verday, in days since January 1st), maturity date corresponding to berries reaching 20° Brix (samplday, in days since January 1st) and the interval between veraison and maturity (vermatu, in days); vigour (vigour, in kg), derived as the ratio between pruning weight and the number of canes. Phenotypic data were collected between 2013 and 2017 for the half-diallel and in 2011–2012 for the diversity panel. There was a slight over-representation of phenotypes from the WW subpopulation because of fertility issues in WE and TE subpopulations. Distribution of raw phenotypic data for both populations are shown in Figure S14.
SNP genotyping
For the half-diallel, we used genotyping-by-sequencing (GBS) SNP markers derived by Tello et al. [40], 622 of the 676 individuals being successfully genotyped, as well as the five parents. Raw GBS data were processed separately for each cross, and then markers from all crosses were merged together (390 722 SNPs), thus generating many missing data (85% of missing data per marker on average), since all markers did not segregate in all progenies. Markers with more than 80% of missing data were removed and remaining markers were imputed with FImpute3 [42] (86 017 SNPs). Some parental cultivars were used either as female or male, depending on the cross, a configuration not allowed by FImpute3. We thus declared only a partial pedigree maximizing the number of crosses defined with both parents (Table S4). For the diversity panel, we used the same SNP markers as in Flutre et al. [26], except that we applied a filter on minor allelic frequency (5%) and no filter on linkage disequilibrium, which yielded 83 264 SNPs. In both populations, sequence data were mapped on PN40024 12X.v2 grapevine reference sequence before SNP calling [43].
Finally, we only retained the 32 894 SNPs common to both populations (marker density plotted in Figure S15).
Phenotypic data analyses
Half-diallel
Statistical modeling for estimating genotypic values
For each trait, we applied a log or square-root transformation if raw data distribution was skewed according to a QQ-test. Then, we fitted the following linear mixed full model by Maximum Likelihood:
with the phenotype of genotype i from cross j in block k and year l. Among the fixed terms, was the overall mean, and and the effects of block k and year l. Among the random terms, and were the effects of genotype i nested within cross j, and and the field coordinates. Interactions are indicated with “:”. was the random residual term, assumed to be normally distributed.
Sub-model selection was based on Fisher tests for fixed effects and log-likelihood ratio tests for random effects. It was performed with the step function from lmerTest R-package [44]. Variance components were estimated after re-fitting the selected model by Restricted Maximum Likelihood (Table S5), and diagnostic plots were drawn to visually check the acceptability of model hypotheses such as homoscedasticity or normality and detect outliers. Best Linear Unbiased Predictors (BLUPs) of cross (C) and genotype (G) values were computed. For genomic predictions, we used their sum (C + G) as total genotypic values for both training and validation data. Variance component estimates were used to compute the proportion of genetic variance due to differences between crosses as: .
Heritability estimation
We estimated overall (for the whole half-diallel) broad-sense heritability for genotype-entry means (modified from Schmidt et al. [45]) as:
with genotype (G) and cross (C) variances at the numerator. Random variance components involving year (Y) were divided by the mean number of years (). Other random variance components involving spatial effects or residuals were divided by the mean number of years times the mean number of replicates per year ().
We also estimated broad-sense heritability per cross (thereafter used to name half-diallel full-sib family). For that, we applied the same selected model, but removed all effects involving cross. Then, we estimated variance components within each cross, and heritability with the same formula, after removing variances involving cross.
All information on analyses of phenotypic data and heritability of the half-diallel is detailed in Table S1.
Diversity panel
We used the genotypic values previously estimated in Flutre et al. [26] with a similar statistical procedure to the one described above for the half-diallel. All phenotypic analysis information is provided in Table S3 of Flutre et al. [26].
For each of the two populations, genotypic BLUPs were scaled to a variance of 1, allowing comparison among traits.
Genomic prediction statistical methods
Marker effects were estimated using two methods to take into account varying genetic architecture among the traits studied. Ridge regression (RR) [46], best adapted to many minor QTLs, shrinks marker effects towards 0. Least Absolute Shrinkage and Selection Operator (LASSO) [47], best adapted to a few major QTLs, applies a L1 norm on allelic effects, thus forcing some to be exactly 0. Both methods were implemented with R/glmnet package [48] and the amount of shrinkage, controlled by parameter, was calibrated by five-fold inner cross-validation within each training set, using cv.glmnet function.
Genomic prediction scenarios
We assessed prediction within half-diallel crosses under three different training scenarios (Figure 2; Table S6):
Scenario 1a: whole half-diallel prediction. We applied random outer 10-fold cross-validation over the whole half-diallel population. In each fold, 90% of the phenotyped offspring were used as the training set (TS) and the remaining 10% as the validation set (VS). Cross-validation was replicated ten times.
Scenario 1b: half-sib prediction. For each half-diallel cross used as VS, we trained the model with the three half-sib crosses of each parent in turn, thus predicting each cross twice.
Scenario 2: across-population prediction. We used the whole diversity panel as TS and each half-diallel cross as VS. We also tested a variation of this scenario by using only genotypes from WW subpopulation as TS.
Predictive ability assessment
In order to account for the effect of genetic architecture, we applied both RR and LASSO methods for each trait and cross and kept the best PA, for both cross mean and within cross individual prediction.
Prediction of cross mean
Cross mean PA was assessed as Pearson’s correlation between the average value of observed total genotypic values (sum of genotype and cross BLUPs for each offspring) for each cross, and the mean predicted genotypic value per cross, calculated in two ways, as:
average predicted value over all offspring of the cross. In scenario 1a, each offspring was predicted 10 times, thus we also averaged the predicted value over the 10 replicates.
predicted value for the parental average genotype, defined at each locus and for each cross as the mean allelic dosage according to the expected segregation pattern based on parents’ genotypes (Table S2).
Genotypic values predicted with these two modalities were highly correlated (above 0.98) in the three scenarios and for the two methods (partly shown in Figure S16). Therefore, in subsequent analyses, we used only prediction with parental average genotypes.
Pearson’s correlation between observed and predicted values was calculated on all cross x trait combinations (overall PA), for each trait (per-trait PA) and for each cross (per-cross PA).
Within-cross individual prediction
We measured PA within each cross in each scenario as Pearson’s correlation between observed total genotypic values and predicted genotypic values.
Test of variables affecting predictive ability
We tested the effect of several variables on within-cross individual PA, in each scenario. We built a multiple linear regression model with PA per trait x cross combination as the response variable and as predictors, a set of variables common to all three scenarios plus specific variables for scenarios 1b and 2. Common variables were: the proportion of non-segregating markers in the cross (Table S7), overall and per-cross broad-sense heritability, the distance between the parents of the cross measured either as the additive relationship or as the distance on the first or first two axes of the panel PCA (Figure 1a) and the proportion of genetic variance due to differences between crosses ( ratio). A specific variable for scenarios 1b and 2 was the mean additive relationship between training and validation sets. In scenario 2, it was calculated for each trait only with phenotyped individuals. Specific variables for scenario 2 were: broad-sense heritability in the diversity panel (retrieved from Flutre et al. [26] and Table S1) and the percentage of trait variance explained by the subpopulation factor (see below). After fitting the overall model, we applied a forward-backward stepwise regression, with the AIC criterion to select the best explanatory model. Then, we estimated the relative importance of each variable selected in this model with the pmvd method [49], which allows to decompose the R2 of correlated regressors with the R-package relaimpo [50].
The percentage of trait variance within the diversity panel explained by subpopulation (WW, WE or TE) was evaluated by fitting for each trait the following linear model: , where is the genotypic (BLUP) value within the diversity panel, is a fixed subpopulation effect, and a random residual term. The percentage of variance due to differences between subpopulations was then estimated as the coefficient of determination (R2) of the model.
Training set optimization
We tested three methods for optimizing TS in scenario 2, for both cross mean and within-cross individual prediction. We used the STPGA R-package [51] to implement Prediction Error Variance (PEVmean) and CDmean (based on the coefficient of determination) [10]. Moreover, we computed the mean relationship criterion (MeanRel), as the mean additive relationship between each genotype in TS and all genotypes in VS. Each optimized TS was specific to a cross. The realized additive relationship based on marker data was estimated using the rrBLUP R-package [52] with the A.mat function implementing the formula from VanRaden et al. [53]. For each of these three optimization methods, we tested five TS sizes (50, 100, 150, 200, 250). PA values obtained with each optimized TS were compared with those obtained with a random sample of genotypes of the same size, repeated 10 times.
Acknowledgments
We thank Charles Romieu for his help on the acquisition of phenotypic data as well as Philippe Châtelet and Morgane Roth for their suggestions on the text. This work has been realized with the support of the SouthGreen platform and MESO@LR-Platform at the University of Montpellier. Partial funding of Charlotte Brault’s PhD was provided by ANRT (Association nationale de la recherche et de la technologie, grant number 2018/0577), IFV (Institut Français de la Vigne et du Vin) and Inter-Rhône (interprofessional grouping of the AOC vine and wine sector of the Rhône valley, France).
Author contribution statement
VS, AD, PT, TF, JPP and LLC conceived the idea of the study and contributed to funding acquisition; TP, PF, AD and JPP obtained the phenotypic data used in this work; CB performed and interpreted results; AD conceived the half-diallel population; PT is the PhD supervisor of CB; CB wrote the original draft, which was reviewed and edited by all authors. All authors read and approved the final manuscript.
Data availability
All analyses were conducted using free and open-source software, mostly R. Phenotypic and genotypic data, R scripts and result tables are available at https://doi.org/10.15454/PNQQUQ.
Ethics declarations
The authors declare that they have no conflict of interest. The authors declare that the experiments comply with the current laws of the country in which they were carried out.
Supplementary data
Supplementary data is available at Horticulture Research online.
Supplementary Material
Contributor Information
Charlotte Brault, UMT Geno-Vigne®, IFV-INRAE-Institut Agro, F-34398 Montpellier, France; UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro, F-34398 Montpellier, France; Institut Français de la Vigne et du Vin, F-34398 Montpellier, France.
Vincent Segura, UMT Geno-Vigne®, IFV-INRAE-Institut Agro, F-34398 Montpellier, France; UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro, F-34398 Montpellier, France.
Patrice This, UMT Geno-Vigne®, IFV-INRAE-Institut Agro, F-34398 Montpellier, France; UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro, F-34398 Montpellier, France.
Loïc Le Cunff, UMT Geno-Vigne®, IFV-INRAE-Institut Agro, F-34398 Montpellier, France; UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro, F-34398 Montpellier, France; Institut Français de la Vigne et du Vin, F-34398 Montpellier, France.
Timothée Flutre, Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE – Le Moulon, 91190, Gif-sur-Yvette, France.
Pierre François, UMT Geno-Vigne®, IFV-INRAE-Institut Agro, F-34398 Montpellier, France; UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro, F-34398 Montpellier, France.
Thierry Pons, UMT Geno-Vigne®, IFV-INRAE-Institut Agro, F-34398 Montpellier, France; UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro, F-34398 Montpellier, France.
Jean-Pierre Péros, UMT Geno-Vigne®, IFV-INRAE-Institut Agro, F-34398 Montpellier, France; UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro, F-34398 Montpellier, France.
Agnès Doligez, UMT Geno-Vigne®, IFV-INRAE-Institut Agro, F-34398 Montpellier, France; UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro, F-34398 Montpellier, France.
References
- 1. Meuwissen THE, Hayes BJ, Goddard ME. Prediction of Total genetic value using genome-wide dense marker maps. Genetics. 2001;11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Heffner EL, Sorrells ME, Jannink J-L. Genomic selection for crop improvement. Crop Sci. 2009;157:1819–29. [Google Scholar]
- 3. Werner CR, Gaynor RC, Gorjanc Get al. . How population structure impacts genomic selection accuracy in cross-validation: implications for practical breeding. Front Plant Sci. 2020;11:592977. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Falconer DS, Mackay TFC. Introduction to Quantitative Genetics. Pearson: Prentice Hall; 2009. [Google Scholar]
- 5. Bernardo R. Genomewide selection of parental Inbreds: classes of loci and virtual Biparental populations. Crop Sci. 2014;54:2586–95. [Google Scholar]
- 6. Osthushenrich T, Frisch M, Herzog E. Genomic selection of crossing partners on basis of the expected mean and variance of their derived lines. PLoS One. 2017;12:e0188839. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Imai A, Kuniga T, Yoshioka Tet al. . Predicting segregation of multiple fruit-quality traits by using accumulated phenotypic records in citrus breeding. PLoS One. 2018;13:e0202341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Yamamoto E, Kataoka S, Shirasawa Ket al. . Genomic selection for F1 hybrid breeding in strawberry (Fragaria × ananassa). Front Plant Sci. 2021;12:308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Heslot N, Yang H-P, Sorrells MEet al. . Genomic selection in plant breeding: a comparison of models. Crop Sci. 2012;52:146–60. [Google Scholar]
- 10. Rincent R, Laloë D, Nicolas Set al. . Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize Inbreds (Zea mays L.). Genetics. 2012;192:715–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Charmet G, Storlie E, Oury FXet al. . Genome-wide prediction of three important traits in bread wheat. Mol Breeding. 2014;34:1843–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Riedelsheimer C, Technow F, Melchinger AE. Comparison of whole-genome prediction models for traits with contrasting genetic architecture in a diversity panel of maize inbred lines. BMC Genomics. 2012;13:452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Ornella L, Singh RP, Singh PKet al. . Genomic prediction of genetic values for resistance to wheat rusts. The Plant Genome. 2012;5:3. [Google Scholar]
- 14. Heffner EL, Jannink J-L, Sorrells ME. Genomic selection accuracy using multifamily prediction models in a wheat breeding program. The Plant Genome. 2011;4:65–75. [Google Scholar]
- 15. Lund MS, Su G, Janss Let al. . Genomic evaluation of cattle in a multi-breed context. Livest Sci. 2014;166:101–10. [Google Scholar]
- 16. Hayes BJ, Bowman PJ, Chamberlain ACet al. . Accuracy of genomic breeding values in multi-breed dairy cattle populations. Genet Sel Evol. 2009;41:51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Roos APW, Hayes BJ, Goddard ME. Reliability of genomic predictions across multiple populations. Genetics. 2009;183:1545–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Rio S, Mary-Huard T, Moreau Let al. . Disentangling group specific QTL allele effects from genetic background epistasis using admixed individuals in GWAS: an application to maize flowering. PLoS Genet. 2020;16:e1008241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Wientjes YC, Veerkamp RF, Bijma Pet al. . Empirical and deterministic accuracies of across-population genomic prediction. Genet Sel Evol. 2015;47:5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Schopp P, Müller D, Wientjes YCJet al. . Genomic prediction within and across Biparental families: means and variances of prediction accuracy and usefulness of deterministic equations. G3: Genes, Genomes, Genetics. 2017;7:3571–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Raymond B, Wientjes YCJ, Bouwman ACet al. . A deterministic equation to predict the accuracy of multi-population genomic prediction with multiple genomic relationship matrices. Genet Sel Evol. 2020;52:21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Viana AP, Resende MDV, Riaz Set al. . Genome selection in fruit breeding: application to table grapes. Sci Agric. 2016;73:142–9. [Google Scholar]
- 23. Migicovsky Z, Sawler J, Gardner KMet al. . Patterns of genomic and phenomic diversity in wine and table grapes. Horticulture Research. 2017;4:17035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Brault C, Doligez A, Cunff Let al. . Harnessing multivariate, penalized regression methods for genomic prediction and QTL detection of drought-related traits in grapevine. G3 Genes|Genomes|Genetics. 2021;11:9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Fodor A, Segura V, Denis Met al. . Genome-wide prediction methods in highly diverse and heterozygous species: proof-of-concept through simulation in grapevine. PLoS One. 2014;9:e110436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Flutre T, Le Cunff L, Fodor Aet al. . Genome-wide association and prediction studies using a grapevine diversity panel give insights into the genetic architecture of several traits of interest. bioRxiv 2020.09.10.290890. 2020. [Google Scholar]
- 27. Nicolas SD, Péros J-P, Lacombe Tet al. . Genetic diversity, linkage disequilibrium and power of a large grapevine (Vitis vinifera L) diversity panel newly designed for association studies. BMC Plant Biol. 2016;16:74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Ferrão LFV, Ferrão RG, Ferrão MAGet al. . Accurate genomic prediction of Coffea canephora in multiple environments using whole-genome statistical models. Heredity. 2019;122:261–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Muranty H, Troggio M, Sadok IBet al. . Accuracy and responses of genomic selection on key traits in apple breeding. Horticulture Research. 2015;2:15060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Roth M, Muranty H, Di Guardo Met al. . Genomic prediction of fruit texture and training population optimization towards the application of genomic selection in apple. Horticulture Research. 2020;7:148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. MacLeod IM, Bowman PJ, Vander Jagt CJet al. . Exploiting biological priors and sequence variants enhances QTL discovery and genomic prediction of complex traits. BMC Genomics. 2016;17:144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Rio S, Mary-Huard T, Moreau Let al. . Genomic selection efficiency and a priori estimation of accuracy in a structured dent maize panel. Theor Appl Genet. 2019;132:81–96. [DOI] [PubMed] [Google Scholar]
- 33. Riedelsheimer C, Endelman JB, Stange Met al. . Genomic predictability of interconnected Biparental maize populations. Genetics. 2013;194:493–503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Chaïb J, Torregrosa L, Mackenzie Det al. . The grape microvine – a model system for rapid forward and reverse genetics of grapevines. Plant J. 2010;62:1083–92. [DOI] [PubMed] [Google Scholar]
- 35. Heslot N, Feoktistov V. Optimization of selective phenotyping and population Design for Genomic Prediction. JABES. 2020;25:579–600. [Google Scholar]
- 36. Myles S, Boyko AR, Owens CLet al. . Genetic structure and domestication history of the grape. PNAS. 2011;108:3530–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Dai Z, Long N, Huang W. Influence of genetic interactions on polygenic prediction. G3 : Genes, Genomes, Genetics. 2020;10:109–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Heslot N, Jannink J-L. An alternative covariance estimator to investigate genetic heterogeneity in populations. Genet Sel Evol. 2015;47:93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Ramstein GP, Casler MD. Extensions of BLUP models for genomic prediction in heterogeneous populations: application in a diverse switchgrass sample. G3: Genes, Genomes, Genetics. 2019;9:789–805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Tello J, Roux C, Chouiki Het al. . A novel high-density grapevine (Vitis vinifera L.) integrated linkage map using GBS in a half-diallel population. Theor Appl Genet. 2019;132:2237–52. [DOI] [PubMed] [Google Scholar]
- 41. Rienth M, Torregrosa L, Sarah Get al. . Temperature desynchronizes sugar and organic acid metabolism in ripening grapevine fruits and remodels their transcriptome. BMC Plant Biol. 2016;16:164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Sargolzaei M, Chesnais JP, Schenkel FS. A new approach for efficient genotype imputation using information from relatives. BMC Genomics. 2014;15:478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Canaguier A, Grimplet J, Di Gaspero Get al. . A new version of the grapevine reference genome assembly (12X.v2) and of its annotation (VCost.v3). Genomics Data. 2017;14:56–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Bates D, Mächler M, Bolker Bet al. . Fitting linear mixed-effects models using lme4 arXiv:1406.5823 [stat]. 2014.
- 45. Schmidt P, Hartung J, Bennewitz Jet al. . Heritability in plant breeding on a genotype-difference basis. Genetics. 2019;212:991–1008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67. [Google Scholar]
- 47. Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B Methodol. 1996;58:267–88. [Google Scholar]
- 48. Hastie T, Qian J. Glmnet Vignette. 2016. https://glmnet.stanford.edu/articles/glmnet.html
- 49. Feldman, B. E. Relative Importance and Value. Rochester, NY: Elsevier; https://papers.ssrn.com/abstract=2255827 (2005). [Google Scholar]
- 50. Groemping U. Relative importance for linear regression in R: the package relaimpo. J Stat Softw. 2006;17:1–27. [Google Scholar]
- 51. Akdemir, D. STPGA. (2018).
- 52. Endelman JB. Ridge regression and other kernels for genomic selection with R package rrBLUP. The Plant Genome. 2011;4:3. [Google Scholar]
- 53. VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91:4414–23. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All analyses were conducted using free and open-source software, mostly R. Phenotypic and genotypic data, R scripts and result tables are available at https://doi.org/10.15454/PNQQUQ.