Abstract
Key message
Approaches for constructing training sets in genomic selection are proposed to efficiently identify top-performing genotypes from a breeding population.
Abstract
Identifying superior genotypes from a candidate population is a key objective in plant breeding programs. This study evaluates various methods for the training set optimization in genomic selection, with the goal of enhancing efficiency in discovering top-performing genotypes from a breeding population. Additionally, two approaches, inspired by classical optimal design criteria, are proposed to expand the search space for the best genotypes and compared with methods focusing on maximizing accuracy in breeding value prediction. Evaluation metrics such as normalized discounted cumulative gain, Spearman’s rank correlation, and Pearson’s correlation are employed to assess performance in both simulation studies and real trait analyses. Overall, for candidate populations lacking a strong subpopulation structure, a ridge regression-based method, referred to as is recommended. For candidate populations with a strong subpopulation structure, a heuristic-based version of generalized coefficient of determination and a D-optimality-like method that maximizes overall genomic variation are preferred approaches for the primary objective of plant breeding. For populations with a large number of candidates, a proposed ranking method () can first be used to down-scale the candidate population, after which a heuristic-based method is employed to identify the best genotypes. Notably, the proposed has been verified to be equivalent to the original version, known as , but its implementation is much more computationally efficient.
Supplementary Information
The online version contains supplementary material available at 10.1007/s00122-024-04766-y.
Introduction
Genomic selection (GS) is a powerful strategy for selecting quantitative traits in plant breeding, whose primary concept is to capture quantitative trait loci using high-density molecular markers over an entire genome (Meuwissen et al. 2001). A GS-based approach predicts genomic estimated breeding values (GEBVs) by building a prediction model using the phenotype and genotype data of a training set. After the model training, the GEBV of any individual can be readily available by plugging its genotype data into the prediction model. In practice, breeders conduct breeding selection according to the resulting GEBVs. Model training plays a crucial role in GS, and its prediction accuracy is highly dependent upon the data quality of the training set. Genotyping costs have decreased considerably; while, phenotyping costs have stayed relatively expensive. Moreover, space and time constraints are usually inevitable in an agricultural experiment for collecting phenotype data. Therefore, a cost-effective optimal training set for selective phenotyping is highly desired to increase the probability of success in GS (Wu et al. 2023).
In the context of GS, the expected genetic gain per year for measuring breeding progress can be defined as , where is the response to selection, is the intensity of selection, is the Pearson’s correlation between true breeding values (TBVs) and GEBVs, is the square root of the additive genetic variance, and is the breeding cycle time (Heffner et al. 2010). This could partially explain why Pearson’s correlation is usually used as an index to measure prediction accuracy in GS. The training set optimization for maximizing Pearson’s correlation has drawn much attention. Based on linear mixed effects models, Rincent et al. (2012) first promoted the generalized coefficient of determination (CD), initially proposed by Laloë (1993), to determine an optimal training set. Isidro et al. (2015) and Rincent et al. (2017) extended CD-based optimization for highly structured populations. Alternatively, based on ridge regression models, Akdemir et al. (2015) considered variation of prediction errors, and Ou and Liao (2019) proposed a criterion derived directly from Pearson’s correlation to optimize a training set. Akdemir and Isidro-Sánchez (2019) discussed classical optimal design criteria such as A- and D-optimality in training set optimization. Atanda et al. (2021a) proposed an algorithm that maximizes the relationship between training and testing sets. Most recently, Fernández-González et al. (2023) made a comprehensive comparison among these training set optimization methods.
Even though Pearson’s correlation between TBVs and GEBVs is essential in genetic gain, a breeding program often aims to select a few top individuals to serve as commercial varieties or parents for future breeding cycles. For this purpose, it would be sufficient to correctly rank candidates from the most favorable to the least favorable. Blondel et al. (2015) promoted the use of normalized discounted cumulative gain (NDCG) to measure the ability of a selection strategy and evaluated some ranking approaches based on real trait analyses. NDCG can quantify the efficiency of the selection strategy for identifying the top genotypes from a candidate population, which has been commonly used to measure the ability of search engines to retrieve highly relevant documents in the top search results (Järelin and Kekäläinen 2000).
Tanaka and Iwata (2018) proposed a Bayesian optimization approach for identifying the best genotypes from a candidate population. Tsai et al. (2021) further developed improved versions of their approach. Bayesian optimization approaches are required to sequentially select individuals for phenotyping. In practice, multi-stage approaches are infeasible in plant breeding mainly due to the time constraint that breeders can usually collect phenotype data only after a whole growing season. A positive linear relationship exists between TBVs and GEBVs if their Pearson’s correlation coefficient r equals 1. That is, for some if . Therefore, Pearson’s correlation can serve as a ranking measure across all individuals of interest, suggesting that training set optimization methods aimed at enhancing r can be utilized to identify the most promising genotypes within a breeding population. It would be highly beneficial to assess their effectiveness in achieving this practical objective in plant breeding.
To identify the best genotypes, this study assessed both existing and newly developed training set optimization methods, and two approaches inspired by classical optimal design criteria of A- and D-optimality. We comprehensively compared the methods based on simulation studies using the genotype data of four genome datasets as templates and analyses of phenotype data of real traits in the genome datasets. NDCG, together with Spearman’s rank correlation (SRC) (Spearman 1904) and Pearson’s correlation, were used to measure the ability to discover the best genotypes from a candidate population. The four real genome datasets under study included one rice (Oryza sativa L.), one wheat (Triticum aestivum L.), one sorghum (Sorghum bicolor L.), and one soybean (Glycine Max).
Materials and methods
Genome datasets
The phenotype and genotype data analyzed in this study are described as follows.
Tropical rice dataset
This dataset, presented by Spindel et al. (2015), contained 73,147 single nucleotide polymorphism (SNP) markers and 363 elite breeding lines belonging to an indica or indica-admixed group. Phenotypic observations were conducted eight times in 2009–2012, once in the dry season and once in the wet season each year, on grain yield (YLD), flowering time (FT), and plant height (PH), although PH data were not available for the wet season of 2009. Phenotypic values for 35 of the 363 individuals were missing. The adjusted least squares mean (ls_means) presented in Spindel et al. (2015) of the 328 individuals was used in this example. One SNP marker was randomly chosen per 0.1-cM interval over each chromosome. This resulted in 10,772 of the 73,147 SNP markers used in this example. The SNP at each locus was coded as − 1, 0, or 1 for the homozygote of the minor allele, the heterozygote, and the homozygote of the major allele, respectively. After SNP coding, any missing locus in an individual was imputed by 1 (the coding for the homozygote of the major allele).
Wheat dataset
This dataset, presented by Kristensen et al. (2019), contained 13,006 SNP markers and 635 F6 winter wheat lines from two breeding cycles. The first breeding cycle had 321 individuals harvested in 2014; while, the second had 314 in 2015. The ls_means were obtained on the four alveograph quality traits: flour yield (FYD), dough tenacity (DT), dough extensibility, and dough strength (DS). Only 313 wheat lines from the second breeding cycle, with no missing data on the four quality traits, were used in this study. The SNPs were filtered at a missing rate > 0.1 and a minor allele frequency (MAF) < 0.05, leaving 11,214 retained for further study. SNP coding was performed as described above for the tropical rice dataset.
Sorghum dataset
This dataset was used to compare methods for training set optimization in genomic selection by Fernández-González et al. (2023), originally presented in Fernandes et al. (2018). The filtered dataset contained 56,299 SNP markers and 451 diverse sorghum lines with the best linear unbiased prediction (BLUP) values of PH, moisture content (MC), and biomass yield (BYD). The BLUP values for each individual were calculated to account for variations resulting from spatial and year effects. This dataset featured a strong subpopulation structure, classifying individuals into four clusters.
Soybean dataset
This dataset, presented by Stewart-Brown et al. (2019), contained 2647 SNP markers and 483 recombinant inbred lines with the BLUP values of yield (YLD), protein content (PRC), and oil content (OC). The BLUP values for each genotype were calculated to account for environmental factors and maturity variations. Individuals were classified into four subpopulations and one admixed group, where the admixed group was composed of individuals in sets 9–11 and 12–14 (see Table 1 in Stewart-Brown et al. 2019). SNPs with a missing rate > 0.1 and a MAF < 0.05 were filtered out, leaving 2425 SNPs for 483 individuals retained for further analyses. SNP coding was the same as in the above tropical rice dataset.
Training set construction methods
We first introduced the training set optimization methods, and then proposed the two approaches, inspired by the optimal design criteria, for searching the top-ranked genotypes.
Training set optimization methods
The methods for training set optimization were derived mainly from two kinds of statistical models: whole genome regression (WGR) models and genomic BLUP (GBLUP) models.
Whole genome regression model-based methods
The additive effect WGR model used in this study can be described as follows:
| 1 |
where is the vector of phenotypic values of length n, is the constant term, is the vector of length n with all elements equal to 1, is a marker-associated matrix of order , is the vector of marker-associated effects of length p, and is the vector of random errors. Here, n is the number of individuals, and p is the number of marker-associated components. Furthermore, let and denote the marker-associated matrices for the candidate population and the training set, respectively. and are of order and , where and are the respective numbers of individuals in the two sets. Two methods for predicting the entire candidate population were evaluated, both of them were based on ridge regression.
r-score
From Ou and Liao (2019), the r-score, derived from Pearson’s correlation between phenotypic values and GEBVs of the whole candidate population, is given by:
| 2 |
where
Here, denotes the trace of a square matrix, , is the square matrix with all elements equal to and is a shrinkage parameter in the ridge regression.
Mean of squared prediction errors
Let be the vector of phenotypic values in the candidate population, which is predicted by with being the vector of phenotypic values in the training set. Then, the mean of squared prediction errors (MSPE) between and is given as follows:
| 3 |
An approximation proportional to the expectation of MSPE is given as:
| 4 |
More details regarding the derivation of are presented in Appendix A.
Let be the original marker score matrix in the candidate population, and be the standardized marker score matrix. That is, , where and are the elements of and , and and are the sample mean and the sample standard deviation for column j in for and , where is the number of markers. Under the assumption that , the spectral decomposition was performed on , producing , where is a nonzero eigenvalue in decreasing order , and is the eigenvector of length . The principal component (PC) score matrix was then obtained as with. The first PCs that could explain about 99% of the total variation were used for the marker-associated matrices. The shrinkage parameter was fixed at 1 in the two WGR model-based methods.
Genomic best linear unbiased prediction model-based methods
An additive effect GBLUP model used in this study is given by:
| 5 |
where is the design matrix for the fixed effects and is the incidence matrix for the vector of random genotypic values . It is assumed that and are mutually independent and follow multivariate normal distributions, denoted by and . Here, is a zero vector, is the additive genetic variance, stands for a genomic relationship matrix and is the random error variance. The genomic relationship matrix was calculated by . Recall that is the standardized marker score matrix and is the number of markers. The variance–covariance matrix for usually called the prediction error variance (PEV) matrix in the context of GS, is given by:
| 6 |
where and is a projection matrix orthogonal to the vector space spanned by the columns of . Here, is a generalized inverse of . Furthermore, the variance–covariance matrix for is equal to the covariance matrix between and , which is given as:
| 7 |
The CD with a coefficient vector is then obtained as:
| 8 |
Clearly, the CD in Eq. (8) is the squared correlation between and .
A ranking approach
Ou and Liao (2019) specified the coefficient vectors in Eq. (8) as
where is the vector of length with all elements equal to 0, except the ith element, which is equal to 1, for . Furthermore, considering the situation where the model is trained by the entire candidate population, . In the current study, the vector of fixed effects consists only of the general mean , hence . Genotypes are then sorted according to the individually calculated , and the top-ranked candidates are selected to form the required training set. This ranking approach is called as , which was also described in Atanda et al. (2021b).
A heuristic-based approach
Based on the GBLUP model in Eq. (5) with consisting only of the general mean , the BLUP for the true genotypic values of a training set, denoted by , can be written as:
| 9 |
where , is the genomic relationship matrix for the training set and denotes the phenotypic values in the training set. From Henderson (1977), the BLUP for the true genotypic values in the candidate population, denoted by , is obtained as:
| 10 |
where with order is the genomic relationship matrix between the candidate population and the training set. It can be verified that the variance–covariance matrix for is the same as the covariance matrix between and , which is given as:
| 11 |
Let denote the diagonal element of in Eq. (11), and denote the corresponding element in , for . A heuristic-based version of CD is given by:
| 12 |
Maximizing is equivalent to maximizing the sum (or mean) of squared correlations between the true and estimated genotypic values in the candidate population. More details regarding the derivation of are presented in Appendix B.
The shrinkage parameter was fixed at 1 when searching the optimal training sets for both and . All the above optimization methods based on the WGR and GBLUP models are untargeted as they do not consider the genomic information of a test set.
Two approaches inspired by classical optimal design criteria
The probability of correctly identifying the best genotypes from a candidate population can be enhanced, if the training set can explain the variation of genotypic values as much as possible (Tanaka and Iwata 2018). The genomic relationship matrix in the GBLUP model, described as Eq. (5), serves as a metric for the genomic variation among genotypes, i.e., . Two approaches inspired by A- and D-optimality were employed on the genomic relationship matrix to construct training sets with maximal genomic variation. Unlike in the field of classical optimal designs, these criteria do not involve an information matrix for estimating parameters. For a fixed training set size , the A- and D-optimality-like criteria were utilized to construct training sets that maximize the trace and the determinant over all possible variance–covariance matrices corresponding to subsets with genotypes chosen from the candidate population. Thus, they were used here to measure the average () and the overall () genomic variation of the training sets, respectively. They can be described as:
| 13 |
and
| 14 |
where denotes the set composed of all possible subsets with candidates, is the submatrix of corresponding to , is the trace of , and is the determinant of .
A genetic algorithm (GA) implemented in the R package TSDFGS (Ou 2022) was utilized to optimize training sets for individually maximizing the r-score, , and or minimizing . It is worth noting that both the and ranks all candidates based on their individually calculated metrics and select the top-ranked ones, eliminating the need for computationally intensive algorithms.
Evaluation metrics
Three indices were applied to measure the ability of a training set to determine the best k genotypes with the most favorable TBVs from a candidate population.
NDCG
Let be the TBVs of all candidates arranged in descending order. Moreover, let , , …, be their corresponding GEBVs based on the training set. By reordering these GEBVs, it follows that where is a permutation of Note that , , …, here denote the ranking of , , …, . From Blondel et al. (2015), the discounted cumulative gain (DCG) score at position k of the predicted ranking using the training set was defined as follows:
| 15 |
the corresponding DCG score at position k of the ideal ranking was also given as:
| 16 |
where is a monotonically increasing gain function, and is a monotonically decreasing discount function. The linear gain function of , and the discount function of are used in Eq. (15) and (16). Finally, the normalized DCG score at position k was defined as follows:
| 17 |
Spearman’s rank correlation
SRC is a nonparametric rank correlation measure used to assess the monotonic relationship between TBVs and GEBVs among the best k genotypes with the highest TBVs. The SRC can be defined as follows:
| 18 |
where is the covariance between the rankings of TBVs and GEBVs among the best k genotypes, and and are the corresponding variances.
Pearson’s correlation
Pearson’s correlation was also used to measure the linear relationship between TBVs and GEBVs among the best k genotypes with the highest TBVs, which can be described as follows:
| 19 |
where is the covariance between TBVs and GEBVs among the best k genotypes, and and are the corresponding variances.
Simulation studies
The above proposed methods were employed for a given dataset to construct training sets of 50, 75, 100, 150, and 200 individuals. For the two datasets (sorghum and soybean) with a strong subpopulation structure, the number of genotypes selected from each cluster was proportional to the number of genotypes of the cluster in the candidate population.
Based on the GBLUP model in Eq. (5), phenotype data were simulated to evaluate the performance of the aforementioned methods. For a given genome dataset, the genomic relationship matrix was first calculated from its genotype data. The model parameters were fixed at and with 0.5, and 0.8 for genomic heritability. Accordingly, 1000 datasets were generated for each setting of parameters, i.e., and . Simulated TBVs were obtained as the constant plus the simulated , and simulated phenotypic values in were obtained as the simulated TBVs plus the simulated . To mitigate the potential vulnerability to stochastic processes in the heuristic algorithm of TSDFGS during training set construction, we evaluated 10 different optimal training sets using the r-score, , , and methods for each fixed size across all simulation scenarios. It is worth noting that both the and generated consistent simulation results as they do not suffer from vulnerability to stochastic processes. The mean performances (calculated from 10 training sets) for the three metrics of and with over the 1000 simulated datasets were compared among the training sets generated from each method. Additionally, random sampling was performed as the baseline method. The Bayesian reproducing kernel Hilbert space method in the R package BGLR (Pérez and de los Campos 2014) was used to perform GEBV prediction.
Real trait analyses
The following approach was proposed to compare the methods based on real trait values.
Step 1: For a given dataset, 250 individuals were randomly selected as the candidate population.
Step 2: For the chosen candidate population, the required training sets with sizes equal to 50, 100, 150, and 200 were generated according to each method.
Step 3: For each trait in the dataset, phenotypic values were treated as TBVs, and with were then calculated based on each training set generated from Step 2.
Step 4: Steps 1 to 3 were repeated 50 times to obtain the and for all of the training sets, and the averages of the three sets consisting of the 50 resulting values were reported to make a comparison among the methods.
Results
Simulation studies
The averages and standard deviations of the mean performances for the three metrics over the 1000 simulation runs for the tropical rice dataset are displayed in Figs. 1, 2, 3. The corresponding results for the remaining three datasets are shown in Figures S1–S3, S4–S6, and S7–S9 of the Supplementary Materials.
Fig. 1.
Averages and standard deviations of NDCG values among the best 20 genotypes over the 1000 simulated datasets using the proposed methods in the tropical rice dataset
Fig. 2.
Averages and standard deviations of SRC values among the best 20 genotypes over the 1000 simulated datasets using the proposed methods in the tropical rice dataset
Fig. 3.
Averages and standard deviations of Pearson’s correlation values among the best 20 genotypes over the 1000 simulated datasets using the proposed methods in the tropical rice dataset
The summarized results are as follows. (i) All of the proposed methods for constructing training sets outperformed the baseline method of random sampling across various datasets and evaluation metrics. However, the exhibited inferior performance, even worse than the random sampling method, in the two datasets (sorghum and soybean) characterized by a strong subpopulation structure; (ii) For a combination of evaluation metrics and the performance of any construction method increased as the training set size became larger; (iii) For a combination of evaluation metrics and training set sizes, any construction method improved its performance with increasing ; (iv) For a particular dataset, the results for and those for appeared to have very similar patterns, e.g., see Figs. 2 and 3 for the tropical rice dataset; (v) For a specific dataset, the discrepancies among the construction methods in were relatively small compared to those in and , e.g., see Figs. 1, 2, 3 for the tropical rice dataset; (vi) From Figs. 1, 2, 3 and S1–S3, the r-score, and methods had slightly better performance in most cases for the two datasets (tropical rice and wheat) lacking a strong subpopulation structure; (vii) For the two datasets with a strong subpopulation structure (Figures S4–S6 for sorghum and S7–S9 for soybean), and methods were found to slightly outperform the others in most cases.
Real trait analyses
The averages of NDCG, SRC, and Pearson’s correlation values among the top 20 genotypes over the 50 sampled candidate populations for the real trait analysis are displayed in Fig. 4 for the tropical rice dataset, and Figures S10–S12 in the Supplementary Materials for the remaining three datasets.
Fig. 4.
Averages of NDCG, SRC, and Pearson’s correlation values among the top 20 genotypes over the 50 re-sampled candidate populations using the proposed methods for the real traits in the tropical rice dataset. YLD: grain yield; FT: flowering time; PH: plant height
There were some results summarized as follows. (i) Construction methods exhibited varying performances across different trait–dataset combinations; (ii) For most trait–dataset combinations, the performance patterns among the proposed methods in the evaluation indices of SRC and Pearson’s correlation were consistent; (iii) No dominant method was found for all the traits in a dataset. Overall, the r-score and showed relatively satisfactory performance in the tropical rice dataset (Fig. 4); while, the and can be recommended for the wheat dataset (Figure S10). In the sorghum dataset (Figure S11), the , and had better performance at PH, MC, and BYD, respectively. In the soybean dataset (Figure S12), the r-score or demonstrated relatively superior performance at YLD, while outperformed the others at PRC and OC; (iv) The baseline method of random sampling sometimes outperformed a few construction methods in certain trait–dataset combinations. For instance, the random sampling method showed better performance than , , and at FT in the tropical rice dataset (Fig. 4).
Discussion
The two WGR model-based methods of r-score and perform equally well in most of the simulation scenarios. Theoretically, both of them consider the variance and bias of errors in GEBV prediction, but has a simpler form compared to r-score. Specifically, the derivation of involves the approximation of a single quadratic form , as described in Appendix A, while that of r-score requires the approximation of three different quadratic forms (Ou and Liao 2019). Therefore, can serve as an alternative to r-score.
In regard to the two GBLUP model-based methods of and , neither involves any approximation in their derivations, unlike the r-score and . Theoretically, an optimal training set constructed based on maximizes the mean squared correlations between the true and predicted genotypic values in the candidate population. Fernández-González et al. (2023) showed that the originally heuristic-based CD method, known as CDmean, excelled in GEBV prediction, albeit with high computational intensity. A detailed illustration exemplifying CDmean can be found in Alemu et al. (2024). Most interestingly, CDmean implementing the covariance matrix in Eq. (7), aims to maximize the same mean squared correlations as the proposed , implementing the covariance matrix in Eq. (11). A proof for the equivalence between and CDmean is presented in Appendix C. According to Eq. (7), CDmean involves two times matrix inversions of dimensions. Therefore, it usually requires extremely intensive computational time due to the cubic time complexity of matrix multiplication and inversion (Alemu et al. 2024). However, according to Eq. (11), involves only one time matrix inversion of dimensions. Since (the training set size) is usually much smaller than (the candidate population size), can be an alternative to CDmean due to its significant advantage of saving computational cost. Note that can also use the same contrast vectors of as CDranking; however, this cannot have a large impact on training set optimization (Alemu et al. 2024). Additionally, can be easily modified to be the targeted method by replacing the part of candidate population in and with a target test set.
The CDranking method employs a ranking approach to select the genotypes with the highest values, as described in Eq. (8), when trained on the entire candidate population. However, this method tends to favor individuals that can be better predicted rather than necessarily the best ones for the training set. Consequently, individuals with high reliability often share similarities, resulting in a lack of diversity in the training set. In an extreme scenario, if the best genotype is duplicated within the candidate population, CDranking will include both copies in the training set. This highlights a potential drawback, as the ranking approach may select highly reductant and inefficient candidates (Alemu et al. 2024). This observation could partially account for the notably poor performance of the CDranking method in the simulation study involving the sorghum and soybean datasets.
Regarding the A- and D-optimality-like methods of and , although the former had competitive performance in the tropical rice, wheat and soybean datasets, it exhibited more significantly inferior performance in the SRC and Pearson’s correlation in the sorghum dataset. From a theoretical standpoint, appears to make more sense than because it takes into account the genomic covariance among the candidates. solely focuses on individuals with the highest diagonal elements in the genomic relationship matrix, thereby risking the selection of individuals with large covariances (strong relationship). Similar to the ranking approach of CDranking, fails to maximize the training set diversity due to the absence of a heuristic. However, outperformed CDranking in most simulation scenarios. Hence, in practical terms, can be recommended over CDranking for identifying superior genotypes within candidate populations. Alternatively, , by maximizing the determinant, aims to select individuals with small covariances (weak relationship). Interestingly, aligns with the concept of as mentioned in Fernández-González et al. (2023) , wherein was identified as the best untargeted method in GEBV prediction. Unlike which maximizes the determinant of genomic relationship matrix of the training set, aims to minimize the average of all elements in it. Evidently, may cost less computational time than . A performance comparison between and in identifying the best genotypes would be beneficial.
The relative performance of the construction methods in the simulation results was not consistently observed in the real trait analysis. Several factors could account for this inconsistency. Firstly, the empirical phenotype data may not precisely align with the assumptions of the GBLUP model used to generate the simulated data. Secondly, the models used to derive the construction methods may not fully account for the sources of variability in the phenotype data. Additionally, the number of the re-sampled candidate populations in the real data analysis may be insufficient to corroborate the results demonstrated in the simulation study.
From the simulation results, the performance of any construction method generally improved with an increase in the training set size. This consistent behavior held in most cases during the real trait analysis, but there were still some exceptions, particularly occurring more frequently with the two heuristic methods of and . For instance, showed discrepancies for and at YLD in the tropical rice dataset (Fig. 4), while exhibited discrepancies for and at PH in the sorghum dataset (Figure S11), as well as at YLD in the soybean dataset (Figure S12). The two methods inspired by the optimal design criteria, focusing solely on the genomic variation, may be more susceptible to negative effects from an increase in training set size during real trait analyses.
The sampling rule of selecting a proportion of candidates within every cluster was employed in the two datasets (sorghum and soybean) with a strong subpopulation structure. In a recent study by Fernández-González et al. (2023), it was found that the inclusion of the stratified sampling rule did not lead to the expected improvement in GEBV prediction. We reran the simulation study at without considering the sampling rule in the two datasets to explore how including subpopulation structure information impacts the ability to identify the best genotypes. The original results, combined with these simulation outcomes without considering the sampling rule, are displayed in Tables 1 and 2. In both the sorghum and soybean datasets, implementing the sampling rule did not yield a significant difference when employing the methods of r-score, , , and . However, the two ranking approaches, and , showed a noticeable advantage when the sampling rule was applied in the sorghum dataset (Table 1). Conversely, exhibited a disadvantage when the sampling rule was applied in all cases, except for at a sample size of 50, in the soybean dataset (Table 2). Scatter plots of the genotypes according to the first two PCs for the two datasets are displayed in Fig. 5. There were different patterns between the two datasets. Further study is needed to investigate how subpopulation structures affect the identification of the best genotypes. An R package called TSODBG is available from GitHub (https://github.com/spcspin/TSODBG) for conducting the construction methods discussed in this study.
Table 1.
Averages of 1000 resulting NDCG, SRC, and Pearson’s correlation values using the construction methods with and without employment of the sampling rule in the sorghum dataset in the simulation study at = 0.5
| Evaluation metrics | Sample sizes | r-score | MSPERidge | CDranking | CDmean(v2) | GVaverage | GVoverall |
|---|---|---|---|---|---|---|---|
| NDCG@k = 20 | 50 | 0.558 | 0.559 | 0.514 | 0.561 | 0.529 | 0.495 |
| (0.561) | (0.563) | (0.445) | (0.557) | (0.437) | (0.495) | ||
| 75 | 0.602 | 0.602 | 0.560 | 0.602 | 0.580 | 0.552 | |
| (0.604) | (0.605) | (0.526) | (0.599) | (0.534) | (0.549) | ||
| 100 | 0.635 | 0.634 | 0.603 | 0.634 | 0.623 | 0.594 | |
| (0.635) | (0.639) | (0.584) | (0.624) | (0.599) | (0.591) | ||
| 150 | 0.678 | 0.680 | 0.656 | 0.677 | 0.676 | 0.649 | |
| (0.678) | (0.681) | (0.644) | (0.659) | (0.667) | (0.641) | ||
| 200 | 0.710 | 0.710 | 0.698 | 0.704 | 0.708 | 0.685 | |
| (0.709) | (0.711) | (0.683) | (0.688) | (0.700) | (0.679) | ||
| SRC@k = 20 | 50 | 0.102 | 0.101 | 0.057 | 0.097 | 0.092 | 0.103 |
| (0.100) | (0.100) | (0.029) | (0.095) | (0.059) | (0.102) | ||
| 75 | 0.124 | 0.124 | 0.072 | 0.112 | 0.106 | 0.129 | |
| (0.12) | (0.123) | (0.035) | (0.118) | (0.081) | (0.128) | ||
| 100 | 0.147 | 0.147 | 0.087 | 0.141 | 0.129 | 0.147 | |
| (0.138) | (0.144) | (0.038) | (0.134) | (0.109) | (0.146) | ||
| 150 | 0.177 | 0.176 | 0.108 | 0.183 | 0.158 | 0.172 | |
| (0.172) | (0.177) | (0.073) | (0.168) | (0.136) | (0.178) | ||
| 200 | 0.205 | 0.205 | 0.150 | 0.211 | 0.189 | 0.205 | |
| (0.203) | (0.205) | (0.106) | (0.202) | (0.161) | (0.209) | ||
| r@k = 20 | 50 | 0.110 | 0.109 | 0.055 | 0.107 | 0.103 | 0.123 |
| (0.108) | (0.11) | (0.013) | (0.102) | (0.074) | (0.13) | ||
| 75 | 0.136 | 0.135 | 0.077 | 0.125 | 0.126 | 0.153 | |
| (0.133) | (0.137) | (0.025) | (0.128) | (0.095) | (0.161) | ||
| 100 | 0.167 | 0.167 | 0.091 | 0.160 | 0.152 | 0.176 | |
| (0.154) | (0.159) | (0.032) | (0.149) | (0.126) | (0.176) | ||
| 150 | 0.202 | 0.202 | 0.116 | 0.210 | 0.180 | 0.205 | |
| (0.193) | (0.200) | (0.071) | (0.193) | (0.155) | (0.212) | ||
| 200 | 0.230 | 0.231 | 0.161 | 0.242 | 0.211 | 0.237 | |
| (0.227) | (0.231) | (0.107) | (0.229) | (0.179) | (0.242) |
The results for the situation without the sampling rule are displayed in parentheses
Table 2.
Averages of 1000 resulting NDCG, SRC, and Pearson’s correlation values using the construction methods with and without employment of the sampling rule in the soybean dataset in the simulation study at = 0.5
| Evaluation metrics | Sample sizes | r-score | MSPERidge | CDranking | CDmean(v2) | GVaverage | GVoverall |
|---|---|---|---|---|---|---|---|
| NDCG@k = 20 | 50 | 0.594 | 0.594 | 0.560 | 0.597 | 0.568 | 0.574 |
| (0.606) | (0.608) | (0.545) | (0.596) | (0.568) | (0.585) | ||
| 75 | 0.644 | 0.646 | 0.608 | 0.648 | 0.627 | 0.627 | |
| (0.655) | (0.656) | (0.620) | (0.644) | (0.626) | (0.642) | ||
| 100 | 0.682 | 0.682 | 0.638 | 0.680 | 0.662 | 0.664 | |
| (0.689) | (0.690) | (0.661) | (0.679) | (0.673) | (0.678) | ||
| 150 | 0.731 | 0.731 | 0.693 | 0.727 | 0.716 | 0.718 | |
| (0.737) | (0.742) | (0.713) | (0.722) | (0.734) | (0.727) | ||
| 200 | 0.766 | 0.765 | 0.744 | 0.763 | 0.758 | 0.752 | |
| (0.769) | (0.774) | (0.760) | (0.753) | (0.773) | (0.760) | ||
| SRC@k = 20 | 50 | 0.162 | 0.161 | 0.099 | 0.154 | 0.143 | 0.167 |
| (0.174) | (0.174) | (0.102) | (0.155) | (0.176) | (0.191) | ||
| 75 | 0.186 | 0.188 | 0.118 | 0.186 | 0.178 | 0.199 | |
| (0.206) | (0.207) | (0.159) | (0.188) | (0.206) | (0.225) | ||
| 100 | 0.216 | 0.215 | 0.131 | 0.218 | 0.204 | 0.231 | |
| (0.236) | (0.237) | (0.183) | (0.220) | (0.241) | (0.248) | ||
| 150 | 0.265 | 0.264 | 0.187 | 0.268 | 0.255 | 0.280 | |
| (0.277) | (0.288) | (0.224) | (0.268) | (0.293) | (0.29) | ||
| 200 | 0.304 | 0.303 | 0.257 | 0.310 | 0.299 | 0.311 | |
| (0.314) | (0.322) | (0.286) | (0.305) | (0.337) | (0.325) | ||
| r@k = 20 | 50 | 0.173 | 0.173 | 0.087 | 0.164 | 0.155 | 0.190 |
| (0.187) | (0.188) | (0.102) | (0.166) | (0.197) | (0.218) | ||
| 75 | 0.202 | 0.205 | 0.113 | 0.202 | 0.198 | 0.224 | |
| (0.226) | (0.229) | (0.171) | (0.206) | (0.233) | (0.253) | ||
| 100 | 0.237 | 0.236 | 0.131 | 0.238 | 0.231 | 0.259 | |
| (0.259) | (0.261) | (0.197) | (0.244) | (0.271) | (0.280) | ||
| 150 | 0.294 | 0.294 | 0.191 | 0.296 | 0.283 | 0.313 | |
| (0.306) | (0.319) | (0.243) | (0.295) | (0.324) | (0.327) | ||
| 200 | 0.334 | 0.333 | 0.270 | 0.346 | 0.324 | 0.347 | |
| (0.347) | (0.357) | (0.311) | (0.335) | (0.366) | (0.363) |
The results for the situation without the sampling rule are displayed in parentheses
Fig. 5.
Scatter plots of the genotypes according to the first two principal components for the two datasets (sorghum and soybean) with a strong subpopulation structure.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix A: The derivation of
According to the identity (Searle et al. 1982, p.355),
for two random variables vectors and of same order, where denotes the expectation operator and denotes the covariance matrix between and . The expectation of MSPE in Eq. (3) is therefore given as:
| 20 |
Rewrite as:
| 21 |
The quadratic form equals when the off-diagonal elements of the matrix are assumed to be zero. Under this assumption, is proportional to since for all . Hence, the second term on the right-hand side of Eq. (20) is proportional to if the off-diagonal elements are disregarded. Given that is also proportional to since , the derivation of in Eq. (4) is completed.
Appendix B: The derivation of
Consider the GBLUP model in Eq. (5) with consisting solely of for a training set, described as:
| 22 |
The projection matrix orthogonal to the vector space spanned by is given by:
satisfying that . Henderson’s mixed model equations (Henderson 1975) lead to Hence,
| 23 |
From Henderson (1977), we have that:
| 24 |
Then, the identity in Eq. (11) can be obtained as follows:
| 25 |
since that .
| 26 |
Therefore, in Eq. (25) equals in Eq. (26), and in Eq. (12) follows directly from the definition.
Appendix C: A proof for the equivalence between and
select an optimal training set through the incidence matrix in Eq. (7). Without loss of generality, the first genotypes in the candidate population with individuals are assumed to be the training set. According to Alemu et al. (2024) (Supplementary file 1), this framework leads to that the incidence matrix of order and the projection matrix of order in Eq. (7). Fixing and ignoring , Eq. (7) can be rewritten as:
| 27 |
From the Woodbury matrix identity (Searle 1982, p261):
And let =; ;; = , then Eq. (27) can be equivalently written as:
| 28 |
Partition , then Eq. (28) can be expressed as:
| 29 |
Based on the same framework, plug ; ; and into Eq. (11) for CDmean(v2) to yield exactly the same expression as Eq. (29) for CDmean. This completes the proof.
Author Contribution statement
SPC contributed to data curation; investigation; tables and figures preparation; Software. WHS contributed to data curation; investigation. CTL contributed to conceptualization; project administration; supervision; validation; writing the original draft; review and editing.
Funding
This research was supported by the Ministry of Science and Technology, Taiwan (grant number MOST 112–2118-M-002–003-MY2).
Data availability statement
All phenotype and genotype datasets analyzed in this study are freely accessible and can be downloaded from Figshare: 10.6084/m9.figshare.24425581.
Declarations
Conflict of interest
The authors declare that there is no conflict of interest.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- Akdemir D, Sanchez JI, Jannink JL (2015) Optimization of genomic selection training populations with a genetic algorithm. Genet Sel Evol 47:1–10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Akdemir D, Isidro-Sánchez J (2019) Design of training population for selective phenotyping in genomic prediction. Sci Rep 9:1–15 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alemu A, Åstrand J, Montesinos-López OA, Isidro y Sánchez J, Fernández-Gónzalez J, et al (2024) Genomic selection in plant breeding: key factors shaping two decades of progress. Mol Plant 17:552–578 [DOI] [PubMed] [Google Scholar]
- Atanda SA, Olsen M, Burgueno J, Crossa J, Dzidzienyo D et al (2021a) Maximizing efficiency of genomic selection in Cimmyt’s maize breeding program. Theor Appl Genet 134:279–294 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Atanda SA, Olsen M, Burgueno J, Crossa J, Burgueño J et al (2021b) Scalable sparse testing genomic selection strategy for early yield testing stage. Front Plant Sci 12:658978 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blondel M, Onogi A, Iwata H, Ueda N (2015) A ranking approach to genomic selection. PLoS ONE 10:e0128570 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fernández-González J, Akdemir D, Isidro y Sánchez J, (2023) A comparison of methods for training population optimization in genomic selection. Theor Appl Genet 136:30 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fernandes SB, Dias KO, Ferreira DF, Brown PJ (2018) Efficiency of multi-trait, indirect, and trait-assisted genomic selection for improvement of biomass sorghum. Theor Appl Genet 131:747–755 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heffner EL, Lorenz AJ, Jannink JL, Sorrells ME (2010) Plant breeding with genomic selection: gain per unit time and cost. Crop Sci 50:1681–1690 [Google Scholar]
- Henderson CR (1975) Best linear unbiased estimation and prediction under a selection model. Biometrics 31:423–447 [PubMed] [Google Scholar]
- Henderson CR (1977) Best linear unbiased prediction of breeding values not in the model for records. J Dairy Sci 60:783–787 [Google Scholar]
- Isidro J, Jannink JL, Akdemir D, Poland J, Heslot N, Sorrells ME (2015) Training set optimization under population structure in genomic selection. Theor Appl Genet 128:145–158 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Järelin K, Kekäläinen J (2000) IR evaluation methods for retrieving highly relevant documents. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval, pp 41–48
- Kristensen PS, Jensen J, Andersen JR, Guzmán C, Orabi J, Jahoor A (2019) Genomic prediction and genome-wide association studies of flour yield and alveograph quality traits using advanced winter wheat breeding material. Genes 210:669 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Laloë D (1993) Precision and information in linear models of genetic evaluation. Genet Sel Evol 25:1–20 [Google Scholar]
- Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ou JH (2022) TSDFGS: Training set determination for genomic selection. R package version 2.0. available online at https://cran.r-project.org/package=TSDFGS
- Ou JH, Liao CT (2019) Training set determination for genomic selection. Theor Appl Genet 132:2781–2792 [DOI] [PubMed] [Google Scholar]
- Pérez P, de los Campos G (2014) Genome-wide regression and prediction with the BGLR statistical package. Genetics 198:483–495 [DOI] [PMC free article] [PubMed]
- Rincent R, Laloë D, Nicolas S, Altmann T, Brunel D et al (2012) Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.). Genetics 192:715–728 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rincent R, Charcosset A, Moreau L (2017) Predicting genomic selection efficiency to optimize calibration set and to assess prediction accuracy in highly structured populations. Theor Appl Genet 130:2231–2247 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Searle SR (1982) Matrix algebra useful for statistics. Wiley, New York [Google Scholar]
- Spindel J, Begum H, Akdemir D, Virk P, Collard B et al (2015) Genomic selection and association mapping in rice (Oryza sativa): effect of trait genetic architecture, training population composition, marker number and statistical model on accuracy of rice genomic selection in elite, tropical rice breeding lines. PLoS Genet 11:e1004982 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15:72–101 [PubMed] [Google Scholar]
- Stewart-Brown BB, Song Q, Vaughn JN, Li Z (2019) Genomic selection for yield and seed composition traits within an applied soybean breeding program. G3: Genes. Genomes, Genet 9:2253–2265 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tanaka R, Iwata H (2018) Bayesian optimization for genomic selection: a method for discovering the best genotype among a large number of candidates. Theor Appl Genet 131:93–105 [DOI] [PubMed] [Google Scholar]
- Tsai SF, Shen CC, Liao CT (2021) Bayesian approaches for identifying the best genotype from a candidate population. JABES 26:519–537 [Google Scholar]
- Wu PY, Ou JH, Liao CT (2023) Sample size determination for training set optimization in genomic prediction. Theor Appl Genet 136:57 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All phenotype and genotype datasets analyzed in this study are freely accessible and can be downloaded from Figshare: 10.6084/m9.figshare.24425581.





