Abstract
Background
Use of genomic information has resulted in an undeniable improvement in prediction accuracies and an increase in genetic gain in animal and plant genetic selection programs in spite of oversimplified assumptions about the true biological processes. Even for complex traits, a large portion of markers do not segregate with or effectively track genomic regions contributing to trait variation; yet it is not clear how genomic prediction accuracies are impacted by such potentially nonrelevant markers. In this study, a simulation was carried out to evaluate genomic predictions in the presence of markers unlinked with trait-relevant QTL. Further, we compared the ability of the population statistic FST and absolute estimated marker effect as preselection statistics to discriminate between linked and unlinked markers and the corresponding impact on accuracy.
Results
We found that the accuracy of genomic predictions decreased as the proportion of unlinked markers used to calculate the genomic relationships increased. Using all, only linked, and only unlinked marker sets yielded prediction accuracies of 0.62, 0.89, and 0.22, respectively. Furthermore, it was found that prediction accuracies are severely impacted by unlinked markers with large spurious associations. FST-preselected marker sets of 10 k and larger yielded accuracies 8.97 to 17.91% higher than those achieved using preselection by absolute estimated marker effects, despite selecting 5.1 to 37.7% more unlinked markers and explaining 2.4 to 5.0% less of the genetic variance. This was attributed to false positives selected by absolute estimated marker effects having a larger spurious association with the trait of interest and more negative impact on predictions. The Pearson correlation between FST scores and absolute estimated marker effects was 0.77 and 0.27 among only linked and only unlinked markers, respectively. The sensitivity of FST scores to detect truly linked markers is comparable to absolute estimated marker effects but the consistency between the two statistics regarding false positives is weak.
Conclusion
Identification and exclusion of markers that have little to no relevance to the trait of interest may significantly increase genomic prediction accuracies. The population statistic FST presents an efficient and effective tool for preselection of trait-relevant markers.
Keywords: FST scores, Marker preselection, Genomic prediction, Accuracy
Background
Whole-genome marker information has been successfully utilized through genomic selection (GS) in many livestock and plant genetic improvement programs for the prediction of genomic merit and has led to a significant increase in the rate of genetic gain in these species [1]. This has been partly a result of increased prediction accuracy for selection candidates, particularly for individuals with no phenotypes or progeny of their own [2]. Such improvement in accuracy is due to a better modeling of the Mendelian sampling (MS) using genomic information compared to using only pedigree information.
Though millions of single nucleotide polymorphisms (SNPs) have been discovered in human [3], livestock [4], and plant [5] genomes, relatively high accuracies have been achieved using marker panels that utilize just a fraction of these markers [6, 7]. The falling costs of full genome sequencing and genotyping combined with more reference genomes and the availability of imputation algorithms have now allowed the regular use of high-density and sequence genotypes in genomic analyses.
It has been suggested that sequence data has the potential to significantly improve the accuracy of genomic predictions by increasing the linkage disequilibrium (LD) between quantitative trait loci (QTL) and SNPs or even making available the genotypes of causal loci [8–10]. Early simulation studies found optimistic potential for the use of sequence data in GS. Meuwissen and Goddard [9] estimated that accuracies could be improved by more than 40% when using sequence data compared to low-density SNP panels, but concluded that this was likely due to the weak relationship structure of the training population and did not expect the same results in real livestock populations due to the long-ranging LD and strong family structures. Druet et al. [10] found that accuracies could be increased by up to 28% using sequence data compared to the equivalent of a bovine 50 k SNP chip when the trait was controlled by rare QTL; however, these gains were largely lost when the sequence genotypes were imputed, likely as a result of lower imputation accuracy of rare markers that would be most effective in tracking causal loci with low minor allele frequencies.
Most results from real data have found little to no improvement in accuracy using high-density and sequence data for genomic prediction [11–14]. This lack of improvement has in some cases been attributed to the fact that low- and moderate-density panels are sufficient to capture realized additive relationships across the whole genome. Furthermore, a marginal decline in accuracy with the increase in SNP density was observed in some cases [12, 14], which results in part from overparameterization of the model [15]. This is not a surprising occurrence, as a disproportional increase in the number of unknown parameters in the association model relative to the number of observations available in the training set will lead to the well-known small n large p problem.
Models that intrinsically perform variable selection (e.g., BayesB, LASSO, and elastic-net) have been proposed as a way to reduce the dimensionality of genomic data and alleviate the issues associated with the small n large p problem. Daetwyler et al. [16] showed using a simulation scheme that BayesB [17] tends to have an advantage compared to GBLUP when the number of causal loci is less than the estimated number of independent chromosome segments.
In comparisons between GBLUP and BayesB using real data, the latter tends to yield superior results when the trait of interest is under the influence of at least one major gene, such as DGAT1 for fat and protein content in dairy cattle [18]. While BayesB tends to yield predictions that are at least as accurate as GBLUP in most practical analyses, it is computationally demanding, particularly as the number of predictors included in the model increases. Principal component analyses can dramatically reduce the dimensionality of the association model without a substantial loss in the portion of explained genetic variance; however, the estimated effects are linear combinations of the original predictors, thus complicating their interpretation. In general, the gains from using variable selection methods have been modest to nonexistent.
While the presence of causal variant genotypes in sequence information might be expected to give variable selection methods an advantage, this has not been supported by results from real data [12–14, 19], likely due to the high dimensionality of the models and high LD of the causal variants with large numbers of neighboring markers.
Preselection of variants prior to training of the model has been suggested both as an alternative and complement to variable selection methods. Heidaritabar et al. [13] preselected SNPs based on mutation type (e.g., synonymous, nonsynonymous, and non-coding) from a full set of approximately 4.6 million markers but found no appreciable gain in accuracy. Other studies have attempted to identify the most relevant variants through association statistics such as p-values, absolute estimated effects, or the relative contribution to the genetic variance. Investigating inbred lines of D. melanogaster, Ober et al. [11] selected the top 5% of SNPs ranked either by absolute estimated effect or the proportion of the genetic variance explained and found no significant improvement of accuracy using either preselection criteria. Veerkamp et al. [20] preselected variants based on p-values in Holstein data and found no improvement in accuracy, with the additional disadvantage of bias in the GEBVs and inflation of the variance component estimates. Frischknecht et al. [21] used p-values, annotation missense status, or LD-pruning to preselect variants; LD-pruning was the only strategy that did not reduce accuracies. Some studies have combined preselected SNPs with standard medium-density SNP chips to compromise between the potential benefits of each marker set, but few have found any benefit from this approach [22–25]. However, many of these studies performed SNP discovery and training of the prediction model using the same reference data set.
These results are not surprising and are in fact a consequence of the Beavis effect [26], a variation of the so-called “winner’s curse” phenomenon, where many of the selected SNP effects are overestimated, which will result in biased predictions and reduced accuracies in the validation set. Many studies that have investigated marker preselection based on association statistics criteria (e.g., p-values, absolute estimated effects) have used the data twice (in preselection and training), and this could be the primary explanation for their failure to improve accuracies. Splitting the data into three non-overlapping sets for discovery, training, and validation may alleviate this bias; however, this is a suboptimal use of an expensive resource and could result in an increase in the standard error of estimates and corresponding decrease in power to detect relevant markers. Additionally, splitting the data may not eliminate the population structure that arises from families or breeds, which can contribute to an erroneously inflated association of markers with the trait [27].
Toghiani et al. [28] introduced the population statistic FST, a measure of deviation in allele frequencies between populations, as a criterion for marker preselection in genomic evaluations of livestock. They showed that by using high- and low-phenotype individuals within a population to calculate FST scores, historical selection signals could be detected at markers that tag causal loci. Chang et al. [6] demonstrated that preselection of markers by FST scores could significantly improve genomic prediction accuracies, and even outperformed BayesB and BayesC as the dimensionality of the model increased. A subsequent study by Chang et al. [7] showed that genomic similarity between individuals will be maximized using a highly stringent subset of the top markers as ranked by FST scores, though accuracies will not be maximized using this subset. They proposed that the highest potential accuracy will be achieved when a balance between high genomic similarity and the proportion of genetic variance explained is achieved.
In this study, we expanded upon these results by investigating how the inclusion of markers in linkage equilibrium with causal loci impact the estimation of genomic relationships and affect prediction accuracies. Additionally, we compared the sensitivity of FST scores and estimated SNP effects as preselection criteria to discriminate between markers that are linked and unlinked with causal loci and the potential of each to increase accuracies.
Results
Accuracy of prediction was 0.37, 0.62, 0.89 and 0.22 using pedigree, all, HQ2, and LQ28 markers, respectively, to model the relationship matrix. As expected, the highest (0.89) and lowest (0.22) accuracies were obtained when the genomic relationship matrix was constructed using only linked (HQ2) or unlinked (LQ28) markers (Fig. 1a), respectively. Using the latter, accuracy was 39.6% lower than that achieved using expected relationships despite being based on genomic information. While use of all 777 k markers outperformed expected relationships by 70.3%, the accuracy was still approximately 30% lower than that obtained using only HQ2 SNPs.
Accuracies based on marker subsets preselected either randomly, by FST scores, or by estimated effects are shown in Table 1. When markers were preselected randomly, accuracy increased rapidly and plateaued when approximately 20 k markers were used. This is similar to the trend observed using commercial genotyping panels, where a subset of reasonably well-distributed markers yielded prediction accuracies similar to much higher density platforms. Although 50 to 60 k markers are typically necessary for many livestock species before reaching a plateau in accuracy, the smaller number of SNPs required in this study is likely due to the unconventional simulated genome structure and high LD between markers and QTL.
Table 1.
Selection methoda | Number of preselected SNPs (in thousands) | |||||
---|---|---|---|---|---|---|
1 | 10 | 20 | 30 | 40 | 50 | |
Random | 0.27 | 0.51 | 0.57 | 0.59 | 0.59 | 0.60 |
FST | 0.81 | 0.85 | 0.83 | 0.81 | 0.80 | 0.79 |
Effect | 0.84 | 0.78 | 0.72 | 0.69 | 0.68 | 0.67 |
a SNPs were preselected either randomly, based on their FST scores, or based on the absolute value of their estimated effect
Use of markers preselected based on FST scores resulted in a higher accuracy compared to the use of all markers. In fact, accuracy increased between 26.7 and 36.4% across all subsets. Accuracy peaked with the use of the top 10 k markers and remained fairly persistent; the decrease in accuracy was only 7.1% as the number of preselected markers increased to 50 k.
For preselection based on SNP effects, accuracy for 1 k markers was initially comparable to that achieved using 10 k FST-preselected markers (0.84 and 0.85, respectively); however, accuracies rapidly declined (by 20.2%) with larger subsets and the top 50 k markers yielded accuracies that exceeded use of all markers by only 8%.
Table 2 shows the percentage of preselected markers that are located on either of the two chromosomes harboring QTL. These percentages are measures of the sensitivity of the preselection criteria to detect markers that are truly linked with causal loci. The top 1 k FST-preselected markers were almost all (99.99%) SNPs in true linkage with QTL. The sensitivity steadily declined as the number of preselected markers increased and reached a minimum of only 28% linked when 50 k markers were preselected. Preselection by SNP effects followed a similar trend but had greater sensitivity to detect markers potentially linked with QTL for all subsets compared to FST.
Table 2.
Selection methoda | Number of preselected SNPs (in thousands) | |||||
---|---|---|---|---|---|---|
1 | 10 | 20 | 30 | 40 | 50 | |
Random | 6.75 | 6.65 | 6.63 | 6.64 | 6.63 | 6.69 |
FST | 99.99 | 67.04 | 47.19 | 37.84 | 32.26 | 28.45 |
Effect | 100.00 | 76.07 | 54.13 | 43.13 | 36.42 | 31.89 |
a SNPs were preselected either randomly, based on their FST scores, or based on the absolute value of their estimated effect
The proportion of genetic variance explained by preselected marker subsets is shown in Table 3. The genetic variance contributed by a particular QTL was considered explained by a marker subset if at least one marker had an r2 greater than 0.9 with the QTL. As expected, preselection using a random selection criterion explained the least amount of the genetic variance. Preselection by FST and absolute estimated effects resulted in significantly more genetic variance explained, as much as 40 and 41%, respectively. Yet for neither criteria did maximization of genetic variance explained coincide with maximization of prediction accuracy, likely as a consequence of an increasing proportion of unlinked markers present in larger subsets (Table 2).
Table 3.
Selection methodb | Number of preselected SNPs (in thousands) | |||||
---|---|---|---|---|---|---|
1 | 10 | 20 | 30 | 40 | 50 | |
Random | 0.0041 | 0.018 | 0.082 | 0.11 | 0.10 | 0.14 |
FST | 0.31 | 0.38 | 0.39 | 0.39 | 0.39 | 0.40 |
Effect | 0.33 | 0.40 | 0.41 | 0.41 | 0.41 | 0.41 |
aGV Genetic variance bSNPs were preselected either randomly, based on their FST scores, or based on the absolute value of their estimated effect
Genomic information increased accuracy compared to pedigree by improving modeling of the MS. The effectiveness of a set of markers to capture QTL similarity and MS between individuals could be evaluated by assessing the correlation between marker- and QTL-based G matrices. The non-centered G matrix reflects the total QTL similarity while the centered G matrix (Eq. 1) will reflect the MS component only.
Correlations between the marker- and QTL-based G matrices for all, HQ2, or LQ28 markers are listed in Table 4. As expected, the non-centered correlations followed the same trend as that observed for the accuracies, with the maximum (0.63) and minimum (0.28) correlation obtained using only HQ2 and LQ28 markers, respectively. When G was centered by the expected relationships, the correlation for LQ28 markers was effectively zero. In contrast, using only linked markers to construct G, the correlation decreased by just 8.4% after adjusting for expected relationships.
Table 4.
All | HQ2 | LQ28 | |
---|---|---|---|
Non-Centered | 0.345399 | 0.631371 | 0.284554 |
Centered | 0.159684 | 0.578165 | 0.0017988 |
Relative Decrease (%) | 0.537721 | 0.084473 | 0.993687 |
a All = all markers; HQ2 = markers on the two chromosomes harboring the QTL; LQ28 = markers on the 28 chromosomes lacking QTL
This independence between the variation of LQ28 markers and QTL is illustrated in Fig. 2a, which plots the density of Eq. 2 for all, HQ2, and LQ28 markers. For the LQ28 subset, the distribution of this directional MS component falls evenly around zero; the number of marker-estimated relationships that fail to capture the correct direction of the QTL MS and the number that capture it correctly are approximately equal (Fig. 2b). The distribution for HQ2 is shifted towards more positive values, showing that this group of markers estimates the correct direction of the QTL MS more often than not. Interestingly, HQ2 markers still fail to capture the correct direction of the MS of QTL approximately 30% of the time (Fig. 2b); this likely occurs primarily when the deviation of the QTL genomic relationship from the expectation is quite small.
Tables 5 and 6 show the non-centered and centered correlations of the QTL-based G with G based on FST- and effect-preselected subsets, respectively. For FST, the correlation followed a similar trend as that observed for the accuracies (Table 1), with the largest correlation for both non-centered and centered G matrices achieved using the top 10 k FST-preselected markers. The correlation for effect also peaked at the top 10 k markers, however, this does not coincide with where the accuracy is maximized. The relative decrease in the correlation with centering was smaller for SNP effects than for FST-score-based prioritization, indicating that marker effects have a slightly better ability to capture the direction of the MS of QTL (Fig. 3a). However, both preselection criteria for all subsets considered were more likely than not to identify the true direction of the MS, as presented in Fig. 3b and c.
Table 5.
Number of preselected SNPs (in thousands) | ||||||
---|---|---|---|---|---|---|
1 | 10 | 20 | 30 | 40 | 50 | |
Non-Centered | 0.339069 | 0.542457 | 0.54376 | 0.527988 | 0.511761 | 0.49678 |
Centered | 0.315198 | 0.477285 | 0.469059 | 0.451059 | 0.433872 | 0.417522 |
Relative Decrease (%) | 0.0728319 | 0.121576 | 0.138934 | 0.147398 | 0.153974 | 0.161264 |
Table 6.
Number of preselected SNPs (in thousands) | ||||||
---|---|---|---|---|---|---|
1 | 10 | 20 | 30 | 40 | 50 | |
Non-Centered | 0.378834 | 0.607054 | 0.602191 | 0.576295 | 0.550798 | 0.529322 |
Centered | 0.351288 | 0.550309 | 0.548288 | 0.528462 | 0.506049 | 0.48496 |
Relative Decrease (%) | 0.0739304 | 0.093814 | 0.0899005 | 0.08346 | 0.0818111 | 0.0844371 |
Figure 4 presents the distribution of the errors in estimating the MS of the QTL (Eq. 3) using subsets of markers preselected by FST and absolute estimated effects. For both preselection methods, the error was minimized when only 10 k markers were preselected (highest density near zero). This coincides with the subset that maximizes accuracy for FST, but not for preselection by estimated effects. Preselection based on the magnitude of the estimated effect maximized the accuracy using 1 k markers, which actually appears to yield the greatest error in MS estimation among the subsets considered.
When only 1 k SNPs were prioritized, the estimated effects preselection method seems to outperform the FST-score-based approach. However, beyond the top 1 k panel, FST preselection consistently yields significantly higher accuracies. This coincides with when the sensitivity of both preselection methods starts to decrease, and unlinked markers begin to form part of the preselected subsets. This suggests that the difference between the two approaches is a consequence of the unlinked markers selected. Figure 5a and b show the regression of FST on estimated effect for HQ2 and LQ28 markers, respectively. There is a more consistent trend between the two statistics for HQ2 than for LQ28 markers. The Pearson correlation between FST and estimated effect is 0.77 and 0.27 for HQ2 and LQ28 markers, respectively. Together these results suggest that the two statistics tend to have high agreement when a prioritized marker is linked with a QTL but less so when the marker is unlinked.
In Fig. 5b, the threshold for inclusion in the top 10 k marker subsets for FST and estimated effects are denoted by a yellow and blue lines, respectively. It is clear that more SNPs with a large spurious association are preselected when using estimated SNP effects rather than FST scores. Without an independent training dataset, these large spurious associations will be re-estimated and exacerbated when training the prediction model and negatively affect the prediction accuracy in the validation set. The higher and more persistent accuracy for larger subsets when using FST as a preselection tool could be explained by its tendency to select markers that on average have less pronounced spurious associations.
To investigate this further, the top or bottom 50 k LQ28 (unlinked) markers as ranked based on FST scores or absolute estimated effects were excluded from the full panel of 777 k SNP markers. The reduced panels of 725 k markers were then used for predictions and the resulting accuracies are presented in Table 7. Theoretically, given their lack of linkage with any QTL, it is expected that the excluded 50 k top or bottom markers should not influence the accuracy. However, that was not the case and exclusion of certain unlinked markers yielded an increase in accuracy, indicating that the analysis benefits from their absence.
Table 7.
Excluded markersb | |||
---|---|---|---|
Exclusion criteriaa | None | Top 50 k | Bottom 50 k |
Effects | 0.62 | 0.68 | 0.62 |
FST Scores | 0.62 | 0.65 | 0.63 |
a Markers were excluded from the LQ28 subset based either of their FST scores or effects; b All markers were included (None), top 50 k markers excluded (Top 50 k), and bottom 50 k markers excluded (Bottom 50 k)
Exclusion of the 50 k unlinked markers with the largest estimated effects resulted in the largest increase in accuracy (approximately 8.6%) compared to use of all markers without preselection. In contrast, exclusion of the 50 k unlinked markers with the smallest estimated effect led to no change in accuracy relative to use of all markers, as expected given that their estimated effects were close to zero. However, exclusion of the 50 k unlinked markers with the largest FST scores resulted in a smaller increase in accuracy (4.1%), showing the superiority of the FST method in avoiding the preselection of unlinked markers with pronounced spurious associations.
While the simulation design previously evaluated is convenient for evaluating the behavior of markers that are unlinked with QTL in a prediction model, it would be unreasonable to expect a complex trait in reality to be accurately modeled by such a design. To evaluate whether a similar trend could persist under a more reasonable distribution of QTL across the entire genome, the simulation was repeated with the 200 QTL distributed across all 30 chromosomes. Table 8 shows accuracy and percentage of genetic variance explained for FST- and effect-preselected subsets.
Table 8.
Number of preselected SNPs (in thousands) | |||||||
---|---|---|---|---|---|---|---|
1 | 10 | 20 | 30 | 40 | 50 | ||
Accuracy | FST | 0.66 | 0.73 | 0.73 | 0.72 | 0.71 | 0.71 |
Effect | 0.73 | 0.70 | 0.67 | 0.66 | 0.65 | 0.64 | |
GV Explained (%) | FST | 0.21 | 0.29 | 0.31 | 0.32 | 0.32 | 0.33 |
Effect | 0.21 | 0.34 | 0.35 | 0.35 | 0.35 | 0.36 |
With QTL distributed across all chromosomes, accuracy using all markers was 0.60. Both preselection methods achieve a maximum accuracy of 0.73, though FST requires a larger number of preselected markers to achieve this. As the panel size increases to 50 k, the accuracy for effect- and FST-preselection decrease by approximately 12.3 and 2.7%, respectively. Despite yielding a lower accuracy for panels of 10 k markers and larger, the effect-preselected subsets explain 9.1 to 17.2% more of the genetic variance than the equivalently-sized FST-preselected subsets. This demonstrates that the trend in prediction results for FST- and effect-preselected subsets is consistent even when all chromosomes harbor multiple causal loci.
Discussion
It was shown that the predictive ability of markers that are unlinked with QTL is inferior to even pedigree information, a result that agrees with previous studies [29–31]. However, despite their inferior predictive power, accuracies using only unlinked markers were always positive. Habier et al. [29] attributes this to unlinked markers modeling additive genetic relationships and shows that the accuracy will converge to that of pedigree BLUP as the number of independently segregating markers increases. Regardless of linkage, the distribution of QTL and marker additive relationships for a particular order of kinship will share a mean, the expected relationship. The advantage of using genomic information compared to pedigree is the better modeling of the MS of QTL. However, when markers and QTL segregate independently the covariance of marker and QTL MS is zero (Table 4) and the marker-based relationships are noisy estimates of the average additive relationships. While these markers will independently yield positive accuracies, they should not be expected to benefit the analysis when markers in LD with causal loci are available. HQ2 markers also capture the additive relationship with the additional benefit of accounting for some portion of the MS of QTL, as evidenced by the limited decrease in the correlation between the HQ2-marker- and QTL-based G matrices after centering with expected relationships (Table 4) and the shift of the HQ2 distribution in Fig. 2a to more positive values.
Ideally, the effect of unlinked markers on the estimation of the breeding values would be zero when more informative markers are present in the model. However, the inferior accuracy obtained using all markers compared to only HQ2 markers demonstrates that the effect of unlinked markers will not be null. The results of this study demonstrate that in terms of a GBLUP model, allowing unlinked markers to have a nonzero contribution to G adds noise to the estimation of genomic relationships that will not be reflective of true QTL similarity, resulting in lower accuracy relative to that achieved using only linked markers in the validation population. In terms of a SNP-BLUP model, which has been shown to be equivalent to GBLUP [29], nonzero estimates will be obtained for unlinked markers that have no association with QTL inheritance in validation individuals. Table 4 shows that the MS of QTL and unlinked markers vary around the same average relationship, which creates an association of the unlinked markers with the QTL. The model cannot discriminate spurious marker associations that are a result of this shared expectation and random sampling from associations due to true linkage with a causal locus, particularly when the unlinked markers are themselves used to inform the variance-covariance structure.
These results highlight the motivation and potential for preselection of markers to improve accuracies. Both FST scores and absolute estimated effect preselection-based methods were able to identify relevant markers with high sensitivity when preselecting a small number of markers and yielded high accuracies. However, the trend in accuracy differed substantially between the two approaches. As the number of preselected markers increased, their sensitivity to detect linked markers decayed, and unlinked markers were incorrectly selected. Preselection by FST increased accuracy from 1 k to 10 k markers while the accuracy for preselection by estimated effects decreased by approximately 7.1% over the same interval. This occurred despite FST preselection adding 903 more unlinked markers and explaining approximately 5% less of the genetic variance than estimated effects. The accuracy for FST preselection declined as the number of preselected markers increased beyond 10 k, but was more persistent than the accuracy for estimated effects despite consistently selecting more unlinked markers and explaining less of the genetic variance.
There are two important concepts that are illustrated by the behavior of these statistics. First, when the preselection criteria have imperfect sensitivity, accuracy will be maximized by a balance between increasing the genetic variance explained and minimizing deleterious contributions from poorly informative markers. FST added a large number of unlinked markers when the number of preselected SNPs increased from 1 k to 10 k, but the genetic variance explained was also significantly increased, resulting in an overall improvement in accuracy. As long as the beneficial contribution to the genetic variance explained by linked markers exceeds the negative effects of the association noise added by unlinked markers, the accuracy will increase. The decline in accuracy for FST when the number of preselected markers increased from 10 k to 20 k is explained by the fact that the genetic variance explained increased by only 2.6% while approximately 73% of added markers were unlinked with QTL; this likely contributed significant noise to estimation of genomic relationships. This is in concordance with Chang et al. [7], who concluded that a balance is needed between genomic similarity and the proportion of genetic variance explained by the preselected markers in order to maximize accuracies. While in the current study we make only a distinction between linked and unlinked markers, markers that are linked to but in low LD with a QTL will also contribute noise to the model and the negative impact of this noise may outweigh the benefit of any genetic variance they explain.
Second, the noise contributed by unlinked markers is not necessarily equal between both preselection methods. Estimated-effects-based approach consistently showed a greater sensitivity to detect linked markers than FST, yet yielded significantly lower accuracies, except in the case of the 1 k panel where it selected no unlinked markers. For panel sizes of 10 k and larger, the accuracy for the estimated-effects-based approach was lower than for FST scores largely because the unlinked markers selected by the approach have a greater detrimental effect.
When the 50 k most spuriously associated unlinked markers were excluded from the analysis (Table 7), accuracies improved significantly. These markers have a large spurious association with the trait and the analysis benefits from their exclusion. While the complications that such markers present are often considered in the context of marker preselection, this result shows that such markers will have an appreciable negative impact even in the absence of preselection. There is therefore an incentive to identify and filter spuriously associated markers if a reliable and efficient method for distinguishing them from true associations can be developed.
Excluding the 50 k LQ28 markers with the largest FST scores from the full panel also resulted in the accuracy increasing, but this increase was not as pronounced as when the LQ28 markers with largest estimated effect were excluded. This indicates that when the training data is also used to calculate FST for preselection, there will be some tendency to select irrelevant markers with a spurious association, but that the spurious associations will on average be less severe than when preselecting by the absolute estimated effects. This could explain why accuracies are more persistent for preselection by FST scores than estimated marker effects even when the FST preselection criteria selects more unlinked markers and explains less of the genetic variance.
Both FST and marker effects were estimated using some portion of the training data rather than an independent dataset. While partitioning of the training data into two subsets, one for estimation of preselection statistics and one for training of the prediction model, may alleviate some bias, it will decrease the size of the data available for training the model and therefore increase the standard error in estimation of the statistics anyway. Splitting of the training data will not be a feasible option for most analyses, and the literature shows that several analyses that consider preselection by association statistics in genetic improvement programs have chosen to reuse the SNP discovery data for training of the model.
In contrast to marker effect estimation, calculation of FST used just 10% of the training data (Fig. 1). Spurious associations present in the full training data may be less extreme in subsets of that data, which could explain why FST is less affected by the bias that results from using the same data for both preselection and model training. FST then has the potential to be a simple and efficient preselection tool that can reduce the bias associated with preselection by association statistics without requiring an inefficient partitioning of the training data or expensive collection of new independent data.
FST scores and association statistics could potentially be combined into an index to harness the benefits of both preselection statistics. The Pearson correlation between FST scores and estimated effects was 0.78 and 0.28 for HQ2 and LQ28 markers, respectively. This suggests that there is high agreement among the two statistics when markers are linked with QTL, but much less so among unlinked markers. Spuriously associated markers could possibly be identified and excluded when there is large disagreement between the two statistics.
An additional benefit of FST-based prioritization is that it is not affected by an increase in the number of markers included in the model due to the independence in calculating the score of each marker. As the number of markers in the association model increases, estimation variance for estimated effects of markers will increase without a corresponding increase in the size of the training data set. Furthermore, the estimated effect of each marker will be further regressed toward zero as QTL effects become distributed over correlated blocks of the predictors [32]. This will further complicate disentangling true from spurious associations as both take a similar magnitude of estimated effect. In contrast, FST scores will remain constant regardless of the number of markers, correlated or uncorrelated, that enter jointly into the analysis. This does carry the drawback that highly correlated markers will have similar FST scores and so selecting only by top FST score will select all correlated markers in a block, which could cause bias [21] and inflation of variance estimates [20] due to multicollinearity. While not evaluated in this study, these issues could be avoided through LD-pruning of FST-selected markers or similar filtering measures.
Variable selection models are a conceptually similar but fundamentally different approach to marker preselection for reduction of the parameter space. While we do not explore a comparison of FST and variable selection models in this study, Chang et al. [6] compared FST preselection implemented in a BayesA-like regression with BayesB and BayesC. They found that while FST preselection did not outperform the Bayesian variable selection models in all scenarios, it did tend to have an advantage as the density of the full panel increased. In general, they found that BayesB and BayesC accuracies decreased with increased density of available markers, while accuracy using FST preselection tended to improve. This seems to be a result of the decreased statistical power to identify relevant markers flanking QTL of low effect as the number of parameters in the model increases. The benefits of both approaches might be harnessed by using FST to preselect markers with a generous threshold followed by implementation in a variable selection model that includes the prioritized markers.
Conclusions
In this study, FST was shown to be an efficient criterion for preselection of trait-relevant markers that can improve modeling of QTL similarity between individuals, increase prediction accuracies, and maintain more stable prediction accuracies than comparative association statistics like absolute estimated marker effects. While association statistics are powerful tools for identifying loci associated with a particular trait, disentangling spurious associations from weak but true signals is often not possible within the constraints of the data available. We showed that the more persistent prediction accuracy using FST-score-prioritized markers was the result of the ability of the FST-score-based method to select unlinked markers with weaker spurious associations with the trait compared to preselection based on absolute estimated effects. While this study only explored FST scores as an independent preselection statistic, we showed that FST and absolute estimated effect approaches preselected highly correlated sets of markers linked with QTL but significantly less so among unlinked markers. This highlights the potential for the possibility of combining FST scores (or similar population statistics) and association statistics into a powerful preselection index to reduce inclusion of nonrelevant markers with a large spurious association.
Methods
Data simulation
The simulated genome consisted of 30 chromosomes (100 centiMorgans each in length) that harbored a total of 777 k evenly-distributed SNP markers and was generated using QMSim [33]. As one of the primary goals of this study was to investigate how markers that segregate independently from QTL affect prediction accuracies, 200 QTL were randomly distributed across 2 of the 30 chromosomes, as illustrated in Fig. 1a. While it is an extreme scenario for real distribution of QTL for a complex trait, this design allowed unambiguous classification of approximately 725 k markers as segregating independently from any QTL, with the approximately 52 k remaining markers potentially linked with at least one QTL. QTL were limited to 2 chromosomes to ensure a high LD of the majority of markers on these chromosomes with at least one QTL and to evaluate the behavior of markers that are unlinked with QTL in a prediction model. QTL effects were sampled from a Gamma distribution with a shape parameter of 0.4.
The LD structure was generated through 2070 generations of random mating in a historical population, with a bottleneck occurring at generation 2000. The number of individuals in the historical population varied between 600 and 4000. The resulting average LD (r2) was 0.32 and 0.084 between consecutive SNP and QTL, respectively.
A trait with a heritability equal to 0.4 was simulated. The genetic and residual variances were set equal to 0.4 and 0.6, respectively. The model used to simulate the phenotypes included the true breeding value (the cross product of true QTL effects and genotypes) and a normally distributed error term with mean zero and dispersion equal to the residual variance.
All individuals from the last generation of the historical population (500 males and 3500 females) were selected to be founders of a population under selection (SP). An additional seven generations (3500 progeny each) were generated. Pedigree-based estimates of breeding values (pEBVs) were used for selection. Sire and dam replacement rates were set equal to 0.5 and 0.2, respectively. The training population consisted of the first six SP generations. It included 21 k phenotyped individuals; a randomly selected half of these were also genotyped. The seventh generation of SP consisted of 3.5 k genotyped individuals, none of which were phenotyped, and was used as the validation population. Ten replicates of the simulated genome and phenotypic data were generated.
Methods
Predictions were made using a single-step GBLUP model (ssGBLUP) [34–36] as implemented in the BLUPF90 software [37]. The genomic relationship matrix (G) was constructed according to VanRaden [38],
Where Z is a matrix of SNP genotypes, pi is the minor allele frequency of the ith SNP, and n is the number of SNPs.
In order to evaluate the consequences that SNPs unlinked with QTL have on prediction accuracies, three analyses that differed in the linkage status of the SNPs used to build G were performed. The three SNP subsets considered were 1) only the 51,800 SNP situated on either of the two chromosomes harboring around 100 QTL each (HQ2), 2) only the 725,200 SNP on any of the 28 chromosomes that lack QTL (LQ28) and can be definitively classified as being unlinked with QTL, and 3) the union of the two previous subsets that includes all 777,000 SNP regardless of linkage status.
In the next stage of the study, FST was used as a criterion for preselection of SNP subsets and was compared to preselection according to absolute estimated marker effect or random subsets. Subsets of 1, 10, 20, 30, 40, and 50 k SNPs for each preselection criteria were used for the construction of G and corresponding prediction accuracies, defined as the Pearson correlation between true and estimated breeding values of validation individuals, compared.
Fst scores were calculated following Nei [39],
where HT = 2 ∙ pT ∙ qT, and HSi = 2 ∙ pSi ∙ qSi,
where pT and qT are major and minor allele frequencies, respectively, in the population; pSi and qSi are major and minor allele frequencies, respectively, in the ith subpopulation; and nSi is the number of individuals in the ith subpopulation. The genotyped and phenotyped individuals in the training population were ranked according to their phenotype and the bottom and top 5% of individuals used to create two subpopulations (S1 and S2), as illustrated in Fig. 1b. Using these subpopulations, an Fst score for each SNP was computed as indicated in formulae above. SNPs were then ranked based on their Fst scores and subsets of the top 1, 10, 20, 30, 40, and 50 k markers used to compute G.
The rationale for forming subpopulations from individuals of extreme phenotype in a single breeding population rather than using separate breeding populations with highly divergent phenotypes (e.g., milk production in Holstein-Friesian and Jersey) is to minimize the likelihood of preselection of adaptive SNP markers that are specific to a population due to natural or artificial selection. By obtaining extreme phenotypes from within a single breeding population, any potential divergence in allele frequency related to traits uncorrelated with the one of interest will be more effectively averaged over.
Estimated genomic breeding values that were obtained using all 777 k markers to compute G were used to derive SNP effects through the following relationship [40],
where û is the vector of SNP effects, â is the vector of estimated genomic breeding values for individuals in the training population with G modeled using all 777 k SNPs, Z is a known incidence matrix of SNP genotypes, and D is a diagonal matrix of weights. In this study, D was set equal to the identity matrix to convey equal weight to all SNPs. Estimation of genomic breeding values and the back-calculation of SNP effects were obtained using ssGBLUP [34–36] as implemented in BLUPF90 [37] and PreGSF90 [41], respectively.
Random SNP subsets were generated by random sampling from all available SNPs with no restrictions placed on the number of markers sampled from a particular chromosome or the proportion that were linked and unlinked with QTL. A generalized outline of the approach to the analysis for Fst and SNP effect preselection and predictions is summarized in Fig. 1b.
Analysis of marker and QTL similarity
Genomic information improves prediction accuracies compared to pedigree primarily through a better modeling of the MS of QTL. To dissect how the various marker-estimated genomic relationships capture the true QTL similarity between individuals and how they contribute to the maximization of prediction accuracy, several metrics were used to quantify the agreement between marker- and QTL-based G matrices.
First, a correlation was calculated between all elements of the full marker- and QTL-based G matrices, as suggested by VanRaden [38]; this correlation reflects the adequacy of the estimated genomic relationships to capture both the expected relatedness and MS. To further evaluate how well each set of markers models the MS component specifically, expected relationships were subtracted from each genomic relationship, and a correlation between the resulting centered G matrices was calculated,
where GM and GQTL are the marker and QTL relationship matrices, respectively, and A22 is the matrix of expected relationships based on pedigree information for genotyped individuals. It is possible that certain markers will capture variation in MS that is not consistent with the MS of QTL when LD between markers and QTL is low. If the marker-estimated and QTL relationships fall in opposite directions around the expected relationship, then the expected relationship will in fact be a better estimate than the marker-estimated genomic relationship. The ability of marker subsets to capture the correct direction of the MS of QTL was determined as,
where sd (GM) is the standard deviation over all relationships within GM, and the sign reflects whether the MS component for marker-estimated relationships falls in the same (positive) or opposite (negative) direction as the QTL MS relative to the expected relationship.
At a minimum, marker-estimated relationships should capture the correct direction of the MS of QTL in order to improve the modeling of relationships relative to the expectation. An ideal set of markers will additionally minimize the distance between marker- and QTL-based relationships. This distance can be approximated using the following formulae,
This puts the discrepancy between marker-estimated and QTL relationships in terms of an error (%) relative to the scale of the MS of QTL. The closer a value is to zero, the less the discrepancy between the marker-estimated and QTL relationship. A value less than one implies that the marker-estimated relationship is a closer approximation of the QTL relationship than the expected relationship, while a value greater than one implies either that the marker-estimated relationship captures the correct direction but overestimates the QTL MS, or that the marker-estimated relationship has opposite direction of MS than the QTL. Figures were generated using the tidyverse package in R [42].
Acknowledgements
This study was supported in part by resources and technical expertise from the Georgia Advanced Computing Resource Center, a partnership between the University of Georgia’s Office of the Vice President for Research and Office of the Vice President for Information Technology.
Abbreviations
- GS
Genomic selection
- MS
Mendelian sampling
- SNP
Single nucleotide polymorphism
- LD
Linkage disequilibrium
- QTL
Quantitative trait loci
- SP
Population under selection
- pEBV
Pedigree-based estimated breeding value
- ssGBLUP
Single-step GBLUP
- G
genomic relationship matrix
- S1 and S2
Subpopulations 1 and 2
- HQ2
Subset of markers linked with QTL
- LQ28
Subset of markers unlinked with QTL
Authors’ contributions
RR, EHH and SEA designed the study and proposed the main hypotheses. ASL and RR designed the different simulations scenarios. ASL carried out all aspects of data simulation and analysis and drafting. All authors contributed to the interpretation and discussion of the results. All authors contributed to the revision and editing of the manuscript. The author(s) read and approved the final manuscript.
Funding
AL was funded by the United States Department of Agriculture (USDA) National Institute of Food and Agriculture (NIFA) through the National Needs Grant, grant number 11754154 to RR, https://nifa.usda.gov/. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Footnotes
1 The U.S. Department of Agriculture (USDA) prohibits discrimination in all its programs and activities on the basis of race, color, national origin, age, disability, and where applicable, sex, marital status, familial status, parental status, religion, sexual orientation, genetic information, political beliefs, reprisal, or because all or part of an individual's income is derived from any public assistance program. (Not all prohibited bases apply to all programs.) Persons with disabilities who require alternative means for communication of program information (Braille, large print, audiotape, etc.) should contact USDA's TARGET Center at +1 (202) 720-2600 (voice and TDD). To file a complaint of discrimination, write to USDA, Director, Office of Civil Rights, 1400 Independence Avenue, S.W., Washington, D.C. 20250-9410, or call +1 (800) 795-3272 (voice) or +1 (202) 720-6382 (TDD). USDA is an equal opportunity provider and employer.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Garcia-Ruiz A, Cole JB, VanRaden PM, Wiggans GR, Ruiz-Lopez FJ, Van Tassell CP. Changes in genetic selection differentials and generation intervals in US Holstein dairy cattle as a result of genomic selection. Proc Natl Acad Sci U S A. 2016;113(28):E3995–E4004. doi: 10.1073/pnas.1519061113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.VanRaden PM, Van Tassell CP, Wiggans GR, Sonstegard TS, Schnabel RD, Taylor JF, et al. Invited review: reliability of genomic predictions for north American Holstein bulls. J Dairy Sci. 2009;92(1):16–24. doi: 10.3168/jds.2008-1514. [DOI] [PubMed] [Google Scholar]
- 3.Genomes Project C, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Daetwyler HD, Capitan A, Pausch H, Stothard P, van Binsbergen R, Brondum RF, et al. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nat Genet. 2014;46(8):858–865. doi: 10.1038/ng.3034. [DOI] [PubMed] [Google Scholar]
- 5.Rimbert H, Darrier B, Navarro J, Kitt J, Choulet F, Leveugle M, Duarte J, Rivière N, Eversole K, on behalf of The International Wheat Genome Sequencing Consortium. le Gouis J, on behalf The BreedWheat Consortium. Davassi A, Balfourier F, le Paslier MC, Berard A, Brunel D, Feuillet C, Poncet C, Sourdille P, Paux E. High throughput SNP discovery and genotyping in hexaploid wheat. PLoS One. 2018;13(1):e0186329. doi: 10.1371/journal.pone.0186329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Chang LY, Toghiani S, Ling A, Aggrey SE, Rekaya R. High density marker panels, SNPs prioritizing and accuracy of genomic selection. BMC Genet. 2018;19(1):4. doi: 10.1186/s12863-017-0595-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Chang LY, Toghiani S, Aggrey SE, Rekaya R. Increasing accuracy of genomic selection in presence of high density marker panels through the prioritization of relevant polymorphisms. BMC Genet. 2019;20(1):21. doi: 10.1186/s12863-019-0720-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hayes BJ, MacLeod IM, Daetwyler HD, Bowman PJ, Chamberlian AJ, Vander Jagt CJ, et al., editors. Genomic prediction from whole genome sequence in livestock: the 1000 Bull Genomes Project. 10 World Congress of Genetics Applied to Livestock Production; 2014 2014-08-17; Vancouver, Canadahttps://hal.archives-ouvertes.fr/hal-01193911/documenthttps://hal.archives-ouvertes.fr/hal-01193911/file/2014_Hayes_WCGALP_1.pdf.
- 9.Meuwissen T, Goddard M. Accurate prediction of genetic values for complex traits by whole-genome resequencing. Genetics. 2010;185(2):623–631. doi: 10.1534/genetics.110.116590. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Druet T, Macleod IM, Hayes BJ. Toward genomic prediction from whole-genome sequence data: impact of sequencing design on genotype imputation and accuracy of predictions. Heredity (Edinb) 2014;112(1):39–47. doi: 10.1038/hdy.2013.13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ober U, Ayroles JF, Stone EA, Richards S, Zhu D, Gibbs RA, Stricker C, Gianola D, Schlather M, Mackay TFC, Simianer H. Using whole-genome sequence data to predict quantitative trait phenotypes in Drosophila melanogaster. PLoS Genet. 2012;8(5):e1002685. doi: 10.1371/journal.pgen.1002685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.van Binsbergen R, Calus MP, Bink MC, van Eeuwijk FA, Schrooten C, Veerkamp RF. Genomic prediction using imputed whole-genome sequence data in Holstein Friesian cattle. Genet Sel Evol. 2015;47(1):71. doi: 10.1186/s12711-015-0149-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Heidaritabar M, Calus MP, Megens HJ, Vereijken A, Groenen MA, Bastiaansen JW. Accuracy of genomic prediction using imputed whole-genome sequence data in white layers. J Anim Breed Genet. 2016;133(3):167–179. doi: 10.1111/jbg.12199. [DOI] [PubMed] [Google Scholar]
- 14.Zhang C, Kemp RA, Stothard P, Wang Z, Boddicker N, Krivushin K, Dekkers J, Plastow G. Genomic evaluation of feed efficiency component traits in Duroc pigs using 80K, 650K and whole-genome sequence variants. Genet Sel Evol. 2018;50(1):14. doi: 10.1186/s12711-018-0387-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Hickey JM, Dreisigacker S, Crossa J, Hearne S, Babu R, Prasanna BM, Grondona M, Zambelli A, Windhausen VS, Mathews K, Gorjanc G. Evaluation of genomic selection training population designs and genotyping strategies in plant breeding programs using simulation. Crop Sci. 2014;54(4):1476–1488. doi: 10.2135/cropsci2013.03.0195. [DOI] [Google Scholar]
- 16.Daetwyler HD, Pong-Wong R, Villanueva B, Woolliams JA. The impact of genetic architecture on genome-wide evaluation methods. Genetics. 2010;185(3):1021–1031. doi: 10.1534/genetics.110.116855. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Meuwissen THE, Hayes BJ, Goddard ME. Prediction of Total genetic value using genome-wide dense marker maps. Genetics. 2001;157(4):1819–1829. doi: 10.1093/genetics/157.4.1819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Croiseau P, Legarra A, Guillaume F, Fritz S, Baur A, Colombani C, et al. Fine tuning genomic evaluations in dairy cattle through SNP pre-selection with the elastic-net algorithm. Genet Res (Camb). 2011;93(6):409–417. doi: 10.1017/S0016672311000358. [DOI] [PubMed] [Google Scholar]
- 19.Calus MP, Bouwman AC, Schrooten C, Veerkamp RF. Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection. Genet Sel Evol. 2016;48(1):49. doi: 10.1186/s12711-016-0225-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Veerkamp RF, Bouwman AC, Schrooten C, Calus MP. Genomic prediction using preselected DNA variants from a GWAS with whole-genome sequence data in Holstein-Friesian cattle. Genet Sel Evol. 2016;48(1):95. doi: 10.1186/s12711-016-0274-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Frischknecht M, Meuwissen THE, Bapst B, Seefried FR, Flury C, Garrick D, Signer-Hasler H, Stricker C, Intergenomics Consortium. Bieber A, Fries R, Russ I, Sölkner J, Bagnato A, Gredler-Grandl B. Short communication: genomic prediction using imputed whole-genome sequence variants in Brown Swiss cattle. J Dairy Sci. 2018;101(2):1292–1296. doi: 10.3168/jds.2017-12890. [DOI] [PubMed] [Google Scholar]
- 22.Brondum RF, Su G, Janss L, Sahana G, Guldbrandtsen B, Boichard D, et al. Quantitative trait loci markers derived from whole genome sequence data increases the reliability of genomic prediction. J Dairy Sci. 2015;98(6):4107–4116. doi: 10.3168/jds.2014-9005. [DOI] [PubMed] [Google Scholar]
- 23.van den Berg I, Boichard D, Lund MS. Sequence variants selected from a multi-breed GWAS can improve the reliability of genomic predictions in dairy cattle. Genet Sel Evol. 2016;48(1):83. doi: 10.1186/s12711-016-0259-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.VanRaden PM, Tooker ME, O'Connell JR, Cole JB, Bickhart DM. Selecting sequence variants to improve genomic predictions for dairy cattle. Genet Sel Evol. 2017;49(1):32. doi: 10.1186/s12711-017-0307-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Wiggans GR, Cooper TA, VanRaden PM, Van Tassell CP, Bickhart DM, Sonstegard TS. Increasing the number of single nucleotide polymorphisms used in genomic evaluation of dairy cattle. J Dairy Sci. 2016;99(6):4504–4511. doi: 10.3168/jds.2015-10456. [DOI] [PubMed] [Google Scholar]
- 26.Xu S. Theoretical basis of the Beavis effect. Genetics. 2003;165(4):2259–2268. doi: 10.1093/genetics/165.4.2259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet. 2010;11(7):459–463. doi: 10.1038/nrg2813. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Toghiani S, Chang L-Y, Ling A, Aggrey SE, Rekaya R. Genomic differentiation as a tool for single nucleotide polymorphism prioritization for genome wide association and phenotype prediction in livestock. Livest Sci. 2017;205:24–30. doi: 10.1016/j.livsci.2017.09.007. [DOI] [Google Scholar]
- 29.Habier D, Fernando RL, Dekkers JC. The impact of genetic relationship information on genome-assisted breeding values. Genetics. 2007;177(4):2389–2397. doi: 10.1534/genetics.107.081190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Goddard M. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 2009;136(2):245–257. doi: 10.1007/s10709-008-9308-0. [DOI] [PubMed] [Google Scholar]
- 31.Habier D, Fernando RL, Garrick DJ. Genomic BLUP decoded: a look into the black box of genomic prediction. Genetics. 2013;194(3):597–607. doi: 10.1534/genetics.113.152207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.H. Ishwaran JSR. Generalized ridge regression: geometry and computational solutions when p is larger than n. In: Cleveland Clinic and University of Miami M, editor. 2010.
- 33.Sargolzaei M, Schenkel FS. QMSim: a large-scale genome simulator for livestock. Bioinformatics. 2009;25(5):680–681. doi: 10.1093/bioinformatics/btp045. [DOI] [PubMed] [Google Scholar]
- 34.Legarra A, Aguilar I, Misztal I. A relationship matrix including full pedigree and genomic information. J Dairy Sci. 2009;92(9):4656–4663. doi: 10.3168/jds.2009-2061. [DOI] [PubMed] [Google Scholar]
- 35.Aguilar I, Misztal I, Johnson DL, Legarra A, Tsuruta S, Lawlor TJ. Hot topic: a unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. J Dairy Sci. 2010;93(2):743–752. doi: 10.3168/jds.2009-2730. [DOI] [PubMed] [Google Scholar]
- 36.Christensen OF, Lund MS. Genomic prediction when some animals are not genotyped. Genet Sel Evol. 2010;42(1):2. doi: 10.1186/1297-9686-42-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Aguilar I, Tsuruta S, Masuda Y, Lourenco D, Legarra A, Misztal I. BLUPF90 suite of programs for animal breeding with focus on genomics. Proceedings of the World Congress on Genetics Applied to Livestock Production 2018. p. 751.
- 38.VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91(11):4414–4423. doi: 10.3168/jds.2007-0980. [DOI] [PubMed] [Google Scholar]
- 39.Nei M. Analysis of gene diversity in subdivided populations. Proc Natl Acad Sci. 1973;70(12):3321–3323. doi: 10.1073/pnas.70.12.3321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Wang H, Misztal I, Aguilar I, Legarra A, Muir WM. Genome-wide association mapping including phenotypes from relatives without genotypes. Genet Res (Camb) 2012;94(2):73–83. doi: 10.1017/S0016672312000274. [DOI] [PubMed] [Google Scholar]
- 41.Aguilar I, Misztal I, Tsuruta S, Legarra A, Wang H. PREGSF90 – POSTGSsourcF90: Computational Tools for the Implementation of Single-step Genomic Selection and Genome-wide Association with Ungenotyped Individuals in BLUPF90 Programs 2014.
- 42.Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, et al. Welcome to the Tidyverse. J Open Source Softw. 2019;4(43):1686. 10.21105/joss.01686.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.