Abstract
Genomic selection has the potential to increase genetic progress. Genotype imputation of high-density single-nucleotide polymorphism (SNP) genotypes can improve the cost efficiency of genomic breeding value (GEBV) prediction for pig breeding. Consequently, the objectives of this work were to: (1) estimate accuracy of genomic evaluation and GEBV for three traits in a Yorkshire population and (2) quantify the loss of accuracy of genomic evaluation and GEBV when genotypes were imputed under two scenarios: a high-cost, high-accuracy scenario in which only selection candidates were imputed from a low-density platform and a low-cost, low-accuracy scenario in which all animals were imputed using a small reference panel of haplotypes. Phenotypes and genotypes obtained with the PorcineSNP60 BeadChip were available for 983 Yorkshire boars. Genotypes of selection candidates were masked and imputed using tagSNP in the GeneSeek Genomic Profiler (10K). Imputation was performed with BEAGLE using 128 or 1800 haplotypes as reference panels. GEBV were obtained through an animal-centric ridge regression model using de-regressed breeding values as response variables. Accuracy of genomic evaluation was estimated as the correlation between estimated breeding values and GEBV in a 10-fold cross validation design. Accuracy of genomic evaluation using observed genotypes was high for all traits (0.65−0.68). Using genotypes imputed from a large reference panel (accuracy: R2 = 0.95) for genomic evaluation did not significantly decrease accuracy, whereas a scenario with genotypes imputed from a small reference panel (R2 = 0.88) did show a significant decrease in accuracy. Genomic evaluation based on imputed genotypes in selection candidates can be implemented at a fraction of the cost of a genomic evaluation using observed genotypes and still yield virtually the same accuracy. On the other side, using a very small reference panel of haplotypes to impute training animals and candidates for selection results in lower accuracy of genomic evaluation.
Keywords: genomic selection, genotype imputation, swine, shared data resources, GenPred
Genetic improvement through breeding for lean growth, reproductive performance, meat quality, and health traits is an important tool in the pig-breeding industry to assure its continued competitiveness and success. Traditional estimated breeding values (EBVs) derived from pedigree information have resulted in continuous genetic improvement but have several limitations (Dekkers et al. 2010). Notably, some important phenotypes are difficult and expensive to observe, impairing estimation of accurate EBV.
The use of genomic breeding values (GEBVs), estimated using a large number of genetic markers across the genome, is expected to overcome a number of those limitations (Meuwissen et al. 2001; Dekkers et al. 2010) and allow for the selection of animals at a young age, thereby shortening generation intervals (Hayes et al. 2009a; Vanraden et al. 2009; Wiggans et al. 2011). Several papers have reported the progress and success of genomic selection in dairy cattle (Hayes et al. 2009a; VanRaden et al. 2009; Wiggans et al. 2011), and it is expected to be equally useful in pigs (Tribout et al. 2012). High-density genotypes in pigs can be obtained from the PorcineSNP60 BeadChip (Illumina, San Diego, CA) containing roughly 62K single-nucleotide polymorphisms (SNPs) (Ramos et al. 2009).
First implementations of genomic prediction in pigs included evaluations for total number of pigs born in a litter and percent stillborn (Cleveland et al. 2010). The results of this study indicated that GEBV in pigs can reach accuracies comparable with those observed in dairy cattle if the training population is large enough (Cleveland et al. 2010). In addition, several strategies to increase cost efficiency through the use of low-density genotypes have been explored, but the accuracy of GEBV was reasonable only for certain traits, likely due to differences in the genetic architecture of the traits (Cleveland et al. 2010). However, when genotypes were imputed with high accuracy, results for genomic evaluation were promising for several traits in a commercial pig population (Cleveland and Hickey 2013).
A question that was not investigated in those papers and that we want to answer in this study is how different imputation scenarios (of varying cost and accuracy) translate into accuracy of genomic predictions. The posed question is important because the relatively high genotyping cost per animal currently limits the widespread commercial use of high-density genotypes for genomic selection purposes in pigs. One strategy to improve the cost efficiency of genotyping schemes is the use of genotype imputation for a portion of the population. In the interest of cost efficiency, it is likely that selection candidates will not be genotyped using a high-density array such as the PorcineSNP60 but rather will be genotyped on a low-density array like the recently released GeneSeek Genomic Profiler for Porcine LD (GGP-Porcine: GeneSeek Inc., a Neogen Co., Lincoln, NE), a subset of the PorcineSNP60 containing roughly 10K SNP. We showed (Badke et al. 2013) that genotypes in pigs can be imputed from the GGP-Porcine to the PorcineSNP60 with accuracy of R2 = 0.88 using linkage disequilibrium (LD)-based imputation algorithms with a small reference panel of haplotypes (N = 128 haplotypes). We also showed that imputation accuracy can be further improved by adding animals to the reference panel (Badke et al. 2013), or in case of a pedigreed population, by exploiting Mendelian segregation and population-wide LD (Huang et al. 2012; Gualdrón Duarte et al. 2013). In this paper, we use genotypes imputed based on population wide LD, offering a strategy that can be applied universally in any population, for which a suitable reference panel can be assembled.
Our objective was to estimate the accuracy of genomic evaluation using observed or imputed genotypes. Moreover, we consider two contrasting imputation scenarios: (a) a higher-cost and high-accuracy scenario in which high-density genotypes from training animals and from a reference panel are used to impute genotypes in candidates for selection and (b) a low-cost and lower-accuracy scenario in which a small reference panel of high-density haplotypes is used to impute genotypes in training animals and candidates for selection.
Materials and Methods
Materials
Animals and genotypes:
Data used in this study were collected from 983 Yorkshire sires. A pedigree of 4092 individuals spanning 22 generations and including all 983 sires and their registered ancestors was available from the National Swine Registry (NSR). Of 983 genotyped sires, 575 had their sire genotyped as well, 341 had a grand sire, and 597 animals had at least one half sib among the 983 animals. The number of full sibs was much lower, and only 110 sires had a full sib genotyped. Details on these quantities can be found in Supporting Information, Figure S1. High-density genotypes for these animals were obtained from samples provided by the NSR. Genotyping was performed at a commercial laboratory (GeneSeek) using the Illumina PorcineSNP60 BeadChip. The same dataset was previously used to assess the effect of genotype imputation (Badke et al. 2013) and is publicly available at: https://www.msu.edu/~steibelj/JP_files/imputation.html. Animal protocols were approved by the Michigan State University All University Committee on Animal Use and Care (AUF# 03/09-046-00). Genotyping rate of at least 90% of both animals and SNP and a minor allele frequency (MAF) of at least 5% were required for genotypes to be included in the analysis, leaving a total of 41,248 markers in 983 animals. SNPs that were not assigned to an autosomal position in map build 10.2 were excluded from the analysis. It was our goal to estimate the GEBV of male offspring of a sire and since sires will not pass an X chromosome to their male offspring, these SNP do not contribute to the sons’ GEBV (VanRaden et al. 2009). In addition to genotypes for 983 Yorkshire sires, a set of 128 Yorkshire haplotypes was available as a reference panel for genotype imputation from a previous study (Badke et al. 2012). These haplotypes are also freely available at https://www.msu.edu/~steibelj/JP_files/LD_estimate.html, and details on the design and phasing can be found in Badke et al. (2012).
Phenotypes:
For every animal and their parents, EBVs and accuracies were obtained for three traits from NSR through their traditional genetic evaluation. These traits were: backfat thickness (BF), number of days to 250 lb (D250), and loin muscle area (LEA). Descriptive statistics of EBV and accuracies are presented in Table 1. All code and data used in this paper have been assembled into an R package, accessible at: http://tinyurl.com/MSURGEBV.
Table 1. Descriptive statistics of EBVs.
EBV, estimated breeding values; BF, backfat thickness; D250, number of days to 250 lb; LEA, loin muscle area.
Average reliability of EBV.
Number of animals with usable EBV.
Methods
De-regression of breeding values:
De-regressed breeding values (dEBVs) were used as response variables throughout the analysis. We computed individual animal dEBVs and their weights (wi) with the parent average removed by following the procedure outlined by Garrick et al. (2009). We discarded records with a negative weight. The weight of an animal will only be below 0 if the unknown information content on this particular animal and its offspring is below 0, such that there is no individual information observed. This would be the case in a young animal, where all observed information came from ancestors and parents of this animal. To avoid double counting, these animals were eliminated from the analysis because they did not contribute individual information. After de-regression and filtering a total of 965, 936, and 938 animals remained for the traits BF, D250, and LEA, respectively.
Estimation of genomic relationship matrix:
The genomic relationship matrix was estimated from observed or imputed high density (~41 K) SNP genotypes. Genotypes were expressed as allelic dosage, which is the number of copies of the minor allele, such that genotypes were entered into a marker matrix W as a decimal number in the interval [0, 2]. We obtained matrix Z by subtracting twice the allelic frequency of the minor allele (pi), from columns of W (VanRaden 2008). The genomic relationship matrix was then calculated as:
(1) |
where is a normalizing constant (Wang et al. 2012) summing expected variances across markers scaling G toward the numerator relationship matrix (VanRaden 2008). The allele frequency pi was obtained using all available animals (N = 983). Average relatedness between animals was obtained from the row/column vectors of G. We quantified relatedness in this study as the average of the top 10 relationships observed within the G matrix (re/10). The choice of top 10 as opposed of another number is arbitrary but driven by the fact that each animal had a very limited number of close and distant relatives in the training set (Figure S1). Moreover, other studies have used this measure and proposed its inclusion in future work on genomic selection to promote comparability (Daetwyler et al. 2013).
Implementation of prediction model:
Using the genomic relationship matrix from equation (1), an animal-centric model for genomic evaluations can be written as:
(2) |
where y is the vector of dEBV, μ is the overall mean, a is the vector of n animal effects , and e is a vector of random residuals . The variance of the dEBV is , where R is a diagonal matrix with diagonal elements , the inverse of the weights of the dEBV (VanRaden et al. 2011). Equivalently, the information in G can also be included in the incidence matrix of the animal effects a as follows (Vazquez et al. 2010):
(3) |
where C is the Cholesky decomposition of G, such that G = CC′, μ is the overall mean, a* is the vector of animal effects with noticing that a = Ca*, and e is a vector of residual effects such that . The variance terms for models (2) and (3) are equal, such that the two models are in fact equivalent if variance components are assumed known. Likewise, when estimating the parameters under these two models, we found virtually identical results, but model (3) was computationally more efficient resulting in a twofold reduction in compute time (results not shown). The BLR package (Pérez et al. 2010) in R (R Development Core Team 2011) was used to fit the mixed model equations. Model parameters and were sampled from their corresponding full conditional distribution using a Gibbs sampler. Prior distributions were elicited based on equations presented by Pérez et al. (2010). The prior distribution of and were an inverse χ2 distribution with degrees of freedom df and scale S. To ensure proper priors with finite expectations, we set df = 3. The scale parameters were obtained as a function of the df and assuming values of the genetic variance (Va) and error variance (Ve) (Pérez et al. 2010):
where , is the average inbreeding coefficient, set equal to 1 in this case, assuming no inbreeding. Heritability was assumed to be h2 = 0.5, such that after the value for Ve was arbitrarily set to 0.4, Va was estimated . The Gibbs sampler implemented in BLR (Pérez et al. 2010) was used to obtain a total of 100,000 samples, 10,000 of which were discarded as burn-in. The reported estimates of , , animal effects (a*), and GEBV were based on the posterior means of the remaining 90,000 iterations. We assessed convergence of the Markov chain Monte Carlo method as well as sensitivity to priors to ensure robustness of estimates to priors (results not shown).
Genomic prediction under cross-validation
Accuracy of genomic evaluation was estimated in a 10-fold cross-validation design. Approximately 10% of the animals were randomly assigned to a validation panel (V) in which predictions would be made, whereas the remaining 90% were used as the training panel (T) to estimate the parameters necessary for prediction. A total of 10 separate datasets were created such that each animal would be used for validation once. Across cross-validation datasets we fit model (3) to the training animals; we refer to that subset by adding a subindex T:
(4) |
to estimate the BLUP of (VanRaden et al. 2011):
(5) |
where the matrices G and C are partitioned into block structure such that
(6) |
The relation between the BLUP for a based on model (2) and based on model (3) can be expressed as:
(7) |
The GEBVs of training animals in model (2) were computed as:
(8) |
Subsequently, the GEBVs of the validation animals were estimated from using the following equation:
(8) |
where , , and are estimated using model (4), which is equivalent to applying model (3) to the training animals.
Estimation of accuracy:
Accuracy of genomic evaluation is the correlation between the estimated GEBV and the unknown true breeding values (TBVs) (Hayes et al. 2009a). However, the TBVs are unknown. Consequently, the accuracy of genomic evaluation has to be approximated using the available information. Hayes et al. (2009a) proposed to express the correlation between GEBV and TBV as a function of the correlation between GEBV and EBV:
(9) |
where is the estimated reliability of the EBV. VanRaden et al. (2009) replaced with the arithmetic mean of the reliability of the EBV. Daetwyler et al. (2013) proposed to report a simple Pearson correlation coefficient between GEBV and EBV to allow for comparability of results across studies. We estimate accuracy of genomic evaluation as the Pearson correlation coefficient between GEBV and EBV (r(GEBV, EBV)) and the Pearson correlation coefficient adjusted for the average accuracy of the EBV to facilitate such comparison .
Accuracies of individual GEBV were obtained analogous to the accuracy of EBV in an animal model (Goddard et al. 2011) through inversion of the mixed model equations (Mrode 2005; VanRaden 2008; VanRaden et al. 2009; Strandén and Garrick 2009; Clark et al. 2012). The accuracy of of the model (2) can be expressed as (Mrode 2005; Strandén and Garrick 2009; Clark et al. 2012):
(10) |
This equation and its derivation can be found in Strandén and Garrick (2009) and VanRaden (2008) and was used to estimate the accuracy of individual GEBV for validation animals.
Genotype imputation:
LD-based genotype imputation was performed with BEAGLE version 3.3.1 (Browning and Browning 2009). We used the standard settings for BEAGLE: 10 iterations of the phasing algorithm, drawing four samples per iteration. Previous results from our group (Badke et al. 2013) and other studies (Hayes et al. 2012) showed negligible improvement in imputation accuracy as a result of an increase in iterations or samples per iteration. Imputation of 10K SNP chip [6890 SNP after filtering for minor allele frequency (MAF) and missing rate] were used as tagSNP to impute 60K SNP (41,248 after filtering).
We implemented two separate imputation experiments that differ in the size of the high-density reference panel used for imputation: (1) a reference panel of 128 Yorkshire haplotypes or (2) a reference panel combining the 128 Yorkshire haplotypes with the haplotypes of all animals that are part of the training panel (~1700 additional haplotypes) in the respective cross-validation dataset. To assess the effect of genotype imputation on genomic prediction we considered the following four scenarios: (1) the reference scenario in which genomic evaluation was based on observed genotypes in training and validation animals, (2) genomic evaluation based on observed genotypes in the training animals and genotypes imputed from a large reference panel (~1800 haplotypes) in the validation animals, (3) genomic evaluation based on observed genotypes in the training animals and genotypes imputed from a small reference panel (128 haplotypes) in the validation animals, and (4) genomic evaluation based on imputed genotypes in training and validation animals using a small (128 haplotypes) but representative reference panel for imputation. All genotype imputation and subsequent estimation of imputation accuracy was implemented using the R package impute.R (Badke et al. 2013). To compare average accuracy of genomic evaluation across these four scenarios, we fitted a linear model with the average accuracy of genomic evaluation as response variable and the genotype imputation scenario as independent variable, adding the effect of the random cross-validation dataset in which accuracy of genomic evaluation was estimated as a random blocking factor.
Results
Accuracy of genomic evaluation and GEBV using observed genotypes
When genotypes were observed in both training and prediction animals, the accuracy of genomic evaluation, measured as the weighted mean of the Pearson correlation coefficient between EBV and predicted GEBV across 10 cross-validation datasets, was 0.68, 0.66, and 0.65 for BF, D250, and LEA, respectively (Table 2). When the measure of accuracy was adjusted for the average reliability of the EBV of the training animals, the observed accuracy of genomic evaluation was 0.80, 0.82, and 0.76 for BF, D250, and LEA, respectively (Table 2).
Table 2. Estimates of accuracy for genomic evaluation and individual GEBV across imputation scenarios.
Trait | Scenarioa | Imputation Accuracyb | rEBV, GEBVc | rEBVd | HPDe | ||
---|---|---|---|---|---|---|---|
BF | 1 | (1, 1) | 0.68101 | 0.8510 | 0.7998 | 0.6852 | [0.5395, 0.8211] |
2 | (1, 0.95) | 0.67951 | 0.7981 | 0.6861 | [0.5467, 0.8164] | ||
3 | (0.88, 0.88) | 0.65982 | 0.7749 | 0.7014 | [0.5727, 0.8267] | ||
4f | (1,1) | 0.7210 | 0.8405 | 0.8560 | [0.8174, 0.8768] | ||
D250 | 1 | (1, 1) | 0.66031 | 0.8020 | 0.8229 | 0.6575 | [0.5073, 0.7948] |
2 | (1, 0.95) | 0.65551,2 | 0.8170 | 0.6585 | [0.5187, 0.7962] | ||
3 | (0.88, 0.88) | 0.64632 | 0.8054 | 0.6750 | [0.5345, 0.7985] | ||
4f | (1,1) | 0.5354 | 0.6550 | 0.8438 | [0.8048, 0.8704] | ||
LEA | 1 | (1, 1) | 0.65161 | 0.8529 | 0.7639 | 0.6859 | [0.5386, 0.8325] |
2 | (1, 0.95) | 0.64911 | 0.7610 | 0.6868 | [0.5377, 0.8214] | ||
3 | (0.88, 0.88) | 0.63642 | 0.7461 | 0.7040 | [0.5667, 0.8330] | ||
4f | (1,1) | 0.7165 | 0.8201 | 0.8549 | [0.8223, 0.8787] |
GEBV, genomic breeding value; EBV, estimated breeding values; HPD, highest posterior density; BF, backfat thickness; D250, number of days to 250 lb; LEA, loin muscle area.
Scenarios 1: all observed genotypes, 2: genotypes in prediction animals imputed with large reference haplotype panel (~1800), 3: genotypes in prediction animals imputed with small haplotype reference panel (128), and 4: validation animals with at least one close relative in the reference panel.
Accuracy of genotype imputation R2 for training and validation animals: .
Tukey honest significant difference post-hoc comparison of accuracy of genomic evaluation across imputation scenarios.
Average accuracy of EBV in the validation panel.
95% HPD interval of GEBV accuracy across validation animals.
Scenario with young animals in the validation panel that almost all have at least one close relative in the training panel.
1,2Means with different superscript differ significantly according to Tukey post-hoc tests with α = 0.05.
We observed a significant difference between the estimates of accuracy of genomic evaluation across 10 randomly assigned cross-validation datasets for three traits (Table 3). That variation across cross-validation datasets was partially explained by a significant effect of the average EBV accuracy of validation animals on accuracy of genomic evaluation (Table 3) in three traits and a significant effect of top 10 relatedness on accuracy of genomic evaluation in D250. In general, D250 had slightly lower average EBV accuracy due to an increased frequency of EBV with intermediate accuracy (rEBV close to 0.6, Figure S2). As expected, this resulted in slightly lower correlation of EBV and GEBV because the ‘true value’ (EBV) is subject to more uncertainty. Another source of difference of accuracy of genomic evaluation across cross-validation datasets could be the population structure. This would be revealed through differences in estimated variance components. We did not expected differences in variance components estimated from randomly assigned validation datasets. We confirmed this assumption by studying the distribution of estimated heritability and included the obtained results in Figure S3. We observed that the posterior distributions of heritabilities did not change across folds. Conversely, in the presence of population structure, the relationships of animals of different cross validation datasets will change (depending on who else is in the training set), and we expect that to affect the estimate of heritability.
Table 3. Significance of variables affecting accuracy of genomic evaluation.
dataseta | rel10b | c | ||||
---|---|---|---|---|---|---|
trait | Fd | p | Fe | p | Fe | p |
BF | 258 | < 0.001 | 2.83 | 0.1013 | 11.73 | 0.0016** |
D250 | 229 | < 0.001 | 5.18 | 0.0291* | 7.238 | 0.0109* |
LEA | 311 | < 0.001 | 2.06 | 0.1605 | 3.430 | 0.0725 |
EBV, estimated breeding values; BF, backfat thickness; D250, number of days to 250 lb; LEA, loin muscle area.
Accuracy of genomic evaluation was estimated for a total of 10 randomly assigned datasets of the cross-validation, such that we could assess whether accuracy of genomic evaluation was significantly different across these 10 datasets.
Accuracy of genomic evaluation by average of the top 10 genomic relationship estimates of animals in the validation set.
Accuracy of genomic evaluation by average accuracy of EBV of validation animals by cross-validation dataset.
df = c(9, 27).
df = c(1, 35).
P < 0.05, **P < 0.01.
The average accuracy of the genomic evaluation and the assessment of the accuracy of individual GEBV using equation 10 is equally important in a practical implementation of genomic selection. Average accuracy of individual GEBV was 0.69, 0.66, and 0.69 for BF, D250, and LEA, respectively with a 95% highest posterior density interval ranging from roughly 0.51 to 0.80 across all traits (Table 2).
As can be seen in Figure 1, the accuracy of GEBV (rGEBV) and accuracy of EBV (rEBV) are not linearly related. The accuracy of EBV was higher than the estimated accuracy of GEBV for most animals in three traits, especially when rEBV > 0.8. For a few animals with rEBV between 0.4 and 0.8, the accuracy of GEBV was higher than their respective EBV accuracy. Hypothetically, individual differences in rGEBV can be explained by the presence or absence of relatives of the predicted animal in the training set (Clark et al. 2012; Pérez-Cabal et al. 2012). We investigated this assertion in two ways: (1) by computing average rGEBV for animals with different number of relatives in training panel and (2) by regressing rGEBV on the average top 10 relatedness in the genomic relationship matrix. Following Pérez-Cabal et al. (2012), we defined close relatives as sires and full sibs and distant relatives as maternal grand sires and half sibs. We found that increasing the number of close relatives from one to four in the training panel increased average rGEBV by about 0.1 decimal points (Figure 2) across the three traits in this study (from an average of = 0.63 to = 0.73 regardless of the trait considered). The presence of distant relatives in the training set also resulted in an increase of rGEBV of similar magnitude when comparing individuals without any distant relative to individuals with at least five distant relatives in the training set (Figure 2). A similar relationship was observed when comparing rGEBV with the average relationship to the 10 most-related individuals in the training set. We observed an almost linear increase in rGEBV as top 10 relatedness increased (Figure 2), which was statistically significant (P < 0.01). To further investigate the effect of relatedness between training and validation animals, we selected the youngest 87 animals (approximately 10% of the population) that included 82 animals with at least a sire or a grand-sire in the training panel. We repeated genomic evaluation with this validation panel and estimated the accuracy of GEBV. As expected, average accuracy of GEBV for this validation panel was higher than the average observed across the cross-validation datasets with 0.72 for BF and LEA, and 0.54 for D250. However, when looking at the range of accuracies observed for all 10 cross-validation datasets these values do not exceed the maximum accuracy observed. One interesting finding was that estimates of individual accuracy, or accuracy of GEBV predicted through the genomic relationship matrix, were much larger than the observed accuracy of genomic evaluation in all three traits (Table 2). Goddard et al. (2011) proposed to use this measure of accuracy of individual GEBV when using them for selection but also to screen for animals whose GEBV could be expected to be highly accurate. Our results show that while it is true that individuals with close relatives in the training panel will have on average more accurate GEBV, the individual accuracies obtained from the G matrix would be overestimated.
Effect of genotype imputation on accuracy of genomic evaluation and GEBV
Accuracy of imputation (R2) for each animal was measured as the squared correlation between the observed and imputed allelic dosage across all SNP (Badke et al. 2013). Average accuracy of imputation was R2 = 0.88 for the scenario using a small (128) haplotype reference panel, and it increased to R2 = 0.95, when a larger reference panel (~ 1800 haplotypes) was used. In our previous study (Badke et al. 2013), we found that increasing the size of the reference panel led to an improved imputation, especially of SNP that appear difficult to impute, such as SNP with low (<0.1) MAF and those located in the chromosomal extremes. These results were repeated in this study (Figure S4). For BF we found that the average accuracy of genomic evaluation under scenario 2 (rGEBV, EBV = 0.68), where genotypes in the validation animals were imputed with high accuracy (R2 = 0.95), was not significantly different from the accuracy (rGEBV, EBV = 0.68) estimated in the reference scenario, where all genotypes were observed. However, average accuracy of genomic evaluation was significantly lower (rGEBV, EBV = 0.66), when genotypes were imputed in both training and validation with lower accuracy (R2 = 0.88 using a small reference panel of haplotypes (scenario 3). For D250, there was no significant difference in accuracy of genomic evaluation between the reference scenario (rGEBV, EBV = 0.66) and the scenario where genotypes were imputed in the validation animals (Table 2). However, when genotypes were imputed in both training and validation (scenario 3), the accuracy of genomic selection was significantly lower (rGEBV, EBV = 0.65). For LEA there was also no difference in accuracy of genomic evaluation between the reference scenario (rGEBV, EBV = 0.65) and scenario 2 (rGEBV, EBV = 0.65). There was a significant decrease in accuracy of genomic evaluation when genotypes were imputed with lower accuracy (R2 = 0.88) in scenario 3 (rGEBV, EBV = 0.63). To assess the effect of genotype imputation on the results of a genomic evaluation, we compared the top 5% sires (n = 46), ranked by their estimated GEBV across imputation scenarios. Again, scenario 1 was used as a reference scenario to compare how many of the top 5% ranked animals were also top ranked under the imputation scenarios. The proportion of top 5% ranked sires that were conserved when genotypes were imputed in validation animals with high accuracy (scenario 2) was 0.96 for BF and 0.98 for D250 and LEA. When genotypes were imputed with low accuracy in training and validation, the proportion of top 5% sires conserved in comparison with the reference design showed a small decrease compared with the design with only validation animals imputed for BF (0.88) and for D250 (0.89), and a more substantial decrease for LEA (0.81). Accuracy of individual GEBV is estimated using the genomic relatedness between training and validation animals. Using genotypes imputed with high accuracy (R2 = 0.95) the estimated rGEBV remained constant in all traits, compared with estimates obtained from observed genotypes. Accuracy of imputation was correlated with rGEBV (Figure S5). However, this does not imply that high imputation accuracy caused an increase in rGEBV. Another possibility is that genotypes from animals with relatives in the reference panel will be imputed with high accuracy and their GEBV will also be predicted more accurately. We believe that this was the case for our population because the correlation between GEBV and EBV did not differ significantly when imputation was used (Table 2, compare scenario 1 and 2). Moreover, when genotypes were imputed with less accuracy (R2 = 0.88), the observed accuracy of GEBV was increased even with respect to the reference scenario (Table 2, compare scenario 3 to 1 and 2). This result is counterintuitive, and we investigated the reason for this increase. Examining the estimation procedure for rGEBV we found that the increase was due to smaller estimates of the diagonal elements of the genomic relationship matrix between the validation elements (GV) in the scenario with all imputed genotypes. This is the result of all imputed animals conditional on a small reference panel looking genetically more similar than they really are (because they are all imputed toward the haplotype frequencies in the small panel). Those diagonal elements of G were used to scale values of rGEBV (equation 10), and smaller values in the denominator resulted in the larger estimates of rGEBV we saw for animals in scenario 3. Comparing unscaled values of rGEBV individual accuracy was higher in the reference scenario for all animals.
Discussion
Accuracy of genomic evaluation and GEBV using observed genotypes
The size of the training population used to train the prediction equation in this study was small compared with previous genomic evaluations published in swine (Cleveland et al. 2010, 2012), and especially compared with studies applying genomic evaluation in European (Dassonneville et al. 2011) or US dairy cattle (Weigel et al. 2010; Wiggans et al. 2012). Observed accuracy of genomic evaluation in this study was in good agreement with previously published results for genomic evaluation in pigs, assessing five unspecified commercial traits with comparable heritability (Cleveland et al. 2012) and earlier results for two reproductive traits (Cleveland et al. 2010). Accuracy of genomic evaluation was high across three traits (BF: rGEBV = 0.6810; D250: rGEBV = 0.6603; LEA: rGEBV = 0.6516). In addition, we report accuracy adjusted for the fact that the Pearson correlation between EBV and GEBV will underestimate the true quantity of interest (Luan et al. 2009). Assessing the variation in accuracy of genomic evaluation across datasets of the cross-validation, we found that the of the validation animals and their relatedness to the training animals were significantly associated to the average accuracy of genomic evaluation. Higher accuracy of genomic evaluation of prediction animals with close relatives in the training population (Habier et al. 2010; Clark et al. 2012) and within closely related populations, with relatively small effective population size, has been previously reported (Daetwyler et al. 2013). Accuracy of genomic evaluation in this study was high despite the limited number of animals available for training and the inclusion of animals with relatively low EBV accuracy. Furthermore, we obtained accurate genomic predictions using an equivalent model fitting the genomic relationship matrix instead of a marker based matrix (Hayes et al. 2009b), thereby greatly reducing the computational load. We expect that accuracy of genomic evaluation in this population and other US swine populations with comparable population structure and LD (Badke et al. 2012), will be feasible for commercial implementation and could be further increased through the inclusion of additional training animals with highly accurate EBV.
Besides assessing the accuracy of genomic evaluation, we also reported accuracies for individual GEBV. The accuracy of GEBV is important because it can influence selection decisions. Moreover, as proposed by Goddard et al. (2011), rGEBV can also be approximated prior to the implementation of genomic evaluation and used to inform the design of genomic selection in a population. The main difference between rGEBV and r(GEBV, EBV) is that r(GEBV, EBV) is indicative of the average accuracy of GEBV in a population, whereas rGEBV gives a measure of accuracy of each individually estimated GEBV. As expected, we observed that accuracy of GEBV increased with increased relatedness between the animal and the training panel. An interesting finding was that under a low accuracy imputation scenario, rGEBV was overestimated compared with r(GEBV, EBV). We traced this back to the diagonal elements of the genomic relationship matrix and attributed it to an artifact of the imputation using a small reference panel.
Several previous studies in other populations and simulation experiments also showed the importance of relatedness for the prediction of accurate GEBV (Habier et al. 2010; Clark et al. 2012), especially when the training population was small (Wientjes et al. 2013) as was the case in our study. In addition, we observed that accuracy of GEBV was higher than accuracy of EBV for only a few animals that had mostly low accuracy of EBV. This finding is further supported by previous reports that implementation of genomic evaluation would be most beneficial for young animals with little information on their own and subsequently low accuracy of traditional EBV (VanRaden 2008).
Effect of genotype imputation on accuracy of genomic evaluation and GEBV
Genotype imputation is an efficient tool to decrease the cost of obtaining high-density genotypes for selection candidates. One of the goals of this study was to quantify the loss on accuracy of genomic evaluation if GEBV were estimated from imputed rather than observed genotypes in selection candidates. Comparing accuracy of genomic evaluation across three scenarios of genotype imputation we found that for three traits there was no significant loss of accuracy of genomic prediction if genotypes in validation animals were with high accuracy (R2 = 0.95) instead of observed. However, accuracy of genomic evaluation decreased in comparison with the reference scenario when genotypes were imputed with lower overall accuracy (R2 = 0.88). When low-accuracy imputation was applied in training and prediction animals we observed a decrease in accuracy of genomic evaluation. Previously published results support that although it is not feasibly to implement genomic prediction based on low-density genotypes (Habier et al. 2009; Cleveland et al. 2010) the accuracy of genomic evaluation is still feasible for practical implementation when genotypes in selection candidates are accurately imputed to high density (Weigel et al. 2010; Cleveland and Hickey 2013). In addition, several studies also support that an increase in imputation accuracy will generate genomic evaluations with nearly identical or even higher accuracy compared with that obtained from observed genotypes (Dassonneville et al. 2011; Wiggans et al. 2012; Cleveland and Hickey 2013) because the cost efficiency of low-density genotypes allows a much larger proportion of the population to be included in the genomic evaluation procedure (Wiggans et al. 2012). In conclusion, an implementation of genomic selection based on observed genotypes for training of the prediction equation and GEBV predictions obtained from genotypes imputed with high accuracy appears to be a promising approach to provide the swine breeding industry with a cost-efficient procedure to obtain GEBV for animals at a young age. A recent study assessing the accuracy of genomic evaluation using high-density genotypes and various imputation schemes in a commercial pig population further supports these findings (Cleveland and Hickey 2013).
We found that accuracy of individual GEBV was a linear function of the relatedness between a validation animal and the respective training set. As has been previously shown in the literature, animals that are highly related to the training population will have higher rGEBV (Habier et al. 2010; Clark et al. 2012). As shown in the last scenario, however, when all selection candidates had at least one close relative in the training population, rGEBV overestimates the accuracy observed for the genomic evaluation (r(EBV, GEBV)). Although this measure certainly has value to rank animals according to how trustworthy estimated GEBV are, it is likely overestimated for candidates with close relatives.
The other case in which we observed overestimated individual accuracy of GEBV (rGEBV) pertains to the last of the imputation scenarios where genotypes were imputed in training and prediction animals. Specifically, when genotypes were imputed in training and prediction animals with lower accuracy, the average rGEBV was larger than the accuracy of genomic evaluation, which we found was an artifact of lower estimates of the diagonal elements of the G matrix. This was caused by a decrease in the variance of the allelic dosage of imputed genotypes due to the relatively small number of reference haplotypes available. When the variance of imputed allelic dosages was decreased, the deviation from the expected value estimated from MAF (2p) also decreased, causing overall smaller estimates of Z and the resulting diagonal elements of the G matrix. This increase in the homogeneity of allelic dosages in the imputed genotypes causes the observed inflation in accuracy of estimated GEBV, such that in any case when GEBV are obtained from imputed genotypes the estimated accuracy of GEBV should be used with caution. The average GEBV accuracy notably exceeded the expected accuracy of genomic evaluation in that scenario.
In conclusion, we found that results for the accuracy of GEBV further support the notion that genomic evaluation using high-density genotypes imputed with high accuracy for selection candidates is a feasible method to implement a cost-efficient design for genomic selection in swine. When genotypes were imputed with lower accuracy in training and prediction animals, the accuracy of genomic evaluation was significantly decreased, and estimates of accuracy of GEBV were inflated. From our results, we can affirm that starting a genomic evaluation using low-density genotypes and a small panel of high-density haplotypes will result in reduced accuracy of evaluation. Contrarily, once an evaluation is established with a large number of animals genotyped using a high-density platform, the addition of more animals genotyped at low density is promising. Further research is needed to study the effect of adding those imputed animals to the training population in further model retraining. As mentioned previously, all code and data used in this paper has been made available through an R package, accessible at: http://tinyurl.com/MSURGEBV.
Supplementary Material
Acknowledgments
This project was supported by Agriculture and Food Research Initiative Competitive Grant no. 2010-65205-20342 from the US Department of Agriculture National Institute of Food and Agriculture, US Department of Agriculture grant no. 2011-67015-30338, and by funding from the National Swine Registry. Estimated breeding values and accuracies for phenotypic traits, as well as tissue samples for DNA, were also provided by the National Swine Registry. Computer resources and programming advice were provided by the Michigan State University High Performance Computing Center.
Footnotes
Communicating editor: D.-J. De Koning
Literature Cited
- Badke Y. M., Bates R. O., Ernst C. W., Schwab C., Steibel J. P., 2012. Estimation of linkage disequilibrium in four US pig breeds. BMC Genomics 13: 24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Badke Y. M., Bates R. O., Ernst C. W., Schwab C., Fix J., et al. , 2013. Methods of tagSNP selection and other variables affecting imputation accuracy in swine. BMC Genet. 14: 8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Browning B. L., Browning S. R., 2009. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84: 210–223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clark S. A., Hickey J. M., Daetwyler H. D., van der Werf J. H. J., 2012. The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet. Sel. Evol. 44: 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cleveland M. A., Hickey J. M., 2013. Practical implementation of cost-effective genomic selection in commercial pig breeding using imputation. J. Anim. Sci. 91: 3583–3592. [DOI] [PubMed] [Google Scholar]
- Cleveland, M. A., S. Forni, D. J. Garrick, and N. Deeb, 2010, in Prediction of genomic breeding values in a commercial pig population, pp. 0266 in 9th World Congress on Genetics Applied to Livestock Production, August 1−6, 2010, Leipzig, Germany. [Google Scholar]
- Cleveland M. A., Hickey J. M., Forni S., 2012. A common dataset for genomic analysis of livestock populations. G3 (Bethesda) 2: 429–435. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daetwyler H. D., Calus M. P. L., Pong-Wong R., de Los Campos G., Hickey J. M., 2013. Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics 193: 347–365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dassonneville R., Brondum R. F., Druet T., Fritz S., Guillaume F., et al. , 2011. Effect of imputing markers from a low-density chip on the reliability of genomic breeding values in Holstein populations. J. Dairy Sci. 94: 3679–3686. [DOI] [PubMed] [Google Scholar]
- Dekkers J. C. M., Mathur P. K., Knol E. F., 2010. Genetic Improvement of the Pig, pp. 390–425 in The Genetics of the Pig, Ed. 2, edited by Rothschild M. F., Ruvinsky A. CABI, Cambridge, MA. [Google Scholar]
- Garrick D. J., Taylor J. F., Fernando R. L., 2009. Deregressing estimated breeding values and weighting information for genomic regression analyses. Genet. Sel. Evol. 41: 55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goddard M. E., Hayes B. J., Meuwissen T. H. E., 2011. Using the genomic relationship matrix to predict the accuracy of genomic selection. J. Anim. Breed. Genet. 128: 409–421. [DOI] [PubMed] [Google Scholar]
- Gualdrón Duarte J. L., Bates R. O., Ernst C. W., Raney N. E., Cantet R. J., et al. , 2013. Genotype imputation accuracy in a F2 pig population using high density and low density SNP panels. BMC Genet. 14: 38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Habier D., Fernando R. L., Dekkers J. C. M., 2009. Genomic selection using low-density marker panels. Genetics 182: 343–353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Habier D., Tetens J., Seefried F.-R., Lichtner P., Thaller G., 2010. The impact of genetic relationship information on genomic breeding values in German Holstein cattle. Genet. Sel. Evol. 42: 5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hayes B. J., Bowman P. J., Chamberlain A. J., Goddard M. E., 2009a Invited review: genomic selection in dairy cattle: progress and challenges. J. Dairy Sci. 92: 433–443. [DOI] [PubMed] [Google Scholar]
- Hayes B. J., Visscher P. M., Goddard M. E., 2009b Increased accuracy of artificial selection by using the realized relationship matrix. Genet. Res. 91: 47–60. [DOI] [PubMed] [Google Scholar]
- Hayes B. J., Bowman P. J., Daetwyler H. D., Kijas J. W., van der Werf J. H. J., 2012. Accuracy of genotype imputation in sheep breeds. Anim. Genet. 43: 72–80. [DOI] [PubMed] [Google Scholar]
- Huang Y., Hickey J. M., Cleveland M. A., Maltecca C., 2012. Assessment of alternative genotyping strategies to maximize imputation accuracy at minimal cost. Genet Sel. Evol. 44: 25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luan T., Woolliams J. A., Lien S., Kent M., Svendsen M., et al. , 2009. The accuracy of genomic selection in Norwegian red cattle assessed by cross-validation. Genetics 183: 1119–1126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meuwissen T. H., Hayes B. J., Goddard M. E., 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819–1829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mrode R., 2005. Linear Models for the Prediction of Animal Breeding Values, Ed 2 CABI Publishing, Oxfordshire, UK. [Google Scholar]
- Pérez P., de los Campos G., Crossa J., Gianola D., 2010. Genomic-enabled prediction based on molecular markers and pedigree using the Bayesian linear regression package in R. Plant Genome 3: 106–116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pérez-Cabal M. A., Vazquez A. I., Gianola D., Rosa G. J. M., Weigel K. A., 2012. Accuracy of genome-enabled prediction in a dairy cattle population using different cross-validation layouts. Front. Genet. 3: 27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ramos A. M., Crooijmans R. P. M. A., Affara N. A., Amaral A. J., Archibald A. L., et al. , 2009. Design of a high density SNP genotyping assay in the pig using SNPs identified and characterized by next generation sequencing technology. PLoS One 4: e6524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Development Core Team , 2011. R: A Language and Environment for Statistical Computing. The R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
- Strandén I., Garrick D. J., 2009. Technical note: derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit. J. Dairy Sci. 92: 2971–2975. [DOI] [PubMed] [Google Scholar]
- Tribout T., Larzul C., Phocas F., 2012. Efficiency of genomic selection in a purebred pig male line. J. Anim. Sci. 90: 4164–4176. [DOI] [PubMed] [Google Scholar]
- VanRaden P. M., 2008. Efficient methods to compute genomic predictions. J. Dairy Sci. 91: 4414–4423. [DOI] [PubMed] [Google Scholar]
- VanRaden P. M., Tassell C. P. V., Wiggans G. R., Sonstegard T. S., Schnabel R. D., et al. , 2009. Invited review: reliability of genomic predictions for North American Holstein bulls. J. Dairy Sci. 92: 16–24. [DOI] [PubMed] [Google Scholar]
- VanRaden P. M., O’Connell J. R., Wiggans G. R., Weigel K. A., 2011. Genomic evaluations with many more genotypes. Genet. Sel. Evol. 43: 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vazquez A. I., Bates D. M., Rosa G. J. M., Gianola D., Weigel K. A., 2010. Technical note: an R package for fitting generalized linear mixed models in animal breeding. J. Anim. Sci. 88: 497–504. [DOI] [PubMed] [Google Scholar]
- Wang H., Misztal I., Aguilar I., Legarra A., Muir W. M., 2012. Genome-wide association mapping including phenotypes from relatives without genotypes. Genet. Res. 94: 73–83. [DOI] [PubMed] [Google Scholar]
- Weigel K. A., de los Campos G., Vazquez A. I., Rosa G. J. M., Gianola D., et al. , 2010. Accuracy of direct genomic values derived from imputed single nucleotide polymorphism genotypes in Jersey cattle. J. Dairy Sci. 93: 5423–5435. [DOI] [PubMed] [Google Scholar]
- Wientjes Y. C. J., Veerkamp R. F., Calus M. P. L., 2013. The effect of linkage disequilibrium and family relationships on the reliability of genomic prediction. Genetics 193: 621–631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wiggans G. R., Vanraden P. M., Cooper T. A., 2011. The genomic evaluation system in the United States: past, present, future. J. Dairy Sci. 94: 3202–3211. [DOI] [PubMed] [Google Scholar]
- Wiggans G. R., Cooper T. A., VanRaden P. M., Olson K. M., Tooker M. E., 2012. Use of the Illumina Bovine3K BeadChip in dairy genomic evaluation. J. Dairy Sci. 95: 1552–1558. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.