Abstract
Combining breeds in a multibreed evaluation can have a negative impact on prediction accuracy, especially if single nucleotide polymorphism (SNP) effects differ among breeds. The aim of this study was to evaluate the use of a multibreed genomic relationship matrix (G), where SNP effects are considered to be unique to each breed, that is, nonshared. This multibreed G was created by treating SNP of different breeds as if they were on nonoverlapping positions on the chromosome, although, in reality, they were not. This simple setup may avoid spurious Identity by state (IBS) relationships between breeds and automatically considers breed-specific allele frequencies. This scenario was contrasted to a regular multibreed evaluation where all SNPs were shared, that is, the same position, and to single-breed evaluations. Different SNP densities (9k and 45k) and different effective population sizes (Ne) were tested. Five breeds mimicking recent beef cattle populations that diverged from the same historical population were simulated using different selection criteria. It was assumed that quantitative trait locus (QTL) effects were the same over all breeds. For the recent population, generations 1–9 had approximately half of the animals genotyped, whereas all animals in generation 10 were genotyped. Generation 10 animals were set for validation; therefore, each breed had a validation group. Analyses were performed using single-step genomic best linear unbiased prediction. Prediction accuracy was calculated as the correlation between true (T) and genomic estimated breeding values (GEBV). Accuracies of GEBV were lower for the larger Ne and low SNP density. All three evaluation scenarios using 45k resulted in similar accuracies, suggesting that the marker density is high enough to account for relationships and linkage disequilibrium with QTL. A shared multibreed evaluation using 9k resulted in a decrease of accuracy of 0.08 for a smaller Ne and 0.12 for a larger Ne. This loss was mostly avoided when markers were treated as nonshared within the same G matrix. A G matrix with nonshared SNP enables multibreed evaluations without considerably changing accuracy, especially with limited information per breed.
Keywords: across-breed evaluation, genomic selection, marker effect model, single nucleotide polymorphism–best linear unbiased prediction
Introduction
Genomic evaluations have become common in animal breeding due to the possibility of improving the rate of genetic gain (Schaeffer, 2006). Traditionally, evaluations are done within pure breeds. There is an interest in multibreed evaluations in cattle and sheep where there are many breeds and crosses, as well as chickens and pigs with many lines. Combining breeds increases the training population, which can potentially enhance the accuracy of genomic predictions (Hayes and Goddard, 2008) and is reasonably simple. It also allows the sharing of resources such as funding, specialists, and infrastructure, which is especially attractive and practical for small breeds or countries. This simplicity may come at the expense of accuracy for some or all breeds. Many studies have struggled to find an advantage of using multibreed evaluations, sometimes obtaining a slightly increased accuracy but often unchanged or slightly decreased (Hayes et al., 2009; Erbe et al., 2012; Olson et al., 2012; Makgahlela et al., 2014; Calus et al., 2018).
The linkage disequilibrium (LD) between markers and quantitative trait loci (QTL) differs among breeds and does not persist across breeds (De Roos et al., 2009). Capturing LD contributes more to the accuracy of prediction than tracking relationships through markers (Habier et al., 2013), making it essential to have dense-enough markers to capture LD in diverse populations. This extra genotyping cost could defy the cost-saving strategies of sharing other resources. Avoiding a loss of accuracy with low-density markers would be beneficial.
The cause of persistent or nonpersistent LD is due to independent chromosome segments (ICS). Genomic selection in the absence of QTL identification is based on the estimation of these segments (Goddard, 2009). Larger segments mean that more markers will be in LD with the QTL and, therefore, fewer single nucleotide polymorphisms (SNPs) are required. The number of ICS (Me) is expected to be 4NeL, where Ne is the effective population size and L is the chromosome length (Stam, 1980). Populations with a smaller Ne will have fewer, larger ICS and, therefore, require fewer SNP markers to obtain the same accuracy as those with larger Ne (Pocrnic et al., 2016a). Combining different breeds or lines together in a single evaluation should increase genetic diversity and Ne, requiring more SNPs to trace all segments and avoid loss of accuracy. In fact, Pocrnic et al. (2019) showed that about 30% of the chromosome segments were independent between two different pig lines, which caused a reduction in prediction accuracy across the lines.
Even when markers and QTL are in LD, the QTL and allele substitution effects can differ among breeds (Thaller et al., 2003). Differences in QTL minor allele frequencies (MAF) also affect the accuracy of prediction (Wientjes et al., 2015) and differ in populations due to selection and genetic drift.
The objective of this study was to evaluate the accuracy of multibreed genomic prediction when using the same SNP effects for all breeds and when obtaining breed-specific SNP effects by treating the markers as nonshared in populations with different Ne and genotyping density.
Materials and Methods
Simulated Data
Five different breeds were simulated using QMSim (Sargolzaei and Schenkel, 2009) from a historical population that was randomly mated for 1,000 generations. The historic population started with 10,000 animals, decreased to 1,000 at generation 500 to create LD, and reached 9,000 animals at generation 1,000. Different number of founders using different selection criteria were selected from this population to create the breeds (i.e., recent populations) or distant lines. Each dam had only one progeny.
Within each breed, animals were randomly mated without artificial selection for 40 generations, after which half of the animals in the last generation were used to initiate the selected population. Selection based on high estimated breeding value (EBV) for a single trait was applied for 10 generations using different mating designs and proportions of replacement males and females. Some breeds differed in their number of initial animals. This led to slightly different breed sizes, which created different effective population sizes (Ne). The trait heritability was 0.30 and phenotypes were from a normal distribution with a mean of 0 and variance of 1. Two simulations were done where one had double the number of initially selected animals compared to the other to create breeds with a larger Ne. The summary of parameters used for each breed is presented in Table 1.
Table 1.
Summary of parameters used to simulate the five different breeds for the evaluation
Simulation | Breed A | Breed B | Breed C | Breed D | Breed E |
---|---|---|---|---|---|
Sire replacement | 0.50 | 0.50 | 0.60 | 0.60 | 0.50 |
Dam replacement | 0.20 | 0.30 | 0.30 | 0.20 | 0.20 |
Mating design | Random | Random | + Assortative | Random | − Assortative |
Smaller Ne | |||||
Ne | 98 | 118 | 117 | 98 | 117 |
Initial males | 25 | 30 | 30 | 25 | 30 |
Initial females | 1,200 | 1,500 | 1,200 | 1,500 | 1,200 |
Final data | 13,225 | 16,530 | 13,230 | 16,525 | 13,230 |
Genotyped | 6,600 | 6,900 | 6,600 | 6,900 | 6,600 |
Larger Ne | |||||
Ne | 196 | 236 | 234 | 197 | 234 |
Initial males | 50 | 60 | 60 | 50 | 60 |
Initial females | 2,400 | 3,000 | 2,400 | 1,500 | 2,400 |
Final data | 26,450 | 33,060 | 26,460 | 33,050 | 26,460 |
Genotyped | 7,800 | 8,400 | 7,800 | 8,400 | 7,800 |
The genome was simulated assuming 29 chromosomes of varying length, resulting in a total genome length of 23 Morgans, with 1,000 QTL evenly distributed among them. The QTL effects were sampled from a Gamma distribution with a shape parameter of 0.4. After quality control of genotypes, 45,000 segregating SNPs with MAF >0.05 were retained for analysis. There was a mutation rate for markers and QTLs of 2.5 × 10–5 per generation per locus. The QTL effects were assumed to be the same over all the breeds but the QTLs had different frequencies in each population. This caused a difference in variance explained for each breed. The largest QTL variance in generation 10 of each breed did not exceed 0.02. The average LD measured as the pooled square of the correlation between markers (R2) was between 0.14 and 0.18 within breed. Few QTL became fixed in the breeds using a MAF of <0.001 as criteria.
The genotyped animals consisted of all animals in generation 10, and 600 randomly selected animals from each previous 9 generations. The difference in the size of the 10th generation led to slightly different numbers of genotyped animals per breed (Table 1). Two different SNP densities were used in this study—45k and a subset of 9k. The 9k SNP panel was created by selecting each fifth marker from the full 45k panel.
The simulation was replicated four times and the entire process is visually explained in Figure 1, including the total number of animals (genotyped or not) in the 10 generations that had undergone selection. The different selection criteria for the breeds and the resulting number of animals and Ne in the data/pedigree and genotype file for both small and large Ne are presented in Table 1. A principle component analysis (PCA) based on 45k was done to visualize the separation between breeds. The Ne for each breed was calculated using the following formula by Wright (1931):
Figure 1.
Visual presentation of the simulated data. The historic population of 10,000 animals was mated randomly for 1,000 generations, undergoing a bottleneck in generation 500. Founder animals for five breeds were selected and mated randomly for 40 generations followed by 10 generations of selection, resulting in different breed sizes and selected number of genotyped animals.
where Nm and Nf are the number of breeding males and females in each generation. This method assumes no selection (Table 1). Various other methods exist to calculate Ne.
Model and Analyses
A single-trait animal model was fitted for traditional pedigree-based and genomic evaluations:
where y is a vector of simulated phenotypes, μ is an overall mean, u is a vector of additive genetic effects, Z is an incidence matrix relating y to the effects in u, and e is a vector of random residuals.
Single-step genomic best linear unbiased prediction (ssGBLUP) was used with BLUPF90 software (Misztal et al., 2014) for analyses of all breeds, both separately and together to obtain genomic estimated breeding values (GEBV). Single-step GBLUP is simple to apply, avoids double counting, accounts for preselection on Mendelian sampling, and allows the inclusion of genotyped and nongenotyped animals in the same evaluation. In ssGBLUP, the inverse of the pedigree relationship matrix (A−1), regularly used in BLUP, is replaced with the inverse of the realized relationship matrix (H−1) as demonstrated by Aguilar et al. (2010). This H−1 combines A−1 with the inverse of the genomic relationship matrix (G−1):
In this case, G was obtained using the formula , where M is a centered matrix of marker content adjusted for allele frequencies and pi is the allele frequency for SNP i (VanRaden, 2008). The pedigree-based relationship matrix between genotyped animals is referred to as A22. To reduce bias due to the different genetic levels of genotyped and nongenotyped animals, G is tuned using the constant α, where α is (Vitezica et al., 2011) and n is the number of animals. To avoid singularity problems, G was multiplied by 0.95 and A22 by 0.05 before combining them.
For validation purposes, phenotypes of the validation animals (generation 10) were removed before estimating GEBV. The resulting GEBV of generation 10 were correlated to the true breeding value (TBV) to obtain the accuracy of prediction. A Pearson correlation was used and bias was estimated as the regression coefficient when regressing TBV over GEBV.
The GEBVs were estimated by within-breed and multibreed evaluations using 9k and 45k SNP markers. The multibreed evaluations were done based on two different assumptions: 1) animals from different breeds shared the same SNP effects (shared); 2) animals from different breeds did not share the same SNP effects (nonshared). To achieve the first assumption, genotypes for animals from all breeds overlap and are stacked together prior to the evaluation. In the second assumption, SNPs are considered nonoverlapping. The SNP file is manipulated to achieve this. The SNPs of the same breed are in the same columns in the file. The SNPs of different breeds appear in different columns in the SNP file to create a nonshared scenario. Breed A is treated as if it had SNP markers from position 1 to 45k (or 1 to 9k) and missing markers for the remaining 180k (or 36k) SNPs. Breed B is treated as if its markers start at position 45,001 to 90k (or 9,001 to 18k) with all other markers as missing. The same pattern continues for breeds C, D, and E. This multiplies the number of columns in the SNP file by the number of breeds, even though all animals have SNP in the same physical position on the chromosome. Therefore, in the nonshared scenario, the newly created SNP file will have n × m markers, where n is the number of markers evaluated and m is the number of breeds. The treatment of the SNP file is graphically explained in Figure 2. The shared scenario results in a genomic relationship matrix in which all animals are related to each other (i.e., a dense matrix). The nonshared scenario results in a genomic relationship matrix where all animals within a breed are related, but animals from different breeds are not. This assumes a correlation of 0 between breeds. All missing SNPs are ignored to calculate allele frequency, which allows each breed to be centered around its own frequencies. This is graphically explained in Figure 3.
Figure 2.
A graphic presentation of the SNP file for the shared and nonshared scenarios using three hypothetical breeds (X, Y, and Z) corresponding to the three primary colors (red, blue, and yellow), and only three SNP markers for all animals. When SNP effects were shared, the number of SNPs in the file is three and all animals have nonmissing markers that overlap completely. When SNPs were treated as nonshared, the total number of SNPs in the file is nine (three SNPs × three breeds) and animals from each breed have six missing SNPs. Although physically all animals have SNPs in the same position on the chromosome, the file treats them as if they are in different, nonoverlapping positions.
Figure 3.
A visual presentation of the genomic relationship matrix (G) in a shared and nonshared scenario using three hypothetical breeds (X, Y, and Z) corresponding to three primary colors (red, blue, and yellow). In the shared scenario, the genotypes of all breeds are scaled to a single allele-frequency base (assuming a correlation of 1 between breeds) and all values are based on the combined information from all breeds. In the nonshared scenarios, there are no SNPs in common and, therefore, each breed is based and centered according to its own allele frequencies and animals from different breeds are not genetically correlated to each other. The G matrix has mostly zero elements.
An across-breed analysis was also performed to determine whether the breeds are genetically distant. If breeds are closely related, it would be expected that the direct genomic values (DGV) of one breed can be predicted based on SNP solutions obtained from another breed, with similar accuracy as in single-breed evaluations. For these analyses, SNP effects for one breed were computed based on GEBV using the POSTGSF90 software (Misztal et al., 2014) with the following formula:
where â is a vector of estimated SNP effects, λ is the ratio of SNP to additive genetic variance, D is a diagonal matrix of weights (standardized variances) for SNP, and M is a matrix of centered genotypes for each animal (VanRaden, 2008). Based on SNP effects, DGV for validation animals were calculated by PREDF90 (Misztal et al., 2014) as the sum of SNP effects weighted by the genotype content.
In across-breed evaluations, the SNP effects estimated based on one breed were used to calculate DGV of its own validation population and that of the other breeds. For example, SNP effects estimated using only the training population of breed A were used to calculate the DGV of the validation population of breed A and then the validation populations of breeds B, C, D, and E separately.
Accuracy was computed as the Pearson correlation between TBV and GEBV or DGV, whereas bias was calculated as the regression coefficient when regressing TBV on GEBV. This shows the overdispersion or underdispersion of GEBV.
Results and Discussion
The genetic distance across the breeds using 45k can be observed in the PCA plots. Figure 4 is a 3D plot showing the first three principal components. Breed A and E showed an overlap, although breed E was more variable. This could be because breed E underwent negative assortative mating but founders in generation 0 of these two breeds were selected using similar criteria. The variance explained by the first five principal components averaged over replicates for the smaller Ne were 3.65%, 3.21%, 2.80%, 2.40%, and 0.56%, respectively. The values for the larger Ne were 2.38%, 2.08%, 1.84%, 1.50%, and 0.39%, respectively. The number of eigenvalues explaining 98% (EIGEN98) of the genomic variation for each breed and all combined are presented in Table 2. Considerably more eigenvalues were needed to explain 98% of the genomic variation in the combined evaluation, reflecting a more diverse population. The value is smaller than the sum of the required EIGEN98 for each individual breed, meaning the breeds are not completely independent. This was expected since they originated from the same historical population that had the same QTL effects.
Figure 4.
A principal component analysis with first, second, and third principal components (PCs) with the five different simulated breeds in one replicate using 45k SNP markers.
Table 2.
The number of eigenvalues explaining 98% of the variation of each breed and in a full multibreed scenario using 45k SNP markers when effective population size (Ne) is smaller and larger
Breed A | Breed B | Breed C | Breed D | Breed E | ABCDE | |
---|---|---|---|---|---|---|
Smaller Ne | 3,780 | 3,627 | 3,280 | 3,764 | 3,737 | 13,024 |
Larger Ne | 5,072 | 5,189 | 4,661 | 5,172 | 5,073 | 18,059 |
Very poor predictive ability was observed across breed. Table 3 presents the within-breed predictive ability in the diagonal and across breed in the off-diagonal when using 45k in the smaller Ne scenario. The other SNP densities or Ne showed the same trend. The DGV within breed had an average accuracy of 0.70 ± 0.02 when using the scenario with a smaller Ne and 45k and 0.66 ± 0.01 with a larger Ne, whereas the average accuracy across breed was 0.11 ± 0.03 with a smaller Ne with 45k and 0.07 ± 0.02 with a larger Ne. This simulation results correspond to other studies (Hayes et al., 2009; Kizilkaya et al., 2010; Pryce et al., 2011; Olson et al., 2012; Kachman et al., 2013, Zhou et al. 2014), including Raymond et al. (2018), who found poor predictive ability in dairy cattle across breed and across country, even when using whole-genome sequence (WGS) information.
Table 3.
The correlation between TBV and DGV of the validation populations when 45k SNP effects based on one breed is used to predict within breed (diagonal in bold) or to predict across breed (off-diagonal)1
Breed predicted | |||||
---|---|---|---|---|---|
Breed Effects Used | Breed A | Breed B | Breed C | Breed D | Breed E |
Breed A | 0.67 ± 0.01 | 0.15 ± 0.04 | 0.11 ± 0.04 | 0.10 ± 0.03 | 0.15 ± 0.03 |
Breed B | 0.12 ± 0.02 | 0.71 ± 0.01 | 0.12 ± 0.04 | 0.11 ± 0.03 | 0.13 ± 0.03 |
Breed C | 0.09 ± 0.01 | 0.10 ± 0.02 | 0.72 ± 0.03 | 0.07 ± 0.04 | 0.13 ± 0.03 |
Breed D | 0.10 ± 0.02 | 0.16 ± 0.03 | 0.12 ± 0.04 | 0.73 ± 0.01 | 0.07 ± 0.03 |
Breed E | 0.13 ± 0.02 | 0.10 ± 0.02 | 0.11 ± 0.02 | 0.13 ± 0.04 | 0.69 ± 0.02 |
1Results are for the smaller effective population scenario.
The lack of predictive ability across breed further showed that breeds were genetically different. Within more homogenous breeds, larger ICS will be present and SNP estimates will capture these segments (Goddard et al., 2011). Across breed, animals will share shorter segments, which is more difficult to estimate accurately. Therefore, information from one breed is limited for another even when the true SNP effects are the same (Khansefid et al. 2014). Correlations between estimated SNP effects of different breeds in this study were all lower than 0.05, regardless of the Ne.
Table 4 shows accuracies for breeds A–E with a smaller Ne and 9k and 45k SNP information when each breed was considered separately, when all breeds shared SNP effects, and when the SNP information for each breed was treated as nonshared. With 9k SNPs, that is, 80% of the SNPs masked, the accuracies with analyses for each breed separately were on average 0.05 lower than with 45k SNPs. The 9k SNPs are not enough to fully account for the genomic information provided by larger SNP panels, although the difference was relatively small. In a study by Luan et al. (2009), masking 75% of SNPs reduced the realized accuracy in Norwegian Red Cattle by 0.02. They postulated that this could partly be due to SNP markers clustering together with high LD in some regions of the chromosome.
Table 4.
Accuracies obtained for breeds A–E with smaller Ne using 9k and 45k SNP markers1
Single breed | Multibreed shared | Multibreed nonshared | |||||
---|---|---|---|---|---|---|---|
SNP density | Breed | Acc | SE | Acc | SE | Acc | SE |
9k | Breed A | 0.63 | 0.01 | 0.54 | 0.01 | 0.63 | 0.01 |
Breed B | 0.63 | 0.01 | 0.54 | 0.03 | 0.60 | 0.00 | |
Breed C | 0.70 | 0.03 | 0.63 | 0.06 | 0.72 | 0.04 | |
Breed D | 0.74 | 0.02 | 0.67 | 0.04 | 0.73 | 0.01 | |
Breed E | 0.57 | 0.01 | 0.48 | 0.03 | 0.58 | 0.01 | |
Average | 0.65 | 0.01 | 0.57 | 0.04 | 0.65 | 0.02 | |
45k | Breed A | 0.67 | 0.01 | 0.68 | 0.01 | 0.67 | 0.01 |
Breed B | 0.71 | 0.01 | 0.71 | 0.01 | 0.69 | 0.01 | |
Breed C | 0.72 | 0.03 | 0.72 | 0.02 | 0.71 | 0.03 | |
Breed D | 0.73 | 0.01 | 0.74 | 0.01 | 0.73 | 0.01 | |
Breed E | 0.70 | 0.02 | 0.71 | 0.02 | 0.69 | 0.02 | |
Average | 0.70 | 0.02 | 0.71 | 0.01 | 0.70 | 0.02 |
1Single-breed evaluations were performed as well as multibreed. For multibreed, SNP effects were first assumed to be the same in a shared scenario and then SNP effects were treated as different in a nonshared scenario.
The accuracies with 45k SNPs remained stable no matter how the evaluation was set up. Sharing SNP effects using 9k decreased the accuracy compared to single-breed analyses by an average of 0.08 when the Ne is smaller. Thus, with a limited number of SNPs, sharing SNPs among several breeds is not ideal.
Table 5 shows the results obtained with a larger Ne. All accuracies were lower than obtained in a smaller Ne as expected from the increase of Me (Daetwyler et al., 2010; Pocrnic et al., 2016b). The drop in accuracy from single breed to shared SNPs was 0.12 in the larger Ne, instead of 0.08. Sharing SNPs comes with a larger accuracy penalty when the population is more diverse. Accuracies still remained stable when SNPs were treated as nonshared, even with a drastically lower SNP density or when Ne was larger.
Table 5.
Accuracies obtained for breeds A–E with larger Ne using 9k and 45k SNP markers1
Single breed | Multibreed shared | Multibreed nonshared | |||||
---|---|---|---|---|---|---|---|
SNP density | Breed | Acc | SE | Acc | SE | Acc | SE |
9k | Breed A | 0.61 | 0.01 | 0.51 | 0.01 | 0.62 | 0.01 |
Breed B | 0.63 | 0.01 | 0.52 | 0.03 | 0.61 | 0.00 | |
Breed C | 0.63 | 0.03 | 0.49 | 0.06 | 0.63 | 0.04 | |
Breed D | 0.70 | 0.02 | 0.60 | 0.04 | 0.69 | 0.01 | |
Breed E | 0.60 | 0.01 | 0.49 | 0.03 | 0.59 | 0.01 | |
Average | 0.64 | 0.01 | 0.52 | 0.04 | 0.63 | 0.02 | |
45k | Breed A | 0.65 | 0.02 | 0.66 | 0.02 | 0.65 | 0.02 |
Breed B | 0.66 | 0.00 | 0.66 | 0.01 | 0.65 | 0.00 | |
Breed C | 0.73 | 0.03 | 0.71 | 0.03 | 0.73 | 0.04 | |
Breed D | 0.76 | 0.01 | 0.75 | 0.01 | 0.74 | 0.01 | |
Breed E | 0.60 | 0.01 | 0.60 | 0.01 | 0.60 | 0.01 | |
Average | 0.68 | 0.01 | 0.68 | 0.02 | 0.67 | 0.02 |
1Single-breed evaluations were performed as well as multibreed. For multibreed, SNP effects were first assumed to be the same in a shared scenario and then SNP effects were treated as different in a nonshared scenario.
Even though, on average accuracies were the same, very small differences were observed for some breeds in both 45k and 9k evaluations. Such small changes can be attributed to the scaling of G before being combined with A22. A simulation study by Vitezica et al. (2011) found that bias exists in estimation when no adjustments are made to account for the fact that the genetic level of genotyped animals is different from that of the whole population, especially with strong selection. The G must be combined with a constant α, which is equal to the average difference between all elements of A22 and G to account for this difference. In the nonshared SNP scenario, all animals appear in these matrices and, therefore, the constant used for scaling is overall, instead of breed specific. Such adjustments could be done in theory; however, in practice, any partial change done in specific portions of G may result in nonpositive definiteness. It is important to note that these differences in accuracy are often in the third decimal and, thus, negligibly small. They may be bigger in reality where data structures are more complex and incomplete. In practical studies, the effect of scaling was limited, indicating weak selection on any individual trait with multitrait selection, but the scaling had some effect on biases and inflation of GEBV (Chen et al., 2011). Table 6 shows the average bias in all scenarios. When using 9k SNPs, GEBV were clearly inflated when SNPs were shared, especially when Ne was smaller. The bias was essentially unchanged when SNPs were nonshared. The 45k scenario led to negligibly small changes, regardless of the method used.
Table 6.
Bias measured as the regression coefficient when TBV is regressed over GEBV for single-breed evaluations and multibreed evaluations using shared or nonshared SNP effects and 9k and 45k SNP markers in breeds with a smaller or larger effective population size (Ne)
Single | Multibreed (shared) | Multibreed (nonshared) | ||
---|---|---|---|---|
9k | Smaller Ne | 0.96 ± 0.03 | 0.87 ± 0.03 | 0.94 ± 0.03 |
Larger Ne | 0.92 ± 0.01 | 0.76 ± 0.03 | 0.89 ± 0.02 | |
45k | Smaller Ne | 0.99 ± 0.03 | 0.98 ± 0.03 | 0.97 ± 0.03 |
Larger Ne | 0.98 ± 0.02 | 0.94 ± 0.02 | 0.95 ± 0.02 |
Olson et al. (2012) reported that sharing three breeds—Holstein, Jersey, and Brown Swiss using about 44k SNPs—reduced accuracy compared to single-breed analyses using U.S. data. This suggests that larger populations may benefit from more SNPs while smaller do not require so many; however, a point is reached where denser markers are not useful. It has been shown that marker densities more than 50k generally do not show a remarkable increase in accuracy of mutibreed evaluations, even when using approximately 600k (Erbe et al., 2012) or 700k (Su et al., 2012; Hozé et al., 2014).
In this study, accuracy was depressed with SNP sharing across breeds only with 9k but not with 45k SNP information. When 9k SNPs were shared, there was not enough information to estimate the ICS in the multibreed population. Pocrnic et al. (2016a) showed that combining breeds increases Ne, which requires more SNPs to accurately trace all chromosome segments segregating in the population. The Me as proposed by Stam (1980) can be approximated as the number of eigenvalues explaining 98% of the variance of G (EIGEN98; Pocrnic et al., 2016b). The EIGEN98 in this study for each breed separately were approximately 3.6k and 5k for the smaller and larger Ne, respectively (Table 2). These EIGEN98 do not correspond to the prediction of Stam (1980). This indicates that the Ne is lower than estimated by the formula of Wright (1931). There are various methods of estimating Ne, each of which has particular justifications or merit. Although the simulated scenario with larger Ne is not double that of the smaller one, it is still larger. When the breeds were pooled together and SNPs were shared, EIGEN98 were 13k for smaller Ne and 18k for larger Ne, meaning 9k SNPs were not enough to estimate all the chromosome segments segregating in the combined population.
The observed drop of 0.08 for the smaller Ne and 0.12 for the larger reflects a loss in accuracy that is dependent on the proportion of available SNPs and segregating ICS in the population under genomic selection. When breeds are pooled together but SNPs are not shared, the genomic information for each breed behaves independently and the chromosome segments are assumed to be segregating only within breed. This reflects the reality of multibreed evaluations if crossbreds are not included. The inclusion of crossbred animals when using multibreed G with nonshared SNPs is discussed later.
A question arises whether sharing SNPs at 45k level will reduce accuracy for larger data sets. For example, the Irish national evaluation uses a shared SNP model for over 40 breeds (Mantysaari et al., 2017) with a 54k chip. A 14-breed evaluation that includes the Simmental breed also uses a shared SNP model with less than 3k SNPs (Golden et al., 2018). If each breed has Me close to 10,000, even with 50% common segments, the combined Me assuming unrelated breeds could be close to 200,000 and 70,000, respectively, much larger than 18,000 reported in this study. Therefore, the impact of SNP sharing can be larger in real populations than observed here.
Pocrnic et al. (2019) looked at accuracy of genomic prediction in GBLUP when only a fraction of the largest eigenvalues of G was retained. With less phenotypic information, they found that little accuracy was gained when more than 10% of the total number of EIGEN98 was considered. With more phenotypic information, the accuracy reached a plateau when about 50% of the eigenvalues were considered. As more SNPs were needed to estimate more eigenvalues, the study suggests larger data benefits from more SNPs.
Multibreed evaluations are used in the livestock improvement industry and some studies have shown increases in accuracy, even with the assumption of shared SNP effects. Literature where benefits were found showed that they were mostly small and inconsistent over traits, breeds, and methods (Hayes et al., 2009; Karoui et al., 2012; Olson et al., 2012; Makgahlela et al., 2013a; Hozé et al., 2014; Jónás et al., 2017). Olson et al. (2012) also treated three dairy breeds as different traits in a multitrait evaluation, which slightly increased the accuracy and prevented the largest breed from dominating the smaller breeds. However, this slight improvement did not justify the increased computational demand. Similar findings were made by Makgahlela et al. (2013a). In another study, Makgahlela et al. (2013b) adjusted G for breed-specific allele frequencies of a multibreed evaluation of Nordic Red cattle and Lourenco et al. (2016) did the same for a pig population. They all found that breed-specific G did not improve the validation accuracy. Zhou et al. (2014) accounted for LD phasing and breed-specific SNP effects by using weights in externally created within- and between-breed G-matrix blocks. Accuracies were not improved further compared to a multitrait approach. Khansefid et al. (2014) created different G for breeds, or combinations of breeds, and combined them with off-diagonals between breeds set to zero. Depending on the method of measuring accuracy and the grouping of animals, accuracies changed for some breeds and traits.
In general, realistic simulation of multibreed data is a difficult topic as a more comprehensive simulation would involve dominance and epistasis and, consequently, QTL with different substitution effects (Spelman et al., 2002). In this simulation, QTL substitution effects were set to be equal among breeds. This assumes a correlation of 1; however, this is not the case in practice (Wientjes et al., 2017). Even though this assumption is unrealistic, the SNP markers had different frequencies and their effects still differed among the breeds.
Scaling G When Sharing or Not Sharing SNP Effects
Aside from explicit scaling of relationships as in Vitezica et al. (2011), alternative options include using base population allele frequencies for each breed (Strandén and Christensen, 2011) or in single-step Bayesian regression, fitting fixed effect for each breed (Hsu et al., 2017). Scaling by base population frequencies is less obvious in cases when populations are heterogeneous, that is, parents are missing across generations. With shared SNP effects, a recently developed option is fitting “metafounders”—a special form of unknown parent groups—to each combination of breed and generation (Legarra et al., 2015). The advantage of metafounders depends on the quality of estimates of parameters for such a model.
Nonshared SNP Effects and Crossbred Animals
When sharing SNPs, all breeds and crossbred animals are considered together naturally. When SNPs are separate across breeds, considering crossbreds is more difficult. One possibility is to consider only F1 and use phasing to ascertain which alleles came from sire and dam (Xiang et al., 2016). In this case, F1 would have twice as many SNPs as the parents. A more complex possibility for any crossbred would be to use weighted genotypes based on breed proportions. The use of crossbreds would be justified if they increase accuracy of the purebreds. In a study in pigs involving two parental lines and crossbreds, use of shared SNPs resulted in same accuracy, and addition of crossbreds did not improve purebred accuracy (Pocrnic et al., 2019).
Implications
The use of nonshared SNPs as in this study, that is, by creating an SNP file with a separate block for each breed may lead to large files if many breeds are considered. Possible solutions include compressed files where empty fields are not explicitly stored or separate genotype files per breed and breed combinations. Using preselected markers based on importance for the traits can also drastically reduce the genotype file while even improving the accuracy of prediction (Erbe et al., 2012; Van den Berg et al., 2016; Raymond et al., 2018). Preselected SNPs for each breed separately can be combined in a single evaluation using this nonshared SNP method. Additionally, if a fraction of SNPs has a similar effect size among breeds, a combined G can be constructed that considers shared and nonshared SNPs. Overall, the use of nonshared SNPs may make the most sense when the number of genotyped animals is unequal among breeds. In such cases, sharing SNPs favors the largest breeds, possibly at the cost of lower accuracy for the smaller breeds.
Conclusion
Sharing SNP effects among breeds is an easy way to perform genomic evaluation of multiple breeds and crossbreds but can result in reduction of accuracy as well as biases if scaling is incorrect and the number of markers is not sufficient. Remedies include increasing the number of SNPs and using appropriate scaling. Use of nonshared SNPs per breed can avoid reduction of accuracy even when marker density is low; however, accommodating crossbreds more complex than F1 may be difficult. A decision on whether to share or not to share SNP effects can be made based on validation and on computing viability.
Literature Cited
- Aguilar I., Misztal I., Johnson D., Legarra A., Tsuruta S., and Lawlor T.. . 2010. Hot topic: a unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. J. Dairy Sci. 93:743–752. doi: 10.3168/jds.2009-2730 [DOI] [PubMed] [Google Scholar]
- Calus M. P. L., Goddard M. E., Wientjes Y. C. J., Bowman P. J., and Hayes B. J.. . 2018. Multibreed genomic prediction using multitrait genomic residual maximum likelihood and multitask Bayesian variable selection. J. Dairy Sci. 101:4279–4294. doi: 10.3168/jds.2017-13366 [DOI] [PubMed] [Google Scholar]
- Chen C. Y., Misztal I., Aguilar I., Legarra A., and Muir W. M.. . 2011. Effect of different genomic relationship matrices on accuracy and scale. J. Anim. Sci. 89:2673–2679. doi: 10.2527/jas.2010-3555. [DOI] [PubMed] [Google Scholar]
- Daetwyler H. D., Pong-Wong R., Villanueva B., and Woolliams J. A.. . 2010. The impact of genetic architecture on genome-wide evaluation methods. Genetics 185:1021–1031. doi: 10.1534/genetics.110.116855 [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Roos A. P., Hayes B. J., and Goddard M. E.. . 2009. Reliability of genomic predictions across multiple populations. Genetics 183:1545–1553. doi: 10.1534/genetics.109.104935 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Erbe M., Hayes B. J., Matukumalli L. K., Goswami S., Bowman P. J., Reich C. M., Mason B. A., and Goddard M. E.. . 2012. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J. Dairy Sci. 95:4114–4129. doi: 10.3168/jds.2011-5019 [DOI] [PubMed] [Google Scholar]
- Goddard M. E. 2009. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 136:245–257 doi: 10.1007/s10709-008-9308-0 [DOI] [PubMed] [Google Scholar]
- Goddard M. E., Hayes B. J., and Meuwissen T. H.. . 2011. Using the genomic relationship matrix to predict the accuracy of genomic selection. J. Anim. Breed. Genet. 128:409–421. doi: 10.1111/j.1439-0388.2011.00964.x [DOI] [PubMed] [Google Scholar]
- Golden B. L., Spangler M. L., Snelling W. M., and Garrick D. J.. . 2018. Current single-step National Beef Cattle Evaluation Models used by the American Hereford Association and International Genetic Solutions, Computational Aspects, and Implications of Marker Selection. In Proceedings of the 11th Genetic Prediction Workshop, Beef Improvement Federation, Kansas City (KS), December 5–6. [Google Scholar]
- Habier D., Fernando R. L., and Garrick D. J.. . 2013. Genomic BLUP decoded: a look into the black box of genomic prediction. Genetics 194:597–607. doi: 10.1534/genetics.113.152207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hayes B. J., Bowman P. J., Chamberlain A. C., Verbyla K., and Goddard M. E.. . 2009. Accuracy of genomic breeding values in multi-breed dairy cattle populations. Genet. Sel. Evol. 41:51. doi: 10.1186/1297-9686-41-51 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hayes B. J., and Goddard M. E.. . 2008. Technical note: prediction of breeding values using marker-derived relationship matrices. J. Anim. Sci. 86:2089–2092. doi: 10.2527/jas.2007-0733 [DOI] [PubMed] [Google Scholar]
- Hozé C., Fritz S., Phocas F., Boichard D., Ducrocq V., and Croiseau P.. . 2014. Efficiency of multi-breed genomic selection for dairy cattle breeds with different sizes of reference population. J. Dairy Sci. 97:3918–3929. doi: 10.3168/jds.2013-7761 [DOI] [PubMed] [Google Scholar]
- Hsu W.-L., Garrick D. J., and Fernando R. L.. . 2017. The accuracy and bias of single-step genomic prediction for populations under selection. G37:2685–2694. doi: 10.1534/g3.117.043596 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jónás D., Ducrocq V., Fritz S., Baur A., Sanchez M.-P., and Croiseau P.. . 2017. Genomic evaluation of regional dairy cattle breeds in single-breed and multibreed contexts. J. Anim. Breed. Genet. 134:3–13. doi: 10.1111/jbg.12249 [DOI] [PubMed] [Google Scholar]
- Kachman S. D., Spangler M. L., Bennett G. L., Hanford K. J., Kuehn L. A., Snelling W. M., Thallman R. M., Saatchi M., Garrick D. J., Schnabel R. D., . et al. 2013. Comparison of molecular breeding values based on within- and across-breed training in beef cattle. Genet. Sel. Evol. 45:30. doi: 10.1186/1297-9686-45-30 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karoui S., Carabaño M. J., Díaz C., and Legarra A.. . 2012. Joint genomic evaluation of French dairy cattle breeds using multiple-trait models. Genet. Sel. Evol. 44:39. doi: 10.1186/1297-9686-44-39 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khansefid M., Pryce J. E., Bolormaa S., Miller S. P., Wang Z., Li C., and Goddard M. E.. . 2014. Estimation of genomic breeding values for residual feed intake in a multibreed cattle population. J. Anim. Sci. 92:3270–3283. doi: 10.2527/jas.2014-7375 [DOI] [PubMed] [Google Scholar]
- Kizilkaya K., Fernando R. L., and Garrick D. J.. . 2010. Genomic prediction of simulated multibreed and purebred performance using observed fifty thousand single nucleotide polymorphism genotypes. J. Anim. Sci. 88:544–551. doi: 10.2527/jas.2009-2064 [DOI] [PubMed] [Google Scholar]
- Legarra A., Christensen O. F., Vitezica Z. G., Aguilar I., and Misztal I.. . 2015. Ancestral relationships using metafounders: finite ancestral populations and across population relationships. Genetics 200:455–468. doi: 10.1534/genetics.115.177014 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lourenco D. A., Tsuruta S., Fragomeni B. O., Chen C. Y., Herring W. O., and Misztal I.. . 2016. Crossbreed evaluations in single-step genomic best linear unbiased predictor using adjusted realized relationship matrices. J. Anim. Sci. 94:909–919. doi: 10.2527/jas.2015-9748 [DOI] [PubMed] [Google Scholar]
- Luan T., Woolliams J. A., Lien S., Kent M., Svendsen M., and Meuwissen T. H.. . 2009. The accuracy of genomic selection in Norwegian red cattle assessed by cross-validation. Genetics 183:1119–1126. doi: 10.1534/genetics.109.107391 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Makgahlela M. L., Mäntysaari E. A., Strandén I., Koivula M., Nielsen U. S., Sillanpää M. J., and Juga J.. . 2013a. Across breed multi-trait random regression genomic predictions in the Nordic Red dairy cattle. J. Anim. Breed. Genet. 130:10–19. doi: 10.1111/j.1439-0388.2012.01017.x [DOI] [PubMed] [Google Scholar]
- Makgahlela M. L., Strandén I., Nielsen U. S., Sillanpää M. J., and Mäntysaari E. A.. . 2013b. The estimation of genomic relationships using breedwise allele frequencies among animals in multibreed populations. J. Dairy Sci. 96:5364–5375. doi: 10.3168/jds.2012-6523 [DOI] [PubMed] [Google Scholar]
- Makgahlela M. L., Strandén I., Nielsen U. S., Sillanpää M. J., and Mäntysaari E. A.. . 2014. Using the unified relationship matrix adjusted by breed-wise allele frequencies in genomic evaluation of a multibreed population. J. Dairy Sci. 97:1117–1127. doi: 10.3168/jds.2013-7167 [DOI] [PubMed] [Google Scholar]
- Mäntysaari E. A., Evans R. D., and Strandén I.. . 2017. Efficient single-step genomic evaluation for a multibreed beef cattle population having many genotyped animals. J. Anim. Sci. 95:4728–4737. doi: 10.2527/jas2017.1912 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Misztal I., Tsuruta S., Lourenco D., Aguilar I., Legarra A., and Vitezica Z.. . 2014. Manual for BLUPF90 family of programs. Athens (GA): University of Georgia. [Google Scholar]
- Olson K. M., VanRaden P. M., and Tooker M. E.. . 2012. Multibreed genomic evaluations using purebred Holsteins, Jerseys, and Brown Swiss. J. Dairy Sci. 95:5378–5383. doi: 10.3168/jds.2011-5006 [DOI] [PubMed] [Google Scholar]
- Pocrnic I., Lourenco D. A. L., Chen C. Y., Herring W. O., and Misztal I.. . 2019. Crossbred evaluations using single-step genomic BLUP and algorithm for proven and young with different sources of data1. J. Anim. Sci. 97:1513–1522. doi: 10.1093/jas/skz042 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pocrnic I., Lourenco D. A. L., Masuda Y., and Misztal I.. 2016a. Dimensionality of genomic information and performance of the Algorithm for Proven and Young for different livestock species. Genet. Sel. Evol. 48:82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pocrnic I., Lourenco D. A., Masuda Y., Legarra A., and Misztal I.. . 2016b. The Dimensionality of genomic information and its effect on genomic prediction. Genetics 203:573–581. doi: 10.1534/genetics.116.187013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pryce J. E., Gredler B., Bolormaa S., Bowman P. J., Egger-Danner C., Fuerst C., Emmerling R., Sölkner J., Goddard M. E., and Hayes B. J.. . 2011. Short communication: genomic selection using a multi-breed, across-country reference population. J. Dairy Sci. 94:2625–2630. doi: 10.3168/jds.2010-3719 [DOI] [PubMed] [Google Scholar]
- Raymond B., Bouwman A. C., Schrooten C., Houwing-Duistermaat J., and Veerkamp R. F.. . 2018. Utility of whole-genome sequence data for across-breed genomic prediction. Genet. Sel. Evol. 50: 27 doi: 10.1186/s12711-018-0396-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sargolzaei M., and Schenkel F. S.. . 2009. QMSim: a large-scale genome simulator for livestock. Bioinformatics 25:680–681. doi: 10.1093/bioinformatics/btp045 [DOI] [PubMed] [Google Scholar]
- Schaeffer L. R. 2006. Strategy for applying genome-wide selection in dairy cattle. J. Anim. Breed. Genet. 123:218–223. doi: 10.1111/j.1439-0388.2006.00595.x [DOI] [PubMed] [Google Scholar]
- Spelman R. J., Ford C. A., McElhinney P., Gregory G. C., and Snell R. G.. . 2002. Characterization of the DGAT1 gene in the New Zealand dairy population. J. Dairy Sci. 85:3514–3517. doi: 10.3168/jds.S0022-0302(02)74440-8 [DOI] [PubMed] [Google Scholar]
- Stam P. 1980. The distribution of the fraction of the genome identical by descent in finite random mating populations. Genet. Res. 35:131–155. doi: 10.1017/S0016672300014002 [DOI] [Google Scholar]
- Strandén I., and Christensen O. F.. . 2011. Allele coding in genomic evaluation. Genet. Sel. Evol. 43: 25 doi: 10.1186/1297-9686-43-25 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Su G., Brøndum R. F., Ma P., Guldbrandtsen B., Aamand G. P., and Lund M. S.. . 2012. Comparison of genomic predictions using medium-density (∼54,000) and high-density (∼777,000) single nucleotide polymorphism marker panels in Nordic Holstein and Red Dairy Cattle populations. J. Dairy Sci. 95:4657–4665. doi: 10.3168/jds.2012-5379 [DOI] [PubMed] [Google Scholar]
- Thaller G., Krämer W., Winter A., Kaupe B., Erhardt G., and Fries R.. . 2003. Effects of DGAT1 variants on milk production traits in German cattle breeds. J. Anim. Sci. 81:1911–1918. doi: 10.2527/2003.8181911x [DOI] [PubMed] [Google Scholar]
- Van den Berg I., Boichard D., Guldbrandtsen B., and Lund M. S.. . 2016. Using sequence variants in linkage disequilibrium with causative mutations to improve across breed prediction in dairy cattle: a simulation study. G3. 6:2553–2561. doi: 10.1534/g3.116.027730 [DOI] [PMC free article] [PubMed] [Google Scholar]
- VanRaden P. M. 2008. Efficient methods to compute genomic predictions. J. Dairy Sci. 91:4414–4423. doi: 10.3168/jds.2007-0980 [DOI] [PubMed] [Google Scholar]
- Vitezica Z. G., Aguilar I., Misztal I., and Legarra A.. . 2011. Bias in genomic predictions for populations under selection. Genet. Res. (Camb). 93:357–366. doi: 10.1017/S001667231100022X [DOI] [PubMed] [Google Scholar]
- Wientjes Y. C. J., Bijma P., Vandenplas J., and Calus M. P. L.. . 2017. Multi-population genomic relationships for estimating current genetic variances within and genetic correlations between populations. Genetics 207:503–515. doi: 10.1534/genetics.117.300152 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wientjes Y. C., Calus M. P., Goddard M. E., and Hayes B. J.. . 2015. Impact of QTL properties on the accuracy of multi-breed genomic prediction. Genet. Sel. Evol. 47:42. doi: 10.1186/s12711-015-0124-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright S. 1931. Evolution in Mendelian Populations. Genetics 16:97–159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiang T., Nielsen B., Su G., Legarra A., and Christensen O. F.. . 2016. Application of single-step genomic evaluation for crossbred performance in pig. J. Anim. Sci. 94:936–948. doi: 10.2527/jas.2015-9930 [DOI] [PubMed] [Google Scholar]
- Zhou L., Lund M. S., Wang Y. and Su G.. . 2014. Genomic predictions across Nordic Holstein and Nordic Red using the genomic best linear unbiased prediction model with different genomic relationship matrices. J. Anim. Breed. Genet. 131:249–257. doi: 10.1111/jbg.12089 [DOI] [PubMed] [Google Scholar]