Impact of QTL minor allele frequency on genomic evaluation using real genotype data and simulated phenotypes in Japanese Black cattle

Yoshinobu Uemoto; Shinji Sasaki; Takatoshi Kojima; Yoshikazu Sugimoto; Toshio Watanabe

doi:10.1186/s12863-015-0287-8

. 2015 Nov 19;16:134. doi: 10.1186/s12863-015-0287-8

Impact of QTL minor allele frequency on genomic evaluation using real genotype data and simulated phenotypes in Japanese Black cattle

Yoshinobu Uemoto ^1,^✉, Shinji Sasaki ¹, Takatoshi Kojima ¹, Yoshikazu Sugimoto ², Toshio Watanabe ¹

PMCID: PMC4653875 PMID: 26586567

Abstract

Background

Genetic variance that is not captured by single nucleotide polymorphisms (SNPs) is due to imperfect linkage disequilibrium (LD) between SNPs and quantitative trait loci (QTLs), and the extent of LD between SNPs and QTLs depends on different minor allele frequencies (MAF) between them. To evaluate the impact of MAF of QTLs on genomic evaluation, we performed a simulation study using real cattle genotype data.

Methods

In total, 1368 Japanese Black cattle and 592,034 SNPs (Illumina BovineHD BeadChip) were used. We simulated phenotypes using real genotypes under different scenarios, varying the MAF categories, QTL heritability, number of QTLs, and distribution of QTL effect. After generating true breeding values and phenotypes, QTL heritability was estimated and the prediction accuracy of genomic estimated breeding value (GEBV) was assessed under different SNP densities, prediction models, and population size by a reference-test validation design.

Results

The extent of LD between SNPs and QTLs in this population was higher in the QTLs with high MAF than in those with low MAF. The effect of MAF of QTLs depended on the genetic architecture, evaluation strategy, and population size in genomic evaluation. In genetic architecture, genomic evaluation was affected by the MAF of QTLs combined with the QTL heritability and the distribution of QTL effect. The number of QTL was not affected on genomic evaluation if the number of QTL was more than 50. In the evaluation strategy, we showed that different SNP densities and prediction models affect the heritability estimation and genomic prediction and that this depends on the MAF of QTLs. In addition, accurate QTL heritability and GEBV were obtained using denser SNP information and the prediction model accounted for the SNPs with low and high MAFs. In population size, a large sample size is needed to increase the accuracy of GEBV.

Conclusion

The MAF of QTL had an impact on heritability estimation and prediction accuracy. Most genetic variance can be captured using denser SNPs and the prediction model accounted for MAF, but a large sample size is needed to increase the accuracy of GEBV under all QTL MAF categories.

Electronic supplementary material

The online version of this article (doi:10.1186/s12863-015-0287-8) contains supplementary material, which is available to authorized users.

Keywords: BovineHD, Genomic prediction, Heritability estimation, Japanese Black cattle, Minor allele frequency, Simulation study

Background

The development of single nucleotide polymorphism (SNP) array technology has enhanced the genetic dissection of complex traits, and this SNP information can be directly utilized in cattle breeding programs using genomic selection [1, 2]. In addition, whole genome sequence (WGS) data are becoming increasingly available for cattle, and WGS data are expected to yield a better understanding of complex traits, which can capture all of the genetic variance and predict an accurate genomic estimated breeding value (GEBV), by accounting for all the variants including quantitative trait loci (QTLs) [3, 4].

A recent report showed that the SNPs significantly associated with a complex trait explain only a fraction of the phenotypic variance in human height, and this has been called the “missing heritability” problem [5]. It has been argued that missing heritability is due to imperfect linkage disequilibrium (LD) between SNPs and QTLs, and the extent of LD between SNPs and QTLs depends on differences in the minor allele frequency (MAF) between SNPs and QTLs [6]. SNPs with similar MAF can potentially have high LD, but SNPs with very different MAF cannot have high LD. In cattle populations, QTLs may have a lower MAF than SNPs on low-density SNP arrays, because these are designed to work in several different breeds. In this case, the genetic variation explained by SNPs will be lower than that due to low LD between SNPs and QTLs with low MAF. Meat from Japanese Black cattle is known to have the unique characteristic of a high degree of marbling; the cattle are genetically distant from other European breeds at the genome level [7]. The extent of LD between SNPs and QTLs in Japanese Black cattle may differ from that in other cattle breeds, and it is necessary to evaluate the impact of MAF of QTLs on the genomic evaluation in this target population.

Heritability estimation and GEBV prediction are measures of goodness-of-fit in reference populations and have predictive ability in test populations, respectively. The amount of genetic variance not captured by SNPs affects the maximum predictive ability [8]. On the other hand, increasing the goodness-of-fit will not necessarily increase the predictive ability, because of the model over-fitting problem [9]. The heritability estimation and prediction accuracy depend on several factors such as the genetic architecture of a trait (e.g., QTL heritability, number of QTLs, and distribution of QTL effect), the evaluation strategy (e.g., SNP marker density and prediction method), and population size [6, 9–12]. Therefore, it is important how heritability estimation and GEBV prediction depends on these factors in different MAF of QTLs.

The objective of this study was to evaluate the impact of MAF of QTLs on heritability estimation and accuracy of GEBV prediction, and how that depends on the genetic architecture (QTL heritability, number of QTLs, and distribution of QTL effect), the evaluation strategy (SNP density and prediction model), and population size. We performed a simulation analysis based on a reference-test validation design, which used real genotype data to account for the extent of LD in Japanese Black cattle.

Methods

Genotypes for this study were obtained from previously published data [13]. All animal experiments were performed according to the Guidelines for the Care and Use of Laboratory Animals of Shirakawa Institute of Animal Genetics, and this research was approved by Shirakawa Institute of Animal Genetics Committee on Animal Research (H21-2). We have obtained the written agreement from the cattle owners to use the samples.

Data

In this simulation analysis, real genotype data were used to account for the extent of LD in Japanese Black cattle. Complete descriptions of the experimental population and SNP information were reported previously by Uemoto et al. [13]. Briefly, a total of 1444 Japanese Black cattle, which were 653 steers from two slaughterhouses in Japan [14] and 791 cows from farms managed by a large cooperative farming company in Japan [15], were genotyped using the Illumina BovineHD BeadChip (HD) (Illumina, San Diego, CA, USA), and 593,696 SNPs on autosomal chromosomes assessed by the exclusion criteria of MAF < 0.01, call rate < 0.95, and Hardy–Weinberg equilibrium test < 0.001 were used in this study. To avoid having very close relatives in the data, the animals with large off-diagonal elements in the genomic relationship matrix (GRM) were excluded (a cut-off value of ± 0.4 for off-diagonal elements), and the SNPs were then reassessed by the same criteria. A total of 1368 animals and 592,034 SNPs were then used in the simulation study. These animals were low relatives with the progeny of 438 sires, and the mean, median, and maximum number of progenies per sire were 3.1, 2, and 24, respectively. The distribution of progenies per sire was shown in Additional file 1: Figure S1.

Simulation design

In this study, we simulated the true breeding value (TBV) and phenotypes under the different scenarios varying the following factors: different MAF categories, QTL heritability, number of QTLs, and distribution of QTL effect. After generating TBV and phenotypes, the QTL heritability was estimated and the prediction accuracy of GEBV was assessed under different conditions varying the following factors: different SNP densities, prediction models, and size of the reference-test populations by a reference-test validation design. The factors considered in the simulation study are summarized in Table 1, and shown in detail below. The impact of the MAF of QTLs on genomic evaluation under different genetic architecture was evaluated in scenarios 1 and 2. In addition, the impact of the MAF of QTLs on genomic evaluation under different evaluation strategy and population size was evaluated in scenarios 3 and 4, respectively.

Table 1.

Factors for different scenarios in a simulation study

	Scenario
Factor	1	2	3	4
MAF^a	All, High, Low	All, High, Low	All, High, Low	All, High, Low
QTL heritability	0.2, 0.4, 0.8	0.4	0.4	0.4
Number of QTLs	500	50, 100, 300, 500, 1000, 2000	500	500
Distribution of QTL effect^b	EquV	Gamma, EquV	EquV	EquV
SNP density^c	50 K	50 K	7 K, 50 K, 7K_to_HD, 50 K_to_HD, HD	50 K
Prediction model^d	Model (1) with G_Y	Model (1) with G_Y	Model (1) with G_V, G_Y, and G_S, Model (2)	Model (1) with G_Y
Size of reference set	1231	1231	1231	200, 400, 800, 1200
Size of test set	137	137	137	1168, 968, 568, 168

Open in a new tab

^aMAF, Minor allele frequency; All, 0.01 ≤ MAF ≤ 0.5; High, 0.05 < MAF ≤ 0.5; Low, 0.01 ≤ MAF ≤ 0.05

^bGamma, Gamma distribution model; EquV, Equal variance model

^c7K, 50 K and HD, Illumina infinium BovineLDv1.1, BovineSNP50v2, and BovineHD BeadChips, respectively; 7 K_to_HD and 50 K_to_HD, Imputations were performed from 7 K and 50 K to HD, respectively

^dG_V, VanRaden's G matrix; G_Y, Yang's G matrix; G_S, Speed's G matrix

In this simulation, 36,478 and 6316 SNPs on the BovineSNP50v2 BeadChip (50 K) and the BovineLDv1.1 BeadChip (7 K) (Illumina, San Diego, CA, USA), respectively, were designated as SNP markers. The distribution density of MAF of SNPs on 7 K, 50 K, and HD is plotted in Fig. 1. The MAF distribution shows a low ratio of SNPs on 7 K and a high ratio of SNPs on 50 K and HD at low MAF. The remaining 555,556 SNPs that are present in the HD but not in the 50 K and 7 K were assumed as candidate QTLs. For SNP density, three types of SNPs were used in this simulation. First, SNPs on 7 K and 50 K were used, and this scenario involved imperfect LD between SNPs and QTLs (and named as the imperfect LD SNPs). Second, the HD genotype was imputed from SNPs on 50 K (50 K_to_HD) and 7 K (7K_to_HD) by the BEAGLE (v4.0) software [16]. We performed a 10-fold cross-validation to have imputed HD genotype in this population, and the detail of imputation was reported previously by Uemoto et al. [13]. The imputed SNPs were then reassessed by the same exclusion criteria as described above, and 585,015 and 588,547 SNPs were used in the 7K_to_HD and 50 K_to_HD, respectively. The detail of the imputation error ratio was shown by Uemoto et al. [13], and the average correlation between true and imputed genotypes were 0.98 in 50 K_to_HD and 0.93 in 7 K_to_HD. This scenario involved some SNPs being QTLs but with a low imputation error ratio (and named as the imputed SNPs). Third, all SNPs on the HD were used as SNPs, and this scenario assumed that WGS data were available and some SNPs were QTLs itself (and named as the perfect LD SNPs).

Fig. 1 — Distribution of minor allele frequencies for SNPs under different SNP densities. The x-axis indicates the MAF of SNPs, and the y-axis represents the proportion of SNPs in each MAF category. 7 K, 50 K, and HD are SNP markers on Illumina infinium BovineLDv1.1, BovineSNP50v2, and BovineHD BeadChips, respectively

For candidate QTLs, three MAF categories were defined as follows: a low MAF group (0.01 ≤ MAF ≤ 0.05), a high MAF group (0.05 < MAF ≤ 0.5), and an all MAF group (0.01 ≤ MAF ≤ 0.5). A total of 50, 100, 300, 500, 1000, and 2000 QTLs were randomly selected from candidate QTLs in each MAF group. Hill et al. [17] showed that the distribution of allele frequency affecting additive genetic variance is under the U-shaped distribution and $f (p) \propto \frac{1}{p (1 - p)}$ . For the all MAF group, the U-shaped distribution was assumed as the distribution of QTL allele frequency (0.01 ≤ p ≤ 0.5), and the ratio of the integrated values for low MAF, $\int_{0.01}^{0.05} f (p) d p$ , and high MAF, $\int_{0.05}^{0.5} f (p) d p$ , were 0.36 and 0.64, respectively. Therefore, QTLs with low and high MAFs in the all MAF group were randomly selected from the ratio 0.36:0.64, respectively.

We assumed the use of a polygenic model in the simulation, because this is a reasonable assumption for the majority of complex traits in cattle. The phenotype was simulated by summing all true QTL genotypic values and the residual effect, that is, $y_{i} = \sum_{j}^{m} x_{i j} b_{j} + e_{i}$ , where m is the number of QTLs, x_ij is the genotype for the j-th QTL of the i-th animal (coded as 0, 1, or 2 for the homozygote, heterozygote, and the other homozygote, respectively), b_j is the allele substitution effect of the j-th QTL, and e_i is the residual effect generated from $N (0, σ_{g}^{2} (1 / h^{2} - 1))$ . $\sum_{j}^{m} x_{i j} b_{j}$ is TBV, $σ_{g}^{2}$ is the total genetic variance of TBV, and h² is the setting value of QTL heritability. Three setting values of QTL heritability (h² = 0.20, 0.40, and 0.80) were used to generate phenotypes.

In this study, two different distributions of the QTL effect were assumed. The first model was a gamma distribution model in which the QTL effect was generated from a gamma distribution with a shape parameter of 0.4 and scale parameter of 1.66 [2]. The second model was an equal variance model in which the QTL effect was assumed as $b_{j} = \frac{1}{\sqrt{2 p_{j} (1 - p_{j})}}$ , where p_j is MAF of j-th QTL. In the equal variance model, the QTL effect was assumed in that all QTLs had contributed to QTL variance equally (Var(b_j) = 1 in this assumption) if linkage equilibrium was assumed among QTLs. The signs of QTL effects were randomly selected, and total QTL variance was adjusted to 100 × h² in both distribution models.

Statistical analysis

The generated data were analyzed by the genomic best linear unbiased prediction (GBLUP) method with the following model:

y = 1_{n} μ + X u + e

where y is the phenotypic values, 1_n is a vector of n ones, μ is the mean, X is the design matrix for random effects, u is the additive genetic effect with $u ~ N (0, G σ_{u}^{2})$ , and e is the residual effect with $e ~ N (0, I σ_{e}^{2})$ . G is a GRM using all SNPs in each SNP density. $σ_{u}^{2}$ is the additive genetic variance, and $σ_{e}^{2}$ is residual variance. We also used the following model:

y = 1_{n} μ + X u_{L} + X u_{H} + e

where u_L is the additive genetic effect attributed to the low MAF SNPs with $u_{L} ~ N (0, G_{L} σ_{u_{L}}^{2})$ , and u_H is the additive genetic effect attributed to the high MAF SNPs with $u_{H} ~ N (0, G_{H} σ_{u_{H}}^{2})$ . G_L is a GRM using SNPs with low MAF, and G_H is a GRM using SNPs with high MAF in each SNP density. $σ_{u_{L}}^{2}$ and $σ_{u_{H}}^{2}$ are the additive genetic variances attributed to the SNPs with low and high MAFs, respectively, and $σ_{e}^{2}$ is the residual variance. We defined three different GRMs as follows:

VanRaden’s GRM (G_V): The first GRM, G_V, was proposed by VanRaden [18] and is calculated as follows:

G_{V} = \frac{Z Z'}{2 \sum_{j = 1}^{m} p_{j} (1 - p_{j})}

where m is the number of SNPs, p_j is the frequency of the second allele of j-th SNP, and the elements of Z are calculated as follows:

z_{i j} = x_{i j} - 2 p_{j}

where x_ij is the number of the second allele of the i-th individual at the j-th SNP.

Yang’s GRM (G_Y): The second GRM, G_Y, was proposed by Yang et al. [6] and is computed as follows:

G_{Y} = \frac{\bar{Z} \bar{Z}'}{m}

where $\bar{Z}$ is the Z matrix but with each element scaled based on the allele frequency of each locus as follows:

{\bar{z}}_{i j} = \frac{z_{i j}}{\sqrt{2 p_{j} (1 - p_{j})}}

Speed’s GRM (G_S): The third GRM, G_S, was proposed by Speed et al. [19] and is calculated as follows:

G_{S} = \frac{W W'}{\sum_{j = 1}^{m} k_{j}}

where k_j is the weighting factor of the j-th SNP accounted for LD and the elements of W are calculated as follows:

w_{i j} = \sqrt{k_{j}} {\bar{z}}_{i j}

Speed et al. [19] proposed a method for weighting markers to account for LD. Their method, linkage-disequilibrium adjusted kinships (LDAK), examines the local SNP correlation caused by LD and computes optimal SNP weights by solving a linear program. We calculated the weighting factor k_j and the LD-adjusted GRM (G_S) by the LDAK software with default parameters and LD decay function. When analyzing high density SNPs (i.e., imputed SNPs and perfect LD SNPs), the weighting factors were calculated twice as suggested.

After calculating these three GRMs, 0.00001 was added to diagonal elements of each GRM to avoid near singularity problems. We used the three GRMs in model (1) and G_Y in model (2). The QTL heritability $h_{1}^{2}$ and $h_{2}^{2}$ for model (1) and (2), respectively, are calculated as follows,

h_{1}^{2} = \frac{σ_{u}^{2}}{σ_{u}^{2} + σ_{e}^{2}}

h_{2}^{2} = \frac{σ_{u_{L}}^{2} + σ_{u_{H}}^{2}}{σ_{u_{L}}^{2} + σ_{u_{H}}^{2} + σ_{e}^{2}}

Validation test of heritability estimation and prediction accuracy

Under each scenario, we replicated a reference-test validation design 300 times. In each reference-test experiment, data were randomly split into two disjointed sets, that is, 137 animals (one-tenth of all animals) in the test population and the remaining 1231 animals in the reference population. In each replica, this approach was performed only one time. In addition, to evaluate the impact of MAF of QTLs under different population size, 200, 400, 800, and 1200 animals were randomly selected as the reference population, and the remaining 1168, 968, 568, and 168 animals were used as the test population, respectively. Phenotypes of animals in the test population were masked in each replicate, and we estimated QTL heritability in the reference population and predicted the GEBV in the test population using the ASREML 3.0 program [20]. After predicting the GEBV, the prediction accuracy was assessed using Pearson’s correlation between TBV and GEBV in each test population of the validation set. The mean and standard deviation (SD) of 300 replicates was then calculated.

Results

Extent of LD between SNPs and QTLs

Under all scenarios, three MAF categories were defined to evaluate the impact of MAF of QTLs. To evaluate the impact of MAF of QTLs on the extent of LD between SNPs and QTLs, the extent of LD between SNPs on 50 K and QTLs in each MAF category is shown in Fig. 2. The extent of LD between SNPs and QTLs was evaluated using the r² value, which is a measure of LD. The r² values between QTLs and both adjacent SNPs were calculated by PLINK software [21]. The maximum value of r² between two QTL-SNP intervals was chosen in each QTL, and the density distributions of r² for three MAF categories were then plotted. The parameters used were the same as those used in scenario 1. In this result, most QTLs with low MAF had a lower r² value than those with high MAF. The r² value of QTLs with all MAF was between that of QTLs with low and high MAFs. The mean values of r² for all, high, and low MAFs were 0.294, 0.360, and 0.184, respectively. This shows that the extent of LD between SNPs and QTLs is higher in the QTLs with high MAF than that in those with low MAF.

Fig. 2 — Proportion of linkage disequilibrium value (r²) between QTLs and adjacent SNPs. The plot on the right upper corner is the zoomed area of the bigger plot. The x-axis indicates the r² value between QTLs and SNPs, and the y-axis represents the proportion of QTLs in each minor allele frequency (MAF) category (All, Low, and High). The r² values between QTLs and both adjacent SNPs were calculated, and then the maximum value of r² between two QTL-SNP intervals was chosen to plot in each QTL. The parameters used were the same as those under scenario 1

The genetic architecture

We evaluated the impact of MAF of QTLs on genomic evaluation under different QTL heritability in scenario 1, and the estimated QTL heritability and correlation between TBV and GEBV are shown in Fig. 3. The estimated QTL heritability was close to the setting value and a higher correlation was observed as the QTL heritability was increased in each MAF category. For the MAF of QTLs, the estimated QTL heritability and correlation between TBV and GEBV for QTLs with high MAF has the highest value, and the values of all MAF were between those of low and high MAFs in each setting value of QTL heritability. In addition, as the setting value was increased from 0.20 to 0.80, the differences in the results between high and low MAFs increased in QTL heritability (from 0.06 to 0.15, respectively) and correlation between TBV and GEBV (from 0.14 to 0.16, respectively).

Fig. 3 — Results obtained from scenario 1. Estimated QTL heritability and correlation between true breeding and genomic estimated breeding values are calculated. The x-axis indicates the true QTL heritability, and the y-axis represents mean values of 300 replicates for the estimated QTL heritability (a) and the correlation between true breeding value (TBV) and genomic estimated breeding value (GEBV) (b). The results of varying minor allele frequency (MAF) categories (All, Low, and High) and QTL heritabilities (0.20, 0.40, and 0.80) are shown. The whiskers represent the standard deviation of 300 replicates

We evaluated the impact of MAF of QTLs on genomic evaluation under different number of QTLs and distribution of the QTL effect in scenario 2, and the estimated QTL heritability and correlation between TBV and GEBV are shown in Fig. 4. For QTL number, the estimated QTL heritability and correlation remained constant, regardless of the number of QTLs in each MAF category.

Fig. 4 — Results obtained from scenario 2. Estimated QTL heritability and correlation between true breeding and genomic estimated breeding values are calculated. The x-axis indicates the number of QTLs, and the y-axis represents mean values of 300 replicates for the estimated QTL heritability (a) and the correlation between true breeding value (TBV) and genomic estimated breeding value (GEBV) (b). The results of varying minor allele frequency (MAF) categories (All, Low, and High), number of QTLs (50, 100, 300, 500, 1000, and 2000), and distribution of QTL allele substitution effect (Gamma, gamma distribution model; EquV, equal variance model) are shown

For the distribution of QTL effect, the results of the QTLs with high and low MAFs followed a similar trend between the two distribution models, whereas different results were observed between two distribution models in the QTLs with all MAFs. The results of high and all MAFs showed similar trends in the gamma distribution model, and the estimated QTL heritability and correlation between TBV and GEBV were about 0.39 and 0.50, respectively. On the other hand, the results of all MAFs were lower than those of high MAF in the equal variance model, and the values of estimated QTL heritability and correlation between TBV and GEBV were about 0.36 and 0.44 for all MAF and 0.39 and 0.50 for high MAF, respectively.

The evaluation strategy

We evaluated the impact of the MAF of QTLs on genomic evaluation under different evaluation strategy for SNP density and prediction model in scenario 3. Goodness-of-fit was measured by the Akaike information criterion (AIC) to compare the prediction models. The AIC is defined as $A I C = 2 v - 2 ln (likelihood)$ , where v is the number of variance components. This formula shows that the goodness of fit is high, if the AIC is low. The estimated QTL heritability, AIC, and correlation between TBV and GEBV are shown in Table 2, Table 3, and Table 4, respectively.

Table 2.

Heritability estimation in scenario 3

		All MAF^a		High MAF^a		Low MAF^a
SNP^b	Prediction model^c	Mean	SD	Mean	SD	Mean	SD
7 K	Model (1) with G_V	0.28	0.05	0.32	0.05	0.20	0.06
	Model (1) with G_Y	0.30	0.05	0.33	0.05	0.23	0.06
	Model (1) with G_S	0.30	0.05	0.33	0.05	0.24	0.06
	Model (2)	0.30	0.05	0.33	0.05	0.23	0.06
50 K	Model (1) with G_V	0.33	0.06	0.38	0.06	0.24	0.06
	Model (1) with G_Y	0.36	0.06	0.39	0.06	0.30	0.06
	Model (1) with G_S	0.38	0.06	0.40	0.06	0.34	0.07
	Model (2)	0.37	0.06	0.39	0.06	0.34	0.06
7K_to_HD	Model (1) with G_V	0.34	0.06	0.39	0.06	0.24	0.06
	Model (1) with G_Y	0.37	0.06	0.40	0.06	0.30	0.06
	Model (1) with G_S	0.41	0.07	0.41	0.07	0.39	0.07
	Model (2)	0.39	0.06	0.40	0.06	0.38	0.06
50K_to_HD	Model (1) with G_V	0.34	0.06	0.39	0.06	0.25	0.06
	Model (1) with G_Y	0.37	0.06	0.41	0.06	0.30	0.07
	Model (1) with G_S	0.41	0.07	0.42	0.07	0.40	0.07
	Model (2)	0.40	0.06	0.40	0.06	0.40	0.06
HD	Model (1) with G_V	0.35	0.06	0.39	0.06	0.25	0.06
	Model (1) with G_Y	0.38	0.06	0.41	0.06	0.31	0.07
	Model (1) with G_S	0.42	0.07	0.41	0.07	0.40	0.07
	Model (2)	0.40	0.06	0.40	0.06	0.41	0.06

Open in a new tab

^aMAF, Minor allele frequency; All MAF, 0.01 ≤ MAF ≤ 0.5; High MAF, 0.05 < MAF ≤ 0.5; Low MAF, 0.01 ≤ MAF ≤ 0.05

^b7K, 50 K and HD, Illumina infinium BovineLDv1.1, BovineSNP50v2, and BovineHD BeadChips, respectively; 7 K_to_HD and 50 K_to_HD, Imputations were performed from 7 K and 50 K to HD, respectively

^cG_V, VanRaden's genome relationship matrix (GRM); G_Y, Yang's GRM; G_S, Speed's GRM

Table 3.

Model fitness measured by Akaike information criterion (AIC) in scenario 3

		All MAF^a		High MAF^a		Low MAF^a
SNP^b	Prediction model^c	Mean	SD	Mean	SD	Mean	SD
7 K	Model (1) with G_V	6164	63	6145	66	6191	61
	Model (1) with G_Y	6162	63	6145	66	6188	61
	Model (1) with G_S	6162	63	6146	66	6187	61
	Model (2)	6163	63	6147	66	6186	61
50 K	Model (1) with G_V	6159	63	6139	65	6188	62
	Model (1) with G_Y	6155	63	6139	65	6181	62
	Model (1) with G_S	6155	63	6142	65	6175	62
	Model (2)	6155	63	6140	65	6163	62
7K_to_HD	Model (1) with G_V	6158	63	6138	65	6189	62
	Model (1) with G_Y	6155	63	6138	65	6182	62
	Model (1) with G_S	6156	63	6147	65	6171	62
	Model (2)	6154	63	6139	65	6155	62
50K_to_HD	Model (1) with G_V	6157	63	6137	65	6188	62
	Model (1) with G_Y	6154	63	6137	65	6181	62
	Model (1) with G_S	6155	63	6146	65	6169	62
	Model (2)	6153	63	6138	65	6152	62
HD	Model (1) with G_V	6157	63	6136	65	6188	62
	Model (1) with G_Y	6154	63	6137	65	6180	62
	Model (1) with G_S	6155	63	6147	65	6168	62
	Model (2)	6152	63	6138	65	6150	62

Open in a new tab

^aMAF, Minor allele frequency; All MAF, 0.01 ≤ MAF ≤ 0.5; High MAF, 0.05 < MAF ≤ 0.5; Low MAF, 0.01 ≤ MAF ≤ 0.05

^b7K, 50 K and HD, Illumina infinium BovineLDv1.1, BovineSNP50v2, and BovineHD BeadChips, respectively; 7 K_to_HD and 50 K_to_HD, Imputations were performed from 7 K and 50 K to HD, respectively

^cG_V, VanRaden's genome relationship matrix (GRM); G_Y, Yang's GRM; G_S, Speed's GRM

Table 4.

Correlation between true breeding value and genomic breeding value in scenario 3

		All MAF^a		High MAF^a		Low MAF^a
SNP^b	Prediction model^c	Mean	SD	Mean	SD	Mean	SD
7 K	Model (1) with G_V	0.41	0.08	0.48	0.08	0.30	0.09
	Model (1) with G_Y	0.42	0.08	0.48	0.08	0.32	0.09
	Model (1) with G_S	0.42	0.08	0.48	0.08	0.33	0.09
	Model (2)	0.42	0.08	0.48	0.08	0.33	0.09
50 K	Model (1) with G_V	0.43	0.08	0.50	0.08	0.32	0.09
	Model (1) with G_Y	0.44	0.08	0.50	0.08	0.35	0.09
	Model (1) with G_S	0.44	0.08	0.49	0.08	0.37	0.09
	Model (2)	0.44	0.08	0.50	0.08	0.41	0.09
7K_to_HD	Model (1) with G_V	0.44	0.08	0.50	0.08	0.32	0.09
	Model (1) with G_Y	0.45	0.08	0.50	0.08	0.35	0.09
	Model (1) with G_S	0.44	0.08	0.48	0.08	0.38	0.08
	Model (2)	0.45	0.08	0.50	0.08	0.44	0.08
50K_to_HD	Model (1) with G_V	0.44	0.08	0.51	0.08	0.32	0.09
	Model (1) with G_Y	0.45	0.08	0.51	0.08	0.36	0.09
	Model (1) with G_S	0.44	0.08	0.48	0.08	0.39	0.08
	Model (2)	0.46	0.08	0.51	0.08	0.46	0.08
HD	Model (1) with G_V	0.44	0.08	0.51	0.08	0.32	0.09
	Model (1) with G_Y	0.45	0.08	0.51	0.08	0.36	0.08
	Model (1) with G_S	0.44	0.08	0.48	0.08	0.39	0.08
	Model (2)	0.46	0.08	0.51	0.08	0.47	0.08

Open in a new tab

^aMAF, Minor allele frequency; All MAF, 0.01 ≤ MAF ≤ 0.5; High MAF, 0.05 < MAF ≤ 0.5; Low MAF, 0.01 ≤ MAF ≤ 0.05

^b7K, 50 K and HD, Illumina infinium BovineLDv1.1, BovineSNP50v2, and BovineHD BeadChips, respectively; 7 K_to_HD and 50 K_to_HD, Imputations were performed from 7 K and 50 K to HD, respectively

^cG_V, VanRaden's genome relationship matrix (GRM); G_Y, Yang's GRM; G_S, Speed's GRM

Differences in the SNP density have an impact on heritability estimation and GEBV prediction. For model (1) with G_Y, the results of 50 K were higher than those of 7 K in all MAF categories. For example, from the QTLs with all MAFs, the results of 50 K and 7 K were 0.36 and 0.30 for QTL heritability and 0.44 and 0.42 for correlation between TBV and GEBV, respectively. The results of imputed SNPs (i.e., 7 K_to_HD and 50 K_to_HD) were higher than those of 7 K and 50 K, and were very close to the results of perfect LD SNPs (i.e., HD) in all MAF categories. For example, from the QTLs with all MAFs, the results of both 50 K_to_HD and 7 K_to_HD were 0.37 for QTL heritability and 0.45 for correlation between TBV and GEBV, and the results of HD were 0.38 for QTL heritability and 0.45 for correlation between TBV and GEBV. These results indicate that heritability estimation and GEBV prediction depend on the SNP density. However, the different results among SNP densities in each MAF category depend on the prediction model.

For the prediction model, the result of model (1) with G_V was similar to that with G_Y in the QTL with high MAF, but the difference between the results obtained from G_V and G_Y increased in the QTL with low MAF. For example, the differences between G_V and G_Y in the AIC and correlation between TBV and GEBV with 50 K were 0 and 0.00 in the QTL with high MAF but 7 and 0.03 in the QTL with low MAF, respectively. The result of model (1) with G_S was similar to or better than that with G_Y in the QTL with all and low MAFs, but performed worse in the QTL with high MAF. In particular, the difference in the results between G_S and G_Y in the QTL with high MAF was increased at larger SNP density. For example, the difference between G_S and G_Y in AIC and correlation between TBV and GEBV were 1 and 0.00 in 7 K but 10 and 0.03 in HD. In addition, the results of G_S with HD in high MAF were 6147 in AIC and 0.48 in the correlation between TBV and GEBV, which represented the worst of all results by other models under the high MAF scenario. The results of model (2) were similar to or better than those of the other three models under all MAF categories. In particular, the results of model (2) with HD in low MAF, which were 6150 in AIC and 0.47 in correlation between TBV and GEBV, representing the best values in the low MAF results.

Population size

In this simulation, the impact of the MAF of QTLs on genomic evaluation under different population size was evaluated in scenario 4. The estimated QTL heritability and correlation between TBV and GEBV are shown in Fig. 5. The results of heritability estimation and GEBV prediction followed a different trend. The mean values of estimated QTL heritability were close to the setting value (0.40) and were almost the same as those among different population sizes, but the SD of the estimated results decreased as the size of the population increased (e.g., from 0.47 to 0.07 in reference size from 200 to 1200, respectively, for all MAFs). The following trend of the results, the mean values of high MAF > all MAF > low MAF, was shown for QTL heritability, when the size of reference set was more than 800. These results indicated that the heritability estimates at lower population sizes are less precise than those at higher population sizes, even if the estimated value is close to the setting value. In addition, the impact of the MAF of QTLs was shown at larger population sizes.

Fig. 5 — Results obtained from scenario 3. Estimated QTL heritability and correlation between true breeding and genomic estimated breeding values are calculated. The x-axis indicates the size of the reference set, and the y-axis represents mean values of 300 replicates for the estimated QTL heritability (a) and the correlation between true breeding value (TBV) and genomic estimated breeding value (GEBV) (b). The results of varying minor allele frequency (MAF) categories (All, Low, and High) and size of the reference set (200, 400, 800, and 1200) are shown. The whiskers represent the standard deviation of 300 replicates

In the GEBV prediction, the correlations between TBV and GEBV were increased as the size of the reference increased (e.g., 0.11–0.41 at reference size 200–1200, respectively, for all MAFs). QTLs with high MAF had the highest value, and the values of all MAFs were between those with low and high MAFs in all reference sizes (e.g., 0.34, 0.41, and 0.50 for low, all, and high MAFs in reference size 1200, respectively). In addition, as the size of the reference increased from 200 to 1200, the difference between the high and low MAFs for the correlations between TBV and GEBV increased from 0.07 to 0.15, respectively.