Sparse single-step method for genomic evaluation in pigs

Tage Ostersen; Ole F Christensen; Per Madsen; Mark Henryon

doi:10.1186/s12711-016-0227-8

. 2016 Jun 29;48:48. doi: 10.1186/s12711-016-0227-8

Sparse single-step method for genomic evaluation in pigs

Tage Ostersen ^1,^✉, Ole F Christensen ², Per Madsen ², Mark Henryon ^1,³

PMCID: PMC4926299 PMID: 27357825

Abstract

Background

In many animal breeding programs, with the increasing number of genotyped animals, estimation of genomic breeding values by the single-step method is becoming limited by excessive computing requirements. A recently proposed algorithm for proven and young animals (APY) is an approximation that reduces computing time drastically by dividing genotyped animals into core and non-core animals, with only computations for core animals being time-consuming. We hypothesized that choosing core animals based on representing all generations, minimizing the relatedness within the core group, or maximizing the number of genotyped offspring, would result in greater accuracies of estimated breeding values (EBV).

Methods

We compared eight different core groups for the three pig breeds DanAvl Duroc, DanAvl Landrace and DanAvl Yorkshire. These eight sparse approximations of the single-step method were evaluated based on correlations of EBV for genotyped animals obtained from the sparse methods with those obtained from the usual version of the single-step method. We used a single-trait model with daily gain as trait.

Results

For core groups that distributed animals across generations, correlations for genotyped animals (from 0.977 to 0.989) were higher than for those that did not distribute core animals across generations (from 0.934 to 0.956). For core groups that maximized the number of genotyped offspring, correlations for genotyped animals (from 0.983 to 0.989) were higher than for other core groups (from 0.934 to 0.981). There was no clear association between low relatedness within the core group and accuracy of approximations.

Conclusions

We found that for core groups that represent all generations and that maximize the number of genotyped offspring, accurate approximations of EBV were obtained. However, we did not find a clear association between accuracy and relatedness within the core group. For the APY method, this is the first study that reports systematic criteria for the creation of core groups that result in more accurate EBV than a similar-sized random core group. Random core groups only ensure across-generation representation. Therefore, we recommend choosing a core group that represents all generations and that maximizes the number of genotyped offspring for single-step genomic evaluation using the APY method.

Background

To estimate genomic breeding values, the single-step method is the method-of-choice for many animal breeding programs [1–4]. A challenge when using this method is the long computing time when the number of genotyped animals increases [5, 6], which puts a constraint on the estimation of genomic breeding values in most breeding schemes. Misztal et al. [5] proposed a computationally efficient solution to this problem, called the algorithm for proven and young animals (APY). APY computes sparse approximations of the inverse genomic relationship matrix by allocating animals to two groups: core and non-core animals [7]. The algorithm is computationally efficient because it ignores genomic relationships among non-core animals and only requires inversion of the genomic relationship matrix for core animals. Estimated breeding values (EBV) computed with APY can be nearly identical to those computed using the full version of the single-step method, where all genomic relationships are included. The APY approximations become more accurate when the core group size increases [5, 6]. Fragomeni et al. [8] suggested that, when a small number of animals is allocated to the core group to reduce computing time, accurate approximations are obtained when animals in the core group are chosen at random from all genotyped animals. Although other ways of choosing core animals have been proposed [6, 9], no study has reported a formal choice of the core animals that results in more accurate approximations than choosing them at random while keeping the size of the core group constant [5–11]. This is surprising because based on [12], the accuracy of APY is related to how well the animals in the core group represent the independent chromosome segments that are present in the population. This suggests that approximations of EBV obtained with APY could be more accurate by choosing core animals that represent the most independent chromosome segments. We propose three criteria to increase accuracies by choosing animals that represent the most independent chromosome segments: (1) choosing animals from all generations, since new cross-overs occur each generation and thus, new independent chromosome segments are created; (2) minimizing the degree of relatedness within the core group by increasing the number of families in the core group, which should lead to a better representation of independent chromosome segments; and (3) including genotyped parents of genotyped animals in the core group since they represent the independent chromosome segments of their offspring. Based on these assumptions, we hypothesized that choosing core animals based on representing all generations, minimizing the relatedness within the core group, or maximizing the number of genotyped offspring, will increase the accuracy of the resulting EBV. We tested this hypothesis by estimating accuracies of approximations of EBV for daily gain for three Danish pig breeds.

Methods

We compared eight core groups for the three pig breeds DanAvl Duroc, DanAvl Landrace and DanAvl Yorkshire. EBV for genotyped animals from the sparse single-step methods were correlated with EBV from the usual version of the single-step method. We used a single-trait model on daily gain.

Sparse single-step

To understand the computing issues of the single-step procedure, first we provide a summary on this method. We used the single-step procedure that was formulated in Christensen et al. [3], in which the inverse pedigree relationship matrix for all animals $A_{full}^{- 1}$ is replaced by $H_{}^{- 1}$ , where:

H^{- 1} = A_{full}^{- 1} + (\begin{matrix} 0 & 0 \\ 0 & G_{}^{- 1} - A_{22}^{- 1} \end{matrix})

where $G = (1 - w_{a}) G_{a} + w_{a} A_{22}$ and $A_{22}^{- 1}$ is the inverse of the part of the pedigree relationship matrix for genotyped animals, $w_{a}$ is the weight on the pedigree relationship matrix, and $G_{a}^{}$ is the genomic relationship matrix adjusted to the same scale as $A_{22}^{}$ by the following calculation:

G_{a} = β G_{m} + α,

where β and α solve the equations:

mean (diag (G_{m})) β + α = mean (diag (A_{22})),

and

mean (G_{m}) β + α = mean (A_{22}),

where mean() represents the mean of the elements and diag() represent the diagonal elements of a matrix.

In the above procedure, matrix $H$ is sub-divided into non-genotyped and genotyped animals and index 2 denotes the genotyped individuals. The genomic relationship matrix $G_{m}$ is defined as:

G_{m} = (M - 2 p 1^{'}) {(M - 2 p 1^{'})}^{'} / \sum_{j}^{} 2 p_{j} (1 - p_{j}),

where matrix $M$ contains the genotypes coded 0, 1, 2, vector $p$ contains the allele frequencies computed from all genotyped animals, and 1 denotes a vector of ones.

The computationally heavy load of the single-step procedure is partly due to the increasing number of non-zero elements in $H^{- 1}$ , which increases the time necessary for preconditioned conjugate gradient (PCG) iteration, but also partly due to the need to invert $A_{22}$ and $G$ . Note that the definition of $H^{- 1}$ includes both $G_{}^{- 1}$ and $A_{22}^{- 1}$ , and if the same elements of these matrices could be zero, an even sparser $H^{- 1}$ would be achieved.

Sparse inverse genomic relationship matrix

According to Misztal et al. [5], the inverse genomic relationship matrix, $G_{}^{- 1}$ , can be approximated by separating the genotyped animals into two groups using the APY algorithm:

G^{- 1} = [\begin{matrix} G_{cc}^{- 1} + G_{cc}^{- 1} G_{cn} D_{nn}^{- 1} G_{nc} G_{cc}^{- 1} & - G_{cc}^{- 1} G_{cn} D_{nn}^{- 1} \\ - D_{nn}^{- 1} G_{nc} G_{cc}^{- 1} & D_{nn}^{- 1} \end{matrix}],

where index $c$ denotes animals in the core group, index $n$ denotes animals in the non-core group, and $D_{nn}$ is a diagonal matrix with dimension equal to number of non-core animals and diagonal elements as:

D_{nn, ii} = G_{ii} - G_{ic} G_{cc}^{- 1} G_{ic}^{'},

where $G_{ic}$ denotes the ith row of $G_{nc}$ . The APY approximation only requires inversion of the submatrix $G_{cc}$ , which is more time and memory efficient than inversion of the full genomic relationship matrix. Furthermore, the inverse of the genomic relationship matrix approximated with APY is sparse with non-zero blocks among the core animals and between core and non-core animals, but only non-zero diagonal elements among the non-core animals [5].

Although this approach does not require calculation of the full $G$ , we did calculate it. For an implementation, it would be sufficient to calculate only $G_{cc}$ , $G_{nc}$ and $D_{nn}$ , where α and β are estimated based on the core animals only.

Sparse inverse pedigree relationship matrix for genotyped animals

The inverse pedigree relationship matrix for genotyped animals is also dense, partly because of the numerical inversion. However, APY works very poorly for $A_{22}^{- 1}$ because the resulting $H^{- 1}$ is not positive definite (preliminary results not shown). Therefore, $A_{22}^{- 1}$ needs to be made sparse using e.g., the approach proposed by Faux and Gengler [13] or by Misztal et al. [5].

For pig data, the number of genotyped animals is typically much larger than the number of their non-genotyped ancestors, or at least this will soon be the case with the increasing numbers of genotyped animals. Therefore, the inverse of the pedigree relationship matrix for genotyped animals can be calculated efficiently by absorbing non-genotyped ancestors using the following equation [14]:

A_{22}^{- 1} = A^{22} - A^{21} {(A^{11})}^{- 1} A^{12},

where the inverse of the pedigree relationship matrix for animals in the reduced pedigree for genotyped animals $A_{red}^{- 1}$ is sub-divided into:

A_{red}^{- 1} = [\begin{matrix} A^{11} & A^{12} \\ A^{21} & A^{22} \end{matrix}],

where superscript 1 denotes non-genotyped animals in the pedigree for genotyped animals, and superscript 2 denotes genotyped animals. This entails that only the usually small part, $A^{11}$ , of the inverse pedigree relationship matrix containing non-genotyped animals in the pedigree for genotyped animals needs to be inverted.

According to Misztal et al. [5], sparsity of $A_{22}^{- 1}$ can be achieved without large consequences by simply setting small elements of $A_{22}^{- 1}$ equal to zero. A sparse version of $A_{22}^{- 1}$ was achieved here (95 to 99 % sparsity) by setting elements in the range from −0.0001 to 0.0001 equal to zero. We note that, although the resulting sparse version of $A_{22}^{- 1}$ was not positive definite for the three datasets studied here, the resulting approximation of $H^{- 1}$ was positive definite in all cases.

Data and model

We used data on records of daily gain for pigs of the DanAvl Duroc, DanAvl Landrace and DanAvl Yorkshire breeds that were born between 2009 and 2014; their pedigree was traced back to 1996 (see Table 1 for details on the data). A single-trait model for daily gain with variance components from the routine genomic evaluations was used. All animals were phenotyped for daily gain before genotyping and before selection and mating decisions. Further details on the data are in Ostersen et al. [15]. For all genomic evaluation models, a weight $w_{a}$ of 0.25 was put on the traditional relationships, which is the standard value used in the routine genomic evaluation for this trait (for details see Christensen et al. [3]).

Table 1.

Overview of data

	DanAvl Duroc	DanAvl Landrace	DanAvl Yorkshire
Number of observations	110,072	227,786	211,311
Number of animals in pedigree	119,930	239,378	220,998
Number of genotyped animals	13,809	21,681	21,634
Number of animals in pedigree for genotyped animals	25,425	28,774	28,318

Open in a new tab

Pigs born before August 2013 were genotyped with the Illumina PorcineSNP60 Bead chip and pigs born after this date were genotyped with the 8.5 K GGP-Porcine LD Illumina Bead chip. Missing genotypes were imputed using Beagle version 3.3.2 [16]. The following SNP quality controls were applied: SNPs with a call-rate lower than 90 % across all samples genotyped with the 60 K chip were removed; SNPs with a minor allele frequency lower than 0.01 were filtered out; SNPs that deviated strongly from Hardy–Weinberg equilibrium (p < 10⁻⁷) were excluded; SNPs that were not mapped in the porcine reference genome build 10.2 [17] were also excluded. A total of 33,028, 37,841 and 36,919 SNPs were retained for the Duroc, Landrace and Yorkshire datasets, respectively. An animal’s genotypes were only retained if they had a call frequency higher than 90 % for that animal. Except for quality control and imputation of SNPs, all other data preparations and analyses were run in R [18] and DMU [19].

Scenarios evaluated

NormalG

This scenario was used as reference for all other scenarios, and was the usual single-step procedure, where matrices for genotyped animals were fully inverted and no sparsity was gained.

Random10, Random30, Random50

We chose these core groups at random with subset sizes of 10, 30 and 50 % of the genotyped animals. These scenarios were intended to show the effect of across-generation distribution. In addition to Random10, we used Random30 and Random50 to evaluate the number of core animals required for these pig populations. We investigated different random subsets, but the difference in results between two random subsets of the same size was so small (less than 0.001 difference in correlations for all scenarios), that we only report one. This is in agreement with findings from bovine studies [6].

Unrelated10

For this scenario, we chose a core group that included 10 % of the genotyped animals that minimized the average degree of relatedness between core animals. The optimization was achieved using a genetic algorithm [20] by a simulated evolution of a set of potential solutions driven by recombination and mutation, which is based on a fitness function that was the average relationship of the core group.

Offspring10

In this scenario, 10 % of the genotyped animals were chosen based on their number of genotyped offspring. Thus, animals were ranked according to number of genotyped offspring and the 10 % animals that had the largest number of genotyped offspring were chosen.

OffspringRandom10

This was a combination of the Random10 and Offspring10 scenarios. For the youngest genotyped animals (last year of birth), 10 % of the animals were chosen at random. For the oldest genotyped animals (excluding the last year of birth), 10 % of the animals were chosen based on the number of genotyped offspring. Thus, the resulting core group size across old and young animals was 10 % of all genotyped animals.

Old10

This core group consisted of the 10 % oldest genotyped animals. This scenario was used for comparison to the other scenarios, since its characteristics were the opposite of those of the Random10 scenario. Hence, the core group represented only the oldest generations.

Young10

This core group consisted of the 10 % youngest genotyped animals. This scenario was used for comparison to the other scenarios, since its characteristics were the opposite of those of the Random10, Unrelated10 and Offspring10 scenarios. Hence, none of the animals in the core group had genotyped offspring, and they were more related and represented only one generation.

NormalA

In this scenario, we discarded genotypic information completely to act as a baseline scenario.

Performance criteria

We evaluated each scenario based on four indicators. First, we calculated Pearson correlations of EBV from each scenario with EBV from the usual version of the single-step procedure for genotyped animals. Second, we calculated these correlations for all animals. The main criterion was the correlation of EBV for genotyped animals. Differences between scenarios were assessed using a Hotelling-Williams t test. Third, we evaluated each scenario based on the number of PCG iterations, since they are indicators of how numerically well-conditioned the equations are, which influences computation time. The fourth indicator was the sparsity of $G_{}^{- 1} - A_{22}^{- 1}$ , which also influences computation time.

Results

Ignoring genomic information, as in the NormalA scenario, resulted in correlations of EBV with EBV based on the full single-step method for genotyped animals that were on average equal to 0.873 across the three breeds. In the following, results for alternate APY scenarios are compared to those of the full single-step methods.

Core animals

Correlations between EBV from the full model and EBV from each of the eight core groups that were created for the three pig breeds were significantly different from each other (p < 0.05). The largest correlations were realized by core groups with animals that were distributed across generations and that had many genotyped offspring. For the scenarios that distributed core animals across generations, i.e. Random10 and OffspringRandom10, correlations for genotyped animals in the three breeds ranged from 0.977 to 0.989 (Tables 2, 3, 4, 5, 6, 7). For the scenarios that did not distribute core animals across generations, i.e. Unrelated10, Old10 and Young10, correlations for genotyped animals ranged from 0.934 to 0.956. Likewise, for core groups that maximized the number of genotyped offspring, i.e. Offspring10 and OffspringRandom10, correlations for genotyped animals ranged from 0.983 to 0.989 (Tables 2, 3, 4, 5, 6, 7). Finally, for the scenarios that did not maximize the number of genotyped offspring, i.e. Random10, Unrelated10, Old10 and Young10, correlations for genotyped animals ranged from 0.934 to 0.981 (Tables 2, 3, 4, 5, 6, 7).

Table 2.

Correlations between EBV from alternate core groups and EBV from the full single-step model for DanAvl Duroc

Scenario	Cor all	Cor genotyped	PCG iterations	Sparsity of $(G_{}^{- 1} - A_{22}^{- 1}$ )
NormalG^a	1	1	301	0.0 %
Random10^b	0.993	0.981	309	76.9 %
Unrelated10^c	0.968	0.944	412	77.4 %
Offspring10^d	0.996	0.985	298	78.5 %
OffspringRandom10^e	0.997	0.989	287	78.1 %
Random30^b	0.999	0.997	306	46.5 %
Random50^b	1.000	0.999	284	23.7 %
Old10^f	0.947	0.939	405	77.7 %
Young10^g	0.963	0.934	370	76.1 %
NormalA^h	0.965	0.901	320	–

Open in a new tab

Correlations were calculated for all animals (Cor all) and genotyped animals (Cor genotyped)

Number of PCG iterations and sparsity of the matrix involved in the single-step formula $(G_{}^{- 1} - A_{22}^{- 1}$ )

All correlations were significantly different from each other (p < 0.05)

^aNormalG is the usual single-step procedure without sparse approximations

^bRandom10, Random30, Random50 are the sparse single-step, where a random subset of animals (10, 30, 50 %) were treated as core

^cUnrelated10 is 10 % animals chosen as core by minimizing the degree of relatedness between core animals

^dOffspring10 is 10 % animals chosen based on the number of genotyped offspring

^eOffspringRandom10 is, for old animals (excluding last year of birth) 10 % animals chosen based on the number of genotyped offspring, whereas for young animals (last year of birth) 10 % of the animals were chosen at random

^fOld10 is the sparse single-step, where the 10 % oldest animals were treated as core

^gYoung10 is the sparse single-step, where the 10 % youngest animals were treated as core

^hNormalA is where genotypes are ignored completely

Table 3.

Correlations between EBV from alternate core groups and EBV from the full single-step model for DanAvl Landrace

Scenario	Cor all	Cor genotyped	PCG iterations	Sparsity of $(G_{}^{- 1} - A_{22}^{- 1}$ )
NormalG^a	1	1	306	0.0 %
Random10^b	0.995	0.977	377	80.1 %
Unrelated10^c	0.982	0.954	492	79.9 %
Offspring10^d	0.987	0.983	330	80.4 %
OffspringRandom10^e	0.991	0.984	340	80.3 %
Random30^b	0.997	0.996	346	48.4 %
Random50^b	0.996	0.999	312	24.7 %
Old10^f	0.944	0.936	463	80.1 %
Young10^g	0.897	0.937	444	79.7 %
NormalA^h	0.977	0.858	321	–

Open in a new tab

Correlations were calculated for all animals (Cor all) and genotyped animals (Cor genotyped)

Number of PCG iterations and sparsity of the matrix involved in the single step formula $(G_{}^{- 1} - A_{22}^{- 1}$ )

All correlations were significantly different from each other (p < 0.05)

^aNormalG is the usual single-step procedure without sparse approximations

^bRandom10, Random30, Random50 are the sparse single-step, where a random subset of animals (10, 30, 50 %) were treated as core

^cUnrelated10 is 10 % animals chosen as core by minimizing the degree of relatedness between core animals

^dOffspring10 is 10 % animals chosen based on the number of genotyped offspring

^fOld10 is the sparse single-step, where the 10 % oldest animals were treated as core

^gYoung10 is the sparse single-step, where the 10 % youngest animals were treated as core

^hNormalA is where genotypes are ignored completely

Table 4.

Correlations between EBV from alternate core groups and EBV from the full single step model for DanAvl Yorkshire

Scenario	Cor all	Cor genotyped	PCG iterations	Sparsity of $(G_{}^{- 1} - A_{22}^{- 1}$ )
NormalG^a	1	1	303	0 %
Random10^b	0.995	0.978	348	80.1 %
Unrelated10^c	0.985	0.956	471	80.0 %
Offspring10^d	0.997	0.984	319	80.4 %
OffspringRandom10^e	0.997	0.985	325	80.4 %
Random30^b	0.999	0.996	321	48.4 %
Random50^b	1.000	0.999	292	24.7 %
Old10^f	0.967	0.946	442	80.2 %
Young10^g	0.980	0.943	439	79.8 %
NormalA^h	0.976	0.858	300	–