Skip to main content
Genetics, Selection, Evolution : GSE logoLink to Genetics, Selection, Evolution : GSE
. 2020 May 13;52:24. doi: 10.1186/s12711-020-00543-9

Computational strategies for the preconditioned conjugate gradient method applied to ssSNPBLUP, with an application to a multivariate maternal model

Jeremie Vandenplas 1,, Herwin Eding 2, Maarten Bosmans 3, Mario P L Calus 1
PMCID: PMC7222437  PMID: 32404053

Abstract

Background

The single-step single nucleotide polymorphism best linear unbiased prediction (ssSNPBLUP) is one of the single-step evaluations that enable a simultaneous analysis of phenotypic and pedigree information of genotyped and non-genotyped animals with a large number of genotypes. The aim of this study was to develop and illustrate several computational strategies to efficiently solve different ssSNPBLUP systems with a large number of genotypes on current computers.

Results

The different developed strategies were based on simplified computations of some terms of the preconditioner, and on splitting the coefficient matrix of the different ssSNPBLUP systems into multiple parts to perform its multiplication by a vector more efficiently. Some matrices were computed explicitly and stored in memory (e.g. the inverse of the pedigree relationship matrix), or were stored using a compressed form (e.g. the Plink 1 binary form for the genotype matrix), to permit the use of efficient parallel procedures while limiting the required amount of memory. The developed strategies were tested on a bivariate genetic evaluation for livability of calves for the Netherlands and the Flemish region in Belgium. There were 29,885,286 animals in the pedigree, 25,184,654 calf records, and 131,189 genotyped animals. The ssSNPBLUP system required around 18 GB Random Access Memory and 12 h to be solved with the most performing implementation.

Conclusions

Based on our proposed approaches and results, we showed that ssSNPBLUP provides a feasible approach in terms of memory and time requirements to estimate genomic breeding values using current computers.

Background

Genomic data for livestock often include around 50 thousand single nucleotide polymorphism (SNPs), and are used in genomic prediction to obtain genomic estimated breeding values [1]. While some challenges must be still solved, the method of choice for genomic prediction is currently the so-called single-step genomic best linear unbiased prediction (ssGBLUP) that simultaneously analyses phenotypic and pedigree information of genotyped and non-genotyped animals with genomic information of genotyped animals [1]. ssGBLUP considers genomic information by combining genomic and pedigree relationships into a combined genomic-pedigree relationship matrix [2, 3]. A drawback of ssGBLUP is that it requires the inverse of the genomic relationship matrix (G), which can be computed up to approximately 100,000 genotyped animals on current computers [4]. As a result of this limitation some methods were proposed to approximate, or to compute implicitly, the inverse of G [46].

Equivalent models that directly estimate SNP effects and that do not rely on G, hereafter called ssSNPBLUP, were also proposed [79]. However, these models have not yet been implemented and tested on a large scale due to several reasons, such as the lack of breeding value estimation software that is flexible enough to perform ssSNPBLUP, more complicated modeling compared to ssGBLUP, and convergence issues [1]. In Vandenplas et al. [10], we proposed a preconditioned conjugate gradient (PCG) method with a second-level preconditioner that is easy to implement, and that substantially improves the convergence issues associated with two ssSNPBLUP systems. The objective of this paper is to present several computational strategies that improve the efficiency of solving two different ssSNPBLUP systems efficiently with a PCG method. These strategies aim at taking advantage of existing shared-memory parallel libraries while limiting the amount of required random access memory (RAM). Some of these computational strategies can also be implemented in breeding value estimation software that rely on ssGBLUP.

Methods

Two ssSNPBLUP systems

In this study, we investigate the ssSNPBLUP linear equations system proposed by Mantysaari and Stranden [11] (ssSNPBLUP_MS) and the ssSNPBLUP linear equations system proposed by Liu et al. [9] (ssSNPBLUP_Liu). The two ssSNPBLUP systems are equivalent and both systems of equations can be summarized as:

Cixi=bi,

where i refers to the linear system proposed by Mantysaari and Stranden [11] (i = MS) or to the linear system proposed by Liu et al. [9] (i = Liu), Ci is a symmetric (semi-)definite coefficient matrix, xi is the vector of solutions, and bi is the right-hand side of the linear system.

For simplicity, and without loss of generality, the different matrices and vectors are described below for a univariate animal model. For the linear system of Mantysaari and Stranden [11], xMS=β^u^na^gg^ where β is the vector of fixed effects, the subscripts g and n refer to ng genotyped and nn non-genotyped animals, respectively, un is the vector of additive genetic effects for non-genotyped animals, ag is the vector of residual polygenic effects for genotyped animals, and g is the vector of SNP effects. The vector bMS is equal to bMS=XR-1yWnRn-1ynWgRg-1ygZWgRg-1yg where y is the vector of records, and the matrices X, Wn and Wg are incidence matrices relating records to the corresponding effects. The matrix Z contains the SNP genotypes (coded as 0 for one homozygous genotype, 1 for the heterozygous genotype, or 2 for the alternate homozygous genotype) centered by their observed means. The matrix R-1=Rn-100Rg-1 is the inverse of the residual (co)variance structure matrix. The coefficient matrix CMS is equal to:

CMS=XR-1XXnRn-1WnXgRg-1WgXgRg-1WgZWnRn-1XnWnRn-1Wn+ΣMS11ΣMS12ΣMS13WgRg-1XgΣMS21WgRg-1Wg+ΣMS22WgRg-1WgZ+ΣMS23ZWgRg-1XgΣMS31ZWgRg-1Wg+ΣMS32ZWgRg-1WgZ+ΣMS33

where ΣMS-1=ΣMS11ΣMS12ΣMS13ΣMS21ΣMS22ΣMS23ΣMS31ΣMS32ΣMS33=AnnAngAngZAgn1wAgg+1-1wQQZZAgnZQZQZ+m1-wIσu-2 and where σu-2 is the inverse of the additive genetic variance, w is the proportion (strictly between 0 and 1) of variance (due to additive genetic effects) considered as residual polygenic effects, and m=2pj1-pj with pj being the observed allele frequency of the j-th SNP. The matrix Q is equal to Q=AgnAnn-1Ang, where A-1=AnnAngAgnAgg is the inverse of the pedigree relationship matrix.

For the linear system of Liu et al. [9], xLiu=β^u^nu^gg^ where ug=ag+Zg is the vector of additive genetic effects for genotyped animals. The vector bLiu is equal to bLiu=XR-1yWnRn-1ynWgRg-1yg0. The coefficient matrix CLiu is equal to:

CLiu=XR-1XXnRn-1WnXgRg-1Wg0WnRn-1XnWnRn-1Wn+ΣLiu11ΣLiu12ΣLiu13WgRg-1XgΣLiu21WgRg-1Wg+ΣLiu22ΣLiu230ΣLiu31ΣLiu32ΣLiu33

where ΣLiu-1=ΣLiu11ΣLiu12ΣLiu13ΣLiu21ΣLiu22ΣLiu23ΣLiu31ΣLiu32ΣLiu33=AnnAng0AgnAgg+1w-1Agg-1-1wAgg-1Z0-1wZAgg-11wZAgg-1Z+m1-wIσu-2. It is worth noting that Agg-1=Agg-Q [12].

A PCG method

A PCG method is an iterative method that uses successive approximations to obtain more accurate solutions for a linear system at each iteration step [13]. Our implementation of the preconditioned system of linear equations of both ssSNPBLUP has the form:

D-1M-1Cx=D-1M-1b, 1

where M is a preconditioner defined below, and D is a second-level diagonal preconditioner proposed by Vandenplas et al. [10] and described in the “Analyses” section.

The main computational costs of the PCG method for solving ssSNPBLUP systems are the computation of some terms of the preconditioner M and the multiplication of the coefficient matrix C by a vector at each PCG iteration. In the next section, we propose computational approaches at approximating specific elements of M and to multiply C by a vector in an efficient manner.

Computation of the preconditioner M

In animal breeding, a (block-)diagonal preconditioner is commonly used [14]. The (block-)diagonal elements of matrices, such as XR-1X and WnRn-1Wn+Σi11 (i = MS, Liu), can be easily obtained, in contrast to WgRg-1Wg+Σi22 and ZWgRg-1WgZ+ΣMS33 which contain terms like Agg-1, ZAgg-1Z, or ZQZ. Since the preconditioner aims to approximate the coefficient matrix, we approximate diagAgg-1 with a Monte Carlo approach based on 1000 samples, as proposed by Masuda et al. [15]. Furthermore, the j-th diagonal element of diagZAgg-1Z was approximated to 2ngpj(1-pj), and the j-th diagonal element of diagZQZ was approximated to 2ng+noffspringpj1-pj, where noffspring is the total number of offspring of all the ng genotyped animals (see Additional file 1 for derivations). These approximations always provided the same convergence rate compared with the exact values (results not shown).

Computational strategies for the multiplication of C by a vector

Our approach for the efficient multiplication of C by a vector, e.g. x, relies on splitting the coefficient matrix C into multiple parts for which the multiplication by a vector is easier to perform.

For ssSNPBLUP_MS, the coefficient matrix CMS can be split into:

CMS=TCMSLS+CMSR1T+CMSR2 2

with T=I0000I0000I0000Z, CMSLS=XR-1XXnRn-1WnXgRg-1WgXgRg-1WgWnRn-1XnWnRn-1Wn00WgRg-1Xg0WgRg-1WgWgRg-1WgWgRg-1Xg0WgRg-1WgWgRg-1Wg, CMSR1=0000000Ang001-1wQQ0AgnQQσu-2, and CMSR2=00000AnnAng00Agn1wAgg0000m1-wIσu-2.

The multiplication of CMS by a vector, e.g. xMS, can be easily computed in multiple steps as follows:

CMSxMS=TCLSv1+CMSR1v1+CMSR2xMS

where the brackets . indicate the order of the matrix-vector operations, and v1=TxMS=b^u^na^gZg^.

For ssSNPBLUP_Liu, the multiplication of the coefficient matrix CLiu by a vector, e.g. xLiu, can be performed in multiple steps as:

CLiuxLiu=CLiuLSxLiu+TCLiuR1TxLiu+CLiuR2xLiu 3

with CLiuLS=XR-1XXnRn-1WnXgRg-1Wg0WnRn-1XnWnRn-1Wn00WgRg-1Xg0WgRg-1Wg00000, CLiuR1=00000000001w-1Agg-1-1wAgg-100-1wAgg-11wAgg-1σu-2, and CLiuR2=00000AnnAng00AgnAgg0000m1-wIσu-2.

It is worth noting that the multiplication of CMSLS and of CLiuLS by a vector can be performed with approaches that have already been developed in animal breeding, such as iteration-on-data approaches [1619], because these matrices are similar to those obtained with traditional pedigree BLUP. Similarly, the multiplication of CMSR2 and of CLiuR2 (both involving A-1) by a vector, can be easily computed using strategies such as those developed by Stranden and Lidauer [18], or as described below.

In the following, we describe in detail computational strategies for multiplying efficiently submatrices of T, of its transpose, of CMSR1, of CMSR2, of CLiuR1, and of CLiuR2, by a vector. It should be noted that the multiplication of these matrices requires the multiplication of the centered genotype matrix Z, its transpose Z, Q, and Agg-1, by an array. Furthermore, while the proposed computational strategies are described in the context of a univariate animal model, they are readily applicable to more complex models, such as multivariate maternal models (see Additional file 2 for a description of a ssSNPBLUP_MS system associated with a standard bivariate maternal model).

Multiplication of CiLS by a vector

The implemented approach for multiplying CiLS (i = MS, Liu) by a vector was the three-step approach combined with an iteration-on-data technique, as proposed by Stranden and Lidauer [18]. The phenotypes and associated levels for all effects were stored in RAM to allow shared-memory parallelization. Phenotypes were stored using double precision reals, and levels for all effects were stored using 4-byte integers. Each thread was associated with a same amount of records to make the computations involving submatrices of CiLS as even as possible across the threads. Furthermore, the records were sorted following an increasing order of the effect with the largest number of levels to minimize RAM required by the temporary arrays.

Multiplication of Z, or Z, by an array

The main cost of the multiplication of the matrix T, or its transpose, by an array is the multiplication of the centered genotyped matrix Z, or its transpose Z, by an array.

To benefit from shared-memory parallel programming while limiting the amount of RAM required, the SNP genotypes included in Z were stored in RAM using the Plink 1 binary form [20]. In brief, the value of each SNP locus (coded as 0 for one homozygous genotype, 1 for the heterozygous genotype, 2 for the alternate homozygous genotype, or missing) is coded using 2 bits, and each byte (B) stores the genotype of four genotyped animals for a same SNP (see [20] for more details). Observed allele frequencies needed for centering SNP genotypes were stored into a double precision real array. This approach requires ngnSNP4 B to store the genotypes and 8nSNP B to store the allele frequencies. For example, to store one million genotypes with 50,000 SNPs, this approach requires around 12 GB RAM. In comparison, the storage of the same information using a double precision real array would require 32 times more RAM, i.e. around 373 GB.

Because the matrix Z is stored in Plink 1 binary form in RAM, a custom implementation of a Matrix-Matrix product is needed. The matrix Z is split into small blocks intended to fit into the CPU cache. Each block of Z is converted into a small matrix of double-precision numbers (corresponding to centered genotypes or zero for missing values) and subsequently multiplied with part of the array. This implementation uses vectorization and loop unrolling to make optimal use of available hardware resources on modern CPUs.

While it might not be straightforward to implement, the proposed approach for multiplying Z, or its transpose, by an array could be also used in single-step evaluations that rely on genomic relationship matrices. Indeed, the multiplication of the inverse of the genomic relationship matrix by an array could be replaced by a system of equations that would be solved iteratively and that would require the multiplication of Z, and its transpose, by an array [21].

Multiplication of A-1 by an array

The multiplication of the matrices CMSR2, and CLiuR2, by an array requires the multiplication of A-1 by an array. Due to the small amounts of RAM available in the past, an approach that only requires reading the pedigree was developed to multiply A-1 by an array [16]. While such an approach is memory-efficient, it does not allow an efficient shared-memory parallelization of the multiplication of A-1 by an array.

With the current large amounts of RAM available, it is now possible to store A-1 in RAM, even for large pedigrees. For our implementation, since A-1 is a sparse and symmetric matrix, its upper triangular part is stored in RAM using the well-known and widely used 3-array variation of the compressed row storage (CRS3) format [13]. The CRS3 format of a sparse matrix is specified by two arrays of (4-byte) integers (named IA and JA) and one (double precision) real array (named AA). The array IA, of size equal to the number of rows of the sparse matrix plus one, contains the pointers to the beginning of each row of the sparse matrix in the arrays JA and AA. The array JA, of size equal to the number of non-zero real values, contains the column indices of the corresponding elements stored in AA. The array AA contains the non-zero real values of the sparse matrix [13].

Following Henderson’s rules to construct A-1 recursively [22], adding the contributions of one animal to A-1 leads to adding three diagonal elements and three off-diagonal elements to the upper-triangular part of A-1. Therefore, assuming that there are n animals in the pedigree, the maximum number of non-zeros elements in the upper-triangular part of A-1 is equal to 4n (that is, the sum of n diagonal elements and of 3n off-diagonal elements). With the CRS3 format, an upper bound of RAM needed to store the upper triangular part of A-1 as a sparse matrix using double precision reals is equal to the sum of 4(n+1) B for the array IA, of 44n B for the array JA, and of 84n B for the array AA, which is equal to a total of 52n+4 B. This upper bound increases linearly as the number of animals increases in the pedigree, and is equal, for example, to 1.45 GB for a pedigree with 30 million animals. Using shared-memory parallel programming, efficient libraries, such as sparse BLAS routines, can be used for multiplying A-1 by an array.

Multiplication of Q by an array

The multiplication of CMSR1 by an array implies several multiplications of the matrix Q by an array, and subsequently several multiplications of Ann-1 by an array. This matrix Ann-1 has a size almost equal to the number of animals in the pedigree, because, for most single-step genomic evaluations, the number of genotyped animals is a small fraction of the number of animals in the pedigree. An alternative computation of the matrix Q is as follows (see Additional file 3 for the derivation):

Q=AancgnAancgg-1Aancng+Δ

with the matrices Aancii being submatrices of the inverse of the pedigree relationship matrix that include only the genotyped animals and their ancestors, and the matrix Δ being equal to Δ=Agg-Aancgg.

Based on Henderson’s rules [22] to directly construct A-1, it follows that the matrix Δ contains only the contributions of the non-genotyped offspring of the genotyped animals that are not ancestors of genotyped animals (see Additional file 3 for details). Therefore, the matrix Δ can be easily and directly constructed by reading the pedigree only once.

Finally, it is worth noting that the multiplication of CMSR1 by a vector involves four multiplications of Q by an array. However, only two multiplications of Q by an array are actually required due to the presence of the same multiplications.

Multiplication of Agg-1 by an array

The multiplication of CLiuR1 by a vector implies two multiplications of the matrix Agg-1 by an array. As proposed by Stranden et al. [12], the multiplication of Agg-1 by an array is performed using sparse matrices:

Agg-1=Aancgg-AancgnAancnn-1Aancng.

The sparse matrices Aancgg, Aancgn, and Aancnn, were stored in RAM to enable shared-memory parallelization.

Data

The implementations of ssSNPBLUP as described in the previous sections were compared to each other in terms of computational efficiency. This comparison was based on data and associated variance components from the bivariate routine genetic evaluation published in April 2019 for livability of calves for the Netherlands and the Flemish region in Belgium [23, 24]. The data file included 25,184,654 calf records. The pedigree included 29,885,286 animals. The genotypes included 37,995 segregating SNPs, and were associated with 131,189 animals without phenotypes and with 129,402 animals with phenotypes.

The two traits are livability of calves born from heifers, and livability of calves born from multiparous cows. The bivariate mixed model included random effects (correlated additive direct and maternal genetic effects, permanent environmental effect and residual), fixed co-variables ((direct and maternal) heterosis and recombination effects) and fixed cross-classified effects (herd x year x season, year x month, age at calving, and parity). More details about the model and genetic parameters can be found in [23] and [24].

For both ssSNPBLUP_Liu and ssSNPBLUP_MS, the observed allele frequencies were used to center the genotype matrix, and the compatibility between pedigree and genomic information was guaranteed by fitting two J covariates (corresponding to the additive and maternal genetic effects) as fixed effects in the model [25]. The proportion of variance (due to additive genetic effects) considered as residual polygenic effects, w, was assumed to be equal to 0.05.

Analyses

Both ssSNPBLUP_MS and ssSNPBLUP_Liu were solved by using a Fortran 2003 program that implements the described computational approaches. The program also exploits BLAS and sparse BLAS routines, the parallel direct sparse solver PARDISO, all from the multi-threaded Intel Math Kernel Library 11.3.2, and OpenMP parallel computing. Except for the preconditioner, all real vectors and matrices were stored using double precision reals. For comparison, ssSNPBLUP_Liu was also performed with the centered genotyped matrix stored in RAM using double precision reals, instead of the Plink 1 binary form.

In this study, the preconditioner is defined for both ssSNPBLUP as:

M=diagCf1,f1000block_diagCrr000Cf2,f2+10-4diagCf2,f2

where the subscripts f1, f2, and r refer to the equations associated with the herd x year x season effect, the other fixed effects, and the random effects, respectively, and block_diagCrr is a block-diagonal matrix with blocks corresponding to equations for different traits within a level (e.g. an animal). The diagonal and block-diagonal elements of the preconditioner were stored using single precision reals, while the matrix Cf2,f2+10-4diagCf2,f2 was stored using the CRS3 format described earlier.

The diagonal elements of the second-level diagonal preconditioner D that correspond to the equations of the direct and maternal effects of the SNP effects were equal to 103 for ssSNPBLUP_MS, and 102 for ssSNPBLUP_Liu [10]. Other diagonal elements were equal to 1.

For both ssSNPBLUP systems, convergence was achieved when ri,kbi<10-6 with . being the 2-norm, and ri,k being the residual after k+1 iterations computed as ri,k=bi-Cixi,k, although it is not strictly comparable across systems. For all systems, the smallest and largest eigenvalues of the preconditioned coefficient matrices D-1M-1C that influence the convergence of the PCG method were estimated using the Lanczos method based on information obtained from the PCG method [26]. Effective condition numbers were computed from the ratio of these estimates, as this provides an indication of the properties of the preconditioned system of equations, with higher effective spectral condition numbers being associated with poorer convergence [27].

All computations were performed on a computer with 528 GB and running RedHat 7.4 (x86_64) with an Intel Xeon E5-2667 (3.20 GHz) processor with 16 cores. The number of OpenMP threads used for all computations was equal to 5. All reported times are indicative, because they may have been influenced by other jobs running simultaneously on the computer.

Results

Characteristics and results for different parts of the preparation and solving steps for ssSNPBLUP_MS and ssSNPBLUP_Liu using the Plink 1 binary form, or using double precision reals, are in Tables 1 and 2. All three ssSNPBLUP systems included 142,283,778 equations. Estimates for all fixed effects, additive direct and maternal genetic effects, and other random effects, of the three ssSNPBLUP systems were (almost) the same after convergence was reached (e.g., the Pearson correlations between all estimates for the direct and maternal genetic effects of the three systems were higher than 0.999).

Table 1.

Characteristics of different ssSNPBLUP systems

Characteristic ssSNPBLUP_Liu (Plink)a ssSNPBLUP_Liu (DP)b ssSNPBLUP_MSc
Number of iterations 3,358 3,359 6,334
Smallest eigenvalue 2.30410-6 2.30410-6 1.98910-6
Largest eigenvalue 3.813 3.813 5.194
Spectral condition number 1.655106 1.655106 2.612106
Software peak memory (MB)d 18,120.7 89,615.7 27,780.3

assSNPBLUP model proposed by Liu et al. [9] and using the Plink 1 binary form; or b using double precision reals; cssSNPBLUP model proposed by Mantysaari and Stranden [11] and using the Plink 1 binary form; dThe software peak memory is defined as the peak resident size (VmHWM) obtained from the Linux /proc virtual file system

Table 2.

Wall clock times for the preparation and solving processes of different ssSNPBLUP systems

Wall clock time (s) ssSNPBLUP_Liu (Plink)a ssSNPBLUP_Liu (DP)b ssSNPBLUP_MSc
diagAgg-1d 136.69 139.78 136.20
Preconditionerd 546.41 581.42 1,177.26
A-1d 50.62 57.40 50.77
Zve 3.47 7.77 3.42
Zve 1.53 4.61 1.45
Average time/iteration 12.82 20.23 16.89
Iterative process 43,074.48 67,961.84 107,041.71
Software total time 44,531.00 69,593.09 109,126.07

assSNPBLUP model proposed by Liu et al. [9] and using the Plink 1 binary form; or busing double precision reals; cssSNPBLUP model proposed by Mantysaari and Stranden [11] and using the Plink 1 binary form; dWall clock time needed for the computation of the mentioned matrix; eMultiplication of the centered genotype matrix, or its transpose, by an array

The wall clock time spent outside the iterative process varied between 1456 s for ssSNPBLUP_Liu using the Plink 1 binary form and 2084 s for ssSNPBLUP_MS. Those times include input/output operations and computations of several matrices. For example, the computation of the diagonal elements of the matrix Agg-1 using a Monte Carlo method [15] required less than 140 s for each of the three evaluations (Table 2). As described by Masuda et al. [15], the Monte Carlo method only requires Aancgg, Aancgn, and Aancnn. These three sparse matrices were computed using the pedigree of the 558,642 ancestors of the 260,591 genotyped animals. Also, the preparation of A-1 for the whole pedigree, i.e. for the 29,885,286 animals, required less than a minute (Table 2) and about 1.40 GB RAM. Finally, while the same amount of RAM (i.e. 807.71 MB) was required across the three evaluations, the computation of the preconditioner M for ssSNPBLUP_MS needed about twice the wall clock time of the computation of M for ssSNPBLUP_Liu (Table 2). This was due to the fact that the diagonal elements of ZWgRg-1WgZ were computed explicitly for ssSNPBLUP_MS. This additional computation also explains the additional wall clock time needed for ssSNPBLUP_MS outside the iterative process.

As expected, ssSNPBLUP_Liu using the Plink 1 binary form and ssSNPBLUP_Liu using double precision reals, converged in about the same number of iterations (i.e. around 3360 iterations; Fig.  1; Table 1). Their preconditioned coefficient matrices had an effective spectral condition number equal to 1.655106, resulting from the same extreme eigenvalues (Table 1). Differences between the two ssSNPBLUP_Liu were observed at the level of their performances. ssSNPBLUP_Liu using the Plink 1 binary form required a maximum of around 18 GB RAM and about 13 s per iteration. In comparison, ssSNPBLUP_Liu using double precision reals required a maximum of around 89 GB RAM and about 20 s per iteration (Tables 1 and 2). The increase in RAM was due to the fact that the centered genotyped matrix stored with double precision reals required about 74 GB RAM (versus <3 GB RAM with the Plink 1 binary form). The increase in time per iteration was due to the fact that the wall clock time for the multiplication of the centered genotyped matrix by an array using the Intel MKL DGEMM subroutine was more than twice the wall clock time needed for the same multiplication using our subroutine with the Plink 1 binary form (Table 2). Due to this increase in time per iteration, ssSNPBLUP_Liu using double precision reals needed about 56% more wall clock time to complete than ssSNPBLUP_Liu using Plink 1 binary form (that required about 12 h to complete) (Table 2). Using the Plink 1 binary form instead of double precision reals to store the genotype matrix in-memory is therefore beneficial for both memory and time requirements.

Fig. 1.

Fig. 1

Termination criteria for different ssSNPBLUP systems. The three systems investigated were a ssSNPBLUP system proposed by Liu et al. [9] using the Plink 1 binary form, or using double precision (DP) reals, and a ssSNPBLUP proposed by Mantysaari and Stranden [11] using the Plink 1 binary form

In comparison to ssSNPBLUP_Liu using the Plink 1 binary form, ssSNPBLUP_MS using the Plink 1 binary form was less efficient in terms of convergence, wall clock time, and RAM (Figure 1; Tables 1 and 2). The PCG method required a total of 6334 iterations to reach convergence, which can be partly explained by a larger spectral condition number, equal to 2.612106. Previously Vandenplas et al. [10] noted that spectral condition numbers and convergence of the PCG method for ssSNPBLUP_MS are worse than for ssSNPBLUP_Liu. Furthermore, ssSNPBLUP_MS required 5 additional seconds per iteration in comparison to ssSNPBLUP_Liu using the Plink 1 binary form (Table 2). This additional time per iteration is mainly due to additional computations needed for ssSNPBLUP_MS when multiplying CMSLS and CMSR1 by a vector. For CMSLS, compared to CLiuLS, this was due to the additional non-zero entries for the SNP equations. For CMSR1, compared to CLiuR1, the extra time needed was mainly due to the presence of Ang and its transpose. The larger number of iterations to reach convergence and the longer time per iteration are the two main reasons that explain that ssSNPBLUP_MS completed in almost three times the time needed for ssSNPBLUP_Liu using the Plink 1 binary form. Finally, ssSNPBLUP_MS also required more RAM than ssSNPBLUP_Liu (around 52% more) due to additional temporary arrays to perform the multiplication of the Eq. (2).

Discussion

In this study, several computational strategies were proposed to compute a preconditioner M for different ssSNPBLUP systems and to multiply the associated coefficient matrix C by a vector efficiently. The different strategies are based on approximations for the computation of the preconditioner, and on the splitting of the coefficient matrix C into multiple parts. Some matrices, such as A-1, are also computed explicitly and stored in RAM to enable the use of efficient parallel libraries (e.g. BLAS and sparse BLAS). We also developed an approach to multiply a centered genotype matrix by an array when the genotype matrix is stored using a Plink 1 binary form. In general, it is not possible to write a matrix-matrix product subroutine that outperforms a good BLAS DGEMM implementation like the one found in the Intel MKL by a significant margin, if at all. We have shown however that significantly better performance can be achieved by storing the genotype matrix in a compressed form and applying the computation directly to that form.

Across the three implemented evaluations, ssSNPBLUP_Liu using the Plink 1 binary form outperformed the two others in terms of RAM and time requirements. Regarding RAM requirements, the main gain can be explained by the use of the Plink 1 binary form. Assuming one million genotypes of 50,000 SNPs, using the Plink 1 binary form would require around 12 GB to store the genotype matrix, while using double precision reals would require around 373 GB. Even with dimensionality-reduction methods [6, 28], single-step evaluations will still require more RAM than with the Plink 1 binary form. For example, assuming that 20,000 eigenvalues explain 99% of the variation of the genomic information, around 149 GB would still be needed to store the reduced genotype matrix. Similar amounts of RAM would also be required for single-step evaluations using dosage scores (e.g. to account for imputation errors [29, 30]), or based on the algorithm for proven and young animals [4] or on the Woodbury decomposition of the genomic relationship matrix [5], because these approaches require real arrays. Therefore, for a same amount of RAM, ssSNPBLUP using the Plink 1 binary form allows more genotyped animals in a single-step evaluation than the other approaches. A second reason of smaller RAM requirements by ssSNPBLUP_Liu is that in our implementation fewer temporary arrays were needed for ssSNPBLUP_Liu than for ssSNPBLUP_MS.

Regarding the time requirements of the different approaches implemented, ssSNPBLUP_Liu using the Plink 1 binary form used the smallest amount of time per iteration due to its use of the Plink 1 binary form and to fewer multiplications needed than ssSNPBLUP_MS. In addition, the convergence properties of ssSNPBLUP_Liu are better than those of ssSNPBLUP_MS [10, 31]. Hence, it is preferable to implement ssSNPBLUP_Liu instead of ssSNPBLUP_MS. It is also worth noting that the actual runtimes could be shorter than those reported in this study. For example, for direct and maternal genetic effects, as well as for direct and maternal SNP effects, of ssSNPBLUP_Liu, the Pearson correlations between estimates obtained when the termination criterion reached 10-5 (i.e. after 2032 iterations) and when it reached 10-6 (i.e. after 3358 iterations; Table 1) were all higher than 0.999. Further investigation on convergence criteria applied to ssSNPBLUP are therefore needed.

Our splitting of the coefficient matrix C of the two ssSNPBLUP systems into multiple parts to efficiently calculate its multiplication by a vector, should facilitate the implementation of ssSNPBLUP in breeding value estimation software currently used in animal breeding. Indeed, current software that implement ssGBLUP have already procedures to perform efficiently multiplications involving matrices such as CiLS, CiR1, or CiR2 [16, 18]. To enable running ssSNPBLUP with those software, requires the implementation of at least two multiplications of the centered genotype matrix by an array. Finally, the computational strategies for two ssSNPBLUP systems proposed in this manuscript can be readily adapted for other ssSNPBLUP systems proposed in the literature [7, 8, 32].

Conclusions

Based on the proposed approaches and our results, we showed that ssSNPBLUP provides a feasible approach to estimate genomic breeding values using current computers without resource to graphics processing units or special architecture. Using the Plink 1 binary form efficiently throughout the whole breeding value estimation process is relatively straightforward with a ssSNPBLUP approach, and allows to include more genotyped animals in a single-step evaluation than other single-step approaches with a same amount of RAM. The ssSNPBLUP approach proposed by Liu et al. [9] in combination with the Plink 1 binary form and solved with a PCG method with a second-level preconditioner was shown to be the most efficient approach in terms of memory and time requirements.

Supplementary information

12711_2020_543_MOESM1_ESM.pdf (227.9KB, pdf)

Additional file 1. Derivation of formula to approximate diagZAgg-1Z and diag(ZAgn(Ann)-1AngZ) . Derivation of formula to approximate diagZAgg-1Zand diag(ZAgn(Ann)-1AngZ)[3335].

12711_2020_543_MOESM2_ESM.pdf (192.4KB, pdf)

Additional file 2.Description of the system of equations of ssSNPBLUP MS for a bivariate maternal model. Description of the system of equations of ssSNPBLUP MS for a bivariate maternal model, as well as of the different submatrices needed for the proposed computational strategies.

12711_2020_543_MOESM3_ESM.pdf (103.4KB, pdf)

Additional file 3. Derivation of an alternative computation of Agn(Ann)-1AngDerivation of an alternative computation of Agn(Ann)-1Angthat used only the ancestors and the progeny of the genotyped animals, instead of the complete pedigree.

Acknowledgements

The use of the high-performance cluster was made possible by CAT-AgroFood (Shared Research Facilities Wageningen UR, Wageningen, the Netherlands).

Authors' contributions

JV conceived the study design, ran the tests, and wrote the programs and the first draft. MB wrote the subroutine for the multiplication of a centered genotype matrix by an array using the Plink 1 binary form. HE prepared data. All authors provided valuable insights throughout the writing process. All authors read and approved the final manuscript.

Funding

This study was financially supported by the Dutch Ministry of Economic Affairs (TKI Agri \& Food Project 16022) and the Breed4Food partners Cobb Europe (Colchester, Essex, United Kingdom), CRV (Arnhem, the Netherlands), Hendrix Genetics (Boxmeer, the Netherlands), and Topigs Norsvin (Helvoirt, the Netherlands).

Ethics approval and consent to participate

The data used for this study were collected as part of routine data recording for a commercial breeding program. Samples collected for DNA extraction were only used for the breeding program. Data recording and sample collection were conducted strictly in line with the Dutch law on the protection of animals (Gezondheids- en welzijnswet voor dieren).

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Jeremie Vandenplas, Email: jeremie.vandenplas@wur.nl.

Herwin Eding, Email: herwin.eding@crv4all.com.

Maarten Bosmans, Email: maarten.bosmans@vortech.nl.

Mario P. L. Calus, Email: mario.calus@wur.nl

Supplementary information

Supplementary information accompanies this paper at 10.1186/s12711-020-00543-9.

References

  • 1.Legarra A, Christensen OF, Aguilar I, Misztal I. Single step, a general approach for genomic selection. Livest Sci. 2014;166:54–65. doi: 10.1016/j.livsci.2014.04.029. [DOI] [Google Scholar]
  • 2.Christensen OF, Lund MS. Genomic prediction when some animals are not genotyped. Genet Sel Evol. 2010;42:2. doi: 10.1186/1297-9686-42-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Legarra A, Aguilar I, Misztal I. A relationship matrix including full pedigree and genomic information. J Dairy Sci. 2009;92:4656–4663. doi: 10.3168/jds.2009-2061. [DOI] [PubMed] [Google Scholar]
  • 4.Misztal I. Inexpensive computation of the inverse of the genomic relationship matrix in populations with small effective population size. Genetics. 2016;202:401–409. doi: 10.1534/genetics.115.182089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Mäntysaari EA, Evans RD, Strandén I. Efficient single-step genomic evaluation for a multibreed beef cattle population having many genotyped animals. J Anim Sci. 2017;95:4728–4737. doi: 10.2527/jas2017.1912. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ødegård J, Indahl U, Strandén I, Meuwissen THE. Large-scale genomic prediction using singular value decomposition of the genotype matrix. Genet Sel Evol. 2018;50:6. doi: 10.1186/s12711-018-0373-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Fernando RL, Dekkers JC, Garrick DJ. A class of Bayesian methods to combine large numbers of genotyped and non-genotyped animals for whole-genome analyses. Genet Sel Evol. 2014;46:50. doi: 10.1186/1297-9686-46-50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Fernando RL, Cheng H, Golden BL, Garrick DJ. Computational strategies for alternative single-step Bayesian regression models with large numbers of genotyped and non-genotyped animals. Genet Sel Evol. 2016;48:96. doi: 10.1186/s12711-016-0273-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Liu Z, Goddard M, Reinhardt F, Reents R. A single-step genomic model with direct estimation of marker effects. J Dairy Sci. 2014;97:5833–5850. doi: 10.3168/jds.2014-7924. [DOI] [PubMed] [Google Scholar]
  • 10.Vandenplas J, Calus MPL, Eding H, Vuik C. A second-level diagonal preconditioner for single-step SNPBLUP. Genet Sel Evol. 2019;51:30. doi: 10.1186/s12711-019-0472-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Mäntysaari EA, Strandén I. Single-step genomic evaluation with many more genotyped animals. In: Proceedings of the 67th Annual Meeting of the European Association for Animal Production: 29 August-2 September 2016; Belfast; 2016.
  • 12.Strandén I, Matilainen K, Aamand GP, Mäntysaari EA. Solving efficiently large single-step genomic best linear unbiased prediction models. J Anim Breed Genet. 2017;134:264–274. doi: 10.1111/jbg.12257. [DOI] [PubMed] [Google Scholar]
  • 13.Saad Y. Other titles in applied mathematics. 2. Philadelphia: Society for Industrial and Applied Mathematics; 2003. Iterative methods for sparse linear systems. [Google Scholar]
  • 14.Strandén I, Tsuruta S, Misztal I. Simple preconditioners for the conjugate gradient method: experience with test day models. J Anim Breed Genet. 2002;119:166–174. doi: 10.1046/j.1439-0388.2002.00327.x. [DOI] [Google Scholar]
  • 15.Masuda Y, Misztal I, Legarra A, Tsuruta S, Fragomeni BO, et al. Lourenco DaL, Technical note: Avoiding the direct inversion of the numerator relationship matrix for genotyped animals in single-step genomic best linear unbiased prediction solved with the preconditioned conjugate gradient. J Anim Sci. 2017;95:49–52. doi: 10.2527/jas.2016.0699. [DOI] [PubMed] [Google Scholar]
  • 16.Misztal I, Gianola D. Indirect solution of mixed model equations. J Dairy Sci. 1988;71:99–106. doi: 10.1016/S0022-0302(88)79987-7. [DOI] [Google Scholar]
  • 17.Schaeffer LR, Kennedy BW. Computing strategies for solving mixed model equations. J Dairy Sci. 1986;69:575–579. doi: 10.3168/jds.S0022-0302(86)80441-6. [DOI] [Google Scholar]
  • 18.Strandén I, Lidauer M. Solving large mixed linear models using preconditioned conjugate gradient iteration. J Dairy Sci. 1999;82:2779–2787. doi: 10.3168/jds.S0022-0302(99)75535-9. [DOI] [PubMed] [Google Scholar]
  • 19.Mrode RA. Linear models for the prediction of animal breeding values. 2. Wallingford: CABI Publishing; 2005. [Google Scholar]
  • 20.Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaSci. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Legarra A, Ducrocq V. Computational strategies for national integration of phenotypic, genomic, and pedigree data in a single-step best linear unbiased prediction. J Dairy Sci. 2012;95:4629–4645. doi: 10.3168/jds.2011-4982. [DOI] [PubMed] [Google Scholar]
  • 22.Henderson CR. A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics. 1976;32:69–83. doi: 10.2307/2529339. [DOI] [Google Scholar]
  • 23.van Pelt ML, de Jong G. Genetic evaluation for direct and maternal livability in The Netherlands. Interbull Bull. 2011;44:235–239. [Google Scholar]
  • 24.CRV Animal Evaluation Unit. Kengetallen, E-23: Fokwaarde levensvatbaarheid bij geboorte - Fokwaarde levensvatbaarheid bij afkalven; 2017. https://cooperatiecrv-be6.kxcdn.com/wp-content/uploads/2019/05/E_23_LIV_20190424.pdf.
  • 25.Hsu WL, Garrick DJ, Fernando RL. The accuracy and bias of single-step genomic prediction for populations under selection. G3. 2017;7:2685–2694. doi: 10.1534/g3.117.043596. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Kaasschieter EF. A practical termination criterion for the conjugate gradient method. BIT Numer Math. 1988;28:308–322. doi: 10.1007/BF01934094. [DOI] [Google Scholar]
  • 27.Frank J, Vuik C. On the construction of deflation-based preconditioners. SIAM J Sci Comput. 2001;23:442–462. doi: 10.1137/S1064827500373231. [DOI] [Google Scholar]
  • 28.Vandenplas J, Eding H, Calus MPL, Vuik C. Deflated preconditioned conjugate gradient method for solving single-step BLUP models efficiently. Genet Sel Evol. 2018;50:51. doi: 10.1186/s12711-018-0429-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Hickey JM, Kinghorn BP, Tier B, van der Werf JH, Cleveland MA. A phasing and imputation method for pedigreed populations that results in a single-stage genomic evaluation. Genet Sel Evol. 2012;44:9. doi: 10.1186/1297-9686-44-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Mulder HA, Calus MPL, Druet T, Schrooten C. Imputation of genotypes with low-density chips and its effect on reliability of direct genomic values in Dutch Holstein cattle. J Dairy Sci. 2012;95:876–889. doi: 10.3168/jds.2011-4490. [DOI] [PubMed] [Google Scholar]
  • 31.Vandenplas J, Veerkamp RF, Evans RD, Calus MPL, Napel Jt. Single-step evaluation for calving traits with 1.5 million genotypes: SNP-based approaches. In: Proceedings of the 70th Annual Meeting of the European Association for Animal Production: 26–30 August 2019; Ghent; 2019.
  • 32.Taskinen M, Mäntysaari EA, Strandén I. Single-step SNP-BLUP with on-the-fly imputed genotypes and residual polygenic effects. Genet Sel Evol. 2017;49:36. doi: 10.1186/s12711-017-0310-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Walsh B, Lynch M. Evolution and selection of quantitative traits. Oxford: Oxford University Press; 2018. [Google Scholar]
  • 34.Crow JF, Kimura M. An introduction to population genetics theory. New York: Harper and Row; 1970. [Google Scholar]
  • 35.Malécot G. Les mathématiques de l’hérédité. Paris: Masson; 1948. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12711_2020_543_MOESM1_ESM.pdf (227.9KB, pdf)

Additional file 1. Derivation of formula to approximate diagZAgg-1Z and diag(ZAgn(Ann)-1AngZ) . Derivation of formula to approximate diagZAgg-1Zand diag(ZAgn(Ann)-1AngZ)[3335].

12711_2020_543_MOESM2_ESM.pdf (192.4KB, pdf)

Additional file 2.Description of the system of equations of ssSNPBLUP MS for a bivariate maternal model. Description of the system of equations of ssSNPBLUP MS for a bivariate maternal model, as well as of the different submatrices needed for the proposed computational strategies.

12711_2020_543_MOESM3_ESM.pdf (103.4KB, pdf)

Additional file 3. Derivation of an alternative computation of Agn(Ann)-1AngDerivation of an alternative computation of Agn(Ann)-1Angthat used only the ancestors and the progeny of the genotyped animals, instead of the complete pedigree.


Articles from Genetics, Selection, Evolution : GSE are provided here courtesy of BMC

RESOURCES