Abstract
Diagonal elements of the coefficient matrix are necessary to calculate the genomic prediction accuracy. Here an improved methodology is described, to update the inverse of the coefficient matrix (C) for new individuals with a genotype, with and without phenotypes. Computational performance is significantly improved by re-using parts of the coefficient matrix inverse calculations that do not change from one animal to another, in combination with updated calculations for those that do change. This method expedites calculation of accuracy for new individuals with genotypes, without re-doing the whole population, by using the previously calculated matrices.
Keywords: accuracy, efficient, breeding value, prediction, method
1. Introduction
In the last decade, technological advances have significantly decreased genotyping costs, particularly for agricultural livestock and cropping species. This reduction in costs has enabled regular genomic best linear unbiased prediction (GBLUP) (VanRaden, 2008) analyses for the production of genomic breeding values. Currently, genomic information is routinely used in the Australian beef industry in producing estimated breeding values (EBVs). Low costs and high industry uptake has resulted in a rapid increase in the number of new genotypes and thus the size of the genomic population is growing larger. GBLUP requires inversion of the genomic relationship matrix (G) and the coefficient matrix (C), which is computationally demanding. More efficient methods such as APY (Algorithm for Proven and Young animals) (Misztal, 2016) and PICD (Partial Incomplete Cholesky Decomposition) (Hancock, 2017) can handle large numbers of genotyped animals by approximating the inverse of G only and not the coefficient matrix. However, these approaches do not address the need for diagonal elements of the coefficient matrix inverse (left hand side) required for calculating EBV accuracies. With the increasing speed at which new genotypes are provided, inversion of the coefficient matrix for accuracy calculation is increasingly computationally demanding and time consuming, requiring more efficient methods.
Here we propose a method to calculate the accuracy of new individuals, with and without phenotypes, by updating the coefficient matrix inverse (C−1) for new individuals only, without re-doing the whole population. Using this method, we significantly reduce time and computational demand by updating the accuracy of new individuals and reducing redundancy in the reference population.
2. Methods
2.1. Theory
We consider a simple animal model without fixed effects. This model is
(1) |
where y, Z, u, and e are vector of observations, design matrix, vector of solutions, and vector of random residual effects, respectively. The solutions and residual variances are and . The mixed model equations (MME) for the above model are
(2) |
where C is the coefficient matrix and . Henderson derived a method by using the diagonal of C−1 and the diagonal of G to calculate the accuracy of each estimated breeding value (Henderson, 1975). Accordingly, the accuracy can be calculated with this formula , where cii is the diagonal element of C−1 for individual i, and gii is the diagonal element of G for individual i.
2.1.1. Updating C−1 to Illustrate the Proposed Method
To calculate the accuracy of individuals with or without phenotypes, each individual can be added to C−1 separately. In this case, the partitioned matrix of MME (Equation 2) is
(3) |
where subscript p and q are core individuals forming the reference population and new individuals respectively. New individuals may or may not have phenotypes.
As demonstrated in Equation (2), Z′Z becomes where is a diagonal matrix with dimension equal to the number of core animals in the population, and the in the lower diagonal represents new animals. Since G−1 becomes , and C−1 becomes
(4) |
based on Equations (2) and (3). Inverting C is computationally demanding, as both G and the entire C should be inverted in each analysis for all individuals. needs to be updated as the new individuals are added to G. This can be performed by the method explained in Meyer et al. (2013). However, since we want to know the accuracy, we must invert C as well as G. Equation (4) can be converted with the following inversion lemma which is equivalent to the Woodbury's formula (Henderson, 1963, 1975; Henderson and Searle, 1981):
(5) |
With C we can consider A = Z′Z, B = −I, D−1 = G−1 and E = αI. Thus C−1 is
(6) |
where M−1 as (G + α(Z′Z)−1)−1 is used for simplification and is shown below in Equation (8). For the partitioned matrices in Equation (4), C−1 becomes
(7) |
where ≈ is approximation sign. By using lemma (6) G−1 is not required and we only need to invert the middle matrix (M) in Equation (7). With this simplification M−1 can be updated for each new individual using Cholesky decomposition and multiplying the Cholesky factors, i.e., M−1 = L−TL−1 (Harville, 1997; Meyer et al., 2013).
(8) |
Therefore, Equation (7) can be written as
(9) |
based on Equation (8) and . By multiplying back the Cholesky factors of M the solutions for and are
(10) |
and
(11) |
and Cqq which is the inverse of Cqq becomes
(12) |
(13) |
where “—” is the right division sign (multiplying numerator by inverse of denominator) and so
(14) |
Only , Gpq, Gqq, and change with each new genotyped individual.
For animals without phenotypes is a null matrix and the denominator in equation 14 becomes zero. However, we can assume the limit approach to α as approach to zero. Thus, Cqq is
(15) |
In summary, Equations (14) and (15) can be used to calculate the prediction accuracies of individuals with and without phenotype, respectively.
2.1.2. Updating the M−1 for New Individuals
Based on Equations (14) and (15) only changes (see Equation 7) in order to update the reference population. The updated , i.e., is
(16) |
by regarding previous assumptions and Equation (8). M−1 is the largest matrix that was generated in the previous run and can be compressed and stored in binary format to avoid memory issues. The other matrices were small and can be built efficiently by using optimized Linear Algebra PACKage (LAPACK). The equations (14, 15, and 16) were implemented as an R function (Appendix A) to show the prototype and in C++ with Armadillo library (Sanderson and Curtin, 2016) to assess its performance—single thread.
2.1.3. Simulated Data
Matrices with seven columns representing seven single-nucleotide polymorphisms for each individual and 1000, 2000, 3000, … 24000, and 25000 rows were created and filled with 0 (AA), 1 (AB), and 2 (BB) randomly. The genomic relationship matrices (G) were built by using VanRaden (2008) method 1, with dimension individuals by individuals. Importantly, increasing the number to SNPs does not affect the computational time. These matrices were used to assess the performance of the proposed method to calculate accuracy.
2.1.4. Performance Evaluation
To evaluate performance, each set was run in three steps. In the first step, the elapsed time to build the coefficient matrix by using the classic approach (i.e., inverting G and C) was measured. In the second step, the time to build — initial matrices required to update cqq was measured. In the third step, the time to calculate cqq by using the initial matrices was measured.
3. Results and Discussions
By calculating the accuracy of young individuals using Equations (14) and (15) computational times have been significantly reduced. Computational performance using this method is considerably faster, in comparison with existing methods, as shown in Figure 1, with only negligible differences in accuracy due to rounding errors (less than 8.88 × 10−16). The proposed approach using Z′Z (a diagonal matrix) resulted in shorter time to build the matrices used to update cqq compared to when using the classic approach to calculate accuracies. This method can be extended in order to accommodate fixed effects and dense Z′Z when cqq is updated. Furthermore, the part of C for animals with phenotype (Cpp) must be updated as more individuals are phenotyped.
This method could be exploited within routine breeding value estimation for expidited accuracy calculations. Breeding value accuracy is based on an individual's relatedness to the core reference, such that high accuracy indicates high relatedness. This method to calculate accuracy will affect how the genotypes are used, based on how informative they are for the prediction, improving efficiency by reducing redundant information.
New individuals with phenotypes and low accuracy can be added to the core population, as it is likely these animals are lowly related. Their addition improves the diversity and informativity of the core reference population, and can further improve imputation accuracy of the missing genotypes, with added diversity into the imputation haplotype library. Individuals with high accuracy are not required to be added to the core, with or without phenotype, as their accuracy indicates their relatives are already included in this reference population, making their addition redundant. New individuals without phenotypes and low accuracy, should have relatives genotyped to improve accuracy and/or should have their phenotypes recorded to improve the core population.
It is possible to exploit the accuracy calculation as a type of quality control filter for population data, such that individuals with an expected level of relatedness to the reference population, obtains a low accuracy, this may be indicative of genotyping/sampling error, mis-assigned breed, etc. The rapid accuracy calculation for those individuals without phenotype can provide important context for quickly developing a phenotyping strategy.
4. Conclusion
Updating the inverse of C for new individuals with and without phenotype, using the method here, is shown to reduce the computational effort significantly. With increasing numbers of genotyped animals in genetic evaluations, computational efficiency is essential for frequent and timely evaluations. This method provides an improved and efficient method to deliver accurate and fast evaluations when few new young individuals are genotyped but may or may not have phenotypes.
Author Contributions
MF developed the method, structured the manuscript, and wrote the method and theory. NC wrote the introduction, result and discussion, and performed major revision. BT gave some comments to improve the final method and article.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Acknowledgments
Parts of this article has been published in proceeding of World Congress on Genetics Applied to Livestock Production (WCGALP)—(Ferdosi et al., 2018). The authors want to thank Daniela A. L. Lourenco and Karin Meyer for their comments on the WCGALP paper. The authors also wish to thank the reviewers for comments leading to the clarification and improvement of this article.
Appendix A
R functions that show the prototype.
# Create a partition matrix from 4 matrices bind <- function(A, B, C, D) { rbind(cbind(A, B), cbind(C, D)) } # Create a zero matrix zM <- function(row, col) { matrix(0, nrow = row, ncol = col) } # Calculate X'X xpx <- function(x) { t(x) %*% x } # Calculate C inverse by using traditional method tradMethod <- function(Zp, Zq, Gpp, Gpq, Gqq, Alpha) { A <- xpx(Zp) D <- xpx(Zq) B <- zM(nrow(A), ncol(D)) C <- t(B) solve(bind(A, B, C, D) + Alpha * solve(bind(Gpp, Gpq, t(Gpq), Gqq))) } # Calculate Cqq when there are phenotypes for new # individuals cqqPhen <- function(Zp, Zq, Gpp, Gpq, Gqq, Alpha) { Minv <- solve(Gpp + Alpha * solve(xpx(Zp))) n1 <- -t(Gpq) %*% Minv %*% Gpq + Gqq n2 <- (Gqq + Alpha * solve(xpx(Zq)) - t(Gpq) %*% Minv %*% Gpq) %*% xpx(Zq) n1 %*% solve(n2) } # Calculate Cqq when there are no phenotypes for new individuals cqq <- function(Zp, Zq, Gpp, Gpq, Gqq, Alpha) { Minv <- solve(Gpp + Alpha * solve(xpx(Zp))) n1 <- -t(Gpq) %*% Minv %*% Gpq + Gqq n1 / Alpha }
Footnotes
Funding. MF, NC, and BT were supported by Meat and Livestock Australia project L.GEN.0174.
References
- Ferdosi M., Connors N., Tier B. (2018). “An efficient method to calculate accuracy of estimated breeding values for individuals without phenotypes,” in Proceedings of the World Congress on Genetics Applied to Livestock Production, Electronic Poster Session - Methods and Tools - Models and Computing Strategies (Auckland: ). [Google Scholar]
- Hancock T. P. (2017). “Approximate gblup for efficient routine evaluations,” in Association for the Advancement of Animal Breeding and Genetics (Townsville, QLD: ). [Google Scholar]
- Harville D. A. (1997). Matrix Algebra From a Statistician's Perspective, Vol. 1. Yorktown Heights, NY: Springer. [Google Scholar]
- Henderson C. R. (1963). Selection index and expected genetic advance. Stat. Genet. Plant Breed. 982, 141–163. [Google Scholar]
- Henderson C. R. (1975). Best linear unbiased estimation and prediction under a selection model. Biometrics 31, 423–447. [PubMed] [Google Scholar]
- Henderson H. V., Searle S. R. (1981). On deriving the inverse of a sum of matrices. SIAM Rev. 23, 53–60. [Google Scholar]
- Meyer K., Tier B., Graser H.-U. (2013). Updating the inverse of the genomic relationship matrix. J. Anim. Sci. 91, 2583–2586. 10.2527/jas.2012-6056 [DOI] [PubMed] [Google Scholar]
- Misztal I. (2016). Inexpensive computation of the inverse of the genomic relationship matrix in populations with small effective population size. Genetics 202, 401–409. 10.1534/genetics.115.182089 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sanderson C., Curtin R. (2016). Armadillo: a template-based c++ library for linear algebra. J. Open Source Softw. 1:26 10.21105/joss.00026 [DOI] [Google Scholar]
- VanRaden P. M. (2008). Efficient methods to compute genomic predictions. J. Dairy Sci. 91, 4414–4423. 10.3168/jds.2007-0980 [DOI] [PubMed] [Google Scholar]