Skip to main content
Frontiers in Genetics logoLink to Frontiers in Genetics
. 2019 Jun 25;10:596. doi: 10.3389/fgene.2019.00596

An Efficient Method to Calculate Genomic Prediction Accuracy for New Individuals

Mohammad H Ferdosi 1,*, Natalie K Connors 1, Bruce Tier 1
PMCID: PMC6603240  PMID: 31293622

Abstract

Diagonal elements of the coefficient matrix are necessary to calculate the genomic prediction accuracy. Here an improved methodology is described, to update the inverse of the coefficient matrix (C) for new individuals with a genotype, with and without phenotypes. Computational performance is significantly improved by re-using parts of the coefficient matrix inverse calculations that do not change from one animal to another, in combination with updated calculations for those that do change. This method expedites calculation of accuracy for new individuals with genotypes, without re-doing the whole population, by using the previously calculated matrices.

Keywords: accuracy, efficient, breeding value, prediction, method

1. Introduction

In the last decade, technological advances have significantly decreased genotyping costs, particularly for agricultural livestock and cropping species. This reduction in costs has enabled regular genomic best linear unbiased prediction (GBLUP) (VanRaden, 2008) analyses for the production of genomic breeding values. Currently, genomic information is routinely used in the Australian beef industry in producing estimated breeding values (EBVs). Low costs and high industry uptake has resulted in a rapid increase in the number of new genotypes and thus the size of the genomic population is growing larger. GBLUP requires inversion of the genomic relationship matrix (G) and the coefficient matrix (C), which is computationally demanding. More efficient methods such as APY (Algorithm for Proven and Young animals) (Misztal, 2016) and PICD (Partial Incomplete Cholesky Decomposition) (Hancock, 2017) can handle large numbers of genotyped animals by approximating the inverse of G only and not the coefficient matrix. However, these approaches do not address the need for diagonal elements of the coefficient matrix inverse (left hand side) required for calculating EBV accuracies. With the increasing speed at which new genotypes are provided, inversion of the coefficient matrix for accuracy calculation is increasingly computationally demanding and time consuming, requiring more efficient methods.

Here we propose a method to calculate the accuracy of new individuals, with and without phenotypes, by updating the coefficient matrix inverse (C−1) for new individuals only, without re-doing the whole population. Using this method, we significantly reduce time and computational demand by updating the accuracy of new individuals and reducing redundancy in the reference population.

2. Methods

2.1. Theory

We consider a simple animal model without fixed effects. This model is

y=Zu+e (1)

where y, Z, u, and e are vector of observations, design matrix, vector of solutions, and vector of random residual effects, respectively. The solutions and residual variances are var(u)=Gσu2 and var(e)=Iσe2. The mixed model equations (MME) for the above model are

Cu=[ZZ+αG1]u=r, (2)

where C is the coefficient matrix and α=σe2σu2. Henderson derived a method by using the diagonal of C−1 and the diagonal of G to calculate the accuracy of each estimated breeding value (Henderson, 1975). Accordingly, the accuracy can be calculated with this formula 1-αciigii, where cii is the diagonal element of C−1 for individual i, and gii is the diagonal element of G for individual i.

2.1.1. Updating C−1 to Illustrate the Proposed Method

To calculate the accuracy of individuals with or without phenotypes, each individual can be added to C−1 separately. In this case, the partitioned matrix of MME (Equation 2) is

[CppCpqCpqCqq][upuq]=[rprq] (3)

where subscript p and q are core individuals forming the reference population and new individuals respectively. New individuals may or may not have phenotypes.

As demonstrated in Equation (2), Z′Z becomes ((ZZ)pp00(ZZ)qq) where (ZZ)pp is a diagonal matrix with dimension equal to the number of core animals in the population, and the (ZZ)qq in the lower diagonal represents new animals. Since G−1 becomes (GppGpqGpqGqq)-1, and C−1 becomes

[ZZ+αG1]1                          ([(ZZ)pp00(ZZ)qq]+α[GppGpqGpqGqq]1)1 (4)

based on Equations (2) and (3). Inverting C is computationally demanding, as both G and the entire C should be inverted in each analysis for all individuals. Gpp-1 needs to be updated as the new individuals are added to G. This can be performed by the method explained in Meyer et al. (2013). However, since we want to know the accuracy, we must invert C as well as G. Equation (4) can be converted with the following inversion lemma which is equivalent to the Woodbury's formula (Henderson, 1963, 1975; Henderson and Searle, 1981):

(ABD1E)1=E1D(DEA1B)1EA1. (5)

With C we can consider A = Z′Z, B = −I, D−1 = G−1 and E = αI. Thus C−1 is

(ZZ+αG1)1=α1G(G+α(ZZ)1)1α(ZZ)1                                                                               =GM1(ZZ)1 (6)

where M−1 as (G + α(Z′Z)−1)−1 is used for simplification and is shown below in Equation (8). For the partitioned matrices in Equation (4), C−1 becomes

([(ZZ)pp00(ZZ)qq]+α[GppGpqGpqGqq]1)1[GppGpqGpqGqq]([GppGpqGpqGqq]+α[(ZZ)pp100(ZZ)qq1])1                                    [(ZZ)pp100(ZZ)qq1][GppGpqGpqGqq][Gpp+α(ZZ)pp1GpqGpqGqq+α(ZZ)qq1]1                                                                    [(ZZ)pp100(ZZ)qq1], (7)

where ≈ is approximation sign. By using lemma (6) G−1 is not required and we only need to invert the middle matrix (M) in Equation (7). With this simplification M−1 can be updated for each new individual using Cholesky decomposition and multiplying the Cholesky factors, i.e., M−1 = LTL−1 (Harville, 1997; Meyer et al., 2013).

M1=[LppTLpp1+LppTLqpLqqTLqq1LqpLpp1LppTLpqLqqTLqq1LqqTLqq1LqpLpp1LqqTLqq1]                                                                                                  =[S1S2S3S4]. (8)

Therefore, Equation (7) can be written as

  [CppCpqCpqCqq]1=[GppGpqGpqGqq] [S1S2S3S4]  [(ZZ)pp100(ZZ)qq1]       =[GppS1+GpqS3GppS2+GpqS4GpqS1+GqqS3GpqS2+GqqS4][(ZZ)pp100(ZZ)qq1]=[(GppS1+GpqS3)(ZZ)pp1(GppS2+GpqS4)(ZZ)qq1(GpqS1+GqqS3)(ZZ)pp1(GpqS2+GqqS4)(ZZ)qq1], (9)

based on Equation (8) S2=-Lpp-TLqpLqq-TLqq-1 and S4=Lqq-TLqq-1. By multiplying back the Cholesky factors of M the solutions for Lpq and LqqLqqT are

Lqp=Lpp1Gpq (10)

and

LqqLqqT=Gqq+α(ZZ)qq1Gpq(Gpp+α(ZZ)pp1)1Gpq, (11)

and Cqq which is the inverse of Cqq becomes

Cqq=(Gpq(LppTLqpLqqTLqq1)+Gqq(LqqTLqq1))(ZZ)qq1                    =(Gpq(LppTLpqS4)+GqqS4)(ZZ)qq1                     =S4(Gpq(LppTLqp)+Gqq)(ZZ)qq1=GpqLppTLqp+Gqq(Gqq+α(ZZ)qq1Gpq(Gpp+α(ZZ)pp1)1Gpq)(ZZ)qq=GpqLppTLpp1Gpq+Gqq(Gqq+α(ZZ)qq1Gpq(Gpp+α(ZZ)pp1)1Gpq)(ZZ)qq (12)
Cqq=GpqLppTLpp1Gpq+Gqq(Gqq+α(ZZ)qq1Gpq(Gpp+α(ZZ)pp1)1Gpq)(ZZ)qq (13)

where “—” is the right division sign (multiplying numerator by inverse of denominator) and Lpp-TLpp-1=(Gpp+α(ZZ)pp-1)-1 so

Cqq=Gpq(Gpp+α(ZZ)pp1)1Gpq+Gqq(Gqq+α(ZZ)qq1Gpq(Gpp+α(ZZ)pp1)1Gpq)(ZZ)qq (14)

Only Gpq, Gpq, Gqq, and (ZZ)qq change with each new genotyped individual.

For animals without phenotypes (ZZ)qq is a null matrix and the denominator in equation 14 becomes zero. However, we can assume the limit approach to α as (ZZ)qq approach to zero. Thus, Cqq is

lim(ZZ)qq0Gpq(Gpp+α(ZZ)pp1)1Gpq+Gqq(Gqq+α(ZZ)qq1Gpq(Gpp+α(ZZ)pp1)1Gpq)(ZZ)qq                              =Gpq(Gpp+α(ZZ)pp1)1Gpq+Gqqα (15)

In summary, Equations (14) and (15) can be used to calculate the prediction accuracies of individuals with and without phenotype, respectively.

2.1.2. Updating the M−1 for New Individuals

Based on Equations (14) and (15) only (Gpp+α(ZZ)pp-1)-1=Mpp-1 changes (see Equation 7) in order to update the reference population. The updated Mpp-1, i.e., Mppnew-1 is

Mppnew1= [Sn1Sn2Sn3Sn4]=[M1S2GpqM1M1GpqS4Sn2(Gqq+α(ZZ)qq1GpqM1Gpq)1] (16)

by regarding previous assumptions and Equation (8). M−1 is the largest matrix that was generated in the previous run and can be compressed and stored in binary format to avoid memory issues. The other matrices were small and can be built efficiently by using optimized Linear Algebra PACKage (LAPACK). The equations (14, 15, and 16) were implemented as an R function (Appendix A) to show the prototype and in C++ with Armadillo library (Sanderson and Curtin, 2016) to assess its performance—single thread.

2.1.3. Simulated Data

Matrices with seven columns representing seven single-nucleotide polymorphisms for each individual and 1000, 2000, 3000, … 24000, and 25000 rows were created and filled with 0 (AA), 1 (AB), and 2 (BB) randomly. The genomic relationship matrices (G) were built by using VanRaden (2008) method 1, with dimension individuals by individuals. Importantly, increasing the number to SNPs does not affect the computational time. These matrices were used to assess the performance of the proposed method to calculate accuracy.

2.1.4. Performance Evaluation

To evaluate performance, each set was run in three steps. In the first step, the elapsed time to build the coefficient matrix by using the classic approach (i.e., inverting G and C) was measured. In the second step, the time to build (Gpp+α(ZZ)pp-1)-1— initial matrices required to update cqq was measured. In the third step, the time to calculate cqq by using the initial matrices was measured.

3. Results and Discussions

By calculating the accuracy of young individuals using Equations (14) and (15) computational times have been significantly reduced. Computational performance using this method is considerably faster, in comparison with existing methods, as shown in Figure 1, with only negligible differences in accuracy due to rounding errors (less than 8.88 × 10−16). The proposed approach using ZZ (a diagonal matrix) resulted in shorter time to build the matrices used to update cqq compared to when using the classic approach to calculate accuracies. This method can be extended in order to accommodate fixed effects and dense Z′Z when cqq is updated. Furthermore, the part of C for animals with phenotype (Cpp) must be updated as more individuals are phenotyped.

Figure 1.

Figure 1

The graph shows the elapsed time required to calculate cqq using different approaches. inv M is the elapsed time to calculate the M matrix which is the core element to calculate cqq for a new individual. Update Reference is the elapse time to update M matrix, the cqq with and without phenotype shows the elapsed time to calculate cqq when there is or there is not phenotype for the individual, respectively. Their performances were very similar, and as such the lines overlap.

This method could be exploited within routine breeding value estimation for expidited accuracy calculations. Breeding value accuracy is based on an individual's relatedness to the core reference, such that high accuracy indicates high relatedness. This method to calculate accuracy will affect how the genotypes are used, based on how informative they are for the prediction, improving efficiency by reducing redundant information.

New individuals with phenotypes and low accuracy can be added to the core population, as it is likely these animals are lowly related. Their addition improves the diversity and informativity of the core reference population, and can further improve imputation accuracy of the missing genotypes, with added diversity into the imputation haplotype library. Individuals with high accuracy are not required to be added to the core, with or without phenotype, as their accuracy indicates their relatives are already included in this reference population, making their addition redundant. New individuals without phenotypes and low accuracy, should have relatives genotyped to improve accuracy and/or should have their phenotypes recorded to improve the core population.

It is possible to exploit the accuracy calculation as a type of quality control filter for population data, such that individuals with an expected level of relatedness to the reference population, obtains a low accuracy, this may be indicative of genotyping/sampling error, mis-assigned breed, etc. The rapid accuracy calculation for those individuals without phenotype can provide important context for quickly developing a phenotyping strategy.

4. Conclusion

Updating the inverse of C for new individuals with and without phenotype, using the method here, is shown to reduce the computational effort significantly. With increasing numbers of genotyped animals in genetic evaluations, computational efficiency is essential for frequent and timely evaluations. This method provides an improved and efficient method to deliver accurate and fast evaluations when few new young individuals are genotyped but may or may not have phenotypes.

Author Contributions

MF developed the method, structured the manuscript, and wrote the method and theory. NC wrote the introduction, result and discussion, and performed major revision. BT gave some comments to improve the final method and article.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

Parts of this article has been published in proceeding of World Congress on Genetics Applied to Livestock Production (WCGALP)—(Ferdosi et al., 2018). The authors want to thank Daniela A. L. Lourenco and Karin Meyer for their comments on the WCGALP paper. The authors also wish to thank the reviewers for comments leading to the clarification and improvement of this article.

Appendix A

R functions that show the prototype.

 
# Create a partition matrix from 4 matrices
bind  <- function(A, B, C, D)
{
  rbind(cbind(A, B), cbind(C, D))
}
 
# Create a zero matrix
zM  <- function(row, col)
{
  matrix(0, nrow = row, ncol = col)
}
 
# Calculate X'X
xpx  <- function(x)
{
  t(x) %*% x
}
 
# Calculate C inverse by using traditional method
tradMethod  <- function(Zp, Zq, Gpp, Gpq, Gqq, Alpha)
{
  A  <- xpx(Zp)
  D  <- xpx(Zq)
  B  <- zM(nrow(A), ncol(D))
  C  <- t(B)
  solve(bind(A, B, C, D) + Alpha *
  solve(bind(Gpp, Gpq, t(Gpq), Gqq)))
}
 
# Calculate Cqq when there are phenotypes for new
# individuals
cqqPhen  <- function(Zp, Zq, Gpp, Gpq, Gqq, Alpha)
{
  Minv  <- solve(Gpp + Alpha * solve(xpx(Zp)))
  n1  <- -t(Gpq) %*% Minv %*% Gpq + Gqq
  n2  <- (Gqq + Alpha * solve(xpx(Zq)) - t(Gpq) %*%
  Minv %*% Gpq) %*% xpx(Zq)
  n1 %*% solve(n2)
}
 
# Calculate Cqq when there are no phenotypes for
new individuals
cqq  <- function(Zp, Zq, Gpp, Gpq, Gqq, Alpha)
{
  Minv  <- solve(Gpp + Alpha * solve(xpx(Zp)))
  n1  <- -t(Gpq) %*% Minv %*% Gpq + Gqq
  n1 / Alpha
}
 

Footnotes

Funding. MF, NC, and BT were supported by Meat and Livestock Australia project L.GEN.0174.

References

  1. Ferdosi M., Connors N., Tier B. (2018). “An efficient method to calculate accuracy of estimated breeding values for individuals without phenotypes,” in Proceedings of the World Congress on Genetics Applied to Livestock Production, Electronic Poster Session - Methods and Tools - Models and Computing Strategies (Auckland: ). [Google Scholar]
  2. Hancock T. P. (2017). “Approximate gblup for efficient routine evaluations,” in Association for the Advancement of Animal Breeding and Genetics (Townsville, QLD: ). [Google Scholar]
  3. Harville D. A. (1997). Matrix Algebra From a Statistician's Perspective, Vol. 1. Yorktown Heights, NY: Springer. [Google Scholar]
  4. Henderson C. R. (1963). Selection index and expected genetic advance. Stat. Genet. Plant Breed. 982, 141–163. [Google Scholar]
  5. Henderson C. R. (1975). Best linear unbiased estimation and prediction under a selection model. Biometrics 31, 423–447. [PubMed] [Google Scholar]
  6. Henderson H. V., Searle S. R. (1981). On deriving the inverse of a sum of matrices. SIAM Rev. 23, 53–60. [Google Scholar]
  7. Meyer K., Tier B., Graser H.-U. (2013). Updating the inverse of the genomic relationship matrix. J. Anim. Sci. 91, 2583–2586. 10.2527/jas.2012-6056 [DOI] [PubMed] [Google Scholar]
  8. Misztal I. (2016). Inexpensive computation of the inverse of the genomic relationship matrix in populations with small effective population size. Genetics 202, 401–409. 10.1534/genetics.115.182089 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Sanderson C., Curtin R. (2016). Armadillo: a template-based c++ library for linear algebra. J. Open Source Softw. 1:26 10.21105/joss.00026 [DOI] [Google Scholar]
  10. VanRaden P. M. (2008). Efficient methods to compute genomic predictions. J. Dairy Sci. 91, 4414–4423. 10.3168/jds.2007-0980 [DOI] [PubMed] [Google Scholar]

Articles from Frontiers in Genetics are provided here courtesy of Frontiers Media SA

RESOURCES