Abstract
Best linear unbiased prediction (BLUP) has been used to estimate the fixed effects and random effects of complex traits. Traditionally, genomic relationship matrix-based (GRM) and random marker-based BLUP analyses are prevalent to estimate the genetic values of complex traits. We used three methods: GRM-based prediction (G-BLUP), random marker-based prediction using an identity matrix (so-called single-nucleotide polymorphism [SNP]-BLUP), and SNP-SNP variance-covariance matrix (so-called SNP-GBLUP). We used 35,675 SNPs and R package "rrBLUP" for the BLUP analysis. The SNP-SNP relationship matrix was calculated using the GRM and Sherman-Morrison-Woodbury lemma. The SNP-GBLUP result was very similar to G-BLUP in the prediction of genetic values. However, there were many discrepancies between SNP-BLUP and the other two BLUPs. SNP-GBLUP has the merit to be able to predict genetic values through SNP effects.
Keywords: best linear unbiased estimation (BLUE), best linear unbiased prediction (BLUP), SNP genomic best linear unbiased prediction (SNP-GBLUP), SNP-SNP relationship matrix
Introduction
Many important human traits are moderately to highly heritable [1]. Yang et al. [2] found that roughly 45% of the genetic variance of human height can be explained by the regression on common single-nucleotide polymorphisms (SNPs). So-called dark matter of the genome in human height complex traits, especially, can be explained by the existence of gene-by-gene or gene-by-environment interactions [3,4,5]. Although the causal variants are not in complete linkage disequilibrium (LD) with the SNPs and although SNP-SNP interaction effects do exist, the traditional independent and identically distributed (IID) assumption has been used for so-called SNP-best linear unbiased predictor (SNP-BLUP). We calculated an SNP-SNP relationship matrix from a genomic relationship matrix (GRM) for best linear unbiased prediction (BLUP) analysis. This usage is called SNP-genomic linear unbiased prediction (SNP-GBLUP). The prediction of SNP effects is the best feature of this BLUP. This fact can help BLUP scientists predict the genetic values of other sample groups. Also, through the SNP-SNP relationship matrix, scientists can view the SNP-SNP interaction via the covariance terms of that matrix.
We used the BLUP to estimate the numerical values of genetic factors that reside in an individual's DNA information. The expression of diverse and complicated genetic factors can lead to phenotypic quantitative values, and the genetic factors can translate into numerical values. SNPs can be the representatives of genetic factors. Therefore, the invisible but real SNP effects can be translated into numerical values.
BLUP, including best linear unbiased estimation (BLUE), is a standard method for predicting the random effects and fixed effects of a mixed model [6]. In general, random effects can be SNP effects, and fixed effects can be factors, such as sex. This method was originally designed in the field of animal breeding for estimating breeding values but is now applicable to many areas of research [6]. In particular, the advent of DNA sequencing technology has made it possible for the application of the BLUP-BLUE mixed model to predict SNP effects. The BLUP method, originally developed by Henderson, which is now applicable to animals, plants, and humans, uses statistical regression and exploits fixed effects as well as random effects. It is a general model that is able to find polygenic effects. In fact, polygenic effects are the replacement of quantitative trait loci (QTL) effects or randomly distributed genomewide effects. The mixed linear model, including BLUP and BLUE, is in the form:
y = Xb + Zu + e, | (1) |
where y is the vector of phenotypic values of given quantitative traits and b and u are vectors of fixed and random effects, respectively. X and Z are the design matrices. The random effects and residual error effects are assumed to have a normal distribution as u~MVN(0,Gu) and e~MVN(0,R), where MVN indicates a multivariate normal distribution and E(y) = Xb, cov(u, e') = 0. The general solution using the maximum likelihood (ML) method, which shows the effects of b and u, is in the form of:
(2) |
Here, Gu-1 is the inverse matrix of the variance-covariance matrix of random effects, which indicates the SNP-SNP variance-covariance matrix. The GRM, G, can be calculated using R package "rrBLUP." For the feasible and precise prediction of the breeding value of organisms, genome-wide studies are quickly becoming mainstream. SNP information can show us the precise and broad application of BLUP, because it is widely applicable, and hundreds of thousands of genetic markers (SNPs) can be tested for association with a phenotype [7].
We used three methods: GRM-based BLUP (G-BLUP), SNP-BLUP, and SNP-GBLUP. G-BLUP is a GRM-based method. SNP-BLUP is assumed to be IID between SNPs [8, 9]. SNP-GBLUP exploits the SNP-SNP relationship matrix. We found out the calculation method of the SNP-SNP variance covariance matrix using the GRM G [9, 10, 11]. The tabular method, to calculate the Gu matrix, developed by Fernando and Grossman [12], uses the IBD concept, which is complicated and vulnerable to incorrect calculations in organisms. There is no R package available in Gu matrix calculation (http://www.r-project.org). However, the G matrix can be calculated from R codes [12, 13]. We proposed a simple method for calculating the Gu matrix from the G matrix. The SNP-SNP relationship matrix means the variance-covariance matrix between SNPs.
Height in humans is a classical quantitative trait, is easy to measure, and has high heritability. The heritability of height has been estimated to be ~0.8 [14, 15, 16, 17, 18]. We analyzed human height using the BLUP method. There were 997 individuals and 35,675 total SNPs in the analysis. Despite the use of many SNPs, we did not achieve satisfactory heritability results (about 30% of 0.8 total heritability). However, it was consistent with the results, which showed that 45% of height variance can be explained by considering all SNPs simultaneously.
Methods
Data preparation
We used the Ansan-Anseong cohort dataset in Korea. This dataset was established for a Korean chronic diseases study, with Ansan and Anseong representing urban and rural areas of Korea, respectively. The subjects were men between 40-69 years of age who had been residents of the region for at least 6 months. The basic survey was conducted from 2000 to 2001, and our study was based on the 3rd Ansan-Anseong cohort dataset version 2.1. We chose height as the phenotypic data and sex as a fixed effect. The SNPs of the cohort dataset were implemented using Affymetrix Genome-wide Human SNP Array 5.0 (Affymetrix, Santa Clara, CA, USA). The mean call rate was 99.01%, and the genetic analysis result, proved by SNPstream UHT 12 plex, was 99.934%. The total number of genotyped SNPs was 352,228, and they were filtered using the conditions of minor allele frequency (MAF; <0.01), Hardy-Weinberg equilibrium (<0.0001), and missing genotyping (missing >0.2). The SNPs were pruned using PLINK (http://pngu.mgh.harvard.edu/~purcell/plink), leaving 35,675 SNPs for the analysis of human height; 997 individuals were involved in the analysis.
The GRM was calculated by using R package "rrBLUP" with the option "Expectation-Maximization (EM) algorithm." Then, using the restricted maximum likelihood method (REML) of the same packages, we calculated the SNP effects, genetic values, error variance, and genetic value variance. The EM imputation algorithm was used for the GRM, because we dealt with high-density SNPs. The REML method is used in a rather small size sample instead of the ML method.
The method of statistical SNP-SNP variance-covariance matrix calculation
From the regression Eq. (1), Henderson [19] used the ML method to derive the BLUP equation, which is in the form:
f(y, u) = g(y|u)h(u) = g(Xb + Zu + e|u)h(u) = g(e)m(u). | (3) |
Then, we used Zu as the independent variable instead of u, which denotes the breeding value or genetic value Zu for the derivation of BLUP. We assumed the normality condition of the genetic value Zu and random effect u. Based on these assumptions, the variance-covariance matrix of genetic values, Zu, is the GRM, G.
f(y, Zu) = g(y|Zu)h(Zu) = g(e)m(Zu) = g(e)k(u) | (4) |
From Eqs. (3) and (4),
(5) |
(6) |
Eqs. (5) and (6) represent the probability density function (pdf) of the BLUP model (Eq. 1). Eqs. (5) and (6) are equivalent, and we can easily know that Gu-1 = ZTG-1Z. This important relationship links the SNP-SNP variance covariance matrix Gu to the GRM G.
The generalized least squares for the BLUP
We used the "rrBLUP" package for calculating the G matrix and predicting the genetic values of men's height complex traits. "rrBLUP" uses a practically generalized least squares (GLS) solution for the BLUP. For the G-BLUP, we used the G matrix, which was calculated from the R "rrBLUP" package using information on 35,675 SNPs. For the random effects estimation, such as SNP effects, we used two methods. One was SNP-BLUP, which is assumed to be IID between SNPs, and the other was SNP-GBLUP, which uses the statistical SNP-SNP variance covariance matrix. We will call this the SNP-SNP relationship matrix.
(7) |
(8) |
The Sherman-Morrison-Woodbury lemma
From the relationship Gu-1 = ZTG-1Z, we should calculate the inverse matrix of the Gu-1 matrix. Because a lot of time was needed for calculating the inverse matrix directly, we used the Sherman-Morrison-Woodbury (SMW) lemma [20, 21, 22, 23]. The formula was as follows:
(A + YGZ*)-1 = A-1 - A-1Y(G-1 + Z*A-1Y)-1Z*A-1, | (9) |
where A and G are both invertible, and A + YGZ* are invertible if and only if G-1 + Z*A-1 Y are invertible. Practically, to reduce the error, we used A as the identity matrix (I matrix), and the practically used formula was as follows:
(10) |
Calculating the SNP-SNP relationship matrix Gu using the lemma was faster than the direct calculation, because we used larger SNPs than the sample size.
Results
Genetic value prediction and SNP-GBLUP
The result of the genetic value prediction is shown in Fig. 1 (Fig. 1A, G-BLUP; Fig. 1B, SNP-BLUP; Fig. 1C, SNP-GBLUP). These results show the genetic value variances are nearly identical between G-BLUP and SNP-GBLUP. However, SNP-BLUP is very disparate from the other two BLUPs. Fig. 2 shows the histogram of SNP random effects (Fig. 2A, SNP-BLUP; Fig. 2B, SNP-GBLUP). Both cases indicate that the SNP effects in both BLUPs are distributed normally and that they suit the normal assumption of SNP effects. However, the result of the genetic values was very different in both cases, as shown in Fig. 3 and Supplementary Table 1. Supplementary Table 1 shows the height phenotype and the genetic value prediction of the three BLUPs. G-BLUP and SNP-GBLUP showed identical results. The estimated residual error variance was 21.39 in the three BLUP cases. The fixed effect was 166.6 (men) and 153.3 (women) in both G-BLUP and SNP-GBLUP and 167.4 (men) and 154.1 (women) in SNP-BLUP.
SNP-SNP relationship matrix
Supplementary Table 2 shows the calculated 1-8th SNP components of the Gu square matrix obtained using the SMW lemma. The variance components (diagonal parts) were the values close to 1. It suggests that the Gu matrix can be interpreted as the SNP-SNP relationship matrix. The covariance components can be interpreted as the SNP-SNP relationships or, in other words, SNP-SNP interaction terms, because the variance components represent the SNPs' relationships themselves.
Estimated heritability
We estimated the narrow-sense heritability (h2) using simple regression between the genetic values and the height phenotypic values. The heritability was 0.24 in G-BLUP and SNP-GBLUP and 0.20 in SNP-BLUP. According to Yang et al.'s article [2], only 45% of variance can be explained by total SNPs. Therefore, our estimated heritability was not poor, and of the generally accepted narrow-sense heritability of 0.8, we explained 33% in G-BLUP and SNP-GBLUP. We explained 53% of the heritability when using total SNPs. Also, we found the method that is widely applicable, SNP-GBLUP, because it can predict the SNP effects. It is impossible in G-BLUP, and Gu matrix components can be used widely, such as in genome-wide association studies (GWASs), because they can be interpreted as representing the interaction terms between SNPs.
Discussion
SNP-GBLUP and its applicability
The G-BLUP uses the GRM. This matrix contains the information between individuals. However, SNP-BLUP and SNP-GBLUP use the direct information between SNPs. SNP-BLUP is assumed to be IID between SNPs. This IID assumption is good but not accurate, because it ignores the interaction terms between SNPs. Instead, SNP-GBLUP uses the covariance structures directly. This BLUP is more proximal to actuality than SNP-BLUP.
The covariance terms of the SNP-SNP relationship matrix can be used as interaction terms in phenotype-related analysis. It can provide the scientist with clues for SNP-SNP interactions. The covariance terms of it are independent of the phenotypic values, because they came from the Z matrix and GRM. However, after proper constant multiplication, as in BLUP, they may be interpreted as interaction terms between SNPs.
Probably, the SNP effects through SNP-GBLUP rather than SNP-BLUP can be used in Bayesian BLUP. Bayesian BLUP excludes the low or zero effects of SNPs. The accurately calculated SNP effects through SNP-GBLUP can be classified into low-effect SNPs and high-effect SNPs as absolute values. Low-effect SNPs can be excluded, and Bayesian BLUP can exploit only high-effect SNPs [24, 25, 26].
Heritability and the GLS approach in BLUP
Narrow-sense heritability (h2) reflects the additive effects of QTL. The estimated heritability from our data was smaller than the generally accepted heritability. This is because causal variants were not in complete LD with the SNPs that were genotyped. Incomplete LD might occur if causal variants have a lower MAF than genotyped. The effects of the SNPs are treated statistically as random, and the SNPs have a small effect on the trait [2]. However, we achieved better heritability than the predicted value in common GWASs-i.e., only ~5% of the phenotypic variance in human height [27].
The present studies aim at predicting genetic values using the covariance structures between SNPs via GRM [28]. The numerator relationship matrix can be replaced by a genome-based GRM [29]. This G-BLUP, using GRM, uses information of SNPs [30]. We used three methods-particularly, the method of normal distribution-based SNP-SNP covariance structures. These structures were constructed using GRM and the Z design matrix. Algorithmically, the calculation of GRM uses the frequency of genotyped SNPs and the Z matrix. Also, in the previous study, the SNP-SNP covariance structures assumed that the identity matrix was multiplied by proper constants. However, we calculated the SNP-SNP relationship matrix using GRM and the Z matrix. This square matrix contains the information of relations between SNPs and can make it possible to estimate the SNP effects. Also, adequate use of the SMW lemma could make it plausible to estimate the SNP effects and genetic values simultaneously with faster computation time [21]. We expect that SNP-SNP covariance structures can make it feasible for calculating the genetic values and SNP effects that are linked to the QTL more accurately.
Genetic value estimation combines the performance and kinship information, which is based on a known pedigree [31]. The accuracy of genomic prediction depends on many factors, such as the genetic architecture of the trait and the population, the methodology of estimating the SNP effects, the distribution of the markers, and the degree of LD [32]. These can be important for accurate BLUP analysis, and there can be two approaches for solving BLUP. The GLS approach and Henderson's ML-based approach lead to identical solutions. The "rrBLUP" package uses the GLS approach. In that package, when the GRM is calculated, the imputation uses the mean method or EM algorithm. Especially, the EM algorithm can be used for haplotype inferences with SNPs. The EM algorithm is the most popular statistical algorithm because of its interpretability and stability. This algorithm is a deterministic procedure, consumes less computation time, and is easier for convergence checks [33]. The REML method is used in BLUP analysis with a small sample size. The use of REML for variance component estimation avoids the small sample bias associated with fixed effects and seems to avoid selection bias. Henderson [34] asserted the use of REML in BLUP models [35, 36]. Thus, we used the EM algorithm and REML method schemes.
Acknowledgments
We thank Dong-Hyun Shin and Min-Su Park for their help and Professor Hee-Seok Oh for helpful advice. The work was supported by the Project (PJ009260) of the National Livestock Institute of the Rural Development Administration, Republic of Korea. And this research was supported by the Bio & Medical Technology Development Program of the National Research Foundation (NRF) funded by the Ministry of Science, ICT & Future Planning (M3A9D1054622). We are also grateful to the provider of the KARE3 Ansan-Ansung cohort data.
Supplementary materials
Supplementary data including two tables can be found with this article at http://www.genominfo.org/src/sm/gni-12-254-s001.pdf.
References
- 1.de Los Campos G, Vazquez AI, Fernando R, Klimentidis YC, Sorensen D. Prediction of complex human traits using the genomic best linear unbiased predictor. PLoS Genet. 2013;9:e1003608. doi: 10.1371/journal.pgen.1003608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Maher B. Personal genomes: the case of the missing heritability. Nature. 2008;456:18–21. doi: 10.1038/456018a. [DOI] [PubMed] [Google Scholar]
- 4.Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Frazer KA, Murray SS, Schork NJ, Topol EJ. Human genetic variation and its contribution to complex traits. Nat Rev Genet. 2009;10:241–251. doi: 10.1038/nrg2554. [DOI] [PubMed] [Google Scholar]
- 6.Piepho HP, Möhring J, Melchinger AE, Büchse A. BLUP for phenotypic selection in plant breeding and variety testing. Euphytica. 2008;161:209–228. [Google Scholar]
- 7.Goddard ME, Wray NR, Verbyla K, Visscher PM. Estimating effects and making predictions from genome-wide marker data. Stat Sci. 2009;24:517–529. [Google Scholar]
- 8.Shen X, Alam M, Fikse F, Rönnegård L. A novel generalized ridge regression method for quantitative genetics. Genetics. 2013;193:1255–1268. doi: 10.1534/genetics.112.146720. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Koivula M, Strandén I, Su G, Mäntysaari EA. Different methods to calculate genomic predictions: comparisons of BLUP at the single nucleotide polymorphism level (SNP-BLUP), BLUP at the individual level (G-BLUP), and the one-step approach (H-BLUP) J Dairy Sci. 2012;95:4065–4073. doi: 10.3168/jds.2011-4874. [DOI] [PubMed] [Google Scholar]
- 10.Goddard ME, Hayes BJ, Meuwissen TH. Using the genomic relationship matrix to predict the accuracy of genomic selection. J Anim Breed Genet. 2011;128:409–421. doi: 10.1111/j.1439-0388.2011.00964.x. [DOI] [PubMed] [Google Scholar]
- 11.Legarra A, Aguilar I, Misztal I. A relationship matrix including full pedigree and genomic information. J Dairy Sci. 2009;92:4656–4663. doi: 10.3168/jds.2009-2061. [DOI] [PubMed] [Google Scholar]
- 12.Fernando RL, Grossman M. Marker assisted selection using best linear unbiased prediction. Genet Sel Evol. 1989;21:467–477. [Google Scholar]
- 13.Wang T, Fernando RL, van der Beek S, Grossman M, van Arendonk JA. Covariance between relatives for a marked quantitative trait locus. Genet Sel Evol. 1995;27:251–274. [Google Scholar]
- 14.Fisher RA. XV. The correlation between relatives on the supposition of Mendelian inheritance. Trans R Soc Edinb. 1918;52:399–433. [Google Scholar]
- 15.Galton F. Hereditary stature. Nature. 1886;33:317. [Google Scholar]
- 16.Silventoinen K, Sammalisto S, Perola M, Boomsma DI, Cornes BK, Davis C, et al. Heritability of adult body height: a comparative study of twin cohorts in eight countries. Twin Res. 2003;6:399–408. doi: 10.1375/136905203770326402. [DOI] [PubMed] [Google Scholar]
- 17.Visscher PM, Hill WG, Wray NR. Heritability in the genomics era: concepts and misconceptions. Nat Rev Genet. 2008;9:255–266. doi: 10.1038/nrg2322. [DOI] [PubMed] [Google Scholar]
- 18.Macgregor S, Cornes BK, Martin NG, Visscher PM. Bias, precision and heritability of self-reported and clinically measured height in Australian twins. Hum Genet. 2006;120:571–580. doi: 10.1007/s00439-006-0240-z. [DOI] [PubMed] [Google Scholar]
- 19.Henderson CR. Best linear unbiased estimation and prediction under a selection model. Biometrics. 1975;31:423–447. [PubMed] [Google Scholar]
- 20.Deng CY. A generalization of the Sherman-Morrison-Woodbury formula. Appl Math Lett. 2011;24:1561–1564. [Google Scholar]
- 21.Sherman J, Morrison WJ. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann Math Stat. 1950;21:124–127. [Google Scholar]
- 22.Bartlett MS. An inverse matrix adjustment arising in discriminant analysis. Ann Math Stat. 1951;22:107–111. [Google Scholar]
- 23.Woodbury MA. Inverting modified matrices. Memo Rep. 1950;42:106. [Google Scholar]
- 24.Honarvar M, Rostami M. Accuracy of genomic prediction using RR-BLUP and Bayesian LASSO. Eur J Exp Biol. 2013;3:42–47. [Google Scholar]
- 25.Ben Gara A, Rekik B, Bouallègue M. Genetic parameters and evaluation of the Tunisian dairy cattle population for milk yield by Bayesian and BLUP analyses. Livest Sci. 2006;100:142–149. [Google Scholar]
- 26.Robinson GK. That BLUP is a good thing: the estimation of random effects. Stat Sci. 1991;6:15–32. [Google Scholar]
- 27.Gudbjartsson DF, Walters GB, Thorleifsson G, Stefansson H, Halldorsson BV, Zusmanovich P, et al. Many sequence variants affecting diversity of adult human height. Nat Genet. 2008;40:609–615. doi: 10.1038/ng.122. [DOI] [PubMed] [Google Scholar]
- 28.Henderson CR. Sire evaluation and genetic trends. J Anim Sci. 1973:10–41. [Google Scholar]
- 29.Ranade K, Chang MS, Ting CT, Pei D, Hsiao CF, Olivier M, et al. High-throughput genotyping with single nucleotide polymorphisms. Genome Res. 2001;11:1262–1268. doi: 10.1101/gr.157801. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ober U, Ayroles JF, Stone EA, Richards S, Zhu D, Gibbs RA, et al. Using whole-genome sequence data to predict quantitative trait phenotypes in Drosophila melanogaster. PLoS Genet. 2012;8:e1002685. doi: 10.1371/journal.pgen.1002685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Harris BL, Johnson DL. Genomic predictions for New Zealand dairy bulls and integration with national genetic evaluation. J Dairy Sci. 2010;93:1243–1252. doi: 10.3168/jds.2009-2619. [DOI] [PubMed] [Google Scholar]
- 32.Solberg TR, Heringstad B, Svendsen M, Grove H, Meuwissen TH. Genomic predictions for production- and functional traits in Norwegian red from BLUP analyses of imputed 54K and 777K SNP data. Interbull Bull. 2011;44:240–243. [Google Scholar]
- 33.Qin ZS, Niu T, Liu JS. Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am J Hum Genet. 2002;71:1242–1247. doi: 10.1086/344207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Henderson CR. Estimation of variances in animal model and reduced animal model for single traits and single records. J Dairy Sci. 1986;69:1394–1402. [Google Scholar]
- 35.Graser HU, Smith SP, Tier B. A derivative-free approach for estimating variance components in animal models by restricted maximum likelihood. J Anim Sci. 1987;64:1362–1370. [Google Scholar]
- 36.Patterson HD, Thompson R. Recovery of inter-block information when block sizes are unequal. Biometrika. 1971;58:545–554. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.