Abstract
Background
The amount of functional genomic information has been growing rapidly but remains largely unused in genomic selection. Genomic prediction and estimation using haplotypes in genome regions with functional elements such as all genes of the genome can be an approach to integrate functional and structural genomic information for genomic selection. Towards this goal, this article develops a new haplotype approach for genomic prediction and estimation.
Results
A multi-allelic haplotype model treating each haplotype as an ‘allele’ was developed for genomic prediction and estimation based on the partition of a multi-allelic genotypic value into additive and dominance values. Each additive value is expressed as a function of h − 1 additive effects, where h = number of alleles or haplotypes, and each dominance value is expressed as a function of h(h − 1)/2 dominance effects. For a sample of q individuals, the limit number of effects is 2q − 1 for additive effects and is the number of heterozygous genotypes for dominance effects. Additive values are factorized as a product between the additive model matrix and the h − 1 additive effects, and dominance values are factorized as a product between the dominance model matrix and the h(h − 1)/2 dominance effects. Genomic additive relationship matrix is defined as a function of the haplotype model matrix for additive effects, and genomic dominance relationship matrix is defined as a function of the haplotype model matrix for dominance effects. Based on these results, a mixed model implementation for genomic prediction and variance component estimation that jointly use haplotypes and single markers is established, including two computing strategies for genomic prediction and variance component estimation with identical results.
Conclusion
The multi-allelic genetic partition fills a theoretical gap in genetic partition by providing general formulations for partitioning multi-allelic genotypic values and provides a haplotype method based on the quantitative genetics model towards the utilization of functional and structural genomic information for genomic prediction and estimation.
Keywords: Haplotype, Genomic selection, Variance component, Heritability, BLUP, REML
Background
Genomic best linear unbiased prediction (GBLUP) using genome-wide single nucleotide polymorphism (SNP) markers can utilize a wealth of theoretical results and computational strategies of best linear unbiased prediction (BLUP) [1] that has become a standard approach for genetic evaluation, with dairy cattle having the most widespread use of BLUP worldwide [2–5]. The implementation of GBLUP within the BLUP framework is made possible by a genomic relationship matrix that replaces the pedigree relationship matrix in BLUP [6]. With genomic relationship matrix established, genomic estimation of variance components can also readily use the method of restricted maximum likelihood estimation (REML) [7], to be referred to as GREML (genomic REML). Using a quantitative genetics model as the unifying model, genomic relationship matrix is formulated by equaling the covariance of genomic values between two individuals to the corresponding pedigree covariance [8, 9]. Previously defined genomic relationships based on standardization of SNP coding [6, 8, 10, 11] can be considered as special cases of this unifying approach [9]. The quantitative genetics model partitions a genotypic value as the summation of a common mean, breeding value and dominance deviation [12–18]. Using matrix notations, this partition can be expressed as: g = 1μ + a + d = 1μ + Wαα + Wδδ, where μ = common mean, 1 = column vector of 1’s, a = breeding values (additive values), d = dominance deviations (dominance values), α = SNP additive effects, δ = SNP dominance effects, Wα = model matrix of α as a function of SNP allele frequencies, and Wδ = model matrix of δ as a function of SNP allele frequencies. With the factorization of a = Wαα and d = Wδδ, genomic additive relationship is a function of WαWα ' and genomic dominance relationship is a function of WδWδ ' [9]. This approach for defining genomic relationships was only available for bi-allelic loci. Although SNPs are bi-allelic loci, the issue of multi-allelic loci for genomic prediction and estimation arises if each haplotype is treated as an ‘allele’ and the haplotype block containing the haplotypes is treated as a ‘locus’. For a multi-allelic locus, the partition of a genotypic value into additive and dominance values (g = 1μ + a + d) was available [17] and the multi-allelic factorization of a = Wαα and d = Wδδ was available for three alleles [19]. However, general factorization formulations for an arbitrary number of alleles were unavailable, and a method using such multi-allelic haplotype model for genomic prediction and estimation was unavailable.
Haplotype analysis is advantageous over single-locus analysis for several reasons: a haplotype is a functional unit [20], a haplotype contains combined effects of tightly linked cis-acting causal variants [21, 22], a phenotype is affected by multiple causal loci with weak LD (LD = linkage disequilibrium) [23], or a genomic region is subjected to selection with stronger LD than genome regions unaffected by selection [24, 25]. Haplotype analysis has been widely used in genetic and genomic studies [22, 26–28]. Relatively limited studies were available on using haplotypes compared to the literature on using single SNPs for genomic prediction. Methods to define haplotype blocks for genomic prediction included a constant number of SNPs per SNP block [29, 30], fixed block length [31], or LD blocks [32]. Haplotype coding methods for genomic prediction and estimation included 2-1-0 copies of a haplotype in the two-haplotype genotype [30, 33], or maternal or paternal haplotype [29]. Haplotype mixed model methods based on the quantitative genetics model with multi-allelic factorization of additive and dominance values were unavailable for genomic prediction and estimation. Functional genomic information has been growing rapidly but remains largely unused in genomic selection. Simulation study showed that genomic prediction using causal mutations could substantially improve prediction accuracy [34], and using SNPs in transcriptional regions [35] or location specific priors based on QTL mapping results [36] improved prediction accuracy. Haplotype analysis can be a useful tool to account for joint allelic effects unaccounted for by single-SNP analysis and we have obtained encouraging preliminary results of using haplotype analysis of functional genomic information [37, 38].
The purpose of this article is to develop a quantitative genetics based multi-allelic haplotype model as an alternative method to single-SNP analysis towards the integration of functional and structural genomic information for genomic selection. This development includes deriving general multi-allelic partition of genotypic values with factorization for defining genomic relationships using haplotypes, and deriving mixed model formulations for genomic prediction and estimation that can use haplotypes separately or jointly with single SNPs.
Methods
Allelic mean and population mean of multi-allelic genotypic values
A set of m SNP markers are assumed available, and r haplotype blocks are defined from some of the m SNPs across the genome. Each haplotype block is treated as a ‘locus’ and each haplotype within the haplotype block is treated as an ‘allele’. Each locus (haplotype block) is assumed to have h alleles (haplotypes) denoted by Ai, …, Ah, with allele frequency of pi for Ai, i = 1, …, h, and ∑hi = 1pi = 1. The allelic array in the population is ∑hi = 1piAi. Let Pij = frequency of AiAj genotype, ∑hi = 1∑hj = 1PijAiAj = the genotypic array of the population, and gij = genotypic value of AiAj genotype, i,j = 1,…,h. Hardy-Weinberg equilibrium (HWE) is assumed so that the genotypic array of the population is the squared allelic array, i.e., ∑hi = 1∑hj = 1PijAiAj = (∑hi = 1piAi)2. Allele frequency of Ai is calculated as:
1 |
The allelic mean of Ai allele is the weighted mean of all genotypic values with the Ai allele, with each genotypic value weighted by the number of copies of the Ai allele the genotype carries. The general expression of the allelic mean without requiring HWE is a conditional mean [13] and simplifies to a weighted average of genotypic values with allele frequencies as the weights under the HWE assumption [13, 17], i.e.,
2 |
The population mean is the mean of all genotypic values in the population. The general formula without requiring HWE and its expression as a weighted average of allelic means with allele frequencies as the weights requiring HWE are:
3 |
The expressions of μi = ∑hj = 1pjgij and μ = ∑hk = 1pkμk play an important role in the derivations to factorize additive and dominance values and in defining fundamental genetic parameters of quantitative traits.
Multi-allelic effect, additive effect, additive value
The allelic effect (average effect) of allele Ai (i = 1,…h) is the deviation of the allelic mean from the population mean. From Eqs. 2 and 3, the allelic effect of Ai is:
4 |
where αij is the additive effect or the average effect of gene substitution that is the difference between the allelic effects of the two alleles defined by Eq. 4, i.e.,
5 |
For h alleles, h(h − 1)/2 αij parameters of Eq. 5 are possible but these parameters are not independent for all ij values. An example of this dependency is:
6 |
Based on Eq. 6, h-1 independent additive effects can be defined:
7 |
where μ1 = allelic mean of allele 1 that is used as the reference allele (e.g., defining the most frequent allele as ‘allele 1’). It is readily seen that αii = 0. The derivation process will allow the presence of αii but the final results will be based on the h−1 independent additive effects of αlk defined by Eq. 7. All the h(h − 1)/2 possible αij parameters can be expressed in terms of the h−1 independent αlk parameters through Eq. 6. The additive value (breeding value) of genotype AiAj is the summation of the two allelic effects of the genotype, i.e.,
8 |
Each additive value defined by Eq. 8 will be shown to be a function of all h−1 additive effects defined by Eq. 7.
Dominance effect and dominance value
Dominance effect of AiAj genotype (δij) is the deviation of the heterozygous genotypic value from the average of the two homozygous genotypic values, i.e.,
9 |
With the above definition, dominance effect is the unique effect of a heterozygous genotype. Therefore, the number of dominance effects is the same as number of heterozygous genotypes, and the maximum number of dominance effects is h(h − 1)/2. It is readily seen from Eq. 9 that δii = 0. The derivation process will allow the presence of δii but the final results will not have δii. Dominance value or dominance deviation is the deviation of the genotypic value from the common mean and additive value, i.e.,
10 |
An important difference between ‘dominance value’ and ‘dominance effect’ is that a homozygous genotype may have non-zero dominance value but always has zero dominance effect. Each dominance value defined by Eq. 10 will be shown to be a function of all h(h − 1)/2 dominance effects defined by Eq. 9.
Multi-allelic partition of genotypic value and variance
The genotypic value of a multi-allelic genotype has the same partition as for a bi-allelic locus [17], i.e.,
11 |
with E(aij) = 0 and E(dij) = 0. The multi-allelic genotypic variance (σ2g) also has the same partition as for a bi-allelic locus [17], i.e., σ2g = σ2a + σ2d, where σ2a = additive variance, and σ2d = dominance variance. The multi-allelic haplotype model to be developed starts with the factorization of the additive and dominance values in Eq. 11.
Results and discussion
Factorization of additive and dominance values
From Eqs. 4–7, an allelic effect can be expressed as:
12 |
where αlk is defined by Eq. 7. Equation 12 shows that an allelic effect is a function of all h-1 parameters of additive effects denoted by αlk. The additive values (breeding values) of AiAj and AiAi genotypes can be expressed as:
13 |
14 |
In Eqs. 13 and 14, αli = 0 if i = 1 and α1j = 0 if j = 1. From Eqs. 1–3 and 9–10, the dominance value of the AiAj genotype can be expressed as
15 |
In Eq. 15, the quantity gij − gik − gjf + gkf has two positive terms and two negative terms, and each subscript is associated with a positive term and a negative term. Using this fact and the definition of dominance effect (δij) of Eq. 9 with δii = 0, gij − gik − gjf + gkf can be expressed as:
16 |
Combining Eqs. 15 and 16 with Eq. 10 and using pj = 1 − ∑hk ≠ jpk (Eq. 1) yields:
17 |
In Eq. 17,
18 |
Combining Eqs. 17 and 18 yields:
19 |
20 |
Equations 13 and 14 show that each additive value is a function of all h − 1 additive effects defined by Eq. 7, and Eqs. 19–20 show that each dominance value is a function of all h(h − 1)/2 dominance effects defined by Eq. 9. Equations 13 and 14 provide the additive coding and Eqs. 19 and 20 provide the dominance coding of each multi-allelic genotype for the mixed model implementation.
Multi-allelic haplotype model based on multi-allelic genetic partition
Using the results of factorization of additive and dominance values given by Eqs. 13–14 and 19–20, the multi-allelic haplotype model treating each haplotype as an ‘allele’ by Eq. 11 can be expressed as:
21 |
In wij,kα, superscripts ij are for the genotype of AiAj and superscript k is for αlk. In wij,kfδ, superscripts ij are for dij and superscripts kf are for δkf. From Eqs. 13 and 14, the additive coding (wij,kα) of a multi-allelic genotype is:
22 |
23 |
24 |
From Eqs. 19 and 20, the dominance coding (wij,kfδ) of a multi-allelic genotype is:
25 |
26 |
27 |
28 |
29 |
For convenience of computer programming, Eqs. 22–24 can be characterized by whether aij and αlk share no common allele (Eq. 22), or 1 common allele when i ≠ j (Eq. 23) or 1 common allele when i = j (Eq. 24). Similarly, between dij and δkf, Eq. 25 shares two common alleles, Eqs. 26 and 27 share 1 common allele with i ≠ j, Eq. 28 shares one common allele with i = j, and Eq. 29 share no common allele. In Eqs. 25–29, pi or pj is the allele frequency of the shared allele between dij and δkf and pk or pf is the allele frequency of the non-shared allele between dij and δkf. From Eqs. 21–29, the multi-allelic haplotype model for h(h + 1)/2 possible genotypic values (g) of a given haplotype block with h haplotypes can be expressed as:
30 |
where μ = common mean, 1 = [h(h + 1)/2] × 1 column vector of 1’s, ah = Wαhαh = [h(h + 1)/2] × 1 column vector of additive values (breeding values), dh = Wδhδh = [h(h + 1)/2] × 1 column vector of dominance values (dominance deviations), Wαh = [h(h + 1)/2] × (h − 1) model matrix of αhwith wij,kα defined by Eqs. 22–24, dh = [h(h + 1)/2] × 1 column vector of dominance values (dominance deviations), Wδh = [h(h + 1)/2] × [h(h − 1)/2] matrix of δh with wij,kfδ defined by Eqs. 25–29, and αh = (h − 1) × 1 column vector with αlk defined by Eq. 7, and δh = [h(h − 1)/2] × 1 column vector with δkf defined by Eq. 9.
Numerical example of multi-allelic genetic partition
A hypothetical numerical example is used to illustrate the genetic partition of multi-allelic genotypic values described by Eqs. 21–30. Four haplotypes as ‘alleles’ are assumed with frequencies in Table 1 and genotypic values in Table 2. The common mean of the genotypic values using Eq. 3 is: μ = 22.09. The additive effects of the four haplotypes defined by Eqs. 5–7, are:
and the dominance effects defined by Eq. 9 are:
Table 1.
Four hypothetical haplotypes and their frequencies (h = 4)
Haplotype | 1 | 2 | 3 | 4 |
---|---|---|---|---|
Frequency | 0.4 | 0.3 | 0.2 | 0.1 |
Table 2.
Genotypic values of haplotype genotypes (gij = gji)
Haplotype | 1 | 2 | 3 | 4 |
---|---|---|---|---|
1 | g11 = 25 | g12 = 18 | g13 = 15 | g14 = 10 |
2 | g22 = 30 | g23 = 33 | g24 = 40 | |
3 | g33 = 17 | g34 = 12 | ||
4 | g44 = 35 |
Using Eqs. 13–14 and 22–24, the additive values (breeding values) are:
Using Eqs. 19–20 and 25–29, the dominance values (dominance deviations) are:
The genotypic values calculated as the summation of the additive and dominance values are:
By comparing with the genotypic values in Table 2, the above result verifies that the multi-allelic partition of g = 1μ + ah + dh = 1μ + Wαhαh + Wδhδh described by Eqs. 21–30 is correct. With the note that gij = gji, aij = aji and dij = dji, the genotypic variance (σ2g), additive variance (σ2a) and dominance variance (σ2d) are:
It is readily seen that σ2g = σ2a + σ2d.
Mixed model and multi-allelic genomic relationship matrices
A mixed model to implement the multi-allelic haplotype model of Eq. 30 can be established with appropriate changes of matrix dimensions for Wαh, Wδh, ah, dh, αh and δh in Eq. 30. A set of m SNP markers are assumed available, and r haplotype blocks of the m SNPs are defined across the genome. Haplotypes of all individuals are assumed known (e.g., constructed using a phasing or imputing software). Each haplotype block is treated as a ‘locus’ and each haplotype within a haplotype block is treated as an ‘allele’. The ith haplotype block has hi haplotypes, hi−1 additive effects, and nδi dominance effects or heterozygous genotypes. Let nα = total number of additive effects of all r haplotype blocks, nδ = total number of dominance effects (or heterozygous genotypes) of all r haplotype blocks. Then, nα = ∑ri = 1hi − r, and nδ = ∑ri = 1nδi. For a given sample of q individuals, the limit number of effects is 2q-1 for additive effects and is the number of heterozygous genotypes for dominance effects. For a sample with N observations on q individuals, the mixed model to implement the multi-allelic haplotype model of Eq. 30 can be expressed as:
31 |
where Z = N × q incidence matrix allocating phenotypic observations to each individual = identity matrix for one observation per individual (N = q), αh = nα × 1 column vector of haplotype additive effects, Wαh = q × nα model matrix of αh, δh = nδ × 1 column vector for dominance effects of haplotype genotypes, Wδh = q × nδ model matrix of δh, αs = m × 1 column vector of single-SNP additive effects, b = c × 1 column vector of fixed effects such as heard-year-season in dairy cattle (c = number of fixed effects), and X = N × c model matrix of b. To define two equivalent models with complementary computing advantages and identical GBLUP and GREML results, the mixed model of Eq. 31 needs to be expressed as [8]:
32 |
where ah = Tαhαh = multi-allelic genomic breeding values, dh = Tδhδh = multi-allelic genomic dominance values, and each T matrix can be defined by any of the six definitions of genomic relationships we previously discussed and implemented [9]. For simplicity of notations, the T matrices are defined as: Tαh = Wαh/k1/2αh, Tδh = Wδh/k1/2δh, where kαh = the average of diagonal elements of WαhWαh ', and kδh = the average of diagonal elements of WδhWδh '. The genomic relationship matrices of Eq. 31 can thus be defined as:
33 |
34 |
Interpretation of multi-allelic and haplotype genomic relationship matrices
The multi-allelic genomic relationships of Eqs. 33 and 34 using multi-allelic markers such as microsatellite markers have the same interpretation and theoretical expectation as using SNP markers that are bi-allelic, e.g., a genomic additive relationship is expected to be twice the coancestry coefficient [8, 9]. Using either multi-allelic or bi-allelic markers under the assumption of no inbreeding, the theoretical expectation of genomic additive relationships is 0.5, 0.5, 0.25 and 0 for parent-offspring, full-sibs, half-sibs and unrelated individuals respectively, and the corresponding theoretical expectation of genomic dominance relationships is 0, 0.25, 0 and 0.
It is important to distinguish between single-locus multi-allelic markers such as microsatellite markers from haplotypes where each haplotype is treated as an ‘allele’ and each haplotype block is treated as a ‘locus’, because recombination between loci within a haplotype block generally exists, leading to lowered haplotype similarity than single-locus similarity among relatives. As the number of loci increases in each haplotype block, genomic relationships using haplotypes are expected to decrease from those using single-locus markers. Therefore, the utility of haplotype genomic relationships using Eqs. 33 and 34 is for genomic prediction using haplotypes, not for measuring relationships among individuals. The optimal block size and hence the number of haplotypes per block is an important issue for genomic prediction and could be determined by validation studies, as to be further discussed towards the end of this article.
Two equivalent mixed models with complementary computing strategies
To establish mixed models using multi-allelic markers or haplotypes, assumptions for the first and second moments of the mixed model of Eq. 32 are: E(y) = Xb, E(αh) = E(δh) = E(αs) = E(δs) = 0, Var(αh) = σ2αhInα, Var(ah) = Gαh = σ2αhAh, Var(δh) = σ2δhInδ, Var(dh) = Gδh = σ2δhDh, and Var(e) = R = σ2eIN, where σ2αh = variance of multi-allelic additive effects, σ2δh = variance of multi-allelic dominance effects, σ2e = residual variance, and Inα, Inδ, Im and IN are identity matrices of orders nα, nδ, m and N, respectively. All random effects are assumed to be uncorrelated so that the phenotypic variance-covariance matrix is:
35 |
To simply notations for the two equivalent mixed models, terms in Eqs. 32–35 are re-written as αh = τ1, δh = τ2; Tαh = T1, Tδh = T2; ui = Tiτi, i = 1,2; Ah = S1, Dh = S2; and σ2αh = σ21, σ2δh = σ22. Then, Eqs. 32 and 35 can be expressed as:
36 |
37 |
By defining Zi = ZTi, an equivalent model of Eqs. 36 and 37 can be re-written as:
38 |
39 |
Equations 36 and 37 will be referred to as Model-I, and Eqs. 38 and 39 as Model-II. Model-I and Model-II are equivalent models because both models have identical E(y) and V, but these two models have different computational advantages that can be complementary to each other. For each model, two methods can be established for genomic prediction and estimation: the method of conditional expectation (CE) and the method of mixed model equations (MME), yielding a total of four methods for the two equivalent models. Model-I using CE is the best method for large numbers of SNP markers and multiple genetic factors, Model-II using MME is the best method for large numbers of individuals, and Model-I using MME and Model-II using CE have no computing advantage. Therefore, Model-I using CE and Model-II using MME will be used for genomic prediction and estimation. Using our previous naming of these two methods, GBLUP and GREML of Model-I using CE will be referred to as the CE set of formulations, and GBLUP and GREML of Model-II using MME as the QM set of formulation, where QM means ‘q > m’. These two methods yield identical results of prediction and estimation and are applicable to singular genomic relationship matrices. Assuming one observation per individual, CE based on Eqs. 36 and 37 is approximately easier to compute than QM based on Eqs. 38 and 39 if q < c + nα + nδ according to the size of the largest matrix to invert for each method (Table 3). Model-I using MME has no computing advantage over Model-I using CE due to the large coefficient matrix of MME and the requirement for full-rank relationship matrices; and Model-II using CE has no computing advantage over Model-I using CE due to the large T matrices to store in memory.
Table 3.
Comparison of computational feasibility of four methods from the two equivalent models with haplotypes and SNPs for GBLUP and GREML
Method of for calculating GBLUP | |||
---|---|---|---|
Conditional expectation (CE) | Mixed model equations (MME) | ||
Model I, Eqs. 36 and 37 | Largest matrix to invert | V, phenotypic variance-covariance matrix | C, coefficient matrix of MME |
Size of largest matrix to invert | q × q, assuming one observation per individual | c + 2q for C | |
Largest matrix to store in memory | q × q P matrix | c + 2q for C | |
Applicable to singular genomic relationship matrices | Yes, inverse relationship matrices avoided | No, inverse relationship matrices required | |
Model II, Eqs. 38 and 39 | Largest matrix to invert | V, phenotypic variance-covariance matrix | C, coefficient matrix of MME |
Size of largest matrix to invert | q × q, assuming one observation per individual | c + nα + nδ for C | |
Largest matrix to store in memory | q × nα and q × nδ T matrices, q × q P matrix | c + nα + nδ for C | |
Applicable to singular genomic relationship matrices | Yes, inverse relationship matrices avoided | Yes, inverse relationship matrices avoided |
Genomic best linear unbiased prediction of genetic values (GBLUP)
Using the CE method of Model-I (Eqs. 36 and 37), GBLUP of the ith type of genetic values for individuals in the training population is obtained as:
40 |
where = best linear unbiased estimator (BLUE) of fixed non-genetic effects, P = V− 1 − V− 1X(X ' V− 1X)−X ' V− 1, and column vector of regressed phenotypic values of the training population as a regression of the ith type of genetic values on the phenotypic values in the training population. Two equivalent methods with identical results can be used to predict genetic values of individuals without phenotypic observations (validation population): placing all individuals with or without records in the same mixed model by setting to zero the Z matrix for the validation population, or calculate predictions separately based on the regressed phenotypic values of the training population [8, 39]. Using this second method, GBLUP of the ith type of genetic values for individuals in the validation population is calculated as:
41 |
where Si01 = q0 × q genomic relationship matrix between the training and validation populations for the ith type of genetic values (q0 = number of individuals in the validation population).
Using the QM method (MME method of Model-II of Eqs. 38 and 39), genomic prediction first calculates the GBLUP of haplotype effects and then calculates GBLUP of genetic values. GBLUP of haplotype effects is obtained from solving the following MME:
42 |
where , Zg = (Z1, Z2), λi = σ2e/σ2i, t = nα, nδ, m and N for i = 1,2, respectively, and ⊕ denotes direct sum that defines a block diagonal matrix. With haplotype and SNP effects from Eq. 42, GBLUP of the ith type of genetic values for individuals in the training and validation populations are obtained as:
43 |
44 |
where Ti0 = the Ti matrix calculated using SNPs of the validation population. Equations 43 and 44 yield identical results as those of Eqs. 40 and 41. The prediction of total genotypic values in either training or validation population can be obtained from Eqs. 40 and 41 or 43 and 44 as: ĝ = ∑2i = 1ûi = predicted genotypic values of all individuals, and ĝ0 = ∑2i = 1ûi0 = predicted genotypic values of the validation population. Prediction reliabilities of additive, dominance and genotypic predictions as the squared correlations between the genomic and true values has the same formulations as the R2ai, R2di and R2gi formulae in [8], and prediction accuracy is obtained as the square root of the reliability estimate.
Genomic restricted maximum likelihood estimation (GREML) of variance components
Using the CE method of Model-I (Eqs. 36 and 37), the EM type GREML estimates of variance components are:
45 |
46 |
where k = iteration number. Using the QM method (Eqs. 38 and 39), the EM type GREML estimates of variance components are
47 |
48 |
where r is the rank of the coefficient matrix of Eq. 42, , and Cii is defined by:
where M = IN − X(X ' X)−X ', and ti = nα for i = 1 and ti = nδ for i = 2.
The EM-REML of Eqs. 45–48 are known to be slow but reliable to yield non-negative estimates of variance components. The AI-REML algorithm is fast but may be sensitive to starting values of variance components and may fail for extreme heritability levels. Formulations of AI-REML for the multi-allelic haplotype model in this article are straightforward extensions of the formulations we implemented for GVCBLUP [40].
Integration of haplotype and single SNP effects in genomic prediction and estimation
Haplotype analysis and single SNP analysis can be analyzed jointly for genomic prediction in the same mixed model by adding single SNP effects from our previous work [8] to the mixed model of Eq. 31, i.e.,
49 |
50 |
where αs = m × 1 column vector of SNP additive effects, Tαs = q × m model matrix of αs, δs = m × 1 column vector of SNP dominance effects, Tδs = q × m model matrix of δs, Var(αs) = σ2αsIm, Var(as) = Gαs = σ2αsAs, Var(δs) = σ2δsIm, Var(ds) = Gδs = σ2δsDs, As = genomic additive relationship matrix, and Ds = SNP genomic dominance relationship matrix, and where As = TαsTαs ' and Ds = TδsTδs '. Let αs = τ3, δ = τ4; ui = Tiτi, i = 1,…,4; As = S3, Dh = S4; and σ2αs = σ23, σ2δs = σ24. The GBLUP and GREML formulations to jointly include haplotype and single SNP additive and dominance effects essentially entails to extending the range of the subscript i from 2 to 4 for Eqs. 38–50.
GREML estimation using the joint mixed model with haplotype and SNP effects offer flexibility to estimate the heritability for various types of functional genomic information in any given autosome regions based on formulations we implemented in GVCBLUP [40], e.g., the additive and dominance heritabilities of haplotype blocks of all genes, all LD blocks, or all single SNPs. The heritability estimate for each type of genetic effects is: h2i = σ2i/σ2y, where σ2y = ∑4i = 1σ2i + σ2e = phenotypic variance. The total heritability of all types of genetic effects is the summation of all effect heritabilities, i.e., H2 = ∑4i = 1h2i. Genomic heritability estimation has flexibility unavailable from heritability estimation using pedigree relationships: the heritability estimation for a single SNP, a chromosome region, or a set of selected SNPs. Using the GREML formulae of Eqs. 35 and 36, the heritability for haplotype block j or SNP set j can be estimated as: , where = subset j of , i = 1,…,4. Given sufficient computing power and sample sizes for extensive validation studies, these heritability estimates could help identify genomic regions and genes relevant to phenotypes within the framework of genomic prediction.
Defining haplotype blocks using functional genomic information
The multi-allelic haplotype model can be used for the integration of functional genomic information with genomic prediction and estimation. This integration defines haplotype blocks using functional genomic information under the hypothesis that a chromosome region with functional information required more than a single point to affect a phenotype, followed by genomic prediction and estimation using a haplotype analysis such as the methods developed in this article. Each gene could be a ‘natural haplotype block’ and the use of gene blocks improved the prediction accuracy for some human phenotypes in our preliminary results [37]. Other types of functional information can also be used to define haplotype blocks, including ChIP-seq sites, DNA methylation sites, CNV, protein interaction, pathway information, GWAS results and selection signatures (Fig. 1). Other than ‘natural haplotype blocks’, the optimal block sizes for functional information with best prediction accuracy could be determined by extensive validation studies.
Fig. 1.
Integration of functional and structural genomic information for genomic selection. Haplotype blocks are defined using functional genomic information and are analyzed using the multi-allelic haplotype model in this article for genomic prediction and estimation. Single SNPs as structural genomic information can be used jointly with the haplotype analysis. (DHS = DNase I hypersensitive site)
Rare haplotypes, missing genotypic values
The mixed model approach outlined above allows rare haplotypes. In the extreme case of rare haplotypes with one observation per haplotype or haplotype frequency of 1/h when h is large, the multi-allelic model with the mixed model implementation still is applicable for additive effects and values. Missing genotypic values is a problem for dominance effects and values. The dominance effect defined by Eq. 9 requires the availability of all three genotypic values of a haplotype pair. Consequently, dominance effect is undefined with any missing genotypic value. We currently recommend ignoring any haplotype pair with missing genotypic value or values for defining dominance effects. For large haplotype blocks, nearly all individuals could be heterozygous so that such large blocks may not contribute to genomic prediction and estimation of dominance effects and values. This loss of dominance information should be a factor to consider in defining the block size.
Conclusions
A multi-allelic haplotype model for genomic prediction and estimation is established using the quantitative genetics model that partitions a multi-allelic genotypic value into additive and dominance values, factorizes each additive value into a product between a function of allele frequencies and additive effect, and factorizes each dominance value into a product between a function of allele frequencies and dominance effect. Haplotype genomic additive and dominance relationship matrices and formulations are then derived for GBLUP and GREML utilizing haplotypes in haplotype blocks. These results fill a gap in the theory of quantitative genetics for multi-allelic genetic partition and provide a haplotype approach within the theory of quantitative genetics towards the integration of functional and structural genomic information for genomic selection.
Availability of supporting data
The only data set used in this article is shown in Tables 1–2.
Acknowledgements
This research was supported by USDA National Institute of Food and Agriculture Grant no. 2011-67015-30333 and by project MN-16-043 of the Agricultural Experiment Station at the University of Minnesota. Dzianis Prakapenka and Chunkao Wang implemented the methodology in this article by the GVCHAP computer program. Cheng Tan and Dzianis Prakapenka evaluated the methodology. John R. Garbe provided summary and discussion of human functional genomic information. Li Ma processed a dataset for methodology evaluation.
Abbreviations
- SNP
single nucleotide polymorphism
- BLUP
best unbiased linear prediction
- GBLUP
genomic BLUP
- REML
restricted maximum likelihood estimation
- GREML
genomic REML
- EM
expectation-maximization
- AI-REML
average information REML
- CE
conditional expectation
- MME
mixed model equations
Footnotes
Competing interests
The author declares to have no competing interests.
References
- 1.Henderson C. Applications of Linear Models in Animal Breeding. Guelph: University of Guelph; 1984. [Google Scholar]
- 2.Fikse W, Philipsson J. Development of international genetic evaluations of dairy cattle for sustainable breeding programs. Anim Genet Resour Inf. 2007;41:29–43. [Google Scholar]
- 3.Powell R, VanRaden P. International dairy bull evaluations expressed on national, subglobal, and global scales. J Dairy Sci. 2002;85(7):1863–1868. doi: 10.3168/jds.S0022-0302(02)74260-4. [DOI] [PubMed] [Google Scholar]
- 4.VanRaden P. Invited Review: Selection on Net Merit to Improve Lifetime Profit. J Dairy Sci. 2004;87(10):3125–3131. doi: 10.3168/jds.S0022-0302(04)73447-5. [DOI] [PubMed] [Google Scholar]
- 5.Wiggans G, Misztal I, Van Vleck L. Implementation of an animal model for genetic evaluation of dairy cattle in the United States. J Dairy Sci. 1988;71:54–69. doi: 10.1016/S0022-0302(88)79979-8. [DOI] [Google Scholar]
- 6.VanRaden P. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91(11):4414–4423. doi: 10.3168/jds.2007-0980. [DOI] [PubMed] [Google Scholar]
- 7.Patterson HD, Thompson R. Recovery of inter-block information when block sizes are unequal. Biometrika. 1971;58(3):545–554. doi: 10.1093/biomet/58.3.545. [DOI] [Google Scholar]
- 8.Da Y, Wang C, Wang S, Hu G. Mixed model methods for genomic prediction and variance component estimation of additive and dominance effects using SNP markers. PLoS One. 2014;9(1):e87666. doi: 10.1371/journal.pone.0087666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wang C, Da Y. Quantitative genetics model as the unifying model for defining genomic relationship and inbreeding coefficient. PLoS ONE. 2014;9:e114484. doi: 10.1371/journal.pone.0114484. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hayes B, Goddard M. Genome-wide association and genomic selection in animal breeding. Genome. 2010;53(11):876–883. doi: 10.1139/G10-076. [DOI] [PubMed] [Google Scholar]
- 11.Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42(7):565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Fisher RA. The Correlation between Relatives on the Supposition of Mendelian Inheritance. Trans Roy Soc Edinb. 1918;52(02):399–433. doi: 10.1017/S0080456800012163. [DOI] [Google Scholar]
- 13.Fisher RA. Average excess and average effect of a gene substitution. Ann. Eugen. 1941;11(1):53–63. doi: 10.1111/j.1469-1809.1941.tb02272.x. [DOI] [Google Scholar]
- 14.Cockerham CC. An extension of the concept of partitioning hereditary variance for analysis of covariances among relatives when epistasis is present. Genetics. 1954;39(6):859. doi: 10.1093/genetics/39.6.859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kempthorne O. The correlation between relatives in a random mating population. Proc. R. Soc. Lond. B Biol. Sci. 1954;143(910):103–113. doi: 10.1098/rspb.1954.0056. [DOI] [PubMed] [Google Scholar]
- 16.Lynch M, Walsh B. Genetics and analysis of quantitative traits. Massachusetts: Sinauer Sunderland; 1998. [Google Scholar]
- 17.Kempthorne O. An introduction to genetic statistics. New York: Wiley; 1957. [Google Scholar]
- 18.Falconer DS, Mackay TFC. Introduction to Quantitative Genetics. 4. Harlow, Essex: Longmans Green; 1996. [Google Scholar]
- 19.Álvarez-Castro JM, Yang R-C. Multiallelic models of genetic effects and variance decomposition in non-equilibrium populations. Genetica. 2011;139(9):1119–1134. doi: 10.1007/s10709-011-9614-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Vormfelde SV, Brockmöller J: On the value of haplotype-based genotype–phenotype analysis and on data transformation in pharmacogenetics and-genomics. Nature Reviews Genetics 2007, 8(12), doi:10.1038/nrg1916-c1. [DOI] [PubMed]
- 21.Balding DJ. A tutorial on statistical methods for population association studies. Nat Rev Genet. 2006;7(10):781–791. doi: 10.1038/nrg1916. [DOI] [PubMed] [Google Scholar]
- 22.Garnier S, Truong V, Brocheton J, Zeller T, Rovital M, Wild PS, et al. Genome-wide haplotype analysis of cis expression quantitative trait loci in monocytes. PLoS Genet. 2013;9(1):e1003240. doi: 10.1371/journal.pgen.1003240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Morris RW, Kaplan NL. On the advantage of haplotype analysis in the presence of multiple disease susceptibility alleles. Genet Epidemiol. 2002;23(3):221–233. doi: 10.1002/gepi.10200. [DOI] [PubMed] [Google Scholar]
- 24.Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005;21(2):263–265. doi: 10.1093/bioinformatics/bth457. [DOI] [PubMed] [Google Scholar]
- 25.Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, et al. Genome-wide detection and characterization of positive selection in human populations. Nature. 2007;449(7164):913–918. doi: 10.1038/nature06250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 2009;84(2):210–223. doi: 10.1016/j.ajhg.2009.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78(4):629–644. doi: 10.1086/502802. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Von Holdt BM, Pollinger JP, Lohmueller KE, Han E, Parker HG, Quignon P, et al. Genome-wide SNP and haplotype analyses reveal a rich history underlying dog domestication. Nature. 2010;464(7290):898–902. doi: 10.1038/nature08837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Calus M, De Roos A, Veerkamp R. Accuracy of genomic selection using different methods to define haplotypes. Genetics. 2008;178(1):553–561. doi: 10.1534/genetics.107.080838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Villumsen T, Janss L, Lund M. The importance of haplotype length and heritability using genomic selection in dairy cattle. J Anim Breed Genet. 2009;126(1):3–13. doi: 10.1111/j.1439-0388.2008.00747.x. [DOI] [PubMed] [Google Scholar]
- 31.Sun X, L. FR, Garrick DJ, Dekkers JCM: Improved accuracy of genomic prediction for traits with rare QTL by fitting haplotypes. Proceedings, 10th World Congress of Genetics Applied to Livestock Production Vancouver, BC, Canada https://asas.org/docs/default-source/wcgalp-proceedings-oral/209_paper_9178_manuscript_1682_0.pdf?sfvrsn=2 [Last accessed December 8 2015].
- 32.Cuyabano BC, Su G, Lund MS. Selection of haplotype variables from a high-density marker map for genomic prediction. Genet Sel Evol. 2015;47(1):1–11. doi: 10.1186/s12711-015-0143-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Mulder HA, Calus MP, Veerkamp RF. Prediction of haplotypes for ungenotyped animals and its effect on marker-assisted breeding value estimation. Genet Sel Evol. 2010;42:10. doi: 10.1186/1297-9686-42-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Meuwissen T, Goddard M. Accurate prediction of genetic values for complex traits by whole-genome resequencing. Genetics. 2010;185(2):623–631. doi: 10.1534/genetics.110.116590. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Erbe M, Hayes B, Matukumalli L, Goswami S, Bowman P, Reich C, et al. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci. 2012;95(7):4114–4129. doi: 10.3168/jds.2011-5019. [DOI] [PubMed] [Google Scholar]
- 36.Brøndum RF, Su G, Lund MS, Bowman PJ, Goddard ME, Hayes BJ. Genome position specific priors for genomic prediction. BMC Genomics. 2012;13(1):543. doi: 10.1186/1471-2164-13-543. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Da Y, Wang C, Tan C, Prakapenka D, Shigematsu M, Garbe J, Ma L: Multi-allelic haplotype model for genomic prediction and estimation. Abstract P1176. Plant and Animal Genome XXIII, January 10–14, 2015. San Diego. https://pag.confex.com/pag/xxiii/webprogram/Paper14435.html [Last accessed December 8 2015].
- 38.Tan C, Prakapenka D, Wang C, Ma L, Garbe JR, Hu X, Da Y: Integration of haplotype analysis of functional genomic information with single SNP analysis improved accuracy of genomic prediction. ADSA/ASAS 2015, Orlando, July 12–16 2015. Abstract M84. http://m.jtmtg.org/abs/t/65063. [Last accessed December 8 2015].
- 39.Henderson C. Best linear unbiased prediction of breeding values not in the model for records. J Dairy Sci. 1977;60(5):783–787. doi: 10.3168/jds.S0022-0302(77)83935-0. [DOI] [Google Scholar]
- 40.Wang C, Prakapenka D, Wang S, Pulugurta S, Runesha HB, Da Y. GVCBLUP: a computer package for genomic prediction and variance component estimation of additive and dominance effects. BMC bioinformatics. 2014;15(1):270. doi: 10.1186/1471-2105-15-270. [DOI] [PMC free article] [PubMed] [Google Scholar]