Skip to main content
Genetics logoLink to Genetics
. 2021 Aug 7;219(2):iyab130. doi: 10.1093/genetics/iyab130

Genetic evaluation including intermediate omics features

Ole F Christensen 1,, Vinzent Börner 1, Luis Varona 2, Andres Legarra 3
PMCID: PMC8633135  PMID: 34849886

Abstract

In animal and plant breeding and genetics, there has been an increasing interest in intermediate omics traits, such as metabolomics and transcriptomics, which mediate the effect of genetics on the phenotype of interest. For inclusion of such intermediate traits into a genetic evaluation system, there is a need for a statistical model that integrates phenotypes, genotypes, pedigree, and omics traits, and a need for associated computational methods that provide estimated breeding values. In this paper, a joint model for phenotypes and omics data is presented, and a formula for the breeding values on individuals is derived. For complete omics data, three equivalent methods for best linear unbiased prediction of breeding values are presented. In all three cases, this requires solving two mixed model equation systems. Estimation of parameters using restricted maximum likelihood is also presented. For incomplete omics data, extensions of two of these methods are presented, where in both cases, the extension consists of extending an omics-related similarity matrix to incorporate individuals without omics data. The methods are illustrated using a simulated data set.

Keywords: genetic evaluation, breeding value, mixed model equations, single-step method, transcriptomics, metabolomics, Genomic Prediction, GenPred, Shared Data Resource

Introduction

In plant and animal breeding and genetics, there has been an increasing interest in intermediate omics traits that mediate the effects of genetics on the phenotype of interest. Such traits are for example metabolomics (Riedelsheimer et al. 2012; Hayes et al. 2017) or transcriptomics (Morgante et al. 2020), and they are becoming increasingly available in larger quantities at decreasing costs. For inclusion of such intermediate traits into a genetic evaluation system, there is a need for statistical models and methods that integrate phenotypes, genotypes, pedigree, and such omics traits, and provide estimated breeding values (EBVs).

Other intermediate traits where such methods may be applied are metagenome (usually gut or rumen microbiome composition, e.g., Camarinha-Silva et al. 2017; Difford et al. 2018; Weishaar et al. 2020); chemometric traits such as Near- and Mid-Infrared Spectroscopy that are used to either predict another trait of interest which is costly to measure, such as methane emissions (Negussie et al. 2017), or purely to describe similarities among individuals (Rincent et al. 2018); kinetic traits using accelerometers which are used as predictors of other traits, such as lameness (O’Leary et al. 2020). However, as stated in the title, in this paper, we will focus on traits that are deemed to be intermediate between DNA action and the trait of interest for the breeder.

For omics data, phenotype may be predicted using all available omics features rather than selected features based on a statistical significance threshold (Riedelsheimer et al. 2012). This is similar to genomic selection (Meuwissen et al. 2001) where using the whole set of single nucleotide polymorphism (SNP) markers provides a more accurate prediction than using only SNPs based on a P-value threshold. An equivalent formulation that incorporates omics data via a similarity matrix has been used for metabolomic data (Riedelsheimer et al. 2012; Guo et al. 2016) and transcriptomic data (Guo et al. 2016; Morgante et al. 2020). A similar idea has been proposed for microbiota data (Camarinha-Silva et al. 2017). Although such an approach provides a measure of the importance of the omics on the trait, it does not lead in itself to genetic evaluations usable for genetic improvement.

An approach suitable for genetic evaluation is to include an omics-derived phenotype as a correlated trait in a multitrait genetic model (see, e.g., Hayes et al. 2017). This gives EBVs on all individuals, allows incomplete omics data, and extends the usual model for genetic evaluation. However, the prediction equations for omics-derived phenotype would need to be periodically updated, the procedure consists of several steps, which seems suboptimal, and the approach is not based on a theoretical derivation of breeding values in the context of having both phenotypes and omics data.

A theoretical derivation of breeding values for feed efficiency and with microbiota composition as the intermediate trait was made in Weishaar et al. (2020). Here, they derived the effect of microbiota composition on breeding value, and from this, they decomposed the breeding value of the trait into a heritable contribution from genes affecting microbiota composition and a heritable contribution from genes directly affecting the trait. The derivation of the approach incorporated two models: a microbial model describing how the microbial abundances affected the phenotype, and a genetic model for the microbial abundances. The first model does not include the genetic effect which directly affects the trait (not through the microbiome). Instead, the authors proposed to include this in a selection index. For genetic evaluation using omics data, the use of two such models, i.e., a model that describes how omics expression affects phenotypes, and a genetic model for omics expression, seems an interesting approach, but the use of a selection index and a procedure that is based on several steps can be improved.

The use of two such models, a model that describes how transcriptomics expression affects phenotypes, and a genetic model for transcriptomics expression, has been suggested in recent human genetic studies (Mancuso et al. 2019; Qian et al. 2019). Here, they promote methods based on the full likelihood of phenotypes and expression data, and they argue that this incorporates uncertainty into the results, reduces bias, and allows for incomplete expression data by considering this as a missing data problem. The model proposed is a hierarchical model with two model equations. Restricting to a linear model, and mostly following the description in Mancuso et al. (2019), the model is as follows: First, phenotype is a linear function of gene expression of genes of interest, residual genetic effects, and environmental effects, y=Mα+ar+ϵ where matrix M contains expression levels of genes of interest, vector α contains the associated effects sizes, ar contains genetic effects not mediated through expression levels of the genes of interest, and ϵ is environmental noise. Second, the expression levels M are a linear function of genotype weighted by expression quantitative trait loci (eQTL) effects, i.e., a genetic model for the expression levels. In these papers (Mancuso et al. 2019; Qian et al. 2019), focus is on testing associations, which is different from our aim of genetic evaluation, where the focus is on prediction of genetic effects. However, using such type of joint model for phenotypes and intermediate omics traits seems to be an appropriate approach also for genetic evaluation.

The aim of this paper is to develop models and methods for genetic evaluation including the whole intermediate omics features. First, a joint model for phenotypes and intermediate omics data is presented, and a formula for breeding values is derived. Second, best linear unbiased prediction (BLUP) and restricted maximum likelihood (REML) estimation are presented for the specific situation with complete omics data. Third, BLUP is presented for the case with incomplete omics data. Fourth, a simulated data set is analyzed to illustrate the model and methods.

Methods

We assume phenotypes on a trait of interest, and that omics data are summarized into intensities or expression levels of different omics features. For simplicity, we will use the term “expression levels” in the remaining part of the paper. An assumption is that expression levels have been normalized and standardized such that magnitudes of different features are of similar size. In principle, the approach is robust to correlated omics features, but for highly correlated omics features, a possibility would be to decorrelate during the normalization procedure.

First, we present the model, and then we present inference in the model with complete and incomplete omics data, respectively. Finally, the simulated data are presented.

Model for phenotypes and intermediate omics data

We assume that omics expression levels affect the phenotype, but not the opposite. We do not require that all features have a known effect on the trait, but we will allow these effects to be estimated from data. We assume that individuals are genetically related with an additive genetic relationship matrix H constructed from pedigree (Henderson 1976), or marker genotypes (VanRaden 2008), or both as in Legarra et al. (2014), that there are k omics features, and that for i=1,,k,mi is a vector containing expression levels of feature i for the individuals.

The model is

y=Xβ+ZMα+Zrar+ϵ, (1)
mi=X˜β˜i+Z˜gi+ei,i=1,,k. (2)

Equation (1) describes the relationship between phenotypes and omics expression levels, and here y is the vector of phenotypes, β is the vector containing fixed effects, matrix X relates fixed effects to phenotypes, matrix Z relates individuals with omics data to phenotypes, matrix M contains m1,,mk as columns, vector α contains regression effects of omics expression levels on phenotypes, vector ar contains residual genetic effects, i.e., genetic effects on the phenotypes that are not mediated through the observed omics expression levels, matrix Zr relates genetic effects to phenotypes, and ϵ is the vector of residuals. Equation (2) for omics feature i=1,,k describes the relationship between omics expression levels mi and genetic effects for the individuals, where matrix X˜ relates fixed effects to omics expression levels (assuming that variables in fixed effects are the same for different omics features), matrix Z˜ relates individuals to omics expression levels, and for the ith omics feature, vector β˜i contains fixed effects, vector gi is the vector of genetic effects, and ei is the vector of residuals. The genetic effects are specified for all individuals of interest, i.e., including individuals without phenotypes and omics data, and incidence matrices Zr and Z˜ are defined accordingly. We note the relationship between incidence matrices, Zr=ZZ˜, and in the case where the variables in the fixed effects are the same in the two parts of the model, then X=ZX˜. Accordingly, let G be the matrix containing g1,,gk as columns, and E be the matrix containing e1,,ek as columns, so that M=X˜B˜+G+E, where B˜=[β˜1,,β˜k]. Note that here, matrix G is not a genomic relationship matrix.

Distributional assumptions are that αN(0,σα2I),ϵN(0,σϵ2I),arN(0,σa,r2H),giN(0,σg,i2H),i=1,,k,eiN(0,σe,i2I),i=1,,k, where all these vectors are independent. For computational reasons described later, when needed, we further make the assumption that all the omics features have the same heritability hm2=σg,i2/(σe,i2+σg,i2) not depending on i, even if variances σe,i2,σg,i2 differ across i.

Our proposed model has some strong similarities to the approach in Weishaar et al. (2020), although the authors do not view their approach as a joint model, and they also do not emphasize the second part of the model in equation (2) in their paper. A difference to Weishaar et al. (2020) is that in their paper the first part of the model in equation (1) does not contain the residual genetic effect ar, which instead is incorporated as a final step in a selection index. Our model is also a particular case of a structural equation model (SEM; Gianola and Sorensen 2004; Varona et al. 2007), but with some additional features to simplify the complexity of the problem: the omics features are uncorrelated to each other, and do not influence each other. Also, using a prior distribution for α introduces more parsimony and seems a natural choice, assuming that none of the omics features should greatly influence the phenotype. This particular SEM can be seen as a parsimonious multiple trait model, contrary to the full SEM which is equivalent to a classical multiple trait model (Varona et al. 2007).

It is useful in such a context to revisit the concept of breeding values, as the genotypic effect of an individual can now be separated into omics-mediated effects (with corresponding breeding values) and residual genetic effects (with breeding value). In Appendix A, we derive that the vector of breeding values is

a=Gα+ar. (3)

In other words, breeding value equals the genetic effects on omics expression levels times the effect of omics data on phenotype, plus the genetic effect not mediated by omics data. Similar to Weishaar et al. (2020), our model therefore contains an effect am=Gα mediated by omics data M on breeding value. Different from Weishaar et al. (2020), our model directly contains an effect ar of genes on the trait not being mediated through the omics expression levels, i.e., our model partitions the breeding value into these two genetic effects. In the following, we will denote am as the omics-mediated breeding value and ar as the residual breeding value.

The model in (1) and (2) can be viewed as an extension of the usual model for genetic evaluation

y=Xβ+Za+e,

in the sense that with omics expression levels being unobserved for all individuals, inference in this model will be the same as inference in a linear Gaussian mixed model with first- and second-order moments derived from the model in (1) and (2) by marginalizing the unobserved omics expression levels, which is explained in detail in Appendix B. In other words, if nothing is known about omics expression levels, their genetic parts giαi,i=1,,k, combine with the genetic effect a for the usual genetic model, and the residual parts eiαi,i=1,,k, combine with the usual residual effect e. We therefore conclude that the model in equations (1) and (2) is an extension of the usual model for genetic evaluation, at least as long as the variables in the fixed effects in both equations (1) and (2) are the same as in the usual genetic evaluation model. Note that in case the variables in the fixed effects are different in equations (1) and (2), then the variables in the fixed effects in the usual model for genetic evaluation should be a superset of those in the two equations.

For interpretation purposes, variances and ratios of variances for the model in (1) and (2) are now derived. The phenotypic variance is σy2=iσm,i2σα2+σa,r2+σϵ2, where σm,i2=σg,i2+σe,i2, and this can be decomposed in two ways. First, the genetic variance is σa,m2+σa,r2, where σa,m2=iσg,i2σα2 is the variance of the genetic effects mediated by the intermediate omics data and σa,r2 is the residual genetic variance, and the environmental variance is σe,m2+σϵ2, where σe,m2=iσe,i2σα2 is the variance of the nongenetic effects mediated by the intermediate omics data and σϵ2 is the residual variance. Second, the variance of phenotype mediated by omics data equals iσm,i2σα2, and the variance not mediated by omics data equals iσe,i2σα2+σϵ2. From these variances, we obtain that the heritability equals h2=(σa,m2+σa,r2)/σy2, and the proportion of variance explained by omics data equals cm2=iσm,i2σα2/σy2. The term cm2 was named “microbiability” by Difford et al. (2018) in the context of microbiota data, whereas we here will use the term “omics variance ratio” to cover the range of possible intermediate omics traits. Assuming that all omics features have the same heritability hm2=σg,i2/(σg,i2+σe,i2), i.e., hm2 is not depending on i, we see that

h2=cm2hm2+hr2,

where hr2=σa,r2/σy2. We therefore have a decomposition of the heritability of the phenotype into an omics-mediated heritability, cm2hm2, being a product of the omics variance ratio and heritability of the omics expression levels, and residual heritability hr2. This has consequences for the interpretation of studies, e.g., a high omics variance ratio (microbiability) can coexist with a low trait heritability.

Inference in the model with complete omics data

Here, we consider the situation with complete omics data, i.e., individuals with phenotypes are a subset of individuals with omics data. This would for instance be the case for selection candidates, which are measured for omics but not for the phenotype of interest, which occurs later in life.

Method 1:

The likelihood function of the data, p(y,M) splits into a product

p(y,M)=p(y|M,α,ar)p(M|G)p(α)p(G)p(ar)dGdαdar=p(y|M,α,ar)p(α)p(ar)dαdarp(M|G)p(G)dG.

Similarly, the conditional density of random effects also splits into a product

p(α,G,ar|y,M)=p(y|M,α,ar)p(M|G)p(G)p(α)p(ar)/p(y,M)=p(α,ar|y,M)p(G|M).

In both these products, the first term contains parameters σα2,σa,r2,σϵ2 and random effects α and ar, and the second term contains parameters σg2,σe2, and random effects G. Inference about parameters in the model should be conducted using the likelihood function of the data, and inference about random effect should be conducted using the conditional density of random effects. Since these both split into products with each random effect and each parameter appearing in only one of the terms, inference can therefore be conducted separately for first, σα2,σa,r2,σϵ2,α and ar, conditional on M by using the model in equation (1), and second σg2,σe2, and G using the model in equation (2). In both cases, these equations describe a linear mixed model, and therefore parameters can be estimated by using REML and effects can be predicted by using BLUP. Below, we describe BLUP for these two models.

First, BLUP of regression effects of omics expression levels and residual genetic effects are obtained as solutions to the mixed model equations (MME)

[XTXXTZMXTZr(ZM)TX(ZM)TZM+ξ1I(ZM)TZrZrTXZrTZMZrTZr+ξ2H1][β^α^a^r]=[XTy(ZM)TyZrTy], (4)

where ξ1=σϵ2/σα2 and ξ2=σϵ2/σa,r2. Note that ZM is the submatrix of M for individuals with phenotype, and omics expression levels on individuals without phenotype are not included in this first step. The MME in equation (4) is in essence identical to SNP-BLUP with residual polygenic effects, where the omics expression levels enter in place of the genotype codes. Prediction of phenotypes (without fixed effects) is Mα^+Z˜a^r, and this might be useful in itself, for instance in plant breeding with clonal propagation (e.g., poplar trees; Stanton et al. 2010). However, we need a further step to predict breeding values, as the M levels are not fully transmitted from parents to offspring.

Second, BLUP of genetic effects gr are solutions to the MME

[X˜TX˜X˜TZ˜Z˜TX˜Z˜TZ˜+ζiH1][β˜^ig^i]=[X˜TmiZ˜Tmi], (5)

for i=1,,k, where ζi=σe,i2/σg,i2 Assuming constant heritability of omics features, then ζi=σe,i2/σg,i2=(1hm2)/hm2 does not depend on i, i.e., ζ=ζi. The coefficient matrix in equation (5) does therefore not depend on i, and the systems of equations can be solved simultaneously by solving

[X˜TX˜X˜TZ˜Z˜TX˜Z˜TZ˜+ζH1][B˜^G^]=[X˜TMZ˜TM],

where matrix B˜=[β˜1,,β˜k].

EBVs are computed from the solutions of these MME as follows. Using that G and α are conditionally independent given (y,M), then the EBVs equal

a^=E[a|y,M]=E[Gα+ar|y,M]=E[G|M]E[α|y,M]+a^r=G^α^+a^r,

where predicted effects α^,G^=[g^1,,g^k],a^r are solutions to equations (4) and (5). Our approach has some similarity to the approach by Weishaar et al. (2020). In this paper, they solve equation (4) without the ar effect, solve equation (5), compute G^α^ to obtain the heritable contribution on the phenotypes mediated by the microbiota composition (the intermediate traits in their context, corresponding to omics data in our context), compute the contribution from genes directly affecting the trait by solving MME from a the usual genetic model with a genomic relationship matrix, and finally, they construct a selection index from these two genetic components. From our results, we see that G^α^ is indeed the EBV in the case where the whole genetic effect is mediated through the intermediate trait. However, when this is not the case and α^ is computed from equation (4) without the ar effect, then α^ is not BLUP in the correct model, and G^α^ is not BLUP of the heritable contribution mediated by the intermediate trait. Therefore, the approach by Weishaar et al. (2020) only has a theoretical foundation when the whole genetic effect is mediated through the intermediate trait, i.e., when there is no residual genetic effect ar.

Solving equations (4) and (5) provides predictions of effects in the model, and from these, predicted genetic effects on individuals can be computed.

Method 2:

EBVs for the model in equations (1) and (2) can be derived in an equivalent way that directly provides predicted genetic effects on individuals. First, we note that the property of BLUP implies that E[gi|mi] is a linear function of mi for i=1,,k, which can also be seen from equation (5). From equation (5), we in addition see that when assuming constant heritability of omics features, then the coefficient matrix does not depend on i, and therefore E[gi|mi]=C˜mi, where C˜ is a matrix that does not depend on mi and i. This gives that E[Gα|y,M,α]=iE[gi|y,M,α]αi=iE[gi|mi]αi=C˜Mα, which is a linear function of Mα, and therefore equals E[Gα|Mα]. Using formula for conditional expectation, we obtain

a^=E[Gα|y,M]+E[ar|y,M]=E[E[Gα|y,M,α]|y,M]+E[ar|y,M]=E[Gα|Mα]|Mα=E[Mα|y,M]+E[ar|y,M],

where notation E[u|v]|v=v0 refers to E[u|v] being function of v and evaluated at v = v0, and the last equality is due to E[Gα|Mα] being a linear function of Mα and expectation being linear. Here, Mα is a vector containing the sum of omics contributions on phenotype for each individual, and the formula above shows that given the assumption of common heritabilities of omics features, a procedure for prediction of breeding values can be formulated, where BLUP of Mα is computed instead of BLUP of α.

Therefore, for the computation of EBV, we need to compute E[Mα|y,M],E[ar|y,M], and E[Gα|Mα] with E[Mα|y,M] substituted for Mα. Denoting u=Mα, this can be done in two steps. First, u^=E[u|y,M] and a^r=E[ar|y,M] are obtained by solving MME

[XTXXTZXTZrZTXZTZ+ξ1(MMT)1ZTZrZrTXZrTZZrTZr+ξ2H1][β^u^a^r]=[XTyZTyZrTy], (6)

where ξ1 and ξ2 have been defined previously. Note that the solution from these MME is equivalent to the solution from MME in equation (4), which follows from combining formula 5 and 6 in Strandén and Garrick (2009). Therefore, this is also in essence identical to GBLUP with residual polygenic effects, where the omics similarity matrix MMT enter in place of a genomics relationship matrix. For all individuals, we are predicting their individual contributions of omics to phenotypes, u^. Second, plugging u^ into the MME below, and solving this MME to obtain solution for am=Gα, i.e., solving

[X˜TX˜X˜TZ˜Z˜TX˜Z˜TZ˜+ζH1][β˜α^a^m]=[X˜Tu^Z˜Tu^]. (7)

where ζ=σe2/σa,m2 with σe2=iσe,i2σα2 and σa,m2=iσg,i2σα2. Note that ζ is equivalent to the previously defined ζ=σe,i2/σg,i2=(1hm2)/hm2 for all i=1,,k, since equal heritability of omics features is assumed. In this step, the predicted additive genetic part, a^m, of the predicted overall omics effect, u^, is computed. Note that u^ is an omics-derived phenotype, and the equation shows that the appropriate heritability to use for such a phenotype is equal to the heritability of omics features, hm2, based on the assumptions in this paper. The EBV equals a^=a^m+a^r.

Each of the two MME above corresponds to a linear mixed model. Equation (6) corresponds to the model

y=Xβ+Zu+Zrar+ϵ,

where uN(0,σα2MMT) is an overall effect of omics on the trait y, arN(0,σa,r2H) is a residual genetic effect and ϵN(0,σϵ2I) is a residual effect. Equation (7) with u^=Mα^ as pseudo-records, θ=β˜α and ν=eα corresponds to the model

u^=X˜θ+Z˜am+ν,

where amN(0,σa,m2H) is an overall breeding value across all omics affecting the phenotype, and νN(0,σe2I). REML estimation of parameters in these two models provides estimates of ξ1 and ξ2 in (6), and ζ in (7).

Solving the MME for each of these two last models provides us with BLUP in the model defined by equations (1) and (2), but note that the combination of these two last models does not actually specify a joint model for phenotypes y and omics expression levels M, whereas the combination of (1) and (2) does define a joint model. The fact that only five parameters (one residual variance for each of the models, and respectively, two and one genetic variances for the two models) need to estimated, and that this can be done using standard software for REML estimation, makes this approach to compute EBVs for the model in equations (1) and (2) attractive.

When a genomic relationship matrix is used, we name the method GOBLUP, from Genomic Omics BLUP. When pedigree or combined pedigree and genomic relationships are used, the methods could, respectively, be called AOBLUP or HOBLUP (from matrices A and H, respectively). In the simulated example below, we will use full genomic matrices and hence we will use the wording GOBLUP. Similar names [M-BLUP by Camarinha-Silva et al. (2017), where M stands for Metagenomics; TBLUP and GTBLUP by Morgante et al. (2020), where T stands for transcriptomics] have been used, but they were restricted to metagenome or transcriptome and did not consider the influence of the genetics of the individual on the metagenome/transcriptome.

Method 3:

Finally, we present a third equivalent approach for computing EBVs for the model in equations (1) and (2). Using rules for conditional expectations, EBVs can be expressed as a^=E[E[Gα+ar|y,M,G,B˜]|y,M]. Since E[Gα+ar|y,M,G,B˜]=igiE[αi|y,M]+E[ar|y,M], where the first term is linear in G and does not depend on B˜, and the second term does not depend on either G and B˜, we see that

a^=E[E[Gα+ar|y,M,G,B˜]|y,M]=E[Gα+ar|y,M,G,B˜]|G=G^,B˜=B˜^,

where B˜^=E[B˜|M] and G^=E[G|M]. Since M=X˜B˜+Z˜G+E, and there is a one-to-one correspondence between (B˜,G,M) and (B˜,G,E), we obtain that

a^=E[Gα+ar|y,G,E,B˜]|G=G^,E=E^,B˜=B˜^,

where E^=E[E|M]. The approach therefore consists of first solving equation (5) to obtain g^i and β˜^i, for i=1,,k, stacking these into matrices G^ and B˜^, respectively, and computing E^=MX˜B˜^Z˜G^. Then Mα=X˜B˜^α+Z˜a˜m+e˜m, where a˜m=G^α and e˜m=E^α, and the model for inferring the effects becomes

y=Xβ+ZX˜B˜^α+ZZ˜a˜m+Ze˜m+Zrar+ϵ.

Since incidence matrices satisfy ZZ˜=Zr and ZX˜=X, then Xβ+ZX˜B˜^α=X(β+B˜^α) and inference is the same as in the model with ZX˜B˜^α excluded, and BLUP can be obtained from solving MME for the model

y=Xβ+Zra˜m+Ze˜m+Zrar+ϵ,

where

Var[a˜me˜m]=σα2Q,

with

Q=[G^G^TG^E^TE^G^TE^E^T]. (8)

Note that since BLUP a˜^m=E[G^α|y,M]=E[Gα|y,M]=a^m, then BLUP of a˜m is actually BLUP of am. Therefore, a^m and a^r are obtained by solving

[XTXXT[ZrZ]XTZr[ZrZ]TX[ZrZ]T[ZrZ]+ξ1Q1[ZrZ]TZrZrTXZrT[ZrZ]ZrTZr+ξ2H1][β^[a^me˜^m]a^r]=[XTy[ZZ]TyZrTy], (9)

where ξ1 and ξ2 have been defined previously, and Q is defined in equation (8). Note that matrix Q is a quadratic form of [G^,E^], and this matrix is therefore not full rank when the number of omics features is smaller than two times the number of individuals. In that case, a small number should be added to the diagonal to make the matrix invertible, which in combination with the quadratic form would also allow the use of the Woodbury matrix identity for inversion. For computational reasons, this approach of solving equation (5) followed by solving equation (9) is of no practical relevance for the complete omics case presented here, but we will return to it in the incomplete omics case in the next subsection.

BLUP in the model with incomplete omics data

Here, we consider the case with incomplete omics data, meaning that some individuals (for instance, only individuals in a database born after a certain year, or males in test stations and artificial insemination centers) are being measured for omics whereas other individuals are not. We denote the subset of individuals with omics data by subscript o, and the remaining individuals by subscript no. The phenotypes are then contained in vectors yno, and yo, and omics expression levels are contained in matrix Mo. Concerning fixed effects, we make the restrictive assumption that these are specific to the two sets, so that βno and βo are vectors of fixed effects, and matrices Xno and Xo relate fixed effects in βno to phenotypes in yno and fixed effects in βo to phenotypes in yo, respectively. This assumption means for example that all contemporary groups (corresponding to herd-year-season effects) are all either with or without omics data (if necessary, a contemporary group effect should be split into two), and that covariates like age or weight have separate regression coefficients for the two subsets. Concerning the incidence matrices, then assuming that individuals are sorted such that those with omics come last, then

Z=[Zno00Zo],

and similarly for Z˜ and Zr, and the restriction above on fixed effects being separate for the two subsets implies that Z˜o=I,Zr,o=Zo and Zr,no=Zno. The model in (1) and (2) with incomplete omics data then becomes

yno=Xnoβno+Zno(Gno+Eno)α+Znoar,no+ϵno, (10)
yo=Xoβo+ZoMoα+Zoar,o+ϵo, (11)
mo,i=X˜oβ˜o,i+go,i+eo,i,i=1,,k, (12)

where the notation is similar to the previously used notation, except that subindices o and no denote individuals with and without omics data, respectively, and Eno=[eno,1,,eno,k] denotes the matrix of residual effects on omics for individuals without omics data. Distributional assumptions are also defined as previously. Note that Gno and Eno are unobserved. The genetic relationship matrix H is

H=[Hno,noHno,oHo,noHo,o], (13)

where submatrices Hno,no and Ho,o are genetic relationship matrices for individuals without omics data and with omics data, respectively, and submatrix Ho,no=Hno,oT contain relationships between individuals with and without omics data. Note that these submatrices are not be confused with submatrices of H in single-step method (Legarra et al. 2014) for genotyped and nongenotyped individuals, respectively.

Equation (10) describes the relation between phenotypes and genetic effects for individuals without omics data, breeding values are ano=Gnoα+ar,no, and first- and second-order moments are equivalent to the ones for the usual model for genetic evaluation. Equations (11) and (12) are similar to equations (1) and (2). Concerning equation (10), then compared to the situation with complete omics data, the term X˜noβ˜noα from the omics part of the model has been absorbed into Xnoβno for the phenotype part of the model.

Below, we present extensions of methods 2 and 3 to incomplete omics data. For both approaches, we note the following two special cases: when all individuals have omics data, they simplify to the corresponding approach in the subsection about inference in the model for complete omics data, and when no individuals have omics data, they simplify to the usual model for genetic evaluation. In addition, both approaches are on based on statistical models with valid positive definite variance–covariance matrices. Therefore, both of these approaches are valid extensions to handle incomplete omics data.

Method 2 extended to incomplete omics data:

EBVs for the model in (10), (11), and (12) can be derived in a similar way as how EBVs were derived for complete omics data. For individuals without omics data, denote by Mno=Gno+Eno the unobserved omics expression levels, here noting that X˜noβ˜no=0 since X˜noβ˜noα has been absorbed into Xnoβno. Furthermore, let M be the stacked matrix with Mno and Mo on top of each other. Then similar to previously shown, E[Gα|y,Mo,Mno,α]=E[Gα|Mα] is a linear function of Mα, and hence

a^=E[Gα+ar|y,Mo]=E[E[Gα|y,Mo,Mno,α]|y,Mo]+E[ar|y,Mo]=E[Gα|Mα]|Mα=E[Mα|y,Mo]+E[ar|y,Mo].

Therefore, computation of EBVs again requires two steps. Defining u=Mα, then the first step consists of computing u^=E[u|y,Mo], and the second step consists of computing E[Gα|u].

For computation of u^, we extend the omics similarity matrix MoMoT for individuals with omics data to a similarity matrix for all individuals by combining with genetic relationships in matrix H in equation (13). The extension is similar to the way a single-step method for genetic evaluation (Legarra et al. 2009; Christensen and Lund 2010; Legarra et al. 2014) extends the genomic relationship matrix to a combined pedigree and genomic relationship matrix for all individuals. The extension of omics similarity matrix MoMoT is as follows. The expected variance–covariance matrix without any omics data is

Var(u)=σα2iσg,i2H+σα2iσe,i2I=σα2(σg2H+σe2I),

where σg2=iσg,i2 and σe2=iσe,i2, and where it should be noted that this does not incorporate fixed effects. In Appendix C, the central result in the derivation of single-step method is stated using a general notation. Defining z=u,vi=miX˜oβ˜o,i for i=1,,m,A=σg2H+σe2I, noting that Var(vi)=σg,i2H+σe,i2I=σi2A, with σi2=σg,i2/σg2=σe,i2/σe2, due to the fact that omics features have equal heritabilities, and noting that iσi2=1, we obtain from Appendix C that with incomplete omics data Var(u|Mo)=σα2Ω, where Ω has an inverse matrix

Ω1=(σg2H+σe2I)1+[000Oo,o1(σg2Ho,o+σe2I)1], (14)

where Oo,o=(MoX˜oB˜o)(MoX˜oB˜o)T. Note here that Mo is centered by X˜oB˜o, so that the omics similarity matrix Oo,o is compatible with the expected one, σg2Ho,o+σe2I. Computation of u^ is then by first solving MME

graphic file with name iyab130m15.jpg (15)

for u^ and a^r, and second, plugging u^ into the MME below, and solving this for a^m,

[X˜TX˜X˜TX˜I+ζH1][β˜α^a^m]=[X˜Tu^Z˜Tu^]. (16)

EBV again equal a^m+a^r. Note that since the term X˜oβ˜o used in the centering of the omics similarity matrix Qo,o in equation (14) is not known a priori, then preadjustment of omics expression levels is needed. In the simulated example where a genomic relationship matrix is used, we call this method G-ΩBLUP.

Method 3 extended to incomplete omics data:

Another approach for computing EBV can be derived from generalizing the third approach for complete omics data to incomplete omics data. For individuals with omics data, using subscript o to indicate this, this approach consists of first solving equation (5) to obtain g^o,i and β˜^o,i, for i=1,,k, stacking these into matrices G^o and B˜^0, respectively, and computing E^o=MoX˜oB˜^oG^o. Matrix Qo,o is then defined similar to (8),

Qo,o=[G^oG^oTG^oE^oTE^oG^oTE^oE^oT]. (17)

Below, we first present the argument showing that we can generalize Method 3 to incomplete omics data by extending Qo,o to a combined similarity matrix for all individuals, combining Qo,o with relationship matrix H, and second, we present how this combined matrix looks like.

Looking at all individuals, both with and without omics data, then M is a matrix with Mno and Mo stacked on top of each other as previously defined, and matrix G is defined similarly. Noting that there is a one-one correspondence between (B˜o,Go,Mo) and (B˜o,Go,Eo), we approximate the EBV in the model by

a^=E[Gα+ar|y,Mo]=E[E[Gα+ar|y,B˜o,Go,Mo]|y,Mo]=E[E[Gα+ar|y,B˜o,Go,Eo]|y,Mo]=E[E[Gα+ar|y,Go,Eo]|y,Mo]E[Gα+ar|y,Go,Eo]|Go=E[Go|y,Mo],Eo=E[Eo|y,Mo]E[Gα+ar|y,Go,Eo]|Go=G^o,Eo=E^o.

An approximation of BLUP can therefore be obtained from solving MME for the model

y=Xβ+Za˜m+[0Zo]e˜m+Zar+[Enoα+ϵϵ],

where a˜m is defined from the conditional distribution of Gα given G0=G^o, and e˜m=E^oα, i.e.,

Var[a˜me˜m]=σα2Qext,

with matrix Qext derived below. Note that effects a˜m are defined for all individuals, whereas effects e˜m are only defined for individuals with omics data, and note also that due to the inclusion of Enoα in the residual, the residual variance equals σϵ2+σα2σe2 for individuals without omics data, and equals σϵ2 for individuals with omics data. The prediction of effects is obtained by solving

[XTR1XXTR1ZQXTR1ZZQTR1XZQTR1ZQ+(Qext)1/σα2ZQTR1ZZTR1XZTR1ZQZTR1Z+H1/σg,r2][β^[a^me˜^m]a^r]=[XTR1yZQTR1yZTR1y],

where

ZQ=[Z[0Zo]],R1=[I/(σϵ2+σα2σe2)00I/σϵ2].

The derivation of Qext based on the conditional distribution [(G,Eo)|Go=G^o,Eo=E^o] is similar to how single-step method for genetic evaluation (Legarra et al. 2009; Christensen and Lund 2010; Legarra et al. 2014) extends the genomic relationship matrix to a combined pedigree and genomic relationship matrix for all individuals. The expected variance–covariance matrix without any omics data is

Var[a˜me˜m]=σα2[σg2H00σe2I],

where the identity matrix has dimension equal to the number of individuals with omics data. We use the result in Appendix C with

z=[a˜me˜m],vi=[[gno,igo,i]eo,i],A=[σg2H00σe2I],

σz2=σα2 and σi2=σg,i2/σg2=σe,i2/σe2 when assuming equal heritabilities of omics features. Here, go,i and eo,i are the observed elements in vi so that

V=[VnoVo]=[Gno[G^oE^o]],

and submatrices of A are Ano,no=σg2Hno,no,Ano,o=σg2Hno,o,Ao,no=σg2Ho,no, and

Ao,o=[σg2Ho,o00σe2I].

Noting that iσi2=1, we obtain from Appendix C that with incomplete omics data

Var[a˜me˜m]=σα2Qext,

where Qext has inverse matrix

(Qext)1=[H1/σg200I/σe2]+[[000Qo,oHHHo,o1/σg2][0Qo,oHI][0Qo,oIH]Qo,oIII/σe2],

with

Qo,o1=[Qo,oHHQo,oHIQo,oIHQo,oII],

where Qo,o is defined in equation (17), and subdivision is as in that equation. In this formula, the two I/σe2 matrices both have dimension equal to the number of individuals with omics data, and therefore they cancel in the formula, so that we obtain

(Qext)1=[H1/σg2000]+[[000Qo,oHHHo,o1/σg2][0Qo,oHI][0Qo,oIH]Qo,oII]. (18)

Here, the dimension of H1 is the total number of individuals, dimensions of Qo,oHH,Qo,oHI,Qo,oIH and Qo,oII are the number of individuals with omics data, and dimension of (Qext)1 is total number of individuals plus number of individuals with omics data. In the simulated example, where a genomic relationship matrix is used, we call this method GQBLUP. Note that it is not fully equivalent to G-ΩBLUP, as there are some approximations involved.

Simulation

Here, we will illustrate the models and methods developed in this paper in terms of predictive performance using a simulated data set. Methods for complete (GOBLUP) and incomplete (G-ΩBLUP and GQBLUP) omics data will be investigated, in addition to methods not including omics data (GBLUP and pedigree-based BLUP), and the previously developed method by Weishaar et al. (2020).

A simulated data set was generated using QMsim (Sargolzaei and Schenkel 2009). Twenty chromosomes were simulated, each of length 1 Morgan and containing 1000 biallelic loci evenly distributed, from which 750 were SNP marker loci and 250 were putative quantitative trait loci (QTL). In total, there were 5000 QTL and 15,000 marker loci. The initial set of these loci had allele frequencies drawn from independent uniform distributions, but linkage disequilibrium (LD) was generated through 20 previous generations with effective population size of 200. Omics data consisted of 1200 features, where each feature depended on 500 QTL randomly sampled from among the 5000 QTL. Residual genetic effects depended on 500 randomly sampled QTL. All QTL effects were generated following independent Gaussian distributions. Heritabilities and omics variance ratio were roughly based on the actual parameters for metagenomics estimated by Weishaar et al. (2020) for residual feed intake (as trait) and abundance of operational taxonomic units (as omics feature): heritability of omics expression levels was assumed to be 0.61, heritability of phenotype was assumed to be h2=0.47, and the omics variance ratio was cm2=0.44. Using the formula h2=cm2hm2+hr2 results in hr2=0.20 being the heritability of residual breeding value, and 0.470.20=0.27 being the heritability of omics-mediated breeding value.

The data consisted of 11 generations, with the first generation consisting of 1000 females and 100 males. For the following generations, each male was mated to 10 females, each mating producing 1 male and 1 female offspring (in total, 2000 individuals in each generation), and the 10% of the males with the largest phenotype was used for mating to the 1000 females in the next generation, with a constraint on inbreeding that minimized inbreeding in the next generation through assortative mating (Sargolzaei and Schenkel 2009). It was assumed that all 21,100 individuals had genotypes available, whereas phenotypes were available for all individuals, except individuals in the last generation. First, we assumed that all 21,100 individuals had omics data, and second, it was assumed that only the last two generations consisting of 4000 individuals had omics data. In all cases, we used the last generation for validation, comparing predicted genetic effects with true genetic effects.

Relationship matrices were constructed across all individuals. A pedigree-based additive genetic relationship matrix was constructed using the pedigree of the 11 generations. The genomic relationship matrix was constructed including the 15,000 marker loci, according to VanRaden (2008) with centering and scaling by allele frequencies in the first generation. In order to be able to invert the matrix, we assumed 5% contribution from the pedigree-based relationship matrix. The matrices MMT and Q in equations (6) and (8) were made invertible by adding 1% of the average of their diagonal elements to the diagonal of the matrices.

For the inverse similarity matrices Ω in (14) and Qext in (18) used for incomplete omics data, submatrices of the omics similarity, genomic relationship and Q matrices for individuals with omics data were extracted from the full matrices.

In all analyses, we assumed variance parameters to be known and equal to their true values. For complete omics data on all 21,100 individuals, we used the following methods in Table 1 with the corresponding predictors of breeding values.

Table 1.

Methods with the corresponding predictors of breeding values for complete omics data

Method Predictor Description
BLUP a^BLUP Pedigree-based BLUP
GBLUP a^GBLUP Genomic BLUP
GOBLUP a^GOBLUP Solving (6) and (7); components a^m and a^r; genomic relationship matrix,
a^m Predicted omic-mediated breeding value in GOBLUP
a^r Predicted residual breeding value in GOBLUP
Weishaar a^mweishaar The Weishaar et al. (2020) procedure, which is GOBLUP without residual ar.

For complete omics data, we in addition implemented the other two equivalent methods, i.e., combining (4) and (5), and combining (5), (8), and (9), respectively, to confirm that predicted genetic effects on individuals were the same (results not shown). Similarly, for incomplete omics data, we used the methods in Table 2.

Table 2.

Methods with the corresponding predictors of breeding values for incomplete omics data

Method Predictor Description
GOBLUP a^GOBLUP Solving (6) and (7); components a^m and a^r; genomic relationship matrix; subset of 4,000 individuals with complete omics data,
G-ΩBLUP a^GΩBLUP Solving (15) and (16); components a^m,Ω and a^r,Ω; genomic relationship matrix; extended omics relationship matrix in (14),
GQBLUP a^GQBLUP Solving (5) and (9); components a^m,Qext and a^r,Qext; genomic relationship matrix; extended matrix Qext in (18).

Methods are compared based on accuracies, i.e., correlation between predicted and true genetic effect, and inflation/deflation, i.e., whether the estimated regression coefficient for true effect on predicted effect deviates from 1, for all three components ar,am and a. In addition, we also study correlations between pairs of effects, where one is predicted effect and the other one is true effect, e.g., Cor(a^r,am), to investigate a possible confounding of effects.

Results

Accuracies and, more generally, correlations between predicted and true effects for complete omics data are shown in Table 3, where we see that accuracy of EBV for GOBLUP (0.79) is higher than for GBLUP (0.70), which is higher than for BLUP (0.43). This is as expected, as more information is successively included in GBLUP (genotypes) and GOBLUP (genotypes and omics data), although we note that sizes of differences are due to assumptions in the simulation (high heritability of trait, and high heritability of omics expressions levels, among others). Accuracy of a^m and a^r are both high, 0.82 and 0.66, respectively. Seeing that Cor(a^r,a)=0.43, and comparing Cor(a^m,a)=0.65 and Cor(a^GOBLUP,a)=0.79, we see that adding a^r to the prediction results in a significant increase of the accuracy for this data set. For a^mweishaar, accuracies are Cor(a^mweishaar,am)=0.78 and Cor(a^mweishaar,a)=0.70, which for this data set is lower than the accuracies from the theoretically optimal method, i.e., Cor(a^m,am)=0.82 and Cor(a^GOBLUP,a)=0.79. For BLUP and GBLUP, we see that predictions a^BLUP and a^gblup are positively correlated with both ar and am, which are obvious results, since these methods do not decompose breeding value a into these two components. For GOBLUP, we see that Cor(a^m,ar) and Cor(a^r,am) are close to zero, meaning that for this method, the prediction of each of the two components is capturing its respective component of the breeding value and not the other component of the breeding value, suggesting that these two components are well separated for this simulated data set.

Table 3.

Complete omics data

ar am a Slope
a^BLUP 0.25 0.32 0.43 0.97
aGBLUP 0.42 0.52 0.70 0.91
a^r 0.66 −0.03 0.43 0.97
.. 0 0.82 0.65 0.98
a^GOBLUP 0.38 0.66 0.79 0.97
a^mweishaar 0.12 0.78 0.70 0.78

Methods in Table 1. Correlations across last generation. The estimated slope from regression of true effect on predicted effect.

For complete omics data, estimated regression coefficients of true effect on predicted effect are also shown in Table 3, showing that for all the models and effects, the coefficients do not deviate strongly from 1, and hence do not show strong inflation/deflation, with the exception of a^mweishaar where the regression coefficient is 0.78, showing an inflation. However, there is a small tendency that other predicted effects, in particular aGBLUP, are also a bit inflated for this particular data set.

Accuracies and correlations between predicted and true effects for incomplete omics data are shown in Table 4. It is seen that the G-ΩBLUP method on all 21,100 individuals results in higher accuracies of EBV than GBLUP in Table 3 (0.74 vs 0.70), higher accuracies of prediction of ar,am and a than GOBLUP method on the complete subset of individuals, i.e., the 4000 individuals in the last two generations (0.21 vs 0.17 for ar, 0.69 vs 0.66 for am, and 0.74 vs 0.68 for a), but lower than the corresponding accuracies for GOBLUP in Table 1 (0.66, 0.82, and 0.79, respectively). This is as expected, since G-ΩBLUP utilizes more information (omics data on youngest 4000 individuals) than GBLUP, more information (phenotypes and genotypes on oldest 17,100 individuals) than GOBLUP on the complete subset of individuals, and less information (omics data on oldest 17,100 individuals) than GOBLUP in Table 3. Concerning results for GQBLUP, then these look similar to results for G-ΩBLUP, but with some small deviations, fitting with the fact that an additional approximation is used for this method compared to G-ΩBLUP. For all three methods, we see that correlations between predicted am and true ar and vice versa, e.g., Cor(a^m,ar) and Cor(a^r,am), are larger than the ones in Table 3, suggesting that these two components are less well separated in this case.

Table 4.

Incomplete omics data

ar am a slope
a^r 0.41 0.17 0.41 1.07
a^m 0.17 0.66 0.65 1.05
a^GOBLUP 0.31 0.59 0.68 1.07
a^r,Ω 0.45 0.18 0.45 0.67
a^m,Ω 0.21 0.69 0.69 1.01
a^GΩBLUP 0.41 0.58 0.74 0.93
a^r,Q 0.46 0.22 0.49 0.69
a^m,Q 0.20 0.69 0.69 1.03
a^GQBLUP 0.40 0.59 0.74 0.93

Methods in Table 2. Correlations across last generation. The estimated slope from regression of true effect on predicted effect.

For incomplete omics data, estimated slope from regression of true effect on predicted effect is also shown in Table 4, for GOBLUP showing some deviations from 1 but not to a large degree, whereas for G-ΩBLUP and GQBLUP showing only small deviations from 1 for effects am, some deviation from 1 for a, but slopes for ar are smaller than 1. We note here that deflation or inflation of predicted ar or am in themselves are less of a problem, as long as the resulting EBV is not inflated or deflated. Hence, we see for GOBLUP on 4000 individuals that the predicted am,am, and a are a bit deflated, whereas for both G-ΩBLUP and GQBLUP the predicted ar is much inflated, and a is a bit inflated, i.e., not a large dispersion bias on EBV for this data set.

Discussion

In this paper, we have introduced a joint model for phenotypes and intermediate omics data and developed methods for genetic evaluation using this model. For complete omics data, three equivalent methods for computing EBV were presented, and two of those methods were extended to the case with incomplete omics data. The two extended methods for incomplete omics data are not equivalent. The methods were illustrated using a simulated data set.

For the case of complete omics data, considering the results of our analysis on the simulated data set, we observe that including omics data for selection candidates without phenotypic records leads to higher accuracy, as expected. This is in fact similar to the effect of including a correlated trait. Note that the increase in accuracy will depend on the strength of the relationship between the omics features and the phenotype, and also on heritability of omics features, given that only the genetic part of each omics feature is passed on to the offspring. A manner of appraising this is through the expression h2=cm2hm2+hr2, where cm2 measures the strength of the association between omics and the trait of interest, and hm2 measures the part of each omics feature that can be transmitted to the offspring. The three equivalent methods yielded, as expected, the same results. The method of Weishaar et al. (2020) resulted in smaller accuracies, but this result will not hold generally, for instance if hr2 captures a small fraction of h2. More importantly, the selection index suggested by Weishaar et al. (2020) that combines (in our notation) a^mweishaar and a^GBLUP requires specification of weights, and this is not obvious, whereas our method produces a neat split of breeding value a into omics-mediated and residual breeding values am and ar.

For the case of omics incomplete data, the results were also promising and according to expectations. The optimal method consists in using all available information, instead of ignoring phenotypic data on individuals without omics data to fit a simpler model. The two methods, G-ΩBLUP including matrix Ω, and GQBLUP including matrix Qext, gave similar accuracies for this data set. We note that this may not always be the case, since the latter method is an approximation. Having methods for incomplete omics data is important for many plant and animal selection schemes. For instance, subgroups of animals like prospective future bulls or boars could be measured for omics. Also, in plant breeding, in addition to testing some varieties in trials, omics data could be measured for example in a separate facility and for subset of all plants. The methods for omics incomplete data make it possible to study cases where individuals measured for phenotypes are different from individuals measured for omics data, as long as the two sets of individuals are genetically related. These methods together with the well-known single-step methods (Legarra et al. 2014) for combining pedigree and genomic relationships, give ample flexibility to consider all possible cases of incomplete phenotyping, incomplete omics data, and incomplete genotyping.

The flexibility of methods for incomplete omics data comes, however, with fine-tuning details. In principle, all information entering into the system must refer to the same conceptual base population. Hence, if the H matrix from single-step GBLUP (Legarra et al. 2014) is used, care must be taken so that genomic and pedigree information are compatible. In the same manner, omics measures need to be included in a way that they refer to the same population as the genetic relationships (whether pedigree-based, genomic, or a combination of both). This implies that correct heritabilities must be used and that any manipulation (centering, scaling, etc.) of the measures in M must retain compatibility with the genetic base.

As mentioned above, our analysis of a simulated data set shows results that are both promising and according to expectations. However, we would like to stress that these results are only for one simulated data set, where specific assumptions have been made about the architecture of genetics, omics expression levels, and phenotypes. We note for example that our simulation did not include fixed effects and that both the heritability of the trait and the heritability of the omics expression levels are high compared to the fact that many traits of interest for selection have a heritability in the range 0.1–0.3, and for example, for metabolomic data, the estimated heritabilities of features are mostly low to medium, with very few features having heritability above 0.4 (Aliakbari et al. 2019; Guo et al. 2020). Furthermore, in real data sets, the amount of LD may be larger than the amount generated by 20 generations of random mating in our study. Results from a simulated data set with increased amount of LD are presented in Supplementary information and show that the accuracies are very high for all methods, but with the same general pattern as the accuracies in the paper. Finally, our simulated data were mostly made according to the assumptions in the model, with only few violations being made. Therefore, our methods have not been tested to a full extent.

The model that we have developed, in its formulation, assumes that omics expression levels directly affect the phenotype of interest. This assumption has to be taken with a grain of salt. It is reasonable to assume that gene expressions affect phenotypes; it may not be reasonable to assume that all possible omics measures (and also including measures of metagenome or of NIR spectra) affect all possible traits. Thus, our model is causal in the same sense that genomic prediction is causal: for some omics features, there will be a causal effect on the phenotype, whereas for others, the omics feature acts as a proxy to a biological phenomenon affecting phenotype. Furthermore, a consequence of omics expression levels directly affecting the phenotype is that the vector of regression coefficients α is the same for both the genetic component of omics expression levels, G, and the environmental components, X˜B˜ and E, but in reality, these vectors of regression coefficients may be different. For example, the environmental component E may contain measurement noise, which is not supposed to affect the phenotype y. Relaxing this model assumption seems worthy of further investigation, both theoretically and in analysis of data.

In turn, this raises the question of giving more importance to some omics features than to others, in the line of BayesB (Meuwissen et al. 2001) and similar approaches for genomic evaluation. Our first method, with omics regression effects α (similar in spirit to SNP-BLUP), leads naturally to it, as the a priori variance structure of omics effects, Iσα2, can be easily replaced by a more sophisticated structure Dσα2, where D is a diagonal matrix with elements dii, i=1,,k, giving more or less shrinkage. This leads naturally to iterative methods such as Markov chain Monte Carlo. In our second method, with an omics similarity matrix (similar to GBLUP), this is more complex, as it would involve creating a “weighted” omics similarity matrix MDMT, similar to a trait-specific relationship matrix (Zhang et al. 2010; Su et al. 2012; Campbell et al. 2021).

A second, similar question is about relaxing the assumption that heritabilities of omics features (hm2) are the same across all features. The first and third methods for complete data are both based on first estimating genetic effects on omics expression levels for each feature using equation (5). In this equation, constant heritability is actually not assumed, whereas the second method based on the omics similarity matrix does assume constant heritability. In practice, having different heritabilities of omics features requires estimation of a large number of parameters, which may be hard for small data sets, so assuming the same heritability is also a more robust alternative.

Third, the assumption of independence across omics features is also important. The dependence can be strong across features in metabolomics data, since there is not a one-to-one correspondence between peaks in the nuclear magnetic resonance spectra and specific metabolites (Aliakbari et al. 2019), whereas it is likely to be lower for gene expression profiles. Possibly, one may assume that the signal of an important feature will spread out among similar features, but definitely, this deserves further exploration. In principle, the hierarchical structure of the model in (1) and (2) could be modified to accommodate dependencies (correlations) across omics features, but inference would then become more complicated. Also, non-normality, detection thresholds, and many kinds of particularities of omics features could be addressed.

Fourth, the model in (1) and (2) does not include nongenetic random effects commonly used in genetic evaluation, such as litter effects or pen effects used for animals, or location-year and batch effects used for plants. The methods presented here do generalize to such a case with some added complexity. In particular, we note that for the GQBLUP method the matrix Q would then have additional columns and rows and containing cross-products of four terms, instead of G^ and E^ only.

Finally, the model presented here is a univariate model. An extension of the model and associated methods for genetic evaluation to multiple traits (e.g., yield and composition) is important, but requires a specification of both the model and the associated MME for multiple traits using Kronecker products. The MME in this paper can equivalently be specified using variance components instead of ratios of variance components (see Supplementary information), and these may serve as starting points for deriving such methods.

Concerning computational details, for complete data, predicted effects in our first method can be computed using one run to predict all α^i simultaneously, and then another run per each of the k omics features to obtain g^i. Our second method involves only two runs. In both cases, the extra cost is a cross-product of observed omics, either of the form MMT or of the form MTM. For the second method, there is an extra cost due to inversion of MMT, which from experience is easily done for up to 50,000 individuals, and with difficulty for more than 100,000 individuals. For incomplete omics data, the main cost in G-ΩBLUP is the computation of the matrix Ω1, for which we do not have a simple shortcut or alternative expression, although one may possibly be derived. The matrix (Qext)1 in G-QBLUP is less expensive computationally, but it is an approximation that should be carefully explored.

The methods that we propose make an optimal use of the information, and if our assumptions hold, are expected to be more accurate than alternative methods, such as the multitrait approach of (Hayes et al. 2017). However, this can only be verified through analyzes of real data and by means of cross-validation. Such cross-validation, however, needs to be done carefully, since a naive cross-validation, where predictions are compared to phenotypes on individuals, would favor methods that accurately predict phenotypes from omics data, compared to methods that accurately predict breeding values from omics data. Comparing predicted breeding values on parents from omics data with phenotypes on offspring would provide an assessment of the accuracy of the heritable contribution, and may therefore solve the issue, but not all data sets span several generations. In addition, we suggest that our model or future extensions of the model may be useful for studies that would investigate possible genetic progress by including intermediate omics data into genetic evaluation.

To conclude, we think that our methods provide a useful framework for inclusion of omics data (transcriptomics, metabolomics) and similar intermediate data (metagenome, spectra) in genetic evaluation, and we think they would be useful for plant and animal breeders. However, much further work is needed both in terms of method development and in terms of suitability and accuracy of these methods in real-life data sets.

Data availability

Scripts for the simulation and analysis (using the BLUPF90 software) reported in this paper are available at http://genoweb.toulouse.inra.fr/~alegarra/GOBLUP/. A simpler, computationally inefficient implementation in software R was made to confirm the results in the paper, to confirm that three methods for complete omics data gave identical results and to provide the readers a starting point to program and modify the methods for themselves. These scripts are available at https://genetics.ghpc.au.dk/ofch/GOBLUP/.

Supplementary material is available at GENETICS online.

Supplementary Material

iyab130_Supplementary_Data

Acknowledgments

Three reviewers and the associate editor are thanked for comments that lead to an improvement of the manuscript. Mark Henryon, Just Jensen, Pernille Sarup, and Tage Ostersen are thanked for pointing out that for real data sets with intermediate omics data, it is not straight-forward how to validate EBVs using cross-validation. Jette Odgaard Villemoes is thanked for careful reading and language editing of the manuscript.

Funding

O.F.C. acknowledges a grant from the Green Development and Demonstration Programme (GUDP), funded by the Ministry of Food, Agriculture and Fisheries of Denmark (Grant number: 34009-19-1586).

Conflicts of interest

The authors declare that there is no conflict of interest.

Appendix A

Breeding values are two times expected values of phenotypes on offspring. For a given individual j, let y be the phenotype and mi, i=1,,k be the omics expression levels of a hypothetical offspring resulting from mating j with a randomly chosen mate from a specific group of individuals. Ignoring fixed effects, y=i=1kmiαi+ar+ϵ and mi=gi+ei for i=1,,k, and the genetic effects equal gi=(gij+gimate)/2+giM, where giM is the Mendelian sampling term, for i=1,,k, and ar=(arj+armate)/2+arM, where arM is the Mendelian sampling term. Here, the randomness is due to sampling of mates, Mendelian sampling terms, and residual effects, whereas the genetic effects and effects of omics on phenotypes in the model are fixed (nonrandom). Taking expectation over replications of this (drawing mates, Mendelian sampling terms, and residual effects independently), we obtain

Es[y]=Es[i=1kmiαi+ar+ϵ]=i=1kEs[gi+ei]αi+Es[ar]=i=1kEs[(gij+gimate)/2+giM]αi+Es[(arj+armate)/2+arM]=(i=1kEs[gimate]αi+Es[armate])/2+(i=1kgijαi+arj)/2=constant+(i=1kgijαi+arj)/2,

since Es[ϵ]=Es[ei]=Es[giM]=Es[arM]=0, and Es[gimate],i=1,,k,Es[armate] are constants. Since only half the genetic contribution is from one of the parents, the breeding value of individual j is i=1kgijαi+arj, and hence the vector of breeding values is

a=Gα+ar,

where G is the matrix containing .g1,,gk as columns.

Appendix B

We consider here the model in (1) and (2) and the situation where all omics expression levels are unobserved; i.e., expression levels for individuals with phenotypes are unobserved and Z is an identity matrix, so that X=X˜ and Zr=Z˜. Substituting equation (2) into the equation (1) we obtain

y=X(β+β˜α)+Z˜(Gα+ar)+(Eα+ϵ).

We make the requirement that the model in this case should (up to first- and second-order moments) be equivalent to the usual model for genetic evaluation. We see that the model above is equivalent to a model with fixed effects Xβ, and variance–covariance matrix Var[Z˜(Gα+ar)+(Eα+ϵ)]. What remains to be demonstrated is that the variance–covariance structure is as in the usual genetic model.

Using formulas for conditional means, variances, and covariances, the variance–covariances equal

Var[Gα+ar]=E[Var[Gα|α]]+Var[E[Gα|α]]+σa,r2H=E[iVar[gi](αi)2]+Var[0]+σa,r2H=(iσg,i2σα2+σa,r2)H,Var(Eα+ϵ)=E[Var[Eα|α]]+Var[E[Eα|α]]+Var[ϵ]=E[iVar[ei](αi)2]+Var[0]+σϵ2I=(iσg,i2σα2+σϵ2)I,Cov(Gα+ar,Eα+ϵ)=E[Cov[Gα,Eα|α]]+Cov[E[Gα|α],E[Eα|α]]+0+0+0=E[0]+Cov[0,0]=0.

Therefore, we see that first- and second-order moments for this model are the same as those for a genetic model with genetic variance iσg,i2σα2+σa,r2, genetic relationship matrix H, and residual variance iσe,i2σα2+σϵ2. This means that the model in equations (1) and (2) is a proper model for genetic evaluation, in the sense that it extends the usual models for genetic evaluation.

Appendix C

The derivations of the extensions of omics similarity matrices to individuals without omics data are based on the same idea as the extension of the genomic relationship to nongenotyped individuals, i.e., the so-called single-step methods (Legarra et al. 2009; Christensen and Lund 2010; Legarra et al. 2014). For this purpose, we here state the central result in that approach, but using a general notation.

Result

Assuming that the distribution of random vectors v1,,vm,z is described by v1,,vm being independent, viN(0,σi2A) with A being a positive definite and symmetric matrix with inverse A1, and conditional on v1,,vm the distribution [z|V]N(0,σz2VVT), where V=[v1,,vm], then the marginal distribution of z has mean E[z]=0 and variance–covariance Var(z)=σz2iσi2A.

Assuming further that the last elements in v1,,vm are observed, and the remaining are not, i.e., the last rows in matrix V are observed,

V=[VnoVo],

with subscript o denoting observed and subscript no denoting nonobserved, and define submatrices of matrix A accordingly

A=[Ano,noAno,oAo,noAo,o].

Then, the conditional distribution of z given Vo has mean E[Z|Vo]=0 and variance–covariance Var[z|Vo]=σz2iσi2Ψ, where matrix Ψ is invertible with inverse

Ψ1=A1+[000(VoVoT/iσi2)1(Ao,o)1],

when we assume that VoVoT is of full rank.

Proof

A proof of this result along the lines of the proofs in Christensen and Lund (2010) and Legarra et al. (2014) is briefly outlined here.

Using properties of conditional expectations and variances, we see that E[z|Vo]=E[E[z|V]|Vo]=0, and that Var[z|Vo]=Var[E[z|V]|Vo]+E[Var[z|V]|Vo]=σz2i=1kE[viviT|vi,o], where

E[viviT|vi,o]=[Var[vi,no|vi,o]+E[vi,no|vi,o]E[vi,no|vi,o]Tvi,oE[vi,no|vi,o]TE[vi,no|vi,o]vi,oTvi,ovi,oT].

Results on conditional expectations and variances in multivariate normal distribution show that E[vi,no|vi,o]=Ano,oAo,o1vi,o and Var[vi,no|vi,o]=σi2(Ano,noAno,oAo,o1Ao,no). Inserting these results, we obtain that Var[z|Vo]=σz2iσi2Ψ, where

Ψ=[Ano,no+Ano,oAo,o1(VoVoT/iσi2Ao,o)Ao,o1Ao,noAno,oAo,o1VoVoT/iσi2(VoVoT/iσi2)Ano,oAo,o1VoVoT/iσi2].

Inverting this matrix, we obtain the result.

Literature cited

  1. Aliakbari A, Ehsani A, Torshizi R, Løvendahl P, Esfandyari H, et al. 2019. Genetic variance of metabolomic features and their relationship with body weight and body weight gain in Holstein cattle. J Anim Sci. 97:3832–3844. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Camarinha-Silva A, Maushammer M, Wellmann R, Vital M, Preuss S, et al. 2017. Host genome influence on gut microbial composition and microbial prediction of complex traits in pigs. Genetics. 206:1637–1644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Campbell M, Hu H, Yeats T, Brzozowski L, Caffe-Treml M, et al. 2021. Improving genomic prediction for seed quality traits in oat (Avena sativa L.) using trait specific relationship matrices. Front Genet. 12:643733. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Christensen OF, Lund MS.. 2010. Genomic prediction when some animals are not genotyped. Genet Sel Evol. 42:2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Difford G, Plichta D, Løvendahl P, Lassen J, Noel S, et al. 2018. Host genetics and the rumen microbiome jointly associate with methane emissions in dairy cows. PLoS Genet. 14:e1007580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Gianola D, Sorensen D.. 2004. Quantitative genetics models describing simultaneous and recursive relationships between phenotypes. Genetics. 167:1407–1424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Guo X, Sarup P, Jensen J, Orabi J, Kristensen N, et al. 2020. Genetic variance of metabolomic features and their relationship with malting quality traits in spring barley. Front Plant Sci. 11:575467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Guo Z, Magwire M, Basten C, Xu Z, Wang D.. 2016. Evaluation of the utility of gene expression and metabolic information for genomic prediction in maize. Theor Appl Genet. 129:2413–2427. [DOI] [PubMed] [Google Scholar]
  9. Hayes B, Panozzo J, Walker C, Choy A, Kant S, et al. 2017. Accelerating wheat breeding for end-use quality with multi-trait genomic predictions incorporating near infrared and nuclear magnetic resonance-derived phenotypes. Theor Appl Genet. 130:2505–2519. [DOI] [PubMed] [Google Scholar]
  10. Henderson CR. 1976. A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics. 32:69–83. [Google Scholar]
  11. Legarra A, Aguilar I, Misztal I.. 2009. A relationship matrix including full pedigree and genomic information. J Dairy Sci. 92:4656–4663. [DOI] [PubMed] [Google Scholar]
  12. Legarra A, Christensen OF, Aguilar I, Misztal I.. 2014. Single step, a general approach for genomic selection. Livest Sci. 166:54–65. [Google Scholar]
  13. Mancuso N, Freund M, Johnson R, Shi H, Kichaev G, et al. 2019. Probabilistic fine-mapping of transcriptome-wide association studies. Nat Genet. 51:675–682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Meuwissen THE, Hayes BJ, Goddard ME.. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 157:1819–1829. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Morgante F, Huang W, Sørensen P, Maltecca C, Mackay T.. 2020. Leveraging multiple layers of data to predict Drosphila complex traits. G3 (Bethesda). 10:4599–4613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Negussie E, Lehtinen J, Mäntysaari P, Bayat A, Liinamo A-E, et al. 2017. Non-invasive individual methane measurement in dairy cows. Animal. 11:890–899. [DOI] [PubMed] [Google Scholar]
  17. O’Leary NW, Byrne DT, O'Connor AH, Shalloo L.. 2020. Invited review: cattle lameness detection with accelerometers. J Dairy Sci. 103:3895–3911. [DOI] [PubMed] [Google Scholar]
  18. Qian J, Ray E, Brecha R, Reilly M, Foulkes A.. 2019. A likelihood-based approach to transcriptome association analysis. Stat Med. 38:1357–1373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Riedelsheimer C, Czedik-Eysenberg A, Grieder C, Lisec J, Technow F, et al. 2012. Genomic and metabolic prediction of complex heterotic traits in hybrid maize. Nat Genet. 44:217–220. [DOI] [PubMed] [Google Scholar]
  20. Rincent R, Charpentier J, Faivre-Rampant P, Pauz E, Gouis C, et al. 2018. Phenotypic selection is a low-cost and high-throughput method based on indirect predictions: proof of concept on wheat and poplar. G3 (Bethesda). 8:3961–3972. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Sargolzaei M, Schenkel FS.. 2009. QMSim: a large-scale genome simulator for livestock. Bioinformatics. 25:680–681. [DOI] [PubMed] [Google Scholar]
  22. Stanton BJ, Neale DB, Shanwen L.. 2010. Populus breeding: from the classical to the genomic approach. In: Jansson RBS, Groover A, editors. Genetics and Genomics of Populus. New York, NY: Springer. p. 309–348. [Google Scholar]
  23. Strandén I, Garrick DJ.. 2009. Technical note: derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit. J Dairy Sci. 92:2971–2975. [DOI] [PubMed] [Google Scholar]
  24. Su G, Christensen OF, Ostersen T, Henryon M, Lund MS.. 2012. Estimating additive and non-additive genetic variances and predicting genetic merits using genome-wide dense single nucleotide polymorphism markers. PLoS One. 7:e45293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. VanRaden PM. 2008. Efficient methods to compute genomic predictions. J Dairy Sci. 91:4414–4423. [DOI] [PubMed] [Google Scholar]
  26. Varona L, Sorensen D, Thompson R.. 2007. Analysis of litter size and average litter weights in pigs using a recursive model. Genetics. 177:1791–1799. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Weishaar R, Wellmann R, Camarinha-Silva A, Rodehutscord M, Bennewitz J.. 2020. Selecting the hologenome to breed for an improved feed efficiency in pigs—a novel selection index. J Anim Breed Genet. 137:14–22. [DOI] [PubMed] [Google Scholar]
  28. Zhang Z, Liu J, Ding X, Bijma P, de Koning D-J, et al. 2010. Best linear unbiased prediction of genomic breeding values using a trait-specific marker-derived relationship matrix. PLoS One. 5:e12648. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

iyab130_Supplementary_Data

Data Availability Statement

Scripts for the simulation and analysis (using the BLUPF90 software) reported in this paper are available at http://genoweb.toulouse.inra.fr/~alegarra/GOBLUP/. A simpler, computationally inefficient implementation in software R was made to confirm the results in the paper, to confirm that three methods for complete omics data gave identical results and to provide the readers a starting point to program and modify the methods for themselves. These scripts are available at https://genetics.ghpc.au.dk/ofch/GOBLUP/.

Supplementary material is available at GENETICS online.


Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES