Skip to main content
Springer logoLink to Springer
. 2022 Sep 6;135(10):3393–3415. doi: 10.1007/s00122-022-04186-w

Genomic selection using random regressions on known and latent environmental covariates

Daniel J Tolhurst 1,, R Chris Gaynor 2, Brian Gardunia 2, John M Hickey 3, Gregor Gorjanc 1
PMCID: PMC9519718  PMID: 36066596

Abstract

Key message

The integration of known and latent environmental covariates within a single-stage genomic selection approach provides breeders with an informative and practical framework to utilise genotype by environment interaction for prediction into current and future environments.

Abstract

This paper develops a single-stage genomic selection approach which integrates known and latent environmental covariates within a special factor analytic framework. The factor analytic linear mixed model of Smith et al. (2001) is an effective method for analysing multi-environment trial (MET) datasets, but has limited practicality since the underlying factors are latent so the modelled genotype by environment interaction (GEI) is observable, rather than predictable. The advantage of using random regressions on known environmental covariates, such as soil moisture and daily temperature, is that the modelled GEI becomes predictable. The integrated factor analytic linear mixed model (IFA-LMM) developed in this paper includes a model for predictable and observable GEI in terms of a joint set of known and latent environmental covariates. The IFA-LMM is demonstrated on a late-stage cotton breeding MET dataset from Bayer CropScience. The results show that the known covariates predominately capture crossover GEI and explain 34.4% of the overall genetic variance. The most notable covariates are maximum downward solar radiation (10.1%), average cloud cover (4.5%) and maximum temperature (4.0%). The latent covariates predominately capture non-crossover GEI and explain 40.5% of the overall genetic variance. The results also show that the average prediction accuracy of the IFA-LMM is 0.02-0.10 higher than conventional random regression models for current environments and 0.06-0.24 higher for future environments. The IFA-LMM is therefore an effective method for analysing MET datasets which also utilises crossover and non-crossover GEI for genomic prediction into current and future environments. This is becoming increasingly important with the emergence of rapidly changing environments and climate change.

Supplementary Information

The online version contains supplementary material available at 10.1007/s00122-022-04186-w.

Introduction

This paper develops a single-stage genomic selection (GS) approach which integrates known and latent environmental covariates within a special factor analytic framework. The factor analytic linear mixed model of Smith et al. (2001) is an effective method for analysing multi-environment trial (MET) datasets, which includes a parsimonious model for genotype by environment interaction (GEI). The advantage of using random regressions on known environmental covariates, such as soil moisture and maximum temperature, is that the modelled GEI becomes predictable. The GS approach developed in this paper exploits the desirable features of both classes of model.

Genomic selection is a form of marker-assisted selection that can improve the genetic gain in animal and plant breeding programmes (Meuwissen et al. 2001). In plant breeding, however, GS is often restricted by the presence of GEI, that is the change in genotype response to a change in environment. There are two appealing features of using known environmental covariates for GS; (i) meaningful biological interpretation can be ascribed to GEI and (ii) predictions can be obtained for any tested or untested genotype into any current or future environment. These features represent two long-standing objectives of many plant breeding programmes.

Regressions on known environmental covariates were first used in plant breeding by Yates and Cochran (1938). Their approach was later popularised by Finlay and Wilkinson (1963), and includes a fixed coefficient regression on a set of environmental mean yields (covariates) with a separate intercept and slope for each genotype. Hardwick and Wood (1972) extended the fixed regression model to include a more complex set of environmental covariates, such as moisture and temperature (also see Wood 1976). These approaches have distinct limitations when used to analyse MET datasets, however (Smith et al. 2005). An alternative approach is to use a linear mixed model with a random coefficient regression. This approach was popularised by Laird and Ware (1982), and requires an appropriate variance model for the intercepts and slopes which ensures the regression is scale and translational invariant. Heslot et al. (2014) extended the random regression model for GS using a set of genotype covariates derived from marker data and a set of environmental covariates derived from weather data. They were unable to fit an appropriate variance model for the intercepts and slopes, however, so that the regression was not translational invariant. At a similar time, Jarquín et al. (2014) demonstrated an even simpler random regression model for a very large set of correlated environmental covariates. They found that the environmental covariates explained only 23% of the overall genetic variance. These examples highlight the current limitations of using known environmental covariates for GS. That is, they are often highly correlated and only explain a small proportion of GEI, and fitting an appropriate variance model is typically computationally prohibitive (Brancourt-Hulmel et al. 2000; Buntaran et al. 2021).

The factor analytic linear mixed model of Smith et al. (2001) includes a latent regression model for GEI in terms of a small number of common factors (also see Piepho 1997). This approach is a linear mixed model analogue to AMMI (Gauch 1992) and GGE (Yan et al. 2000), or more specifically factor analysis (Mardia et al. 1979), where the factors involve some combination of latent environmental covariates. It also bears similarities to the ordinary regression models with one important difference; the environmental covariates are estimated from the data as well as the genotype slopes. Several authors have discussed the addition of intercepts to the factor analytic model in an attempt to obtain a simple average (simple main effect) for each genotype, but note there are issues which limit their interpretability (Smith 1999).

The factor analytic linear mixed model has been widely adopted for the analysis of MET datasets (Ukrainetz et al. 2018). The two main variants involve pedigree or marker data (Oakey et al. 2007, 2016). Recently, Tolhurst et al. (2019) demonstrated a factor analytic linear mixed model for GS within a major Australian plant breeding programme. They demonstrated genomic selection tools to obtain a measure of overall performance (generalised main effect) and stability for each genotype (Smith and Cullis 2018). There is one limitation of this approach, however. The common factors are latent so the modelled GEI is observable, rather than predictable. This limitation has lead to ad hoc post processing of the latent factors with known covariates (Oliveira et al. 2020).

Until now, the analysis of MET datasets has involved only one set of known or latent environmental covariates. The aim of this paper is to extend the GS approach of Tolhurst et al. (2019) to integrate both known and latent environmental covariates. This new approach is hereafter referred to as the integrated factor analytic linear mixed model (IFA-LMM). There are three appealing features of the IFA-LMM:

  1. The IFA-LMM includes a regression model for GEI in terms of a small number of known and latent common factors. This simultaneously reduces the dimension of the known and latent environmental covariates.

  2. The regression model captures predictable GEI in terms of known covariates. This enables meaningful interpretation of GEI and genomic prediction into any current or future environment.

  3. The regression model also captures observable GEI in terms of latent covariates, which are orthogonal to the known covariates. This enables the regression model to capture a large proportion of GEI overall, and thence enables the IFA-LMM to be an effective method for analysing MET datasets.

The IFA-LMM is demonstrated on a late-stage cotton breeding MET dataset from Bayer CropScience. The predictive ability of the IFA-LMM is compared to several popular random regression models.

Materials and methods

The Bayer CropScience Cotton Breeding Programme evaluates the commercial merit of test genotypes by annually conducting multi-environment field trials. There are two late-stages of field evaluation considered in this paper, referred to as preliminary commercial P1 and P2. The 2017 P1 MET dataset comprises the current set of environments and will be used to train all random regression models. The 2018 P2 MET dataset will be used to assess the predictive ability into future environments.

Data description

Experimental design and phenotypic data

Table 1 presents a summary of the 2017 P1 MET dataset for seed cotton yield. There were 72 field trials conducted in 24 environments across eight states in Southeast, Midsouth and Texas, USA (Fig. 1). A total of 208 genotypes were evaluated in all environments. Each environment consisted of three trials. Each trial was designed as a randomised complete block design with 144 plots comprising two replicate blocks of 68 test genotypes plus four checks. Yield data were recorded on most plots with 6.54% missing. The number of non-missing plots per test genotype ranged from 39 to 47, with mean of 45. The number of non-missing genotypes in common between environments ranged from 173 to 208, with mean of 204. The mean yield and generalised narrow-sense heritability (Oakey et al. 2006) varied substantially between environments and growing regions.

Table 1.

Summary of the 2017 P1 MET dataset for seed cotton yield

State Env Trials Genotypes Plots Yield
Total 1rep 2rep >2rep Total NAs Mean h2
North carolina 17NC1 3 208 15 189 4 432 16 1.43 0.48
17SC1 3 206 0 202 4 432 5 1.63 0.59
17SC2 3 183 52 127 4 432 107 1.94 0.46
South carolina 17SC3 3 208 5 199 4 432 5 2.32 0.50
17GA1 3 208 2 202 4 432 3 1.72 0.59
17GA2 3 208 2 202 4 432 2 1.92 0.64
17GA3 3 208 2 202 4 432 2 1.74 0.50
Georgia 17GA4 3 208 2 202 4 432 2 1.62 0.49
° Missouri 17MO1 3 207 69 134 4 432 76 1.95 0.61
17AR1 3 207 18 185 4 432 20 0.99 0.24
° Arkansas 17AR2 3 205 2 199 4 432 9 1.63 0.83
17MS1 3 204 9 191 4 432 19 1.21 0.57
17MS2 3 207 6 197 4 432 10 1.93 0.63
° Mississippi 17MS3 3 207 140 63 4 432 150 0.91 0.55
17LA1 3 208 4 200 4 432 6 1.32 0.72
° Louisiana 17LA2 3 208 11 193 4 432 12 1.16 0.60
17TX1 3 208 1 203 4 432 1 2.12 0.62
17TX2 3 208 2 202 4 432 2 1.79 0.59
17TX3 3 207 4 199 4 432 7 2.05 0.72
17TX4 3 208 4 200 4 432 4 1.86 0.38
17TX5 3 198 132 62 4 432 161 1.38 0.56
17TX6 3 206 29 173 4 432 33 1.95 0.43
17TX7 3 208 7 197 4 432 7 1.77 0.56
× Texas 17TX8 3 208 18 186 4 432 19 2.57 0.40
Overall 72 208 10,368 678 1.70 0.55

Presented for each environment is the number of trials, genotypes (with one, two or more replicates) and plots (total and missing), as well as the mean yield (t/ha) and generalised narrow-sense heritability (h2)

NoteSymbols distinguish the Southeast, ° Midsouth and × Texas growing regions

*Total number after missing plots removed

Fig. 1.

Fig. 1

Map of the cotton growing environments in the 2017 P1 and 2018 P2 MET datasets. Note: States and years are distinguished by colour and growing regions are distinguished by shape

Supplementary Table 9 presents a summary of the 2018 P2 MET dataset for seed cotton yield. There were 20 field trials conducted in 20 environments across six states of USA (Fig. 1). Eleven trials were conducted in the same locations as the 2017 P1 trials and nine were conducted in new locations. A total of 55 genotypes were evaluated in all trials, with all genotypes previously evaluated in 2017 P1. Each trial was designed as a completely randomised design with a single replicate of all 55 genotypes. Note that only three environments were harvested in the Southeast due to severe weather.

Environmental data

Table 2 and Supplementary Table 10 present a summary of the known environmental covariates in the 2017 P1 and 2018 P2 MET datasets. There were 18 covariates available for all 44 environments, including latitude and longitude as well as 11 covariates derived from daily weather data and 5 covariates derived from daily soil data. These tables show that the known covariates vary substantially within and between growing regions, as well as between years. Each covariate was then centred and scaled to unit length for all subsequent analyses. The practical implication of this will be discussed in  “Regressions on latent covariates”.

Table 2.

Summary of the known environmental covariates in the 2017 P1 MET dataset

Southeast Midsouth × Texas
Covariate Description (units) Min Mean Max Min Mean Max Min Mean Max
LAT latitude () 31.0 33.0 35.4 31.6 33.6 36.4 31.4 33.2 34.9
LONG longitude () − 84.7 − 81.7 − 78.0 − 91.9 − 91.1 − 89.7 − 102.3 − 101.1 − 99.5
avgCCR average cloud cover (%) 53.4 56.0 59.1 46.6 48.7 52.2 32.1 34.5 37.0
minHUM min humidity (%) 43.7 47.7 53.7 52.0 53.4 55.9 30.1 34.0 40.4
maxDSR max downward solar radiation (W/m2) 0.74 0.76 0.77 0.75 0.76 0.77 0.82 0.85 0.87
maxNSR max net solar radiation (W/m2) 0.62 0.64 0.66 0.63 0.64 0.65 0.68 0.68 0.70
maxPRP max precipitation (mm/hr) 2.4 2.9 3.4 1.7 2.6 3.6 1.1 1.4 1.8
totPRP total precipitation (mm/day) 3.2 3.5 4.2 3.0 3.7 4.9 1.3 1.6 2.1
maxDPT max dew point temperature (C) 20.5 21.1 22.1 18.9 20.7 22.0 13.5 15.7 17.6
maxTMP max temperature (C) 28.5 30.3 31.5 27.6 28.9 29.6 28.7 30.3 32.1
minTMP min temperature (C) 19.0 20.1 21.0 17.9 19.5 20.4 15.4 17.4 19.5
minWSP min wind speed (km/hr) 4.9 5.2 5.7 4.7 4.9 5.0 7.4 8.1 9.4
avgWDR average wind direction (azimuth degrees) 166.7 175.8 181.5 152.3 161.9 174.0 144.7 152.9 162.9
maxST1 max soil temperature 1 (C) 27.6 29.9 31.3 27.0 28.3 29.1 29.5 32.2 34.5
minST1 min soil temperature 1 (C) 19.8 21.8 23.2 19.3 20.6 21.5 19.0 20.6 22.8
avgSM3 soil moisture 3 (%) 7.0 23.8 42.3 28.0 30.1 32.7 11.4 19.0 25.6
avgSM4 soil moisture 4 (%) 10.0 29.5 44.6 29.8 32.9 35.3 8.3 15.8 21.8
minST4 min soil temperature 4 (C) 20.0 22.4 24.2 20.2 22.0 23.0 21.0 22.9 25.2

Note: Values presented are prior to centring and scaling

Presented for each covariate is the minimum, mean and maximum for the Southeast, Midsouth and × Texas growing regions

Marker data

Marker data were available for 204 (of the 208) genotypes in 2017 P1, which included all 55 genotypes in 2018 P2. The markers correspond to a high confidence set of 36,009 single-nucleotide polymorphisms. Genotypes were coded as either −1, 0 or 1 for the homozygous minor, heterozygous or homozygous major alleles at each marker. The frequency of heterozygous markers was low given the level of selfing accumulated up to the P1 stage. Monomorphic markers were then removed and missing markers were imputed using the k-nearest neighbour approach of Troyanskaya et al. (2001), with k=10. Note that the four genotypes without marker data are of no practical interest (see Tolhurst et al. 2019, for further details).

The genomic relationship matrix was constructed using the pedicure package (Butler 2019) in R (R Core Team 2021). The default settings in pedicure were used as filters, with minor allele frequency > 0.002% and missing marker frequency < 0.998%. A total of 24,265 markers were retained using this criteria. The diagonal elements of the relationship matrix ranged from 0.004 to 2.022, with mean of 1.234. The off-diagonals ranged from −0.388 to 1.322, with mean of −0.006.

Statistical models

Preliminaries

Assume the MET dataset comprises v=204 genotypes evaluated in t=72 field trials conducted across p=24 environments, where t=j=1ptj and tj=3 is the number of trials in environment j. Let the n-vector of phenotypic data be given by y=(y1,y2,,yp), where yj=(yj;1,yj;2,,yj;tj) is the nj-vector for environment j and yj;k is the njk-vector for trial k in environment j. The length of y is therefore given by:

n=j=1pk=1tjnjk=j=1pnj.

Lastly, assume all p=24 environments have q=18 known covariates available, that is assume p>q. Let the p×q matrix of covariates be given by S=[s1s2sq], with columns given by the centred and scaled environment scores for each covariate, such that sisi=1.

Linear mixed model

The linear mixed model for y can be written as:

y=Xτ+Zu+Zpup+e, 1

where τ is a vector of fixed effects with design matrix X, u is a vp-vector of random genotype by environment (GE) effects with n×vp design matrix Z, up is a vector of random non-genetic peripheral effects with design matrix Zp and e is the n-vector of residuals.

The vector of fixed effects, τ, includes the mean parameter for each environment. This vector is fitted as fixed following a classical quantitative genetics approach where the GE effects in different environments are regarded as different correlated traits (Falconer and Mackay 1996). This vector can be extended to a regression on known environmental covariates, with:

τ=1pμ+Sτs+ω, 2

where μ is the overall mean parameter (intercept), S is the p×q matrix of known covariates, τs is a q-vector with elements given by the mean response of genotypes to each covariate and ω is a p-vector of residual environmental effects, with ωN(0,σω2Ip).

The vector of random non-genetic effects, up, accommodates the plot structures of trials and environments (Bailey 2008). This vector is fitted as random to enable recovery of information across incomplete blocks and trials (Patterson and Thompson 1971). Other effects in up may accommodate extraneous variations across field columns and rows (Gilmour et al. 1997).

It is assumed that:

uupeN000,G000Gp000R. 3

Following Tolhurst et al. (2019), Gp=j=1pGpj is diagonal with a separate variance component model for the jth environment and R=j=1pRj is block diagonal with a two-dimensional spatial model for the jth environment. The form of G is presented below, but note that all variance matrices in Eq. 3 are fitted at the environment level, not trial level. This completely aligns the non-genetic and residual variance models with the genetic variance model.

Variance model for the GE effects

The GE effects are modelled using r=24,265 markers, and therefore referred to as the additive GE effects. This model is an extension of the univariate GBLUP model (Stranden and Garrick 2009), with:

u=(IpM)umandG=GeMM/m=GeGg, 4

where M=[m1m2mr] is a v×r design matrix with columns given by the centred genotype scores for each marker, um is a rp-vector of additive marker by environment effects, Ge is a p×p additive genetic variance matrix between environments and Gg=MM/m is the v×v genomic relationship matrix between genotypes (VanRaden 2008).

The random regression models for u considered in this paper include:

  1. Latent covariates; models with simple or generalised main effects.

  2. Known covariates; models with or without translational invariance.

  3. Known and latent covariates; models with generalised main effects and translational invariance.

All regression models are summarised in Table 3, with full details provided below.

Table 3.

Summary of the variance models for the additive GE effects considered in this paper

Model Description Ge Parameters Reference
id Identity σge2Ip 1
diag Diagonal Σge Σge=j=1pσgej2 p
comp Compound symmetry σg2Jp+σge2Ip Jp=1p1p 2 Patterson et al. (1977)
mdiag Main effects plus diagonal σg2Jp+Σge p+1 Cullis et al. (1998)
FAMk Factor analytic plus main effects σ12Jp+ΛDΛ+Ψ D=l=1kdl p(k+1)-k(k-1)/2+1 Smith et al. (2001)
FAk Factor analytic ΛDΛ+Ψ Ψ=j=1pψj p(k+1)-k(k-1)/2 Smith et al. (2001)
rreg1 Random regression 1 σg2Jp+σs2SS+Ψ p+2 Jarquín et al. (2014)
rreg2 Random regression 2 σg2Jp+SΣsS+Ψ Σs=i=1qσsi2 p+q+1 Heslot et al. (2014)
FARk Factor analytic regression [1pS]ΛaDΛa[1pS]+Ψ Λa=[ΛgΛs] p+k(2q-k+3)/2 Jennrich and Schluchter (1986)
IFAk Integrated factor analytic [SΓ]ΛbDΛb[SΓ]+Ψ Λb=[ΛsΛr] p(k+1)-k(k-1)/2 This paper

Presented for each model is the structure of the additive genetic variance matrix between environments (Ge), number of estimated variance parameters and the reference

Note: The vp-vector of additive GE effects is given by u with var(u)=GeGg, where Gep×p is the variance matrix between environments and Ggv×v is the genomic relationship matrix between genotypes. Also note that 1p=1p/p, Λp×k is a matrix of latent covariates with p environments and k factors, Sp×q is a matrix of known covariates with q covariates and Γp×(p-q) is an orthogonal projection matrix, with SΓ=0

Regressions on latent covariates

The factor analytic model is effective for modelling the covariances between additive GE effects in terms of a small number of latent common factors (Kelly et al. 2007). The two variants considered in this paper include simple or generalised main effects.

Models with simple main effects

Smith et al. (2001) demonstrated an extension of the factor analytic model which includes an explicit intercept for each genotype. This extension will be referred to as the FAMk model, where k denotes the number of latent factors. The FAMk model is given by:

u=(1pIv)γ1+(λ1Iv)f1++(λkIv)fk+δ=(1pIv)γ1+(ΛIv)f+δ, 5

with 1p=1p/p, where γ1=(γ11,γ12,,γ1v) is a v-vector of genotype intercepts, Λ=[λ1λ2λk] is a p×k matrix of latent environmental loadings (covariates), f=(f1,f2,,fk) is a vk-vector of genotype scores (slopes) in which fl is the v-vector for the lth latent factor and δ=(δ1,δ2,,δp) is a vp-vector of regression residuals (deviations) in which δj is the v-vector specific to the jth environment. This specification highlights the analogy to an ordinary random regression, with the difference that the environmental covariates are estimated from the data as well as the genotype slopes (see Eq. 13).

Following Smith et al. (2021), the loadings are assumed to have orthonormal columns, with ΛΛ=Ik, and the scores are assumed to be independent across factors, with non-unit variance. It therefore follows that:

γ1fδN000,pσ12000D000ΨGg,

where σ12 is the intercept variance, D=l=1kdl is a diagonal matrix in which dl is the score variance for the lth latent factor ordered as d1>d2>>dk and Ψ=j=1pψj is a diagonal matrix in which ψj is the specific variance for the jth environment. The variance matrix for u is then given by:

G=[1pΛ]pσ1200D[1pΛ]+ΨGg, 6

where Geσ12Jp+ΛDΛ+Ψ and Jp=1p1p. This variance matrix highlights the analogy to a random regression without translational invariance, that is where the intercepts and slopes are independent (see Eq. 14).

Note that the intercepts in γ1 reflect the fitted value of each genotype at zero values of the environmental loadings. In order for the intercepts to reflect true main effects, however, the average values of the loadings must also be zero. The analogy to ordinary regression models is when the known covariates are column centred, so that the intercepts will reflect main effects taken at average (zero) values of the covariates.

Smith (1999) use a Gram-Schmidt process to column centre the environmental loadings (see “Appendix”). The variance matrix in Eq. 6 can therefore be written as:

G=[1pΛ]pσg2D12D21D22[1pΛ]+ΨGg, 7

where Geσg2Jp+1pD12Λ+ΛD211p+ΛD22Λ+Ψ, with Λ1p=0. This variance matrix highlights the analogy to a random regression with translational invariance, that is where the main effects and slopes are dependent (see Eq. 19). This variance matrix also highlights the analogy to a special FA(k+1) model, where the first factor loadings are constrained to be equal and the higher order loadings sum to zero.

The simple main effects are now equivalent to simple averages across environments, with:

γg=γ1+pl=1kλ¯lflandγgN0,pσg2Gg, 8

where σg2=σ12+l=1kdlλ¯l2 is the simple main effect variance and λ¯l=1pλl/p is the mean loading for the lth latent factor. The distinguishing feature compared to the intercepts in Eq. 5 is that the simple main effects now reflect the fitted value of each genotype at average (zero) values of the loadings.

The percentage of additive genetic variance explained by the simple main effects is given by:

vg=100pσg2/tr(Ge), 9

where Ge is defined in Eq. 7.

Models with generalised main effects

The conventional factor analytic (FAk) model is a simplification of the FAMk model in Eq. 5, with:

u=(ΛIv)f+δandG=(ΛDΛ+Ψ)Gg, 10

where Ge=ΛDΛ+Ψ. The distinguishing feature of this model is that intercepts are not explicitly fitted for each genotype (see “Appendix”).

Smith and Cullis (2018) discuss the ability of factor analytic models to capture heterogeneity of scale variance, that is non-crossover GEI, within the first factor. They proposed a set of generalised main effects based on this factor, with:

γg=λ¯1f1andγgN0,d1λ¯12Gg, 11

where λ¯1=1pλ1/p and λ1 is the p-vector of first factor loadings which are assumed to be positive. The generalised main effects can therefore be viewed as weighted averages across environments. This highlights an important difference to the simple main effects in the FAMk model, which are simple averages across environments.

The percentage of additive genetic variance explained by the generalised main effects is equivalent to the variance explained by the first factor, which is given by:

v1=100d1/tr(Ge), 12

where Ge is defined in Eq. 10. This measure will be compared to the variance explained by the simple main effects in “Results”.

Regressions on known covariates

The ordinary random regression model is given by:

u=(1pIv)γg+(s1Iv)γs1++(sqIv)γsq+δ=(1pIv)γg+(SIv)γs+δ, 13

where γg=(γg1,γg2,,γgv) is the v-vector of simple main effects, S=[s1s2sq] is the p×q matrix of centred and scaled known environmental covariates, γs=(γs1,γs2,,γsq) is a vq-vector of genotype slopes in which γsi is the v-vector for the ith known covariate and δ=(δ1,δ2,,δp) is the vp-vector of regression residuals. This specification highlights the analogy to the FAMk model in Eq. 5. Note, however, that the known covariates are already column centred so that the intercepts already reflect simple main effects.

Models without translational invariance

The random regression model in Heslot et al. (2014) assumes independent main effects and slopes, with:

γgγsN00,pσg200ΣsGg,

where σg2 is the simple main effect variance and Σs=i=1qσsi2 is a diagonal matrix in which σsi2 is the slope variance for the ith known covariate. The distributional assumption for γs may restrict interpretation, however, when the mean response to specific covariates is expected to be nonzero. The regression form of τ in Eq. 2 overcomes this issue, with γsN(τs1v,ΣsGg). The variance matrix for u is then given by:

G=[1pS]pσg200Σs[1pS]+ΨGg, 14

where Geσg2Jp+SΣsS+Ψ.

The random regression model in Jarquín et al. (2014) uses an even simpler variance matrix for the slopes, with var(γs)=σs2IqGg, where σs2 is the slope variance across all known covariates. The variance matrix for u is then given by:

G=[1pS]pσg200σs2Iq[1pS]+ΨGg, 15

where Geσg2Jp+σs2SS+Ψ. Note that this random regression is neither scale nor translational invariant.

Models with translational invariance

Jennrich and Schluchter (1986) proposed an extension of the random regression model which includes a factor analytic model for the known environmental covariates. This extension will be referred to as the FARk model, where k denotes the number of known factors. The FARk model for the simple main effects and slopes in Eq. 13 is given by:

γg=(ΛgIv)f+δgandγs=(ΛsIv)f+δs, 16

where f=(f1,f2,,fk) is the vk-vector of genotype scores which correspond to the k known factors. The FARk model constructs a joint regression across the main effects and slopes, with loadings given by:

Λg=λg1λg2λgkandΛs=λs1λs2λsk,

where Λg is a k-vector and Λs is a q×k matrix. The deviations in Eq. 16 are given by:

δg=δg1,δg2,,δgvandδs=(δs1,δs2,,δsq),

where δg is a v-vector and δs is a vq-vector.

The inclusion of the deviations in Eq. 16 may be unnecessary, however, particularly for higher order FARk models in which the percentage of variance explained by these effects is small. This leads to a reduced rank factor analytic model for the simple main effects and slopes (Kirkpatrick and Meyer 2004), with:

γg=(ΛgIv)fandγs=(ΛsIv)f. 17

The main effects and slopes are assumed to be dependent, with:

γgγsN00,ΛgDΛgΛgDΛsΛsDΛgΛsDΛsGg,

where D=l=1kdl is the score variance matrix with diagonal elements ordered as d1>d2>>dk.

The FARk model is then obtained by substituting the vectors in Eq. 17 into Eq. 13, which gives:

u=[1pΛg+SΛs]Ivf+δ. 18

The variance matrix for u is then given by:

G=AΛgDΛgΛgDΛsΛsDΛgΛsDΛsA+ΨGg, 19

where GeAΛaDΛaA+Ψ, A=[1pS] and Λa=[ΛgΛs], with ΛaAAΛa=Ik. This variance matrix is equivalent to the conventional FAk variance matrix in Eq. 10 when A is square and has full rank.

Regressions on known and latent covariates

The integrated factor analytic (IFAk) model is an extension of the FARk model to include generalised main effects based on latent environmental covariates, instead of simple main effects. The IFAk model can also be viewed as a special FAk model with loadings constrained to be linear combinations of two orthogonal sources of GEI, that is known and latent environmental covariates. The loadings matrix in Eq. 5 can therefore be written as:

Λ=SΛs+ΓΛrorΛ=[SΛsΓΛr]=BΛsΛr=BΛs00Λr, 20

where B=[SΓ] is a p×p matrix of basis functions which is assumed to have full rank,  S=[s1s2sq] is the p×q matrix of known environmental covariates and Γ=[r1r2rp-q] is a p×(p-q) orthogonal projection matrix, with SΓ=0. The two loadings matrices in Eq. 20 correspond to the dependent and independent formulations of the IFAk model. The dependent formulation is translational invariant, and thence the focus of this paper. No further reference will be made to the independent formulation, but full details are provided in the Supplementary Material.

The dependent formulation constructs a joint regression across the known and latent environmental covariates. The p×k matrix of joint factor loadings is given by:

ΛsΛr=λs1λs2λskλr1λr2λrk, 21

where Λs is a q×k matrix corresponding to the known covariates and Λr is a (p-q)×k matrix corresponding to the latent covariates. The common factors underlying Λs and Λr are therefore referred to as the known and latent factors, and collectively as the joint factors.

The projection matrix in Eq. 20 is chosen to ensure that B has full rank and that the known and latent factors are orthogonal. This is achieved by projecting Λr into the orthogonal complement to the space spanned by S. A convenient choice for Γ is the first (p-q) columns in:

[Ip-S(SS)-1S], 22

assuming that p > q. This choice ensures that the same number of variance parameters are estimated as the conventional FAk model in Eq. 10. When pq, however, it may be desirable to take fewer than (p-q) columns in Eq. 22, and thence estimate fewer variance parameters. This enables the IFAk model to be scalable to a very large number of environments.

The IFAk model is obtained by substituting the first loadings matrix in Eq. 20 into Eq. 10, which gives:

u=[SΛs+ΓΛr]Ivf+δ, 23

where f=(f1,f2,,fk) is the vk-vector of genotype scores which correspond to the k joint factors.

The main difference to the FARk model in Eq. 18 is that there are now two vectors of slopes, with:

γs=(ΛsIv)fandγr=(ΛrIv)f, 24

where γs is a vq-vector corresponding to the known covariates and γr is a v(p-q)-vector corresponding to the latent covariates. Another important difference is the addition of generalised main effects in γr, with:

γg=λ¯r1f1andγgN0,d1λ¯r12Gg, 25

where λ¯r1=1p-qλr1/p. The IFAk model can therefore be viewed as a special random regression with generalised main effects as well as translational invariance.

The slopes in Eq. 24 are assumed to be dependent, with:

γsγrN00,ΛsDΛsΛsDΛrΛrDΛsΛrDΛrGg,

where D=l=1kdl is the score variance matrix with diagonal elements ordered as d1>d2>>dk. The variance matrix for u is then given by:

G=BΛsDΛsΛsDΛrΛrDΛsΛrDΛrB+ΨGg, 26

where GeΛDΛ+Ψ and Λ=B[ΛsΛr], with ΛΛ=Ik. This variance matrix is equivalent to the conventional FAk model in Eq. 10, where the factors are constrained to be linear combinations of known and latent environmental covariates.

Model estimation

All variance models for the additive GE effects were implemented within the linear mixed model in Eq. 1. The two factor analytic linear mixed models with simple and generalised main effects are referred to as the FAM-LMM and FA-LMM, respectively. The other two linear mixed models developed in this paper are derived below.

The factor analytic regression linear mixed model (FAR-LMM) is obtained by substituting Eq. 18 into Eq. 1, which gives:

y=Xτ+ZΛaf+Zδ+Zpup+e, 27

where ZΛa=Z(AΛaIv). In this model, the covariances between the simple main effects and slopes are based on a reduced rank factor analytic model.

The integrated factor analytic linear mixed model (IFA-LMM) is obtained by substituting Eq. 23 into Eq. 1, which gives:

y=Xτ+ZΛbf+Zδ+Zpup+e, 28

where ZΛb=Z(BΛbIv) and Λb=[ΛsΛr]. In this model, the covariances between the known and latent environmental covariates are based on a reduced rank factor analytic model. The IFA-LMM will now be used to demonstrate all remaining methods. Similar results can be obtained for the other three linear mixed models where required.

Rotation of loadings and scores

Constraints are required in the IFA-LMM during estimation to ensure unique solutions for [ΛsΛr] and D. Following Smith et al. (2021), the upper right elements of [ΛsΛr] are set to zero when k>1 and D is set to Ik. Let the loadings and scores with these constraints be denoted by [ΛsΛr] and f, with fN(0,IkGg). The loadings and scores can be rotated back to their original form in Eq. 23 for interpretation. This rotation is given by:

ΛsΛr=ΛsΛrVD-1/2andf=(D1/2VIv)f, 29

where V is a k×k orthonormal matrix of right singular vectors and D1/2 is a k×k diagonal matrix of singular values sorted in decreasing order, with fN(0,DGg). These matrices can be obtained from the singular value decomposition given by:

BΛsΛr=UD1/2VorΛ=UD1/2V, 30

where U is a p×k orthonormal matrix of left singular vectors, with [ΛsΛr]B-1U and [ΛsΛr]B-1Λ, where Λ is the loadings matrix in Eq. 10 with upper right elements set to zero (see “Appendix”). This demonstrates how the factor loadings in the IFA-LMM can be obtained directly from the fit of the conventional FA-LMM.

Computation

The IFA-LMM was coded in R (R Core Team 2021) using open source libraries. The computational approach for fitting the IFA-LMM is provided in the Supplementary Material. This approach obtains REML estimates of the variance parameters using an extension of the sparse formulation of the average information algorithm (Thompson et al. 2003). Let the REML estimates of the key variance parameters be denoted by [Λ^sΛ^r] and Ψ^, with EBLUPs of the key random effects denoted by f~ and δ~. All linear mixed models were also fitted in ASReml-R (Butler 2020), with known environmental covariates included using the mbf argument. An example R script is provided in the Supplementary Material.

Model selection

Order selection in the IFA-LMM was achieved using a combination of formal and informal criteria. Formal selection was achieved using the Akaike Information Criterion (AIC) and informal selection was achieved using two measures of variance explained. These measures are an extension of Smith et al. (2021) to include known environmental covariates, and are similar to the R2 goodness-of-fit statistic in multiple regression. These measures are derived in the Supplementary Material.

The percentage of additive genetic variance explained by the known covariates and overall by the known and latent covariates is given by:

v¯s=100tr(SΛ^sD^Λ^sS)tr(G^e)andv¯=100tr(D^)tr(G^e), 31

where Ge is defined in Eq. 26. Similar measures are also obtained for the jth environment, that is vsj and vj. The final model order is typically chosen such that v¯s and v¯ are sufficiently high and the number of environments with low values of vsj and vj is small. Note that this may require a different number of known and latent factors, that is ks and kr.

Model assessment

Model assessment of the IFA-LMM was achieved using the prediction accuracy for current and future environments. Prediction into current environments was assessed using leave-one-out cross-validation, where yield data for a single environment were excluded and then predicted. The additive GE effects for environment j were predicted as:

u~j=[SjΛ^s-j+Λ¯r-j]Ivf~j, 32

where Sj is a q-vector of known covariates, f~j is a vjk-vector of predicted scores for the vj genotypes in the jth current environment and Λ¯r-j=1p-q-1Λ^r-j/(p-1) ensures the scores are appropriately scaled by the latent covariates. Note that the factor loadings, Λ^s-j and Λ^r-j, are estimated using data on the (p-1) environments excluding the jth environment. The prediction accuracy for environment j was then calculated as:

rj=cor(y¯j,u~j), 33

where y¯j is a vj-vector of genotype mean yields for the jth current environment.

Prediction into future environments was assessed using a similar measure, but note that yield data for the entire year were excluded at once. The additive GE effects for environment j were then predicted as:

u~j=[SjΛ^s+Λ¯r]Ivf~j, 34

where Sj is a q-vector, f~j is a vjk-vector for the vj genotypes in the jth future environment and Λ¯r=1p-qΛ^r/p. In this case, the factor loadings, Λ^s and Λ^r, are estimated using data on the p current environments only.

Model summaries and interpretation

The main limitation of the conventional FA-LMM is that the common factors are latent so they cannot be used for interpretation or prediction. The IFA-LMM overcomes this limitation since it integrates known environmental covariates into the common factors. Interpretation is then achieved using a series of regression plots and four measures of variance explained. The regression plots are an extension of Cullis et al. (2014) and the measures of variance explained are an extension of Eq. 31.

The percentage of additive genetic variance explained by known covariate i is given by:

vsi=100[siSλ^sD^λ^si]2[λ^siD^λ^si]tr(Ge^), 35

where Ge is defined in Eq. 26. Note that v¯si=1qvsi since the known covariates are not orthogonal. This issue is addressed in the Supplementary Material.

The percentage of additive genetic variance explained by known factor l and by joint factor l is given by:

vsl=100d^lλ^slSSλ^sltr(G^e)andvl=100d^ltr(G^e). 36

Note that v¯s=l=1kvsl and v¯=l=1kvl since the known and joint factors are orthogonal.

Lastly, the percentage of additive genetic variance in joint factor l explained by known covariate i is given by:

vli=100si[Sλ^sl+Γλ^rl]2. 37

The percentage of variance explained by all covariates is then given by vl·=100[λ^slSSλ^sl], which is equivalent to vsl/vl in Eq. 36.

Results

This section presents the results of model fitting using the 2017 P1 MET dataset and model assessment using the 2018 P2 MET dataset. The P1 dataset is summarised in Tables 1 and 2, and comprises v=204 genotypes evaluated in p=24 current environments with q=18 known covariates. The P2 dataset is summarised in Supplementary Tables 9 and 10, and comprises v=55 (of the 204) genotypes evaluated in p=20 future environments with the same known covariates. The results are presented according to model selection, assessment and interpretation.

Model selection

Tables 4 and 5 present the model selection criteria previously described in “Model selection”. The important results from each model fit are detailed below.

Table 4.

Linear mixed models with random regressions on latent environmental covariates

Regressions on latent covariates
(a) Models with simple main effects (b) Models with generalised main effects
Model Pars Loglik AIC vg v¯ Model Pars Loglik AIC v1 v¯
comp 2 10,504.2 − 20,748.4 36.2 36.2 id 1 10,156.9 − 20,055.9 - -
mdiag 25 10,563.6 − 20,821.1 33.6 33.6 diag 24 10,249.3 − 20,194.7
FAM1 49 10,765.4 − 21,176.8 36.8 54.4 FA1 48 10,667.1 − 20,982.2 43.2 43.2
FAM2 72 10,893.8 − 21,387.6 37.2 67.5 FA2 71 10,827.4 − 21,256.8 44.1 60.4
FAM3 94 10,942.9 − 21,441.8 38.2 72.0 FA3 93 10,940.3 − 21,438.5 43.8 70.7
FAM4 115 10,981.7 − 21,477.5 38.1 76.9 FA4 114 10,978.3 − 21,472.5 43.8 75.2
FAM5 135 11,011.2 − 21,496.5 38.7 80.0 FA5 134 11,010.1 − 21,496.1 44.3 79.0

Presented for each model is the number of estimated genetic variance parameters, residual log-likelihood, AIC and percentage of variance explained by the simple (vg) or generalised (v1) main effects and overall (v¯)

Note: 128 non-genetic and residual variance parameters estimated in all models. The selected FAM4 and FA4 models are distinguished with bold font

*Models where intercepts are not explicitly fitted

Table 5.

Linear mixed models with random regressions on known and latent environmental covariates

Regressions on known covariates Regressions on known and latent covariates
(a) Models with simple main effects (b) Models with generalised main effects*
Model Pars Loglik AIC v¯s v¯ Model Pars Loglik AIC v¯s v¯
rreg1 26 10,721.2 − 21,134.3 20.8 57.1 id 1 10,156.9 − 20,055.9 - -
rreg2 43 10,750.7 − 21,159.3 23.2 58.5 diag 24 10,249.3 − 20,194.7 - -
FAR1 43 10,636.7 − 20,931.4 6.2 40.0 IFA1 48 10,667.1 − 20,982.2 7.0 43.2
FAR2 61 10,791.4 − 21,204.8 19.2 57.0 IFA2 71 10,827.4 − 21,256.8 20.1 60.4
FAR3 78 10,887.0 − 21,361.9 29.2 66.7 IFA3 93 10,940.3 − 21,438.5 30.1 70.7
FAR4 94 10,911.7 − 21,379.4 33.2 70.7 IFA4-3 108 10,971.9 − 21,471.9 34.4 74.9
FAR5 109 10,931.3 − 21,388.7 36.2 73.8 IFA5-3 122 10,996.4 − 21,492.8 36.2 78.0

Presented for each model is the number of estimated genetic variance parameters, residual log-likelihood, AIC and percentage of variance explained by the known covariates (v¯s) and overall (v¯)

Note: 128 non-genetic and residual variance parameters estimated in all models. The models rreg1 and rreg2 correspond to the random regressions in Jarquín et al. (2014) and Heslot et al. (2014). The selected FAR4 and IFA4-3 models are distinguished with bold font.

*Models where intercepts are not explicitly fitted

Baseline linear mixed models

The analyses commenced by fitting a linear mixed model with a diagonal model for the additive GE effects (diag; Table 4b). This approach reflects the initial single-site analyses routinely performed on MET datasets, where the additive GE effects in different environments are assumed to be independent. The single-site analyses are typically used to inspect the experimental design, address spatial variations and identify potential outliers.

The analyses continued by fitting a linear mixed model with a compound symmetry model for the additive GE effects (comp; Table 4a). This approach reflects many current applications of GS in plant breeding, where the additive GE effects in different environments are assumed to be correlated. The compound symmetry model is very restrictive, however, since it comprises a single variance component for the simple genotype main effects and genotype by environment interaction effects. This model can be extended to include heterogeneous interaction variances across environments, that is the main effects plus diagonal model (mdiag; Table 4a). The AIC for this model is much lower, and thence much better, than the standard compound symmetry model. There are negligible differences between the overall additive genetic variance explained, however, with v¯35% for both models.

Regressions on latent covariates

A series of factor analytic linear mixed models were then fitted with either (a) simple or (b) generalised main effects (Table 4). The most notable differences between the FAM-LMMs and FA-LMMs are observed in the lower orders, where the overall additive genetic variance explained by the latent common factors is low. At the higher orders, where the overall variance explained is sufficiently high, the differences are negligible. Both models required k=4 latent factors to reach a sufficient percentage of additive genetic variance explained for individual environments and overall, with vj>40% and v¯>75%. Lastly, note that the generalised main effects in (b) explain 5.7% more variance than the simple main effects in (a), despite very similar overall variance explained. This feature is now discussed.

The simple and generalised main effects are demonstrated in Fig. 2. This figure presents a series of regression plots for checks C1 and C2 in terms of the (a) FAM4 and (b/c) FA4 models. Recall that the FAM4 model can be viewed as a special FA5 model where the first factor loadings are equal and correspond to the simple main effects, whereas the higher order loadings sum to zero and correspond to the interaction effects. The first two factors are plotted for the FAM4 model in Fig. 2a where the simple main effects are denoted by the fitted values of the second factor regressions at the mean loading of zero, that is 0.06 and − 0.09 t/ha for C1 and C2. In contrast, the generalised main effects for the FA4 model in Fig. 2b are denoted by the fitted values of the first factor regressions at the mean loading of 0.19, that is 0.05 and − 0.06 t/ha. There are two important differences between these approaches:

  1. The generalised main effects capture heterogeneity of scale variance, that is non-crossover GEI, whereas the simple main effects do not capture GEI. This is demonstrated in Fig. 2b where the regression lines diverge across environments so the genotype rankings never crossover, whereas the first factor regression lines in the FAM4 model are always parallel (not shown).

  2. The higher order factors in the FA4 model predominately capture crossover GEI only, whereas those in the FAM4 model capture some mixture of non-crossover and crossover GEI. This is demonstrated in Fig. 2c where the regression lines intersect so the genotype rankings crossover, whereas the regression lines in Fig. 2a diverge as well as crossover.

Fig. 2.

Fig. 2

Regression plots for checks C1 and C2 in terms of the first two factors obtained from the a FAM4 and b/c FA4 models. Note: The simple main effects in a and the generalised main effects in b are denoted with closed circles and the growing regions are distinguished by shape. The percentage of additive genetic variance explained by each factor is labelled. The additive GE effects in c have been adjusted for those in b

Regressions on known covariates

The next two linear mixed models fitted include random regressions without translational invariance. The random regression in Jarquín et al. (2014) reflects a popular application of GS in plant breeding (rreg1; Table 5a). Like the compound symmetry model, however, this model is very restrictive since it only comprises two variance components. The only difference is that the interaction effects are now parametrised by known environmental covariates. This model can be extended to include heterogeneous interaction variances across covariates (rreg2; Table 5a). The AIC for the random regression in Heslot et al. (2014) is much better than the simpler random regression. There are negligible differences between the additive genetic variance explained, however, with v¯s23% and v¯58% for both models. Interestingly, the former measure matches that reported in Jarquín et al. (2014).

A series of FAR-LMMs with translational invariance were then fitted (Table 5a). This approach required k=4 known factors to reach a sufficient percentage of additive genetic variance explained for individual environments and overall, with vj>40% and v¯=70.7%. The AIC for the FAR4 model is substantially better than the random regressions in Jarquín et al. (2014) and Heslot et al. (2014). The FAR4 model also explains more additive genetic variance in the known covariates, with v¯s=33.2% compared to only 20.8 and 23.2 %. This demonstrates the importance of appropriately modelling the variance structure between known covariates.

Regressions on known and latent covariates

The analyses concluded by fitting a series of IFA-LMMs with generalised main effects and translational invariance (Table 5b). This approach required ks=4 known and kr=3 latent factors to reach a sufficient percentage of additive genetic variance explained for individual environments and overall, with vj>45% and v¯=74.9%. The AIC for the IFA4-3 model is substantially better than the FAR4 model. The IFA4-3 model also explains more overall variance, that is v¯=74.9% compared to 70.7%, despite similar variance explained by the known covariates, with v¯s35% for both models. This demonstrates the advantage of including generalised main effects based on latent environmental covariates, instead of simple main effects.

Model comparison

The IFA4-3 model provides a good fit to the MET dataset and captures a large proportion of additive genetic variance (Table 5). The FAM4 and FA4 models also provide a good fit and capture a large proportion of variance, but they cannot be used for prediction into future environments (Table 4). The random regression models in Jarquín et al. (2014) and Heslot et al. (2014) can be used for prediction, but they provide a poor fit, capture the lowest variance of all models and are not translational invariant. The FAR4 model provides a better fit and captures more variance than the simpler random regression models, and is translational invariant. The IFA4-3 model provides an even better fit, captures more variance than the FAR4 model and is also translational invariant; making it the preferred method of analysis in this paper.

Model assessment

The mean prediction accuracy of the IFA4-3 model is considerably higher than all other random regression models (Table 6). The prediction accuracy was calculated in terms of 24 current environments in 2017 P1 and 20 future environments in 2018 P2. The most notable differences between models are observed for the 2018 environments in Texas, where the accuracy of the IFA4-3 model is at least 0.22 higher. In the Southeast and Midsouth, the accuracies are at least 0.06 and 0.10 higher, respectively. The differences in Texas are negligible for the 2017 environments, where the accuracies are generally higher for all models. In the Southeast and Midsouth, however, the accuracies of the IFA4-3 model are still at least 0.09 higher.

Table 6.

Summary of the prediction accuracies for the 2017 current and 2018 future environments

Year Model Southeast ° Midsouth × Texas Overall
Min Mean Max Min Mean Max Min Mean Max Min Mean Max
rreg1 0.27 0.51 0.68 0.30 0.58 0.77 0.27 0.47 0.60 0.27 0.52 0.77
rreg2 0.27 0.52 0.69 0.29 0.58 0.76 0.27 0.47 0.61 0.27 0.52 0.76
FAR4 0.25 0.50 0.66 0.34 0.59 0.77 0.25 0.48 0.64 0.25 0.52 0.77
2017 IFA4-3 0.33 0.60 0.76 0.45 0.68 0.79 0.29 0.50 0.65 0.29 0.60 0.79
rreg1 0.58 0.60 0.64 0.30 0.50 0.71 − 0.03 0.20 0.34 − 0.03 0.42 0.71
rreg2 0.58 0.61 0.64 0.28 0.49 0.70 − 0.02 0.21 0.36 − 0.02 0.42 0.70
FAR4 0.58 0.61 0.67 0.26 0.49 0.71 0.02 0.22 0.36 0.02 0.43 0.71
2018 IFA4-3 0.60 0.67 0.71 0.31 0.60 0.79 0.30 0.44 0.62 0.30 0.56 0.79

Presented for each model is the minimum, mean and maximum prediction accuracy for the Southeast, ° Midsouth and × Texas, as well as overall across all regions

Note: The models rreg1 and rreg2 correspond to the random regressions in Jarquín et al. (2014) and Heslot et al. (2014). The highest accuracy is distinguished with bold font

Model summaries and interpretation

Tables 7, 8 and Figs. 3, 4 present the model summaries previously described in  “Model summaries and interpretation”. These summaries are presented for the IFA4-3 model in terms of environments, covariates and genotypes.

Table 7.

The selected IFA4-3 model, Part 1: Summary of growing environments

State Env Var vsj vj λ^1 λ^2 λ^3 λ^4
North Carolina 17NC1 0.01 85.4 69.3 0.06 − 0.04 0.33 0.06
17SC1 0.02 12.5 56.4 0.18 − 0.06 0.17 − 0.15
17SC2 0.01 40.7 48.6 0.07 − 0.03 0.27 − 0.09
South Carolina 17SC3 0.02 23.8 90.5 0.23 − 0.14 0.26 −0.03
17GA1 0.03 23.1 63.8 0.20 − 0.08 0.29 − 0.02
17GA2 0.03 19.1 54.0 0.19 − 0.10 0.31 0.01
17GA3 0.02 29.8 82.3 0.21 − 0.12 0.20 − 0.09
Georgia 17GA4 0.02 26.9 67.6 0.18 − 0.10 0.28 0.14
° Missouri 17MO1 0.06 26.6 82.2 0.39 − 0.17 − 0.15 0.39
17AR1 0.01 49.1 100.0 0.14 0.00 − 0.32 0.09
° Arkansas 17AR2 0.06 32.1 89.2 0.39 − 0.16 − 0.34 0.30
17MS1 0.03 46.0 81.6 0.23 0.00 − 0.26 − 0.44
17MS2 0.03 47.5 77.6 0.24 − 0.12 − 0.15 0.23
° Mississippi 17MS3 0.03 37.3 100.0 0.26 − 0.09 − 0.23 − 0.43
17LA1 0.03 19.9 71.8 0.23 − 0.17 − 0.01 − 0.32
° Louisiana 17LA2 0.02 22.5 76.4 0.20 − 0.10 0.11 − 0.07
17TX1 0.02 61.4 91.8 0.15 0.39 0.04 0.09
17TX2 0.02 36.6 61.9 0.12 0.28 0.10 0.17
17TX3 0.05 41.5 74.0 0.21 0.46 0.07 − 0.06
17TX4 0.01 32.6 64.6 0.10 0.22 0.07 − 0.17
17TX5 0.04 29.9 62.0 0.20 0.34 0.01 − 0.18
17TX6 0.01 80.7 44.3 0.06 0.17 0.05 0.10
17TX7 0.02 44.4 66.5 0.12 0.33 − 0.01 0.02
× Texas 17TX8 0.02 24.1 72.0 0.13 0.28 0.12 0.19
Overall 0.03 34.4 74.9 43.7 16.2 11.0 4.0

Presented are the REML estimates of additive genetic variance, percentage of variance explained by the known covariates (vsj) and overall (vj), as well as estimates of the joint factor loadings (λ^l)

Note: The percentage of variance explained across all environments (v¯s and v¯), as well as by individual factors (vl) is presented in the final row. The measure vsj is greater than vj for 17NC1 and 17TX6 since the known and latent covariates are not orthogonal for individual environments

Table 8.

The selected IFA4-3 model, Part 2: Summary of known environmental covariates

Covariate Covar vsi λ^s1 λ^s2 λ^s3 λ^s4
LAT 0.02 0.4 0.02 0.10 − 0.21 − 0.20
LONG 0.05 0.5 − 0.18 0.04 0.56 0.33
avgCCR − 0.18 4.5 − 0.37 0.31 − 0.02 0.29
maxDPT 0.25 3.7 0.47 − 0.46 − 0.68 − 0.22
maxDSR 0.25 10.1 − 0.30 0.41 − 0.10 0.17
minHUM − 0.33 3.5 − 0.62 0.24 1.03 1.10
maxNSR 0.04 1.9 0.05 0.11 − 0.19 − 0.29
maxPRP − 0.01 0.1 0.04 0.05 − 0.18 − 0.55
totPRP 0.03 1.6 0.11 − 0.01 0.05 − 0.15
maxTMP 0.18 4.0 − 0.31 0.09 0.58 0.32
minTMP − 0.18 3.1 − 0.05 0.44 − 0.67 − 1.00
minWSP 0.01 0.1 − 0.13 − 0.09 0.31 0.16
avgWDR − 0.03 1.5 0.03 0.14 − 0.01 − 0.33
maxST1 − 0.04 1.0 0.09 0.06 − 0.27 − 0.25
minST1 0.04 0.1 0.37 − 0.48 0.15 0.96
avgSM3 − 0.02 0.4 0.10 0.12 0.10 0.19
avgSM4 0.05 1.2 − 0.10 − 0.15 − 0.25 − 0.41
minST4 0.09 1.4 − 0.30 0.32 0.10 − 0.48
Overall 0.01 34.4 5.4 15.3 9.8 3.9

Presented are the REML estimates of additive genetic covariance, percentage of variance explained by individual known covariates (vsi) and estimates of the known factor loadings (λ^sl)

Note: The percentage of variance explained by all known covariates (v¯s) and by individual factors (vsl) is presented in the final row

Fig. 3.

Fig. 3

Heatmaps of the additive genetic correlation matrices between environments in terms of the a known covariates and b known and latent covariates. Note: Both matrices are ordered using the dendrogram applied to b. Black lines distinguish the Southeast, ° Midsouth and × Texas cotton growing regions. The colourkey ranges from 1 (agreement in rankings) through zero (dissimilarity in rankings) to −1 (reversal of rankings)

Fig. 4.

Fig. 4

a Regression plots for checks C1 and C2 in terms of four joint factors and b percentage of additive genetic variance in the joint factors explained by the known covariates. Note: The generalised main effects in a are denoted with closed circles and the growing regions are distinguished by shape. The percentage of variance explained by each factor is labelled in a and the percentage of variance in each factor explained by all known covariates is labelled in b. The additive GE effects for the higher order factors are adjusted for the preceding factor(s). Only 10 (of the 18) known covariates are displayed in b for brevity

Summary of environments and covariates

Table 7 presents a summary of the growing environments in the 2017 P1 MET dataset. The additive genetic variance of individual environments range from 0.01 to 0.06, with mean of 0.03. These variances are obtained from the diagonal elements of the denominator in Eq. 31. The overall variance explained by the known and latent covariates is much higher than the variance explained by the known covariates alone, that is vj=44.3-100.0 % with v¯=74.9% compared to vsj=12.5-85.4 % with v¯s=34.4%. Most variance is explained overall in the Midsouth (84.9% compared to only 66.6 and 69.3%), whereas most variance is explained by the known covariates in Texas (41.1% compared to only 28.4 and 33.4 %). Table 7 also presents REML estimates of the joint factor loadings. The first factor comprises positive loadings only, and explains v1=43.7% of the additive genetic variance. The higher order factors comprise both positive and negative loadings, and explain vl=4.0-16.2 %, with 31.2% in total. The sign of the loadings indicate that the first factor captures non-crossover GEI only, whereas the higher order factors predominately capture crossover GEI only (Smith and Cullis 2018).

Table 8 presents a similar summary for the known environmental covariates in the MET dataset. The additive genetic covariance of individual covariates range from − 0.33 to 0.25, with mean of 0.01. These covariances are obtained from the square-root of the elements in Eq. 37. The variance explained by individual covariates is vsi=0.1-10.1 %, with v¯s=34.4%. The most notable covariates are maxDSR (10.1%), avgCCR (4.5%) and maxTMP (4.0%). Table 8 also presents REML estimates of the known factor loadings. The interpretation of these loadings is similar to above, but note that the higher order factors explain more additive genetic variance than the first factor, with 29.0% in total compared to only 5.4%. This will be discussed further below.

Correlations between environments

Figure 3 presents heatmaps of the additive genetic correlation matrices between environments in terms of the (a) known covariates and (b) known and latent covariates. These matrices are ordered based on the dendrogram constructed using the agnes function in the cluster package (Maechler et al. 2019). This dendrogram generally places environments closer together that have more similar GEI patterns than those further apart. Figure 3 suggests there is structure to the GEI underlying the heatmaps. There are three notable features:

  1. The overall correlations based on the known and latent covariates are considerably higher than the correlations based on the known covariates alone.

  2. The highest overall correlations generally occur between environments in the same growing region. Environments in the Southeast and Midsouth are also well correlated.

  3. The overall correlations between environments in the same growing region are less than one. This indicates that crossover GEI is present within regions.

Regression plots for genotypes

Figure 4a presents a series of regression plots for checks C1 and C2 in terms of the k=4 joint factors in the IFA4-3 model. These plots are used to assess genotype performance and stability in response to the known and latent environmental covariates. These plots show that check C1 is generally higher performing than C2 since it has a higher predicted slope for the first factor regression, that is 0.26 compared to − 0.32. Both checks are considerably unstable, however, since they have large slopes for the higher order factors and therefore have large deviations about the first factor regression. Figure 4a also suggests that the second factor is correlated with longitude (Pearson's r=0.80), where the loadings on the left correspond to the Southeast and Midsouth while the loadings on the right correspond to Texas. This highlights an important limitation of the conventional FA-LMM, where interpretation is often limited to post-processing of the latent factors. This will be discussed further below.

Figure 4b presents direct interpretation of the factors in terms of the variance explained by the known environmental covariates. This figure suggests there is structure to the GEI underlying the regression plots. There are three notable features:

  1. The known covariates predominately model crossover GEI, with vl·=89.2-97.9 % of the additive genetic variance explained in the higher order factors compared to only v1·=12.4% explained in the first factor. These measures are obtained from Eq. 36, and are equivalent to vsl/vl in Tables 7 and 8.

  2. The second factor is well explained by multiple known covariates. This demonstrates the biological drivers of crossover GEI in this factor, that is the drivers of crossover GEI due to changes in LONG.

  3. The third and fourth factors are not well explained by individual covariates. This indicates that crossover GEI in these factors is driven by a combination of known covariates as well as their interaction.

Discussion

This paper developed a single-stage GS approach which integrates known and latent environmental covariates within a special factor analytic framework. The FA-LMM of Smith et al. (2001) is an effective method for analysing MET datasets, but has limited practicality since the underlying factors are latent so the modelled GEI is observable, rather than predictable. The advantage of using random regressions on known environmental covariates is that the modelled GEI becomes predictable. The IFA-LMM developed in this paper includes a model for predictable and observable GEI in terms of a joint set of known and latent environmental covariates.

Regressions on known environmental covariates were first used in plant breeding by Yates and Cochran (1938). Their work was later popularised by Finlay and Wilkinson (1963), and includes a fixed coefficient regression on a set of environmental mean yields (covariates). Despite its popularity, however, there is a fundamental problem with using mean yields as covariates (Knight 1970; Freeman and Perkins 1971). This problem can be overcome by implementing environmental covariates which are independent of the genotypes under study, such as soil moisture and daily temperature (Hardwick and Wood 1972; Fripp 1972). Several authors have also used fixed regressions on genotype covariates, such as disease resistance and maturity, in addition to the environmental covariates. This approach is often referred to as fixed factorial regression (Denis 1980, 1988).

An alternative approach is to use a linear mixed model with a random coefficient regression. This approach was popularised by Laird and Ware (1982), and requires an appropriate variance model for the intercepts and slopes which ensures the regression is scale and translational invariant. An appropriate choice is the fully unstructured variance model, however, this model becomes computationally prohibitive as the number of covariates increases. Recently, Heslot et al. (2014) extended the random regression model for GS, but they were unable to fit an appropriate variance model (also see Jarquín et al. 2014). The FAR-LMM developed in this paper includes a reduced rank factor analytic variance model for the intercepts and slopes. This ensures the regression is computationally efficient as well as both scale and translational invariant, regardless of the number of covariates. The selected FAR-LMM also provides a substantially better fit and captures more additive genetic variance than the simpler random regression models.

The FAR-LMM includes a set of simple main effects which reflect simple averages across environments. Smith and Cullis (2018) discuss the limitations of simple main effects, and demonstrate how generalised main effects can be obtained from FA-LMMs. They also discuss how the generalised main effects capture heterogeneity of scale variance, that is non-crossover GEI, whereas the simple main effects do not. The generalised main effects can therefore be viewed as weighted averages across environments which are based on differences in scale variance. This highlights an important difference to the simple main effects, which are more restrictive and based on a single genetic variance across environments. This feature is demonstrated in Fig. 2 for the FA-LMM and the FAM-LMM, where the generalised main effects capture 6% more additive genetic variance than the simple main effects.

The IFA-LMM is an effective method for analysing MET datasets which also utilises crossover and non-crossover GEI for genomic prediction into current and future environments. The IFA-LMM is effective since it exploits the desirable features of the FAR-LMM and the FA-LMM. That is, it exploits the ability of random regression models to capture crossover GEI for prediction using known covariates and the ability of factor analytic models to capture non-crossover GEI using latent covariates. The IFA-LMM can therefore be viewed as a random factorial regression, with known genotype covariates derived from marker data, known environmental covariates derived from weather and soil data as well as latent environmental covariates estimated from the phenotypic data itself. The IFA-LMM can also be viewed as a linear mixed model analogue to redundancy analysis (Van Den Wollenberg 1977), where the factors are constrained to be linear combinations of known and latent environmental covariates. The selected IFA-LMM provides a substantially better fit and captures more additive genetic variance than the selected FAR-LMM and the simpler random regression models.

There are three appealing features of the IFA-LMM which address several long-standing objectives of many plant breeding programmes:

  1. The IFA-LMM includes a regression model for GEI in terms of a small number of known and latent factors. This simultaneously reduces the dimension of the known and latent environmental covariates.

  2. The regression model captures predictable GEI in terms of known environmental covariates. This is predominately in the form of crossover GEI, and enables meaningful interpretation and prediction into any current or future environment.

  3. The regression model also captures observable GEI in terms of latent environmental covariates, which are orthogonal to the known covariates. This is predominately in the form of non-crossover GEI, and enables a large proportion of GEI to be captured by the regression model overall.

The IFA-LMM was demonstrated on a late-stage cotton breeding MET dataset. This dataset is an example of a small in situ training population which comprises a subset of current test genotypes and growing environments in 2017. A larger MET dataset across multiple years and locations is required, however, to capture the extent of transient and static GEI in the cotton growing regions of USA. This will ensure the scope of the known and latent covariates are relevant for prediction into future environments. Computational challenges are anticipated for these larger MET datasets and finding efficient ways to scale the IFA-LMM is the topic of current research.

There are four important points from “Results”:

  1. The IFA4-3 model has fewer genetic variance parameters compared to the FA4 and FAM4 models, despite very similar model selection criteria (Tables 4 and 5). This highlights an important advantage of implementing known environmental information into the common factors. The IFA4-3 model also has better selection criteria than the FAR4 model. This also highlights the advantage of implementing generalised main effects based on latent environmental covariates, instead of simple main effects.

  2. The known environmental covariates explain v¯s=34.4% of the overall additive genetic variance, which represents 93.0% of the crossover GEI captured by the regression model. This is at least 11% more variance compared to the random regression models in Jarquín et al. (2014) and Heslot et al. (2014).

  3. The latent environmental covariates explain 40.5% of the overall additive genetic variance, which represents 87.6% of the non-crossover GEI. This feature can be visualised in Fig. 3 where the overall correlations based on the known and latent covariates are much higher than those based on the known covariates alone.

  4. The mean prediction accuracy of the IFA4-3 model is 0.02–0.10 higher than all other random regression models for current environments and 0.06-0.24 higher for future environments (Table 6). This highlights another important advantage of implementing known environmental information into the common factors.

Point 4 is now discussed further. The mean prediction accuracy of the IFA4-3 model was considerably higher than all other random regression models, especially for future environments in Texas. The prediction accuracy was calculated in terms of 24 current environments in 2017 P1 and 20 future environments in 2018 P2 (Table 6). The accuracy of all models were generally low for Texas in 2018, with mean of 0.20-0.44 for all models. This suggests that GEI is more complex in Texas and that there is substantial transient GEI present across years in addition to static GEI across locations (Cullis et al. 2000). It also suggests that the crossover GEI captured by the known covariates may not be repeatable across years and that the generalised main effects based on the latent covariates may not accurately capture the true non-crossover GEI across years. That is, the current scope of the known and latent covariates is less relevant for Texas compared to the Southeast and Midsouth. The application of a larger multi-year MET dataset should overcome these issues.

Another key feature of the IFA-LMM is the ability to identify the biological drivers of GEI, such as maximum downward solar radiation and average cloud cover. Interpretation within the IFA-LMM was demonstrated using a series of regression plots (Fig. 4). These plots are used to assess genotype performance and stability in response to the known and latent environmental covariates. Previously, interpretation within factor analytic linear mixed models was limited to post-processing of model terms, for example by correlating known covariates with latent factors (Oliveira et al. 2020) or by examining the response of reference genotypes in different environments (Mathews et al. 2011). The distinguishing feature of the IFA-LMM is the ability to ascribe direct biological interpretation to the modelled GEI. This feature has three important practical implications:

  1. The first factor captures non-crossover GEI only, and is predominately explained by the latent environmental covariates. The higher order factors capture crossover GEI, and are predominately explained by the known environmental covariates. This enables the drivers of GEI across a set of target environments to be identified.

  2. The importance of known covariates as drivers of GEI can be quantified. This provides information on which covariates should be measured with high accuracy, say, and which covariates may be less important or don’t need to be measured at all. This is particularly appealing with the advent of high-throughput environmental data.

  3. Genomic selection tools can be applied to obtain measures of overall performance and stability for each genotype. This will enable the drivers of genotype performance and stability across a set of target environments to be identified. This is the topic of a subsequent paper.

The IFA-LMM is an effective method for analysing MET datasets which also utilises crossover and non-crossover GEI for genomic prediction into current and future environments. This is becoming increasingly important with the emergence of rapidly changing environments and climate change.

Supplementary Information

Below is the link to the electronic supplementary material.

Acknowledgements

The authors thank Bayer CropScience for funding and use of their data. We also thank Kolbyn Joy, Nicholas Ames​ and Nilesh Dighe​ for their stimulating discussions and insights into the cotton breeding programme. Lastly, we sincerely thank the referees whose comments have led to an improved manuscript.

Appendix: Orthogonal matrix rotations

This appendix demonstrates how simple or generalised main effects can be obtained from factor analytic models regardless of whether intercepts are explicitly fitted. The simple main effects require rotation of the loadings and scores using a Gram-Schmidt process, whereas the generalised main effects require rotation to a principal component solution. The two rotations are detailed below.

Gram-Schmidt process

Smith (1999) discuss the need to column centre the environmental loadings in the FAMk model so they are orthogonal to the simple main effects. This is achieved using a Gram-Schmidt process, with:

[1pΛ]=[1pΛ]U-1andγgf=(UIv)γ1f,

with Λ1p=0, where 1p=1p/p and U=Q[1pΛ] is a (k+1)×(k+1) upper triangular matrix in which Q=[q1q2qk+1] is a p×(k+1) matrix with orthonormal columns given by:

q1=1pandql+1=(λl-h=1lqhλlqh)/cl+1, 38

where l=1,2,,k and cl+1 is a constant chosen to ensure ql+1 has unit length.

It is assumed that:

γgfN00,pσg2D12D21D22Gg, 39

where D=Upσ1200DU,   σg2=σ12+l=1kdlλ¯l2 and λ¯l=1pλl/p. The FAMk variance matrix in Eq. 6 is now given by:

G=[1pΛ]pσg2D12D21D22[1pΛ]+ΨGg, 40

where Geσg2Jp+1pD12Λ+ΛD211p+ΛD22Λ+Ψ.

The conventional FAk model can be viewed as a special FAMk model where the intercept variance, σ12, is constrained to zero. The variance matrix in Eq. 10 can therefore be written as:

G=[1pΛ]000D[1pΛ]+ΨGg. 41

Simple main effects can be obtained from this model using a similar Gram-Schmidt process as above. The FAk variance matrix in Eq. 41 is now given by:

G=[1pΛ]pλ¯2D12D21D22[1pΛ]+ΨGg, 42

where Geλ¯2Jp+1pD12Λ+ΛD211p+ΛD22Λ+Ψ and λ¯2=l=1kdlλ¯l2 is the simple main effect variance, which is equal to σg2 in Eq. 40 when σ12=0.

The FAk model in Eq. 10 can therefore be written as:

u=(1pIv)γg+(ΛIv)f+δ, 43

where γg is a v-vector of simple main effects, with:

γg=pl=1kλ¯lflandγgN0,pλ¯2Gg. 44

Principal component rotation

Constraints are required in the FAM-LMM and FA-LMM during estimation to ensure unique solutions for Λ and D. Following Smith et al. (2021), the upper right elements of Λ are set to zero when k>1 and D is set to Ik. Let the loadings and scores with these constraints be denoted by Λ and f, with fN(0,IkGg). The loadings and scores can be rotated back to their original form for interpretation. This rotation is given by:

Λ=ΛVD-1/2andf=(D1/2VIv)f, 45

where V is a k×k orthonormal matrix of right singular vectors and D1/2=l=1kdl is a diagonal matrix of singular values sorted in decreasing order, with fN(0,DGg). These matrices are obtained from the singular value decomposition Λ=UD1/2V, where U is a p×k orthonormal matrix of left singular vectors and ΛU in Eq. 45.

The loadings and scores can then be rotated using the Gram-Schmidt process in the previous section to obtain simple main effects for either model. Alternatively, generalised main effects can be obtained for the FA-LMM using Eq. 11. In terms of the FAM-LMM, however, an alternative rotation is required which consumes the intercept variance, σ12, into the factors. This rotation is given by:

Λ=[1pΛ]VD-1/2andf=(D1/2VIv)γ1f,

where V is a (k+1)×(k+1) orthonormal matrix and D1/2=l=1k+1dl is a diagonal matrix, with fN(0,DGg). These matrices are obtained from the singular value decomposition [1pΛ]=UD1/2V, where U is a p×(k+1) orthonormal matrix and ΛU.

The FAMk model in Eq. 5 can therefore be written as:

u=(ΛIv)f+δ, 46

where Λ is a p×(k+1) matrix and f is a v(k+1)-vector. The generalised main effects are based on the first factor, with:

γg=λ¯1f1andγgN0,d1λ¯12Gg, 47

where λ¯1=1pλ1/p.

Author contribution statement

DT conceived and developed the methodology, curated the data, conducted the analyses and wrote the manuscript. CG and BG provided input on plant breeding perspectives. JH organised the research project and secured funding. GG provided input on quantitative genetics perspectives. All authors have read and approved the final manuscript.

Funding

This research was funded by Bayer CropScience through collaboration with The Roslin Institute.

Data availability

The data that support the findings of this study are available from Bayer CropScience. Restrictions apply to the availability of these data, which were used under license for this study.

Code availability

The R scripts to fit all linear mixed models in Table 3 using ASReml-R are provided in the Supplementary Material.

Declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Change history

8/4/2023

A Correction to this paper has been published: 10.1007/s00122-023-04417-8

References

  1. Bailey RA. Design of comparative experiments. Cambridge: Cambridge University Press; 2008. [Google Scholar]
  2. Brancourt-Hulmel M, Denis JB, Lecomte C. Determining environmental covariates which explain genotype environment interaction in winter wheat through probe genotypes and biadditive factorial regression. Crop and Pasture Science. 2000;100:285–298. doi: 10.1007/s001220050038. [DOI] [Google Scholar]
  3. Buntaran H, Forkman J, Piepho HP. Projecting results of zoned multi-environment trials to new locations using environmental covariates with random coefficient models: accuracy and precision. Theoretical and Applied Genetics. 2021;134:1513–1530. doi: 10.1007/s00122-021-03786-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Butler DG (2019) pedicure: pedigree tools. http://mmade.org/pedicure/, R package version 2.0.1
  5. Butler DG (2020) asreml: Fits the Linear Mixed Model. http://vsni.co.uk/software/asreml-r, R package version 4.1.0
  6. Cullis BR, Gogel BJ, Verbyla AP, Thompson R. Spatial analysis of multi-environment early generation trials. Biometrics. 1998;54:1–18. doi: 10.2307/2533991. [DOI] [Google Scholar]
  7. Cullis BR, Smith AB, Hunt C, Gilmour AR. An examination of the efficiency of Australian crop variety evaluation programmes. The Journal of Agricultural Science, Cambridge. 2000;135:213–222. doi: 10.1017/S0021859699008163. [DOI] [Google Scholar]
  8. Cullis BR, Jefferson P, Thompson R, Smith AB. Factor analytic and reduced animal models for the investigation of additive genotype by environment interaction in outcrossing plant species with application to a pinus radiata breeding program. Theoretical and Applied Genetics. 2014;127:2193–2210. doi: 10.1007/s00122-014-2373-0. [DOI] [PubMed] [Google Scholar]
  9. Denis JB. Analyse de régression factorielle. Biométrie-Praximétrie. 1980;20:1–34. [Google Scholar]
  10. Denis JB. Two way analysis using covariates. Statistics. 1988;19:123–132. doi: 10.1080/02331888808802080. [DOI] [Google Scholar]
  11. Falconer DS, Mackay T. Introduction to Quantitative Genetics. 4. Essex, England: Longman; 1996. [Google Scholar]
  12. Finlay KW, Wilkinson GN. The analysis of adaptation in a plant-breeding programme. Australian Journal of Agricultural Research. 1963;14:742–754. doi: 10.1071/AR9630742. [DOI] [Google Scholar]
  13. Freeman GH, Perkins JM. Environmental and genotype-environmental components of variability VIII. Relations between genotypes grown in different environments and measures of these environments. Heredity. 1971;27:15–23. doi: 10.1038/hdy.1971.67. [DOI] [Google Scholar]
  14. Fripp YJ. Genotype-environmental interactions in Schizophyllum commune. II. Assessing the environment. Heredity. 1972;28:223–228. doi: 10.1038/hdy.1972.27. [DOI] [Google Scholar]
  15. Gauch HG. Statistical analysis of regional yield trials: AMMI analysis of factorial designs. Amsterdam: Elsevier; 1992. [Google Scholar]
  16. Gilmour AR, Cullis BR, Verbyla AP. Accounting for Natural and Extraneous Variation in the Analysis of Field Experiments. Journal of Agricultural, Biological, and Environmental Statistics. 1997;2:269–293. doi: 10.2307/1400446. [DOI] [Google Scholar]
  17. Hardwick R, Wood J. Regression methods for studying genotype-environment interactions. Heredity. 1972;28:209–222. doi: 10.1038/hdy.1972.26. [DOI] [PubMed] [Google Scholar]
  18. Heslot N, Akdemir D, Sorrells ME, Jannink JL. Integrating environmental covariates and crop modeling into the genomic selection framework to predict genotype by environment interactions. Theoretical and Applied Genetics. 2014;127:463–480. doi: 10.1007/s00122-013-2231-5. [DOI] [PubMed] [Google Scholar]
  19. Jarquín D, Crossa J, Lacaze X, Du Cheyron P, Daucourt J, Lorgeou J, Piraux F, Guerreiro L, Pérez P, Calus M, Burgueño J, de los Campos G. A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theoretical and Applied Genetics. 2014;127:595–607. doi: 10.1007/s00122-013-2243-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Jennrich RI, Schluchter MD. Unbalanced Repeated-Measures Models with Structured Covariance Matrices. Biometrics. 1986;42:805–820. doi: 10.2307/2530695. [DOI] [PubMed] [Google Scholar]
  21. Kelly AM, Smith AB, Eccleston JA, Cullis BR. The Accuracy of Varietal Selection Using Factor Analytic Models for Multi-Environment Plant Breeding Trials. Crop Science. 2007;47:1063–1070. doi: 10.2135/cropsci2006.08.0540. [DOI] [Google Scholar]
  22. Kirkpatrick M, Meyer K. Direct estimation of genetic principal components: Simplified analysis of complex phenotypes. Genetics. 2004;168:2295–2306. doi: 10.1534/genetics.104.029181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Knight R. The measurement and interpretation of genotype-environment interactions. Euphytica. 1970;19:225–235. doi: 10.1007/BF01902950. [DOI] [Google Scholar]
  24. Laird NM, Ware JH. Random-Effects Models for Longitudinal Data. Biometrics. 1982;38:963–974. doi: 10.2307/2529876. [DOI] [PubMed] [Google Scholar]
  25. Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2019) cluster: Cluster Analysis Basics and Extensions. R package version 2.1.0
  26. Mardia KV, Kent JT, Bibby JM. Multivariate analysis. London: Academic Press; 1979. [Google Scholar]
  27. Mathews KL, Trethowan R, Milgate AW, Payne T, van Ginkel M, Crossa J, DeLacy I, Cooper M, Chapman SC. Indirect selection using reference and probe genotype performance in multi-environment trials. Crop and Pasture Science. 2011;62:313–327. doi: 10.1071/CP10318. [DOI] [Google Scholar]
  28. Meuwissen THE, Hayes BJ, Goddard ME. Prediction of Total Genetic Value Using Genome-Wide Dense Marker Maps. Genetics. 2001;157:1819–1829. doi: 10.1093/genetics/157.4.1819. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Oakey H, Verbyla AP, Pitchford W, Cullis BR, Kuchel H. Joint modeling of additive and non-additive genetic line effects in single field trials. Theoretical and Applied Genetics. 2006;113:809–819. doi: 10.1007/s00122-006-0333-z. [DOI] [PubMed] [Google Scholar]
  30. Oakey H, Verbyla AP, Cullis BR, Wei X, Pitchford WS. Joint modelling of additive and non-additive (genetic line) effects in multi-environment trials. Theoretical and Applied Genetics. 2007;114:1319–1332. doi: 10.1007/s00122-007-0515-3. [DOI] [PubMed] [Google Scholar]
  31. Oakey H, Cullis BR, Thompson R, Comadran J, Halpin C, Waugh R. Genomic Selection in Multi-environment Crop Trials. G3: Genes|Genomes|Genetics. 2016;6:1313–1326. doi: 10.1534/g3.116.027524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Oliveira ICM, Guilhen JHS, de Oliveira Ribeiro PC, Gezan SA, Schaffert RE, Simeone MLF, Damasceno CMB, de Souza Carneiro JE, Carneiro PCS, da Costa Parrella RA, Pastina MM. Genotype-by-environment interaction and yield stability analysis of biomass sorghum hybrids using factor analytic models and environmental covariates. Field Crops Research. 2020;257:107929. doi: 10.1016/j.fcr.2020.107929. [DOI] [Google Scholar]
  33. Patterson H, Silvey V, Talbot M, Weatherup S. Variability of yields of cereal varieties in U. K. trials. The Journal of Agricultural Science, Cambridge. 1977;89:238–245. doi: 10.1017/S002185960002743X. [DOI] [Google Scholar]
  34. Patterson HD, Thompson R. Recovery of inter-block information when block sizes are unequal. Biometrika. 1971;58:545–554. doi: 10.2307/2334389. [DOI] [Google Scholar]
  35. Piepho HP. Analyzing genotype-environment data by mixed models with multiplicative terms. Biometrics. 1997;53:761–766. doi: 10.2307/2533976. [DOI] [Google Scholar]
  36. R Core Team (2021) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org/
  37. Smith A, Norman A, Kuchel H, Cullis B. Plant variety selection using interaction classes derived from factor analytic linear mixed models: Models with independent variety effects. Frontiers in Plant Science. 2021;12:737462. doi: 10.3389/fpls.2021.737462. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Smith AB (1999) Multiplicative mixed models for the analysis of multi-environment trial data. PhD thesis, University of Adelaide, http://hdl.handle.net/2440/19539
  39. Smith AB, Cullis BR. Plant breeding selection tools built on factor analytic mixed models for multi-environment trial data. Euphytica. 2018;214:143. doi: 10.1007/s10681-018-2220-5. [DOI] [Google Scholar]
  40. Smith AB, Cullis BR, Thompson R. Analyzing Variety by Environment Data Using Multiplicative Mixed Models and Adjustments for Spatial Field Trend. Biometrics. 2001;57:1138–1147. doi: 10.1111/j.0006-341X.2001.01138.x. [DOI] [PubMed] [Google Scholar]
  41. Smith AB, Cullis BR, Thompson R. The analysis of crop cultivar breeding and evaluation trials: an overview of current mixed model approaches. Journal of Agricultural Science, Cambridge. 2005;143:449–462. doi: 10.1017/S0021859605005587. [DOI] [Google Scholar]
  42. Stranden I, Garrick DJ. Derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit. Journal of Dairy Science. 2009;92:2971–2975. doi: 10.3168/jds.2008-1929. [DOI] [PubMed] [Google Scholar]
  43. Thompson R, Cullis BR, Smith AB, Gilmour AR. A sparse implementation of the average information algorithm for factor analytic and reduced rank variance models. Australian and New Zealand Journal of Statistics. 2003;45:445–459. doi: 10.1111/1467-842X.00297. [DOI] [Google Scholar]
  44. Tolhurst DJ, Mathews KL, Smith AB, Cullis BR. Genomic selection in multi-environment plant breeding trials using a factor analytic linear mixed model. Journal of Animal Breeding and Genetics. 2019;136:279–300. doi: 10.1111/jbg.12404. [DOI] [PubMed] [Google Scholar]
  45. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17:520–525. doi: 10.1093/bioinformatics/17.6.520. [DOI] [PubMed] [Google Scholar]
  46. Ukrainetz NK, Yanchuk AD, Mansfield S. Climatic drivers of genotype-environment interactions in lodgepole pine based on multi-environment trial data and a factor analytic model of additive covariance. Canadian Journal of Forest Research. 2018;48:835–854. doi: 10.1139/cjfr-2017-0367. [DOI] [Google Scholar]
  47. Van Den Wollenberg AL. Redundancy analysis an alternative for canonical correlation analysis. Psychometrika. 1977;42:207–219. doi: 10.1007/BF02294050. [DOI] [Google Scholar]
  48. VanRaden PM. Efficient Methods to Compute Genomic Predictions. Journal of Dairy Science. 2008;91:4414–4423. doi: 10.3168/jds.2007-0980. [DOI] [PubMed] [Google Scholar]
  49. Wood J. The use of environmental variables in the interpretation of genotype-environment interaction. Heredity. 1976;37:1–7. doi: 10.1038/hdy.1976.61. [DOI] [PubMed] [Google Scholar]
  50. Yan W, Hunt LA, Sheng Q, Szlavnics Z. Cultivar evaluation and mega-environment investigation based on the GGE biplot. Crop Sci. 2000;40:597–605. doi: 10.2135/cropsci2000.403597x. [DOI] [Google Scholar]
  51. Yates F, Cochran WG. The analysis of groups of experiments. The Journal of Agricultural Science, Cambridge. 1938;28:556–580. doi: 10.1017/S0021859600050978. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The data that support the findings of this study are available from Bayer CropScience. Restrictions apply to the availability of these data, which were used under license for this study.

The R scripts to fit all linear mixed models in Table 3 using ASReml-R are provided in the Supplementary Material.


Articles from TAG. Theoretical and Applied Genetics. Theoretische Und Angewandte Genetik are provided here courtesy of Springer

RESOURCES