Skip to main content
Plant Methods logoLink to Plant Methods
. 2025 Aug 25;21:119. doi: 10.1186/s13007-025-01434-0

Disentangling soybean GxE effects in an integrated genomic prediction and machine learning-GWAS workflow

Niel Verbrigghe 1,, Hilde Muylle 1, Marie Pegard 2, Hendrik Rietman 3, Vuk Đorđević 4, Marina Ćeran 4, Isabel Roldán-Ruiz 1
PMCID: PMC12376716  PMID: 40855500

Abstract

Integrating genotype-by-Environment (GxE) interactions into genomic prediction models has been demonstrated to enhance the accuracy of predictions for crops exposed to unfavourable environmental conditions. However, despite the increasing complexity of machine learning models in genomic prediction, no model or approach has been found to be overall superior in comparison to a classical genomic best linear unbiased prediction (GBLUP) model. In this paper, we compared two GBLUP models (Linear Mixed Effects model and Bayesian GBLUP) with two machine learning models (Random Forest and Extreme Gradient Boosting) on the EUCLEG soybean genotype set phenotyped in Belgium and Serbia. We found similar performance for the Bayesian GBLUP and the two machine learning methods. However, using a workflow that decomposed the environment-specific BLUPs into a main genetic and an interaction GxE effect, we found increased predictive ability for the interaction component compared to a single-component approach. Furthermore, conducting a machine learning-genome wide association study (ML-GWAS) on both components allowed us to identify important markers for the main genetic effect, as well as environment-specific markers. These could then be associated with correlated markers in other environments. By constructing a small random forest model using only 50 uncorrelated, important markers we constructed a genomic prediction model with similar predictive ability over all scenarios when compared to the large models including all markers. The results demonstrate a new, integrated genomic prediction and machine learning-genome-wide association study (ML-GWAS) approach, aimed at high predictive ability and coupled marker detection in the soybean genome for traits phenotyped in different environments.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13007-025-01434-0.

Keywords: Genomic prediction, Machine learning, GWAS, Random forest, Soybean, GxE

Introduction

Soybean is a crucial global crop. Essential in food and feed as a source of both oil and protein, it is also used in various industries. However, the EU relies heavily on imports to meet its needs, with over 80% of its soybean supply coming from non-EU countries [1]. One of the most important challenges for expanding soybean cultivation in Europe is the need for varieties that are specifically adapted to the diverse European climates. A major fitness trait is early maturity, but these genotypes belonging to the maturity group 000 specifically suffer from lower variability [2, 3], so might benefit most from targeted integration of protein- or yield-improving alleles in their genomes. Genomic prediction can enable a more efficient and accurate selection of desirable traits in genotypes from other maturity groups and so accelerate variety development [4, 5]. Several genomic prediction methods have been implemented in breeding of multiple crops, including soybean [68]. However, there is still no statistical, machine learning or deep learning approach that performs best across different genotype sets, cross validation schemes or environments [911]. Hence, testing of different, both well-established and novel, methods in specific scenarios remains necessary.

Genomic best linear unbiased prediction (GBLUP)-type models are commonly used for solving genomic prediction equations [12, 13]. GBLUP methods assume that all markers have an effect on the trait variability and that these marker effects follow a normal distribution [13, 14]. Linear mixed effects models (LMMs) are such a parametric GBLUP approach used to estimate the genotype BLUPs and are based on the variance–covariance matrix calculated from the genomic relationship matrix [15]. An alternative for estimating BLUPs is using the semi-parametric Reproducing Kernel Hilbert Spaces (RKHS) regression, which has been shown to be particularly suited for traits with low heritability [12, 16]. In the RKHS method, different types of kernel functions, such as gaussian or exponential are used to map the independent, non-linear marker effects into a new space, after which conventional machine learning techniques are implemented for genomic prediction [17]. Bayesian GBLUP is a special case using a linear kernel function, which comes down to calculating the same genomic relationship matrix as with the LMM method [12].

In contrast to the GBLUP methods, machine learning methods using an untransformed marker matrix make no assumptions about the distribution of marker effects, i.e., they allow for the presence of predominantly zero-effect markers in combination with some high-effect markers. They are also able to account for marker interaction effects, permitting the prediction of dominance and epistatic effects. A well-established example is Random Forest, an ensemble machine learning method that constructs a large number of decision trees using bagging (bootstrap aggregation), where each decision tree is trained on a different subset of the data [18]. One of the main reasons why Random Forest has become a subject of research in the field of genomic prediction is that the method enables computation of a measure of importance for each marker, reflecting its contribution to predictive ability [19]. Another machine learning technique being explored is Extreme Gradient Boosting (XGB), an advanced implementation of an ensemble method that builds decision trees sequentially, called boosting, as opposed to parallel bagging in Random Forest [20]. At each new iteration, the errors of the previous ones are corrected by increasing the weights of misclassified observations in the decision tree. Although for many tasks, XGB has been shown to have higher predictive ability than Random Forest, it is computationally demanding and more vulnerable to overfitting than parallel bagging when the set of hyperparameters is not tuned precisely [21, 22].

Independent of the genomic prediction approach implemented, one of the biggest challenges in genomic prediction is how to deal with the effects of environmental conditions on the plant phenotypes. This is of paramount importance, as high genotype-by-environment (GxE) effects and low correlations between environments can significantly reduce predictive ability when ignored [23, 24]. From a breeder’s perspective, high GxE effects become especially important when predicting the performance of unphenotyped soybean genotypes in an environment where only other genotypes have been phenotyped. Hence, it is important to assess model performance across different CV schemes. For GBLUP models, several approaches have been introduced to include GxE effects [12, 25, 26] and it has been repeatedly demonstrated that including GxE effects improves predictive ability across environments [2730]. Next to considering only the GxE-effect, one can also explicitly model the marker-by-environment (MxE) effects, as is done in the machine learning methods. Because of the way the marker matrix for such models is constructed, the dimensions of this matrix are huge. One way to reduce the dimensions of the marker matrix is to train the model on the main and interaction components separately. The main component then encompasses the genetic merit of a genotype, regardless of the environment in which it is grown. The interaction component, on the other hand, reflects the genetic merit of a genotype in a specific environment only, since a genotype may be genetically be more adapted to certain environmental conditions than to others. Next to reducing matrix size, this kind of approach possibly could also benefit from more specifically tuned hyperparameter sets for the machine learning methods as there can be a different set for the main and interaction model, leading to higher predictive ability.

When marker effects are modelled explicitly instead of genomic relations, the variable importance (VI) can be estimated for both the main and the MxE markers. In this way, markers that contribute to a more accurate prediction can be identified. The advantage of such a machine learning genome-wide association study (ML-GWAS) over conventional GWAS is that it is able to take into account the interaction effects between markers, enabling simultaneous consideration of a wide range of interconnected processes defining complex traits [31]. For Random Forest, the VI of a marker is defined as the average decrease in model accuracy on the out-of-bag samples when the values of the respective marker are randomly permuted [18]. Although in that way the markers can be ranked according to their importance, it has been shown that the VI can be biased [32]. Hence, Altmann et al. [33] developed a way of normalising the VI values in order to correct the bias by repeatedly permuting the response vector. In that way, a non-informative (null) distribution is estimated for the VI of each marker, and a corresponding p-value can be calculated. However, this approach becomes computationally very demanding for large models such as genomic prediction models estimating MxE effects. A solution was found by Janitza et al. [34], who developed a computationally fast method for datasets with large number of variables without any effect, such as the MxE design, to estimate a null distribution for all marker effects combined. With this non-parametric method a p-value is determined for to each marker VI, calculated on a test set of samples. An important limitation of this method is that the lowest p-value that can be assigned to a VI depends on the number of VIs in the null distribution, which in general does not suffice to accurately assess the p-value for highly important markers. Fortunately, using the minimum value of the null distribution to calculate a relative importance score instead of a p-value, Szymczak et al. [35] developed and validated a feasible workaround. The application of ML-GWAS is relatively new, although it has been used in several (soybean) studies [31, 36, 37].

This study aims to evaluate the benefits of using advanced genomic prediction methods in soybean breeding. The evaluation was divided into three stages. First (i), the predictive ability of four genomic prediction methods was evaluated in two breeder-relevant CV schemes. The second (ii) aim was to test whether a model design where main and interaction effects were explicitly separated performed better than a comprehensive, all-in-one model. The third stage (iii) involved identifying important, uncorrelated genomic markers that contributed to a more accurate prediction of both main and MxE effects in an ML-GWAS analysis. We then used these markers in a smaller genomic prediction model. This workflow enabled us to evaluate and extend the available genomic prediction tools, and to present a tailored approach to facilitate soybean breeding in Europe.

Methods

Experimental design and phenotyping

Field experiments were installed in 2018 and 2019 in Kessenich, Belgium (51°08′N 5°48′E) and Novi Sad, Serbia (45°20′N 19°51′E), making up four different environments: BE2018, BE2019, SRB2018 and SRB2019 as described in [38]. Weather data for the four environments were fetched from the Agri4Cast research portal [39]. Cumulative water deficit, growing degree days, cumulative precipitation and cumulative global radiation were calculated from the first day after sowing until the last genotype was harvested, and are shown in Figure S2. Cumulative water deficit was calculated as being the positive cumulative excess of potential evapotranspiration from a crop canopy compared to precipitation and growing degree days the cumulative daily average temperature (T) above a base T of 5 °C. On a homogenous field without shading, 360 of the original EUCLEG genotypes were sown in an augmented design, where for the four maturity groups (MGI/II, MG0, MG00 and MG000) each containing 90 genotypes, 2 genotypes were replicated 6 times, 8 were replicated twice and 80 were included only once (see Supplementary Material). Seeds were sown at 5 cm depth, in plots of 1.5 × 4 m with varying row distance. To avoid border effects, only the middle 2.5 m of the inner rows was harvested in each plot. The average beginning of flowering (R1) and full maturity (R8) according to Fehr [40] are summarised in Table 1.

Table 1.

Sowing details for the four environments, and average number of Days After Sowing (DAS) at which specific developmental stages (R1 and R8) were achieved at each environment

Environment Sowing date Average start of stage (DAS ± st. dev)
R1 R8
BE2018 2018–05–05 58 ± 9 121 ± 14
BE2019 2019–05–12 67 ± 10 137 ± 17
SRB2018 2018–04–19 42 ± 7 112 ± 14
SRB2019 2019–04–07 66 ± 5 132 ± 13

Before sowing, seeds were inoculated with Nitrazon in the Kessenich sites (Farma Žiro, Ltd. Czech Republic). In Novi Sad, inoculation was not necessary because of frequent soybean cultivation in the past. Only Kessenich was irrigated during the growing season in 2018, and NPK fertiliser was applied at 36–28–140 kg/ha before sowing in Kessenich, and at 45–45–45 in Novi Sad. No fertilisation was used after sowing.

Three phenotypic traits were chosen for genomic prediction: days after sowing when 50% of the plants were fully mature (R8), dry matter seed yield and seed protein content. We considered that a plant had attained the R8 developmental stage when 95% of its pods reached their mature brown pod colour. After harvesting, seeds were dried for 72 h at 70 °C in a drying oven before weighing. Seed protein content was estimated using near-infrared reflectance spectroscopy (NIRS) on dry samples milled over a 1 mm screen using a cutting mill, as described in Pannecoucque et al. [41]. From the 360 phenotyped EUCLEG genotypes sown, 317 had valid observations in the four environments for all three traits under consideration, and were retained for further analysis (see Supplementary Material).

Genotyping

Single-nucleotide polymorphism (SNP) calling of the EUCLEG genotypes and quality control of the 355 K SoySNP microarray SNP data were conducted as described in Saleem et al. [2]. Also EUCLEG population structure analysis using fastSTRUCTURE and division in five subgroups is discussed in that manuscript. SNP calling resulted in a marker matrix XM which was coded 0/1/2, with 0 indicating homozygosity for the most frequent allele, 1 heterozygosity, and 2 homozygosity for the least frequent allele. This allele coding was chosen because a marker matrix containing mostly zeros can be stored more compactly as a sparse matrix (class ‘dgCMatrix’ in R [42]. We had valid data for 226,798 SNP markers, positioned onto 20 chromosomes and 234 unanchored scaffolds in the Glyma.Wm82.a2 soybean reference genome. Markers with a minor allele frequency (MAF) < 0.01 were removed, and missing values (0.27% of the entries) were replaced by the average allele frequency of that marker over the entire population. The MAF distribution of the remaining markers was visually checked for outliers with a histogram. This resulted in 163,926 valid markers, of which 162,849 (99.3%) were located on the chromosomes and 1,077 (0.66%) on unanchored scaffolds.

BLUP estimation

For the traits seed yield and protein content, environment-specific best linear unbiased predictions (envBLUPs) were calculated from the observational data using the following generalised additive mixed models,

y=Xenvβenv+XR8βR8+ZGbG+ZGEbGE+fenvcolenv,rowenv+ε 1

where y is the response vector with length n.s of n individuals in s environments for seed yield or protein content, βenv is the vector of s environments with fixed environmental effects, βR8 the vector of length n.s with fixed effects of maturity (growth stage R8) in days after sowing, effects bG and bGE are the vectors containing the overall genotypic random effect with length n (= the main component) and genotype-by-environment interaction effect with length n.s (= the interaction component) respectively; the design matrices for their respective effects Xenv with dimensions n.s x s, XR8 and ZGE with dimensions n.s x n.s and ZG with dimensions n.s x n. Finally, fenv is a two-dimensional smoother spline for each environment on the rows and columns of that environment and ε is the residual error. The maturity term was included in the equation to account for the bias introduced by a longer growing season.

Maturity (R8) itself was modelled in a similar way as a response variable, leaving out the maturity term as fixed effect,

yR8=Xenvβenv+ZGbG+ZGEbGE+fenvcolenv,rowenv+ε 2

The envBLUPs were then calculated by summing the main and interaction effects,

z1z2z3z4=1111b0+bGE 3

where for each trait, zj is the vector with length n in environment j of which the envBLUPs will be used to fit the investigated genomic prediction models.

For the estimation the generalised additive mixed models above, the R-package gamm4 [43] was used, and random factors were extracted with the ranef function from the R-package lme4 [44].

Full model genomic prediction methods

Linear mixed effects model

The GxE GBLUP linear mixed effects model was constructed using the method described in Lopez-Cruz et al. [12]. The effect of the kth marker in the genomic marker matrix (XM) for the jth environment (αjk) can be decomposed into a random effect common to all environments (b0k,b0N0,Iσb02), and a random effect specific to environment j (bjk,bjN0,Iσbj2), yielding αjk = b0k + bjk. Hence, the equation for the ith genotype in the jth environment becomes zij=μj+k=1pxijkb0k+bjk+εij, (i = 1,2,…,n individuals; j = 1,2,…,s environments; k = 1,2,…,p markers). Since the fixed environment-effect βenv was not extracted from Eqs. 1 and 2, the intercept µj for each environment is set to zero. In matrix notation, the following is then obtained for the four environments,

z1z2z3z4=X1X2X3X4b0+X10000X20000X30000X4b1b2b3b4+ε1ε2ε3ε4 4

where zj is the response vector of n individuals for each trait containing the fitted envBLUP in environment j (Eq. 3); Xj is the n x s design matrix for the genetic markers in environment j and εj (εjN0,Iσε2) is the residual error term. Since for all traits the design is full factorial, meaning all lines are evaluated in all environments, X1 = X2 = X3 = X4 = XM, this model can be represented as a two-variance component GBLUP model, with,

z=z1z2z3z4,u0=XMXMXMXMb0,u1=XM0000XM0000XM0000XMb1b2b3b4 5

becoming,

z=u0+u1+ε 6

where u0N0,σu02G0J4,u1N0,G1,εN0,Iσε2, J4 being an all-ones square matrix of order 4.

G0 and G1 with respective dimensions n x n and (s.n) x (s.p) is defined as following, p being the number of markers, and XMS the n x p centred and standardised matrix of marker genotypes,

G0=XMSXMSp,G1=σu12G00000σu22G00000σu32G00000σu42G0 7

The linear mixed effects model was fitted using the function mmer of the R-package sommer [45].

Bayesian GBLUP genomic model

For the Bayesian GBLUP approach the same model construction was used as for the linear mixed effects model described above. The model was evaluated using MxE GBLUP with RKHS as described in Lopez-Cruz et al. [12] with the R-package BGLR [46]. Default implemented hyperparameters were used for RKHS, with 12,000 iterations and a burn-in of 2000 iterations.

Random forest

The structure for the response vector (Y) and predictor data used in the Random Forest model looks as following in matrix notation for four environments,

zJ4,1XMI4XM+ε 8

where z is the vector of envBLUPs (Eq. 5), XM is the n x p design matrix for the genetic markers; J4,1 an all-ones 4 × 1 column vector; I4 the identity matrix of order 4 and ε the residual error term.

Random Forest was trained and executed using the ranger package in R [47]. Hyperparameters were tuned by defining three inner folds on the training data for each (outer) fold as described by Montesinos-López et al. [48]. A Bayesian optimisation algorithm iteratively updating the prior response distribution was used to identify the best parameters. This tuning algorithm is implemented in the function bayesOpt from the package parBayesianOptimization [49]. We used 10 points chosen with Latin hypercube sampling to initialise the process, and 10 optimisation iterations after process initialisation. A different hyperparameter set was tuned for the different cross validation methods, and listed in Table S1.

Extreme gradient boosting

Extreme gradient boosting (XGB) model training and prediction were done using the package xgboost [50] implemented in R. The structure of the response vector and feature matrix for the XGB model were the same as for Random Forest (Eq. 8). Also, hyperparameter tuning was done using the same inner-outer fold approach and Bayesian optimisation method [49]. We used 50 points chosen with Latin hypercube sampling to initialise the process, and 10 optimisation iterations after process initialisation. Once a set of hyperparameters was identified, the parameter nrounds was fine-tuned for each trait, by running the model five times on the inner folds and averaging the nrounds-parameter. The resulting hyperparameter sets are listed in Table S1. The model was run on the GPU (NVIDIA® RTX A5000) with a gradient based sampling method, while for other settings default values were used.

Split-component model

For all four genomic prediction methods, the main (bG) and interaction (bGE) components and of the envBLUPs (Eq. 3) were also predicted separately. For the LMM and Bayesian GBLUP, Eq. 6 was split up in two parts,

bG=u0+ε 9
bGE=u1+ε 10

with u0 and u1 as defined in Eq. 5. For Random Forest and XGB the same was done,

bGXM+ε 11
bGEI4XM+ε 12

where bG is the coefficient vector of overall genotypic random effects and XM is the design matrix for the genetic markers; bGE is the vector of GxE interaction effects.

All model-specific settings described above are also applied on this split-component model, and for Random Forest and XGB, a separate set of hyperparameters was determined for the main model (Eq. 9) and for the interaction model (Eq. 10). The nrounds parameter was also fine-tuned for each trait in a similar way as for the full model. The hyperparameter sets are listed in Table S1.

Model comparison

The first measure for prediction quality, the slope, was calculated as the slope of the linear regression of the observed to the predicted values.

The relative root mean square error (rRMSE) for model performance was calculated as,

RMSE=1n1nb-b^2 14
rRMSE=RMSEσb 15

with n the number of observations considered, b and b^ the observed and predicted values respectively, and σb the standard deviation of the observed values vector.

Pearson’s product-moment correlation is commonly used as a measure of predictive ability [51] and was calculated as following,

r=covb,b^σbσb^ 16

where cov is the covariance and σ is the standard deviation of the vectors of the observed b and predicted b^ values.

Cross-validation methods

In order to compare the predictive ability of the models tested, two different cross-validation (CV) strategies were applied, each using 10 cross-validations: k-fold CV and random CV with environments [12]. In all cases 20% of the data was used as test set, 80% as training set, and this was repeated ten times. For random CV with environments, genotypes assigned to the test set in one environment, were kept in the training set in the other environments (figure S3A). This was done in such way that all environments were represented in the test set for as many genotypes as possible. For k-fold CV, genotypes were included in the test set for all environments (figure S3B).

From a breeder’s perspective, the k-fold CV method is useful to predict performance of non-phenotyped genotypes in a known environment, whereas random CV with environments is useful to predict the performance of genotypes phenotyped in some, but not all known environments.

Random forest VI for ML-GWAS

The variable importances (VIs) of the predictor variables for the Random Forest method were estimated using the permutation method implemented in the ranger package [47]. This was only done for the k-fold CV method, and not for the random CV with environments. For each fold, all genotypes in the training set were randomly assigned to two subsets of equal size. Each of the two subsets was then once used for creating the forest and once for deriving importance scores. Next, a hold-out VI for the main VIk0HO, and interaction effects VIkjHO of predictor variable xk in environment j was calculated for each fold according to [34],

VIk0HO=12l=12VIk0l 17
VIkjHO=12l=12VIkjl 18

To identify important markers across the entire dataset, a mean VI per marker was calculated using all ten CV folds. However, for subsequent use of the markers in genomic prediction with the small random forest model, it was essential to preserve the test set. In this case, hold-out VIs obtained from the training set were used without modification in the Random Forest model.

The minimal VIkHO was then used to calculate a relative importance scores, or f-factor [35],

f=VIkHOminVIkHO

In this approach, f = 1 corresponds to the minimal P-value being able to be calculated with the number of VIs in the null-distribution estimated with the new testing approach of Janitza et al. [34] . This null-distribution is approximated based on the observed non-positive importance scores, where two sets are defined,

M1=VIkHO|VIkHO<0;k=1,...,pi.e,allnegativeimportancescores,M2=VIkHO|VIkHO=0;k=1,...,pi.e,allzeroimportancescores

and the empirical cumulative null-distribution function F0^=M1M2-M1. The f-factor defined important markers on three stringency levels (f = 1, 3 and 5).

For each marker, the correlations with the more important markers were calculated based on the values in the marker matrix. Markers were discarded if they showed a high correlation (|r|> 0.5) with a more important marker [31]. For the interaction MxE effects, markers were assigned to one or more environments. After elimination of correlated markers, an uncorrelated marker was assigned to an environment if f > 3 in that environment. That marker was also assigned to another environment, if it th correlated markers with f > 2 were detected in that other environment (threshold used by Canella Vieira et al. [31]). In that way, genetic regions which were important in more than one environment were identified.

For creating a ‘small’ Random Forest model, the uncorrelated markers with high VI (Eq. 10) determined in the ML-GWAS were incrementally added as model features, starting with the markers with the highest VI. This process was repeated for each k-fold CV, so that for each trait ten separate models were trained on the training set of that cross-validation fold. Hence, the test set was not used for determining marker VIs, nor was information from one fold transferred to another.

The GWAS performed for method comparison on the main component used the multi-locus mixed-model approach [52]. The extended Bayesian Information Criterion threshold with the default lambda penalty was used to control for the false-discovery rate.

All data analyses were done with the statistical programming language R [53].

Results

Environment-specific BLUPs

The envBLUPs for the traits P and R8 showed a higher correlation between environments than those for Y. Notably, correlation for Y between sites in Serbia and Belgium was lower than the correlation within a site for different years (Figure S1). Examining the GxE interaction effect revealed a strong negative correlation between Serbia and Belgium for all three traits. This indicates that genotypes that performed well in Belgium performed relatively poorly in Serbia and vice versa. In most cases, the between-year correlation at one location was positive or close to zero, although significant negative correlations were found for P between BE2018 and BE2019 (− 0.27; P < 0.001) and for R8 between SRB2018 and SRB2019 (− 0.15; P < 0.001).

For the three traits, the proportion of variation explained by bG and bGE out of the total variation present in the observations was extracted from the linear mixed-effects model in Eqs. 1 and 2 and is shown in Table 2. For P and R8, most variation was explained by the main genetic effect bG, and less by the GxE interaction effect bGE, while for Y, most variation was explained by the GxE interaction effect. For all three traits, the proportion of variation explained was around or little more than one third of total variation.

Table 2.

Proportion of total variance explained by main genetic (bG), interaction genotype-by-environment (bGE) effects and calculated environment-specific BLUP (envBLUP) out of total variance in observations for the traits maturity (R8), seed yield (Y) and protein content (P)

Proportion of variance explained bG (%) bGE (%) envBLUP (bG + bGE; %)
Maturity (R8) 30.3 7.6 37.9
Yield (Y) 12.6 24.4 37.0
Protein (P) 25.6 7.7 33.3

Genomic prediction model comparison

The performance of four ‘full’ genomic prediction models, predicting the envBLUPs, combining the main and interaction components, (Fig. 1) was compared using two different CV strategies. For k-fold CV, there were only small differences between models for predictive ability and rRMSE, and the small significant differences between models were mostly inconsistent across traits. However, the LMM consistently was the worst-performing model in terms of rRMSE and regression slope, which was furthest from 1. This was in contrast to the Bayesian GBLUP slope, which was closest to 1 (best prediction). For random CV with environments, broadly the same trends as for k-fold CV were observed. Here, LMM also performed significantly worse for all measures across all traits, now with large, consistent differences observed. The differences between the other three models were small for predictive ability and rRMSE, and for regression slope the Bayesian GBLUP was consistently among the best.

Fig. 1.

Fig. 1

Comparison of genomic prediction models: GBLUP linear mixed effects model (LMM), Bayesian GBLUP (Bayes), Random Forest (RF) and extreme gradient boosting (XGB) on k-fold CV (kF) and random CV with environments (RE). Slope, relative root mean square error (rRMSE) and predictive ability are shown as performance indicators. The traits maturity (R8), seed yield (Y) and protein content (P) are predicted. Different letters within a model-, trait-, and CV group indicate significantly different model performance on 10 CV runs

A clear and large difference was observed between the two CV strategies. For all models except LMM, the predictive ability for k-fold CV was lower than for random CV with environments, the rRMSE for k-fold CV was higher than for random CV with environments, and the slope for k-fold CV was closer to 1 for random CV with environments. This difference, although still significant, was smaller for seed yield than for protein content and maturity.

Full model vs split-component model

Figure 2 compares the predictive ability of the full model, where the main and interaction components were predicted as one component, the envBLUP, and of the split-component model, where the main and interaction components were predicted separately. Also the predictive ability of our Random Forest model using only the top 50 important uncorrelated markers is indicated. The prediction results obtained from the full model with k-fold CV were split into their main and interaction components, and both were compared, as well as the envBLUPs. The results show that there is no statistically significant difference for the main component between the full and split model approach for any of the genomic prediction models tested. However, significant differences were identified in the interaction component between the full- and the split-component model for the three investigated traits. For seed yield, the split-component model showed statistically significant improvements compared to the full model for all models tested, except LMM. For protein content, LMM, Random Forest and XGB showed higher predictive ability. For maturity, XGB performed better when the split-component model approach was used. All models performed equally well for envBLUP prediction when the split model approach was used, and no difference was found for any trait between the split and full models.

Fig. 2.

Fig. 2

Comparison of the predictive ability for the full model and split-component model approach, showing the main and interaction components and envBLUP. Only k-fold CV is assessed, for the three traits maturity (R8), seed yield (Y) and protein content (P). The dashed vertical line indicates the predictive ability of the Random Forest (RF) model with only 50 uncorrelated markers with highest VI included. Asterisks indicate significance level and error bars the standard error

Overall, the predictive ability of the envBLUP was the highest for R8 (P < 0.001), and the lowest for seed yield (P < 0.001). For all traits and models, the predictive ability of the interaction component was much lower than that of the main component (P < 0.001). XGB was the model that performed poorest in terms of predicting interaction components in the full model approach (P < 0.001), although XGB model performance for the interaction components in the split-component approach significantly increased over the full model approach.

Machine learning-GWAS

Figures 3 and 4 illustrate the mean of the marker variable importances (VIs) over all CV folds, of the main (bG) and interaction (bGE) components respectively, estimated with the Random Forest model. For R8, Y and P the total number of features in the null distribution is 148,564, 151,222 and 151,242 respectively corresponding to a minimum computable P-value of 6.7 × 10–6 (f > 1) for R8 and 6.6 × 10–6 for Y and P. More stringent f-factor thresholds of 3 and 5 are also indicated in the figure, which could be a good trade-off between sensitivity and power [35]. As shown in Figs. 3 and 4, there were several regions on the genome where numerous markers have importance scores above the thresholds. After filtering out the markers that were correlated with a marker with a higher VI, most of the markers in such a genomic region were discarded. When only considering uncorrelated markers with f > 3 in the main genetic effects (Fig. 3), 22 markers were detected for R8, 15 for Y and 9 for P. All important (f > 1) uncorrelated markers from the main component where highlighted on a Manhattan plot showing the results from a traditional GWAS for all traits (Figure S4).

Fig. 3.

Fig. 3

Variable importance (VI) of Random Forest predictor variables for main effects of the traits maturity (R8), seed yield (Y) and protein content (P). Chromosome numbers are indicated, and markers with f > 1 are indicated in red, or pink if correlated with a more important marker

Fig. 4.

Fig. 4

Variable importance (VI) of Random Forest predictor variables for interaction effects of the traits maturity (R8), seed yield (Y) and protein content (P) per environment. Chromosome numbers are indicated, and markers with f > 1 are indicated in red, or pink if correlated with a more important marker in any environment

When examining the GxE interaction effects, markers were assigned to an environment if they were found to be significant in that environment. Consequently, a marker or a marker strongly correlated with a significant marker could be assigned to one or more environments. Figure 5 shows a Venn diagram illustrating the number of markers assigned to the four environments. Across all environments, 225, 60 and 93 markers were identified as being significant for R8, P and Y, respectively. Decreasing the f-factor threshold substantially increased the number of significant markers identified, as well as the number of markers that could be assigned to multiple environments (see Fig. 4).

Fig. 5.

Fig. 5

Venn diagram depicting the percentage of important markers (f > 3) having correlated markers (|r|> 0.5 with f > 2) between environments, for the prediction of MxE interactions for the traits maturity (R8), seed yield (Y) and protein content (P)

ML-GWAS based selection of genomic prediction features

When the uncorrelated markers with high VI determined in the ML-GWAS were incrementally added as Random Forest model features, starting with the markers with the highest VI, the predictive ability of the test and training sets increased with a higher number of features for the main effect models (Fig. 6). As more markers were gradually included, predictive ability of the test set continued to increase and eventually remained at a stable predictive ability from around 50 markers included, with a test-set predictive ability which was not significantly different to the large model, which was true for all three traits under consideration (paired t-test; P > 0.05). In line with the predictive ability of the test set, the predictive ability of the training set also increased as more markers were included. However, with 50 markers added, the predictive ability in the training set of the small model was still increasing whereas the predictive ability of the full model was not yet reached. This was true for all traits and for both the main and GxE interaction models. Incremental inclusion of markers revealed that for all three traits the predictive ability test-set plateau was not reached at f = 2, and even at f = 1, the predictive ability test-set plateau was not reached for P.

Fig. 6.

Fig. 6

Predictive ability of Random Forest genomic prediction models for the main components (bG) in function of increasing number of important markers included in the model. A k-fold CV and split-component approach was used to predict the traits maturity (R8), seed yield (Y) and protein content (P). The solid horizontal lines represent the predictive ability of the GP models with all markers (> 150,000) included, in the test and training set, and the dashed vertical lines the f-factor thresholds defining the mean number of uncorrelated markers across all CV folds. The shaded area shows the standard deviation on 10 folds for the test and training set

Similar dynamics can be observed for the interaction effect models when the process was repeated (Fig. 7). Despite the fact that the maximum achieved predictive ability of the interaction effect models differed between environments for the three traits, and even the ranking of the traits with regard to the predictive ability was different, the small models systematically showed a similar predictive ability as the large models when 50 uncorrelated markers with the highest VI were included model features (paired t-test; P > 0.05). Furthermore, similar as for the main effect models, the average training set predictive ability of the small model did not reach the predictive ability of the large model when only 50 markers were included. An examination of the markers exceeding the f-factor for the interaction effect models showed that the stringent f-factor threshold of 3 already included a sufficient number of markers for the test set plateau to be reached, e.g., all traits in BE2018. Furthermore, in other cases the test set plateau was reached at an f-factor of 2 or lower.

Fig. 7.

Fig. 7

Predictive ability of Random Forest genomic prediction models for the interaction z co z z mponents (bGE) in function of increasing number of important markers included in the model. A k-fold CV and split-component approach was used to predict the traits maturity (R8), seed yield (Y) and protein content (P). The solid horizontal lines represent the predictive ability of the GP models with all markers (> 150,000 per environment) included, in the test and training set, and the dashed vertical lines the f-factor thresholds defining the mean number of uncorrelated markers across all CV folds. The shaded area shows the standard deviation on 10 folds for the test and training set

As shown in Figs. 6 and 7, the predictive ability of the test set in both the main and interaction-effect model had reached a plateau before 50 markers and did not start to decrease again before that threshold. Therefore, a model using 50 uncorrelated markers with a high VI was compared to the other split-component models. In line with with its similar predictive ability over the Random Forest split-component models, the small genomic prediction model had higher predictive ability than the best performing model in both the main- and interaction component and the full envBLUP prediction (Fig. 2; P > 0.05). All the P-values for the paired t-test comparisons between the large Random Forest model using all markers, and the small Random Forest model using only the top 50 most important, uncorrelated markers is given in Table S2.

Discussion

Genomic prediction

Our results showed similar predictive ability for the full ML models and the models based on the genomic relationship matrix—the GBLUP methods—in the k-fold CV approach. However, when calculating the genomic relationship matrix, only additive marker effects are considered, and these effects are assumed to be linear and normally distributed [13]. This means that dominance and epistatic effects are ignored in these models or genomic prediction. Also, high-effect markers that may be present in some genotypes but not in others, are shrunk back to fit the normal distribution, meaning their effect is not sufficiently taken into account for prediction. In that sense, it is surprising that ML methods do not outperform GBLUP methods. Nevertheless, the lack of superiority of ML or deep learning models (i.e. models using neural networks with many layers) over GBLUP models has been widely documented in other studies [911, 48, 54, 55]. Several techniques have been proposed to improve ‘naive’ ML models, including intelligent feature selection methods [56, 57], hyperparameter tuning algorithms [5759], and complex model designs [58, 60]. The results of these approaches are mixed, but not one of the methods performed best in all cases, even in the manuscripts published by the developers themselves. One possible explanation for why the deep learning models underperform is provided by Ubbens et al. [61], stating that these models are prone to ‘shortcut learning’ during training, i.e., focussing on high-performing individuals who share large haploblocks rather than seeking out small and noisy single-marker effects. Consequently, they would consider high-level genetic relatedness to be a viable shortcut for predicting the phenotype, as the GBLUP methods do, thereby also ignoring epistatic and dominance effects. While this has not been documented for ‘simpler’ machine learning models such as Random Forest and XGB, the ensembles of trees they construct are far too complex to analyse manually in order to understand the models' decisions. Therefore, a proxy analysis could be executed for the ML models, as Ubbens et al. [61] did, to determine the extent to which marker interaction effects are taken into account for prediction. Although model structures of Random Forest and XGB are too complex to analyse, unlike deep learning models, Random Forest provides straightforward access to metrics that indicate the importance of features in the final prediction [62, 63], thereby offering insight into the model's dynamics. Another plausible explanation for the similar performance of the ML models compared to the GBLUP models is the small number of observations (n) compared to the large number of parameters (p), a design which is generally present in genomic prediction studies. The large number of parameters allows for a high degree of flexibility in fitting the lower number of observations in the training set, thereby capturing dynamics that cannot be generalised to the test set.

Several studies have shown that including GxE or MxE effects in a genomic prediction model can improve predictive ability compared to models that only take main effects into account, or models that estimate a marker’s or genotype’s effect across environments [12, 2730]. However, by taking it one step further and extracting the main genetic (bG) and interaction GxE (bGE) effects from the observations, and predicting each component separately, rather then using one envBLUP per soybean genotype per environment, predictive ability increased substantially for most interaction terms (Fig. 2). There are two possible mechanisms that can explain the increase in predictive ability for the split-component model. First, tuning hyperparameters separately on the main- and interaction component allows for an optimal ML model structure in both cases. As the interaction component of the marker matrix is four times larger than the main component, and the information is sparser, it may benefit from a specifically tuned hyperparameter set [64]. XGB is especially sensitive to uncalibrated hyperparameters [21, 65], which could explain the poor performance of the full model for the interaction component (Fig. 2). Second, separate modelling of main and interaction components adds extra mechanistic information to the genomic prediction models, which otherwise gets lost when summing both components into one envBLUP. This could explain why the GBLUP models, as well as the ML models, have higher predictive ability for the interaction components and for the envBLUP in R8 in most cases. From a breeder’s perspective, accurate genomic predictions for main genetic effects (bG) enable the selection of varieties with high inherent genetic potential and that are more stable across environments. Meanwhile, predictions for G × E interactions (bGE) provide insights into how these varieties are likely to perform in different environments. This distinction facilitates targeted breeding strategies, optimising both genetic gain and adaptation to specific environmental conditions.

When comparing the two CV strategies, the genomic prediction models have higher predictive ability in the random CV with environments than in the k-fold cross validation, except for the LMM (k-fold CV; Fig. 1). This makes sense because the random CV with environments approach allows the model to borrow information about the genetic effect of a genotype from its phenotyping in another environment. The difference in predictive ability was particularly large for P and R8 and smaller for Y, as reflected in the lower variance explained by the main genetic effect (bG) for Y than for P and R8 (Table 2). This means that less genetic information could be borrowed from the phenotype of the same genotype in another environment.

From a breeder's perspective, the random CV with environments approach reflects scenarios where genotypes are tested in multiple environments, mimicking real-world breeding programmes where some data on genotypes is already available. From a modeller's perspective, however, the random CV with environments approach is less suitable for evaluating the ability of the model to predict entirely new genotypes, as some genotype information is shared between the training and test sets. Here, the k-fold CV approach can be useful in understanding the model’s ability to generalise to new genotypes, which is crucial for breeding programmes that aim to select new genotypes. While more extensive resampling of the folds, or numerous repetition of each of the CV strategies could further enhance model stability, we consider the current approach a reasonable balance between computational feasibility and reliable estimation of predictive ability.

ML-GWAS

Although some GWAS approaches include adaptations to account for dominance and epistatic effects [66], traditional GWAS methods generally identify markers associated with a trait using linear or logistic regression analysis. This analysis is performed separately for each marker, meaning it is unable to account for interaction effects between markers. Another disadvantage of traditional GWAS is that it is susceptible to overfitting, resulting in the identification of spurious associations between markers and phenotypic traits [31, 67]. The ML-GWAS method described here uses Random Forest and the assignment of an f-score to address the problem of overfitting in several ways. First, in the Random Forest model, the trees of a forest are constructed with access to only half of the training set, which is randomly sampled from the full training set [18]. Next, the permutation importance of a marker is defined on the other half of the training set, and not on the out-of-bag samples. Thus, none of the trees in the forest are constructed using the samples that are used to define the VI of the markers. The resulting hold-out VI is shown to have a symmetric null distribution, in contrast to classical variable importance, which is calculated using the same set on which the trees were constructed. Therefore, this approach is expected to control the type I error exactly [34]. Our precautions ensure that overfitting is overcome as much as possible for the identification of important markers in the analysed set of genotypes. Although a detailed comparison of the power of traditional GWAS methods and our ML-GWAS is beyond the scope of this manuscript, there appears to be some overlap between the two methods that could serve as a basis for comparison (Figure S4).

In this paper, we extracted the genotype main effect (bG) and the genotype-by-environment interaction effect (bGE) from the phenotypic observations, to fit a genomic prediction model to both components separately. To the best of our knowledge, this is the first time that genomic prediction models have been used to explicitly predict bG and bGE effects, and the model has been used to find markers that are important in one environment or that have correlated important markers in multiple environments. The split-component model shows higher predictive ability for some components (Fig. 2), has different prediction results for main and GxE interaction effects, and calculates a different set of VIs for the markers. Especially for Y, where 24.4% of the variance observed in the phenotypes is explained by bGE compared to only 12.6% by bG (Table 2), the ability to locate markers specifically important for certain environments is very useful for targeted breeding efforts. Although the number of environments in our dataset is too small to make assumptions about which markers would be overall important to target in Belgium or which in Serbia, the technique used seems promising. If trial data from more environments were available marker associations with reduced susceptibility to unfavourable weather patterns such as drought during flowering or low temperatures during the maturing phase, could also be identified using a split-component model. This also applies to dedicated stress experiments, such as drought experiments [68, 69].

When the markers with the highest VI and low mutual correlation were incrementally included in the genomic prediction models (Figs. 3 and 4), several dynamics were observed. First, an increase in predictive ability in the training set as more markers were included, indicated a more complex model that better fits the data. This increase in predictive ability was also seen in the test set, suggesting improved model generalisation with more markers added. While the test predictive ability stabilised at a similar predictive ability as the large model between at around 50 markers included, the training predictive ability had still not reached the training predictive ability from the large model. This shows that although the large model is becoming more flexible, it may be overfitting to the training data and capturing noise that does not help in predicting the test set.

The incremental inclusion of markers allows to evaluate the effectiveness of the f-factor threshold used to identify markers that contribute to higher predictive ability, and thus could be linked to a locus related to a phenotypic trait. From Figs. 6 and 7, it is clear that the f-factors of 3 and 5, which were found to be a good compromise between sensitivity and power [35], are too stringent for defining which markers to include in a genomic prediction model. For the main effect model, and an f-factor of 2 or even 1 might be a more suitable option. For the interaction model, on the other hand, the f-factor of 3 is mostly on the plateau where the model does not improve by adding more markers, but may be too tolerant for selection of real significant markers selection in some environments and for some traits. These findings suggest that the f-factor based only on the minimum value of all VIs may not be universally applicable and that there is potential for refinement to enhance its efficacy. In our dataset, however, selecting around 50 markers for genomic prediction results in a model that has no significantly lower predictive ability than one that includes over 150,000 markers.

Taking in account only important, uncorrelated markers instead of basing prediction models on genomic similarities, such as GBLUP models do [13, 14], can yield lean, generalisable models showing similar predictive abilities. The small models are in fact a distillation from the complex, large Random Forest models that were initially trained. With a similar predictive ability, model distillation leading to smaller models has other practical advantages such as lower computational costs, higher interpretability of the model features such as locus-interactions [70], and avoiding the issue of ‘shortcut learning’ [61]. Another significant advantage of this approach is that the smaller number of significant markers could allow marker-environment interactions to be included in process-based crop growth models that predict GxE phenotypes.

Conclusion

By separately predicting main genetic effects (bG) and genotype-by-environment interaction effects (bGE), predictive ability for bGE can be improved. This approach provides valuable insights into how different genotypes perform across various environments and is particularly useful for breeders aiming to develop varieties that are both high-yielding and adaptable to specific environmental conditions. In addition, the combined approach of machine learning-based genome-wide association studies (ML-GWAS) to select important markers, followed by genomic prediction, has demonstrated significant potential in our study. By using ML-GWAS, we can identify markers that are associated with phenotypic traits, overcoming the limitations of traditional GWAS methods that often struggle with overfitting and the inability to account for interaction effects between markers. Then, selecting uncorrelated markers with high variable importance and integrating them into small genomic prediction models with similar predictive ability can lead to less computation time and lower cost, while providing greater interpretability of model features and greater flexibility for use in scenarios with many environments.

Overall, the combined use of ML-GWAS and genomic prediction can offer an effective approach to identify significant genetic markers and constructing high performing genomic prediction models. This can ultimately lead to improved breeding strategies and the development of more resilient and productive crop varieties.

Supplementary Information

Acknowledgements

We would like to thank Aamir Saleem and Jonas Aper for previous work on the EUCLEG collection we could build on, and Jonas Aper for reading and commenting on the manuscript. Also a thank you to Tom De Swaef for brainstorming about the models and the technical teams in Serbia and Belgium that made the collection of data possible.

Abbreviations

GxE

Genotype-by-environment

GBLUP

Genomic best linear unbiased prediction

ML-GWAS

Machine learning-genome wide association study

LMM

Linear mixed-effects model

RKHS

Reproducing kernel Hilbert spaces

XGB

Extreme gradient boosting

MxE

Marker-by-environment

R1

Beginning of flowering maturity stage

R8

Full maturity stage

SNP

Single-nucleotide polymorphism

MAF

Minor allele frequency

envBLUP

Environment-specific best linear unbiased prediction

rRMSE

Relative root mean square error

VI

Variable importance

Y

Seed yield

P

Protein content

Author contributions

NV conceptualised the design of the analysis and drafted the manuscript; HR, VD and MC provided the data and revised the manuscript; HM, MP and IRR made substantial contributions to the design of the work and revised the manuscript.

Funding

This project has received funding from the European Union’s Horizon 2020 Program for Research & Innovation under grant Agreement no. 727312 and the Ministry of Science and Technology of China under grant no. 2017YFE0111000 for the project titled “EUCLEG – Breeding forage and grain legumes to increase EU’s and China’s protein self-sufficiency”. Additional funding has been received from the Horizon Europe program of the European Union for the BELIS project (Breeding European Legumes for Increased Sustainability) under Grant no. 101081878.

Data availability

The datasets used and source code for analysis during the current study are available on a public GitLab repository (https://gitlab.ilvo.be/nverbrigghe/GP_MLGWAS) under the GNU General Public Licence v3.0.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.FAO (No. FAO-AMIS (Distributed by AMIS Statistics)); 2024.
  • 2.Saleem A, Muylle H, Aper J, Ruttink T, Wang J, Yu D, Roldán-Ruiz I. A genome-wide genetic diversity scan reveals multiple signatures of selection in a European soybean collection compared to Chinese collections of wild and cultivated soybean accessions. Front Plant Sci. 2021;12: 631767. 10.3389/fpls.2021.631767. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Yao X, Xu J, Liu Z, Pachner M, Molin EM, Rittler L, Hahn V, Leiser W, Gu Y, Lu Y, Qiu L, Vollmann J. Genetic diversity in early maturity Chinese and European elite soybeans: a comparative analysis. Euphytica. 2023;219(1):17. 10.1007/s10681-022-03147-0. [Google Scholar]
  • 4.Crossa J, Pérez-Rodríguez P, Cuevas J, Montesinos-López O, Jarquín D, de los Campos G, Burgueño J, González-Camacho JM, Pérez-Elizalde S, Beyene Y, Dreisigacker S, Singh R, Zhang X, Gowda M, Roorkiwal M, Rutkoski J, Varshney RK. Genomic selection in plant breeding: methods, models, and perspectives. Trends Plant Sci. 2017;22(11):961–75. 10.1016/j.tplants.2017.08.011. [DOI] [PubMed] [Google Scholar]
  • 5.Jannink J-L, Lorenz AJ, Iwata H. Genomic selection in plant breeding: from theory to practice. Brief Funct Genom. 2010;9(2):166–77. 10.1093/bfgp/elq001. [DOI] [PubMed] [Google Scholar]
  • 6.Gill M, Anderson R, Hu H, Bennamoun M, Petereit J, Valliyodan B, Nguyen HT, Batley J, Bayer PE, Edwards D. Machine learning models outperform deep learning models, provide interpretation and facilitate feature selection for soybean trait prediction. BMC Plant Biol. 2022;22(1):180. 10.1186/s12870-022-03559-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Jarquín D, Kocak K, Posadas L, Hyma K, Jedlicka J, Graef G, Lorenz A. Genotyping by sequencing for genomic prediction in a soybean breeding population. BMC Genom. 2014;15(1):740. 10.1186/1471-2164-15-740. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Jarquin D, Specht J, Lorenz A. Prospects of genomic prediction in the USDA soybean germplasm collection: historical data creates robust models for enhancing selection of accessions. G3 Genes|Genomes|Genetics. 2016;6(8):2329–41. 10.1534/g3.116.031443. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lourenço VM, Ogutu JO, Rodrigues RAP, Posekany A, Piepho H-P. Genomic prediction using machine learning: a comparison of the performance of regularized regression, ensemble, instance-based and deep learning methods on synthetic and empirical data. BMC Genom. 2024;25(1):152. 10.1186/s12864-023-09933-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Waldmann P, Pfeiffer C, Mészáros G. Sparse convolutional neural networks for genome-wide prediction. Front Genet. 2020. 10.3389/fgene.2020.00025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Westhues CC, Mahone GS, Da Silva S, Thorwarth P, Schmidt M, Richter J-C, Simianer H, Beissinger TM. Prediction of maize phenotypic traits with genomic and environmental predictors using gradient boosting frameworks. Front Plant Sci. 2021;12: 699589. 10.3389/fpls.2021.699589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lopez-Cruz M, Crossa J, Bonnett D, Dreisigacker S, Poland J, Jannink J-L, Singh RP, Autrique E, De Los Campos G. Increased prediction accuracy in wheat breeding trials using a marker × environment interaction genomic selection model. G3 Genes|Genomes|Genetics. 2015;5(4):569–82. 10.1534/g3.114.016097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91(11):4414–23. 10.3168/jds.2007-0980. [DOI] [PubMed] [Google Scholar]
  • 14.Meher PK, Rustgi S, Kumar A. Performance of Bayesian and BLUP alphabets for genomic prediction: analysis, comparison and results. Heredity. 2022;128(6):519–30. 10.1038/s41437-022-00539-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Montesinos López OA, Montesinos López A, Crossa J. Multivariate statistical machine learning methods for genomic prediction. Springer International Publishing; 2022. 10.1007/978-3-030-89010-0. [PubMed] [Google Scholar]
  • 16.Espigolan R. Parametric and semi-parametric models for predicting genomic breeding values of complex traits in nelore cattle; 2017.
  • 17.Montesinos-López A, Montesinos-López OA, Montesinos-López JC, Flores-Cortes CA, De La Rosa R, Crossa J. A guide for kernel generalized regression methods for genomic-enabled prediction. Heredity. 2021;126(4):577–96. 10.1038/s41437-021-00412-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. 10.1023/A:1010933404324. [Google Scholar]
  • 19.Ogutu JO, Piepho H-P, Schulz-Streeck T. A comparison of random forests, boosting and support vector machines for genomic selection. BMC Proc. 2011;5(3): S11. 10.1186/1753-6561-5-S3-S11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; 2016. p 785–794. 10.1145/2939672.2939785
  • 21.Bentéjac C, Csörgő A, Martínez-Muñoz G. A Comparative Analysis of XGBoost. Artif Intell Rev. 2021;54(3):1937–67. 10.1007/s10462-020-09896-5. [Google Scholar]
  • 22.Mohammed A, Kora R. A comprehensive review on ensemble deep learning: opportunities and challenges. J King Saud Univ Comput Inf Sci. 2023;35(2):757–74. 10.1016/j.jksuci.2023.01.014. [Google Scholar]
  • 23.Đorđević V, Ćeran M, Miladinović J, Balešević-Tubić S, Petrović K, Miladinov Z, Marinković J. Exploring the performance of genomic prediction models for soybean yield using different validation approaches. Mol Breed. 2019;39(5): 74. 10.1007/s11032-019-0983-6. [Google Scholar]
  • 24.Pembleton LW, Inch C, Baillie RC, Drayton MC, Thakur P, Ogaji YO, Spangenberg GC, Forster JW, Daetwyler HD, Cogan NOI. Exploitation of data from breeding programs supports rapid implementation of genomic selection for key agronomic traits in perennial ryegrass. Theor Appl Genet. 2018;131(9):1891–902. 10.1007/s00122-018-3121-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Burgueño J, Crossa J, Cornelius PL, Trethowan R, McLaren G, Krishnamachari A. Modeling additive × environment and additive × additive × environment using genetic covariances of relatives of wheat genotypes. Crop Sci. 2007;47(1):311–20. 10.2135/cropsci2006.09.0564. [Google Scholar]
  • 26.Burgueño J, Crossa J, Cotes JM, Vicente FS, Das B. Prediction assessment of linear mixed models for multienvironment trials. Crop Sci. 2011;51(3):944–54. 10.2135/cropsci2010.07.0403. [Google Scholar]
  • 27.Heslot N, Akdemir D, Sorrells ME, Jannink J-L. Integrating environmental covariates and crop modeling into the genomic selection framework to predict genotype by environment interactions. Theor Appl Genet. 2014;127(2):463–80. 10.1007/s00122-013-2231-5. [DOI] [PubMed] [Google Scholar]
  • 28.Jarquín D, Lemes Da Silva C, Gaynor RC, Poland J, Fritz A, Howard R, Battenfield S, Crossa J. Increasing genomic-enabled prediction accuracy by modeling genotype × environment interactions in Kansas Wheat. The Plant Genome. 2017. 10.3835/plantgenome2016.12.0130. [DOI] [PubMed] [Google Scholar]
  • 29.Lado B, Barrios PG, Quincke M, Silva P, Gutiérrez L. Modeling genotype × environment interaction for genomic selection with unbalanced data from a wheat breeding program. Crop Sci. 2016;56(5):2165–79. 10.2135/cropsci2015.04.0207. [Google Scholar]
  • 30.Skøt L, Nay MM, Grieder C, Frey LA, Pégard M, Öhlund L, Amdahl H, Radovic J, Jaluvka L, Palmé A, Ruttink T, Lloyd D, Howarth CJ, Kölliker R. Including marker x environment interactions improves genomic prediction in red clover (Trifolium pratense L.). Front Plant Sci. 2024;15: 1407609. 10.3389/fpls.2024.1407609. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Canella Vieira C, Zhou J, Usovsky M, Vuong T, Howland AD, Lee D, Li Z, Zhou J, Shannon G, Nguyen HT, Chen P. Exploring machine learning algorithms to unveil genomic regions associated with resistance to Southern Root-Knot Nematode in soybeans. Front Plant Sci. 2022;13: 883280. 10.3389/fpls.2022.883280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Strobl C, Boulesteix A-L, Zeileis A, Hothorn T. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform. 2007;8(1):25. 10.1186/1471-2105-8-25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Altmann A, Toloşi L, Sander O, Lengauer T. Permutation importance: a corrected feature importance measure. Bioinformatics. 2010;26(10):1340–7. 10.1093/bioinformatics/btq134. [DOI] [PubMed] [Google Scholar]
  • 34.Janitza S, Celik E, Boulesteix A-L, A computationally fast variable importance test for random forests for high-dimensional data [Doc-type:workingPaper]; 2015. 10.5282/ubm/epub.25587.
  • 35.Szymczak S, Holzinger E, Dasgupta A, Malley JD, Molloy AM, Mills JL, Brody LC, Stambolian D, Bailey-Wilson JE. r2VIM: a new variable selection method for random forests in genome-wide association studies. BioData Min. 2016;9(1):7. 10.1186/s13040-016-0087-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Xavier A, Rainey KM. Quantitative genomic dissection of soybean yield components. G3 Genes|Genomes|Genetics. 2020;10(2):665–75. 10.1534/g3.119.400896. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Yoosefzadeh-Najafabadi M, Eskandari M, Torabi S, Torkamaneh D, Tulpan D, Rajcan I. Machine-learning-based genome-wide association studies for uncovering QTL underlying soybean yield and its components. Int J Mol Sci. 2022;23(10):10. 10.3390/ijms23105538. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Vymyslický T, Trněný O, Rietman H, Balko C, Đorđević V, Ranđelović P, Dybová M. Phenotypic characterization of soybean genetic resources at multiple locations: Breeding implications for enhancing environmental resilience, yield and protein content. Front Plant Sci. 2025;16:1422162. 10.3389/fpls.2025.1422162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Agri4Cast. [Agri4Cast dataset from the Joint Research Center of European Commission, Institute of Environment, MARS unit]. Retrieved 12 September 2024; n.d. from https://agri4cast.jrc.ec.europa.eu/DataPortal/Index.aspx?o=d
  • 40.Fehr WR. Stages of Soybean Development; 1977.
  • 41.Pannecoucque J, Goormachtigh S, Ceusters N, Bode S, Boeckx P, Roldan-Ruiz I. Soybean response and profitability upon inoculation and nitrogen fertilisation in Belgium. Eur J Agron. 2022;132: 126390. 10.1016/j.eja.2021.126390. [Google Scholar]
  • 42.Bates D, Maechler M, Jagan M. Matrix: Sparse and dense matrix classes and methods. R Package Version 0.999375-43; 2010. http://cran.r-project.org/package=matrix. http://matrix.r-forge.r-project.org/slides/2009-07-10-Rennes/figs/Matrix-CRAN-depend-2.pdf.
  • 43.Wood S, Scheipl F. gamm4: Generalized Additive Mixed Models using ‘mgcv’ and ‘lme4’ ; 2009. p. 0.2–6.10.32614/CRAN.package.gamm4.
  • 44.Bates D, Mächler M, Bolker B, Walker S. Fitting Linear Mixed-Effects Models using lme4 (No. arXiv:1406.5823); 2014. arXiv. http://arxiv.org/abs/1406.5823.
  • 45.Covarrubias-Pazaran G. Genome-assisted prediction of quantitative traits using the R package sommer. PLoS ONE. 2016;11(6): e0156744. 10.1371/journal.pone.0156744. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Perez P. BGLR: a statistical package for whole genome regression and prediction; n.d. [DOI] [PMC free article] [PubMed]
  • 47.Wright, M. N., & Ziegler, A. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1). 10.18637/jss.v077.i01
  • 48.Montesinos-López A, Montesinos-López OA, Gianola D, Crossa J, Hernández-Suárez CM. Multi-environment genomic prediction of plant traits using deep learners with dense architecture. G3 Genes|Genomes|Genetics. 2018;8(12):3813–28. 10.1534/g3.118.200740. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Wilson S. ParBayesianOptimization: Parallel Bayesian optimization of hyperparameters. R Package Version 2021;1(4).
  • 50.Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, Chen K, Mitchell R, Cano I, Zhou T, Li M, Xie J, Lin M, Geng Y, Li Y, Yuan J. xgboost: Extreme Gradient Boosting; 2014. p. 1.7.7.1. 10.32614/CRAN.package.xgboost.
  • 51.Daetwyler HD, Calus MPL, Pong-Wong R, de los Campos G, Hickey JM. Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics. 2013;193(2):347–65. 10.1534/genetics.112.147983. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Segura V, Vilhjálmsson BJ, Platt A, Korte A, Seren Ü, Long Q, Nordborg M. An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nat Genet. 2012;44(7):825–30. 10.1038/ng.2314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2024. https://www.R-project.org/
  • 54.Azodi CB, Bolger E, McCarren A, Roantree M, De Los Campos G, Shiu S-H. Benchmarking parametric and machine learning models for genomic prediction of complex traits. G3 Genes|Genomes|Genetics. 2019;9(11):3691–702. 10.1534/g3.119.400498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Farooq M, de Ridder D; Genomic prediction in plants: Opportunities for ensemble machine learning based approaches [version 2; peer review: 1 approved, 2 approved with reservations]; 2023. [DOI] [PMC free article] [PubMed]
  • 56.Yan Q, Fruzangohar M, Taylor J, Gong D, Walter J, Norman A, Shi JQ, Coram T. Improved genomic prediction using machine learning with variational Bayesian sparsity. Plant Methods. 2023;19(1):96. 10.1186/s13007-023-01073-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Zhou G, Gao J, Zuo D, Li J, Li R. MSXFGP: Combining improved sparrow search algorithm with XGBoost for enhanced genomic prediction. BMC Bioinform. 2023;24(1):384. 10.1186/s12859-023-05514-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Wang K, Abid MA, Rasheed A, Crossa J, Hearne S, Li H. DNNGP, a deep neural network-based method for genomic prediction using multi-omics data in plants. Mol Plant. 2023;16(1):279–93. 10.1016/j.molp.2022.11.004. [DOI] [PubMed] [Google Scholar]
  • 59.Zingaretti LM, Gezan SA, Ferrão LFV, Osorio LF, Monfort A, Muñoz PR, Whitaker VM, Pérez-Enciso M. Exploring deep learning for complex trait genomic prediction in polyploid outcrossing species. Front Plant Sci. 2020. 10.3389/fpls.2020.00025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Ma W, Qiu Z, Song J, Cheng Q, Ma C. DeepGS: Predicting phenotypes from genotypes using Deep Learning; 2017. p. 241414. bioRxiv. 10.1101/241414.
  • 61.Ubbens J, Parkin I, Eynck C, Stavness I, Sharpe AG. Deep neural networks for genomic prediction do not estimate marker effects. Plant Genome. 2021;14(3): e20147. 10.1002/tpg2.20147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Fisher A, Rudin C, Dominici F. All models are wrong, but many are useful: learning a variable’s importance by studying an entire class of prediction models simultaneously. J Mach Learn Res. 2019;20(177):1–81. [PMC free article] [PubMed] [Google Scholar]
  • 63.Pichler M, Hartig F. Machine learning and deep learning—A review for ecologists. Methods Ecol Evol. 2023;14(4):994–1016. 10.1111/2041-210X.14061. [Google Scholar]
  • 64.De Coninck A, De Baets B, Kourounis D, Verbosio F, Schenk O, Maenhout S, Fostier J. Needles: toward large-scale genomic prediction with marker-by-environment interaction. Genetics. 2016;203(1):543–55. 10.1534/genetics.115.179887. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Li B, Zhang N, Wang Y-G, George AW, Reverter A, Li Y. Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods. Front Genet. 2018;9:237. 10.3389/fgene.2018.00237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Li M, Zhang Y-W, Xiang Y, Liu M-H, Zhang Y-M. IIIVmrMLM: the R and C++ tools associated with 3VmrMLM, a comprehensive GWAS method for dissecting quantitative traits. Mol Plant. 2022;15(8):1251–3. 10.1016/j.molp.2022.06.002. [DOI] [PubMed] [Google Scholar]
  • 67.Chen Z, Boehnke M, Wen X, Mukherjee B. Revisiting the genome-wide significance threshold for common variant GWAS. G3 Genes|Genomes|Genetics. 2021;11(2): jkaa056. 10.1093/g3journal/jkaa056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Saleem A, Aper J, Muylle H, Borra-Serrano I, Quataert P, Lootens P, De Swaef T, Roldán-Ruiz I. Response of a diverse european soybean collection to “short duration” and “long duration” drought stress. Front Plant Sci. 2022;13: 818766. 10.3389/fpls.2022.818766. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Saleem A, Roldán-Ruiz I, Aper J, Muylle H. Genetic control of tolerance to drought stress in soybean. BMC Plant Biol. 2022;22(1):615. 10.1186/s12870-022-03996-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Boix-Adsera E. Towards a theory of model distillation (No. arXiv:2403.09053); 2024. arXiv. 10.48550/arXiv.2403.09053.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The datasets used and source code for analysis during the current study are available on a public GitLab repository (https://gitlab.ilvo.be/nverbrigghe/GP_MLGWAS) under the GNU General Public Licence v3.0.


Articles from Plant Methods are provided here courtesy of BMC

RESOURCES