Abstract
Main conclusion
We systematically evaluated three key determinants affecting prediction accuracy and the algorithm performance differences based on fifteen state-of-the-art GP methods, and found LSTM suitable for capturing additive and epistatic effects.
Abstract
Genomic prediction (GP) has been developed as an important method supporting crop breeding. By utilizing the phenotype values result from GP, breeders could make decisions in the seedling stage that consequently benefit for cost saving. In recent years, machine learning emerged as an efficient technology to solve modeling problems in many fields, including crop breeding. However, numerous modeling approaches have hindered the application of GP since breeders struggle to choose. Therefore, a comprehensively methodological research with guiding significance is extremely necessary. In the present study, we systematically evaluated three key determinants affecting prediction accuracy and the algorithm performance differences based on fifteen state-of-the-art GP methods. As for genomic feature processing, we found feature selection (SNP filtering approach) performed better than feature extraction (PCA method). Specifically, the feature relationship dependent methods (GBLUP, RNN, and LSTM) as well as DNN architecture showed superior performance with feature selection. Marker density analysis showed positive correlation with prediction accuracy in a limited threshold. Comparison on effect of population size demonstrated a positive correlation between trait genetic complexity and the optimal population size required. By testing fifteen modeling methods, we found LSTM network displayed superior performance, achieving the highest average STScore (0.967) across six datasets. Further research using all cell states or the latest cell states of LSTM inputs demonstrated its architecture particularly adept with capturing additive and epistatic QTL effects among SNPs. In conclusion, our findings provide basic principles for implementing GP in breeding project to maximize prediction accuracy while maintaining cost-effectiveness.
Supplementary Information
The online version contains supplementary material available at 10.1007/s00425-025-04843-6.
Keywords: Crop breeding, Genomic prediction, Machine learning, Deep learning, Long short-term memory (LSTM)
Introduction
Genomic prediction (GP) has emerged as a transformative tool in plant breeding in the past two decades, accelerating genetic gains by predicting genomic estimated breeding values (GEBVs) of candidate individuals based on genomic and phenotypic data (Meuwissen et al. 2001; Crossa et al. 2024). Most GP methods currently employed in genomic selection (GS) projects can be broadly classified into three categories (Fig. S1): statistical methods, machine learning methods, and deep learning methods. Traditional statistical methods of GP, such as Bayesian approaches (Pérez and de los Campos 2014), genomic best linear unbiased prediction (GBLUP) (VanRaden 2008), and ridge regression best linear unbiased prediction (RR-BLUP) (Endelman 2011), have been widely used in many plant and animal breeding programs (Crossa et al. 2017). However, these linear models exhibit limitations in processing high-dimensional genomic data and effectively capturing complex, non-linear relationships between predictor and response variables (Meuwissen et al. 2001). Recent advancements in machine learning (ML) and deep learning (DL) algorithms have revolutionized GP by addressing these limitations, offering superior predictive accuracy and adaptability to a wide range of crops, including rice (Ma et al. 2024), maize (Wang et al. 2023; Wu et al. 2023), tomato (Gao et al. 2023), soybean (Gao et al. 2023; Liu et al. 2019), wheat (Wang et al. 2023).
Bayesian methodologies have been extensively adopted for GP, with prominent approaches including Bayes A, Bayes B, Bayes C, and Bayesian LASSO (BL) (López et al. 2023). These methods incorporate probabilistic frameworks by establishing prior distributions and subsequently updating posterior distributions through Bayesian inference based on the observational data. Grenier et al. (2016) demonstrated the efficacy of Bayesian approaches through genomic selection analysis of 343 S2:4 lines derived from an upland rice synthetic population. The results revealed that Bayesian LASSO achieved the highest predictive ability for grain yield (0.309) and Bayesian ridge regression outperformed in plant height prediction (0.538).
Best linear unbiased prediction (BLUP) methods, such as genomic BLUP (GBLUP) (VanRaden 2008) and ridge regression BLUP (RR-BLUP) (Endelman 2011), have gained widespread application in GP (Crossa et al. 2017). GBLUP assumes that all markers contribute equally to genetic variance and employs a genomic relationship matrix for phenotype prediction without estimating marker effects (VanRaden 2008). Notably, RR-BLUP has been mathematically demonstrated to be equivalent to BLUP within the framework of mixed models (Meuwissen et al. 2001). A comprehensive comparative study (Bhering et al. 2015) evaluated the performance of RR-BLUP, GBLUP, and BL across F2 populations of varying sizes. The results demonstrated that RR-BLUP outperformed the other two methods in selecting genetically superior individuals.
Machine learning (ML) methods, including random forest (RF), extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and support vector machines (SVM), have emerged as pivotal tools in GP. These methods revolutionize plant and animal breeding by improving prediction accuracy, optimizing breeding strategies, and enhancing integration of multi-omic datasets (Yan et al. 2021; Shen et al. 2023; Jeong et al. 2020). RF, an ensemble learning algorithm, builds multiple decision trees using bootstrap samples of training data and random subsets of features for node splitting. This approach reduces model variance while slightly increasing bias, enabling RF achieves competitive prediction accuracy in breeding studies. For example, RF achieved the highest correlation rate (0.529) in predicting days to flowering time (DTF) in rice (Jeong et al. 2020). XGBoost, a gradient-boosting framework, sequentially constructs decision trees to minimize residuals from preceding models. Gill et al. (2022) demonstrated that XGBoost and RF outperformed deep learning models in 13 out of 14 prediction tasks. LightGBM, an advanced gradient-boosting variant, optimizes high-dimensional data processing through leaf-wise tree growth, gradient-based one-side sampling, and exclusive feature bundling. Yan et al. (2021) highlighted that LightGBM has superior performance in prediction precision, model stability, and computational efficiency. SVM, a supervised learning method, classifies data by identifying optimal separating hyperplanes or fits regression models by minimizing deviations within a tolerance margin. Annicchiarico et al. (2015) reported that SVM achieved the highest genomic selection accuracy for alfalfa biomass yield across different reference populations.
Deep learning (DL) methodologies, encompassing multi-layer perceptron (MLP), deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN), long short-term memory network (LSTM), and transformer, have been increasingly integrated into the field of GP alongside advancements in artificial intelligence (AI). MLP, a fundamental architecture in feed-forward artificial neural networks (ANNs), serves as a cornerstone for deep learning. CNNs represent a robust category of deep learning models extensively employed across diverse applications, including but not limited to object detection (Galvez et al. 2018), speech recognition (Abdel-Hamid et al. 2012), computer vision (Karpathy et al. 2014), image classification (Jia et al. 2014), and bioinformatics (Tasdelen and Sen 2021). In the field of GP, numerous models have been developed that integrate CNN, including CNN-RNN (Khaki et al. 2019), DeepGS (Ma et al. 2018), DNNGP (Wang et al. 2023), SoyDNGP (Gao et al. 2023), DeepCCR (Ma et al. 2024), DLGWAS (Liu et al. 2019) and ResGS (Xie et al. 2024). DNNGP (Wang et al. 2023) integrates multi-omics data through a multilayered hierarchical structure with CNN layers, batch normalization, and dropout regularization. This design enables DNNGP to achieve superior performance in wheat, surpassing GBLUP, LightGBM, SVR, DeepGS, and DLGWAS by an average of 234.2%, 2.5%, 48.9%, 16.8%, and 8.2%, respectively. RNN, characterized by their internal memory capabilities, excels in capturing sequential dependencies. Despite their potential, RNN has seen fewer applications in GP compared to CNN. Khaki et al. (2019) introduced a CNN-RNN framework for crop yield prediction, utilizing RNN to model the temporal dependencies of environmental factors and the genetic improvement of seeds over time without genomic data. The results showed that CNN-RNN (Khaki et al. 2019) achieved an root-mean-square-error (RMSE) 9% and 8% of corn and soybean average yield, respectively, substantially outperforming RF, deep fully connected neural networks (DFNN), and LASSO. LSTM mitigates the limitations of standard RNN by incorporating both past and future contexts in sequence modeling tasks, excelling in processing long-distance dependencies. Similar to RNN, LSTM has been less frequently implemented in GP than CNN. Ma et al. (2024) developed DeepCCR, a method integrating CNN with LSTM to improve rice breeding. The experimental results revealed that DeepCCR significantly outperformed the runner-up model by 17.2%, 11.7%, 19.9%, 12.8%, 9.6%, 12.6%, 6.6%, 12.8%, 10.3%, and 12.6% for traits Y, HD, PH, PL, TN, GP, SSR, GL, GW, and TGW, respectively, when compared to XGBoost, LightGBM, DNNGP, and GBLUP.
Although several reviews have comprehensively described the methods and development of GP (Zou et al. 2019; Farooq et al. 2024), comparative analyses of their prediction accuracy and practical constraints are still lacking. Identifying optimal strategies for implementing these methods in GP applications represents an urgent research priority. To comprehensively investigate the advantages, and limitations of GP methods, our study systematically evaluated fifteen state-of-the-art methods across three dimensions known to influence GP accuracy: feature processing methods, marker density, and population size. The fifteen GP methods include four Bayesian approaches (BayesA, BayesB, BayesC, and BL), two BLUP approaches (GBLUP and RR-BLUP), four ML algorithms (XGBoost, LightGBM, RF, and SVM), and five DL architectures (DNN, RNN, LSTM, ResNet34, and ResNet18). Notably, all ML and DL methods employed hyper-parameter optimization strategies to achieve optimal results. Model performance was systematically assessed using six crop datasets (rice439, maize1404, tomato398, soybean20087, cotton1037, and wheat599) to identify the optimal strategy for GP. This study explored the main characteristic of GP methods and provided suggestions for how to effectively use these methods in GS.
Materials and methods
Crop materials
A total of six datasets spanning six critical crops were employed to evaluate the performance of GP methods. These datasets were named as rice439, maize1404, wheat599, cotton1037, tomato398, and soybean20087 according to crop type and corresponding sample number.
The rice439 dataset is a collection of 439 rice accessions, including major inbred cultivars, landraces, and introduced varieties (Ye et al. 2022). High-throughput re-sequencing was conducted on these accessions. Following rigorous quality control criteria (QD < 20.0, ReadPosRankSum < 8.0, FS > 10.0, and QUAL < Mean QUAL), a total of 3,010,765 high-quality single nucleotide polymorphisms (SNPs) were identified, with an average density of 8.05 SNPs per kilobase (kb). The 439 accessions were planted and phenotyped in Lingshui, Hainan Province, China, in 2019. We focused on four key agronomic traits: plant height (PH), grain length (GL), grain width (GW), and grain ratio (GR, calculated as GL/GW ratio).
Maize1404 dataset comprises 1,404 maize inbred lines derived from 24 elite Chinese maize founder lines belonging to four heterotic subgroups: LvDaHongGu, ZI330, SiPingTou, and Yugoslavia-improved germplasm (Liu et al. 2020). Through whole-genome sequencing and subsequent variant calling and imputation processes, approximately 14 million high-quality SNPs were identified across these lines. Our study focused on eight key agronomic traits: ear leaf length (ELL), ear leaf width (ELW), cob weight (CW), days to silking (DTS), days to tasseling (DTT), days to anthesis (DTA), ear height (EH), plant height (PH).
The wheat599 dataset comprises 599 historical wheat accessions obtained from the Global Wheat Program at the International Maize and Wheat Improvement Center (CIMMYT) (McLaren et al. 2005). Genotyping of these accessions was performed using 1,447 Diversity Array Technology (DArT) markers, which were generated by Triticarte (Canberra, ACT, Australia; https://www.diversityarrays.com). To ensure data quality, markers with a minor allele frequency (MAF) below 0.05 were excluded. Missing genotypes were imputed based on the marginal distribution of marker genotypes. Following stringent quality control procedures, a final set of 1279 high-quality markers was retained for further analysis. The average grain yield measurements across four environments (designated as yield1-yield4), were used in subsequent analysis. The processed genotypic and phenotypic data of wheat599 were download from BGLR package for R, version 1.1.3 (Pérez and de los Campos 2014).
The cotton1037 dataset consists of 1,037 upland cotton accessions obtained from the National Mid-term Gene Bank of China (Wang et al. 2024). The accessions were cultivated in greenhouse conditions, with controlled environmental parameters (16 h light/8 h dark photoperiod, 28 °C/25 °C day/night temperatures). Whole-genome re-sequencing was performed on the Illumina HiSeq X Ten platform, generating sequence data with an average coverage depth of 9.89 × per accession. Following stringent quality control measures, including a MAF threshold > 0.05 and a genotype missing rate < 0.2, a total of 2,759,902 high-quality SNPs were identified across the accessions. Our study focused on one key agronomic trait, leaf pubescence amount (LPA), which was characterized and analyzed in detail.
The tomato398 dataset comprises 398 genetically diverse accessions, representing modern, heirloom, and wild accessions (Tieman et al. 2017). This population provides a comprehensive genomic baseline to quantify the impact of domestication and modern breeding on flavor-associated metabolic traits. Whole-genome sequencing (Illumina platform) generated 2,014,488 high-confidence SNPs (MAF > 5%, genotype missing rate < 10%). Phenotypic characterization was conducted on accessions cultivated under standardized conditions in Florida, USA. Six key flavor-associated compounds were selected for in-depth analysis, including soluble solids, glucose, fructose, 1-nitro-2-phenylethane, isovaleraldehyde, and phenylacetaldehyde, which are critical determinants of tomato flavor profiles.
Soybean20087 comprises a comprehensive collection of 20,087 soybean accessions, representing a wide range of genetic diversity. Genotypic data for these accessions were obtained from the SoyBase database (https://www.soybase.org/) (Gao et al. 2023), which includes 42,509 high-confidence SNPs derived from the SoySNP50K iSelect BeadChip platform. Phenotypic data for each accession were retrieved from the GRIN-Global database (https://npgsweb.ars-grin.gov/gringlobal/search/) (Postman et al. 2010). Our study focused on six key agronomic traits: plant height (PH), flowering time (FT), seed oil content (SOC), seed protein content (SPC), seed weight (SW), and yield, which are critical for soybean breeding and improvement. To evaluate the influence of population size on model performance, 100, 200, 400, 800, 1200, 1500, 2000, and 3000 samples were randomly selected from the soybean20087 dataset. To mitigate potential bias introduced by random sampling, we performed five independent sampling iterations, resulting in five distinct datasets with varying population sizes. The average accuracy of these five datasets was used to evaluate the model performance.
Genotypic data processing
Genotypic data were processed and analyzed using a series of bioinformatics tools and statistical methods. To ensure data quality, missing genotypes were first filtered using Vcftools (Danecek et al. 2011), with the max-missing parameter set to 0.98 to remove variants with more than 2% missing data. Subsequently, missing genotypes were imputed using Beagle (version 5.4) (Browning et al. 2021) to enhance data completeness. The genotypic data were then encoded using PLINK (Purcell et al. 2007). For markers with two alleles, the genotypes AA, Aa, and aa were encoded as 2, 1, and 0, respectively.
To evaluate the feature processing methods and marker density effect on model performance, one feature extraction and two feature selection methods were applied to reduce dimensionality of genomic data. The feature extraction method, principal component analysis (PCA), was conducted using the sklearn.decomposition module in Python (version 3.10.9), with the parameter of n_components ranging from 0.85 to 0.99 in steps of 0.05. The first feature selection method, linkage disequilibrium (LD) pruning, was performed using PLINK (Purcell et al. 2007), with the r2 threshold ranging from 0.1 to 0.8 with a step size of 0.1. This procedure was designed to reducing marker redundancy while retaining informative variants. Another feature selection method, variance-correlation (VC) analysis, was conducted using the”scipy” (version 1.14.1) package in Python, employing a variance threshold of 0.1 and correlation thresholds of 0.7 and 0.8. The results of LD0.8 filtering were used for GP methods comparison.
Phenotypic data processing
For the phenotypic data, we performed rigorous quality control by implementing a two-step filtering procedure. Initially, all missing values were systematically removed to ensure data completeness. Subsequently, we applied a statistical outlier detection method based on the three-sigma rule, where extreme values falling outside the range of mean ± 3 standard deviations were excluded from the data. This stringent filtering approach was implemented to maintain data integrity and minimize the potential impact of measurement errors or biological anomalies on subsequent analyses.
GP methods used for comparison
We evaluated the performance of fifteen state-of-the-art methodologies spanning multiple computational strategies. These methods include: (1) four Bayesian methods (BayesA, BayesB, BayesC, and BL); (2) two BLUP methods, specifically GBLUP and RR-BLUP; (3) four ML algorithms, including LightGBM, RF, XGBoost, and SVM; and (4) five DL architectures, specifically DNN, RNN, LSTM, along with two CNNs (ResNet18 and ResNet34).
The implementation of various GP methods was conducted using established computational packages in R (version 4.4.1) or Python (version 3.10.9) environments. For Bayesian approaches, we employed the”BGLR” package (version 1.1.3) (Pérez and de los Campos 2014) in R, which provides a comprehensive framework for Bayesian generalized linear regression. GBLUP was also implemented using the”BGLR” package with the model parameter set to reproducing kernel hilbert space (RKHS) (de los Campos et al. 2010). RR-BLUP was constructed using the dedicated”rrBLUP” package (version 4.6.2) (Endelman 2011).
For ML methods implementations: LightGBM was implemented using the”lightgbm” package (version 4.5.0) (Ke et al. 2017) in Python; XGBoost was employed through the”xgboost” package (version 2.1.1) (Chen and Guestrin 2016); both RF and SVM algorithms were implemented using the”scikit-learn” package (version 1.5.2) (Pedregosa et al. 2011). Hyper-parameter optimization was performed using the sklearn.model_selection.GridSearchCV module in Python to identify optimal configurations. For LightGBM, the following parameter space was explored: learning rate (0.05, 0.01), n_estimators (100, 200), feature_fraction (0.9, 1.0), and bagging_fraction (0.8, 1.0). Similarly, XGBoost parameters were optimized across these ranges: eta (0.1, 0.01), max_depth (6, 8), and subsample (0.7, 0.9). For RF, the n_estimators was set to 100.
For DL methods implementations, we utilized the PyTorch framework (version 2.4.1) to construct and train five distinct neural network architectures: DNN, RNN, LSTM, ResNet18 and ResNet34. This methods were optimized using mean squared error (MSE) as the loss function, with gradient descent optimization performed through the ADAM optimizer incorporating L2 regularization (weight decay = 0.000001) (Kingma and Ba 2014). To enhance training efficiency and model convergence, we implemented an exponential learning rate scheduling strategy. This adaptive approach facilitates an optimal balance between exploration during initial training phases and convergence precision in later stages. A comprehensive hyper-parameter search was conducted, evaluating multiple configurations including learning rate (0.001, 0.0001) and dropout rate (0.1, 0.2, 0.3, 0.4).
To ensure robust and unbiased evaluation of model performance, we implemented a rigorous ten-fold cross-validation (CV) procedure using the KFold function from the sklearn.model_selection module in Python (version 3.10.9). The CV configuration was set with n_splits = 10, random_state = 100, and shuffle = True. Setting random_state = 100 ensured reproducibility of the fold splits across different methods, while enabling shuffle guaranteed that the data was randomly permuted prior to partitioning. In this framework, each dataset was partitioned into 10 mutually exclusive and approximately equal-sized subsets. During each iteration, 9 subsets were utilized as the training set for model development, while the remaining subset served as the testing set for performance evaluation. This rotation process was repeated iteratively until all subsets had been employed exactly once as the testing set, ensuring comprehensive utilization of the entire dataset for both training and validation purposes.
Model performance evaluation
The pearson correlation coefficient (PCC) was employed to assess and compare the predictive performance of the implemented methods. PCC, ranging from − 1 to 1, quantifies the linear relationship between predicted and observed values, with higher values indicating stronger predictive performance, providing the strength of association between predicted and actual values.
We developed a novel metric to evaluate and compare the performance of different methods across multiple datasets, called standardized score (STScore). The STSscore of each method was calculated by dividing its PCC by the highest PCC achieved among all methods for that specific trait. This comprehensive scoring approach not only facilitates direct comparison of method performance but also accounts for potential variability in prediction accuracy across different traits and datasets.
where is the score of each method, denotes the PCC value obtained by each method for specific trait, and indicates the maximum PCC value observed for each specific trait across all methods.
To statistically assess the performance differences among various marker types across different methodological approaches, we conducted independent two-sample t-tests using the ttest_ind function from the”SciPy” (version 1.14.1) package in Python. This statistical test was employed to evaluate whether significant differences (p < 0.01) existed in predictive performance metrics between distinct marker types within each methodological framework.
Results
The effects of feature processing methods on model performance
The rapid advancement of re-sequencing technologies has facilitated the identification of a vast number of SNPs. However, the majority of these variants are non-informative, potentially leading to computational inefficiencies and overfitting in GP models. Therefore, selecting optimal subsets of markers is critical for building a GP model. To evaluate the impact of feature processing strategies on model performance, we assessed three DR methods: one feature extraction known as principal component analysis (PCA), and two feature selection methods known as linkage disequilibrium (LD) pruning, and variance-correlation (VC). These approaches were systematically applied to the rice439 dataset to analyze their effects on model accuracy, providing insights into their utility for enhancing GP applications. The number of retained features following DR processing was described in Table S1. According to the predicted results, the feature subsets were stratified into two distinct types: PCA-based feature type derived from PCA decomposition, and SNPs feature type sourced from LD pruning and VC methods (Fig. 1a). This stratification highlights the fundamental differences in their approaches that PCA focuses on capturing global variance patterns through linear transformations, while LD pruning and VC methods directly utilize SNPs marker information to reduce redundancy.
Fig. 1.
The comparison of different feature processing methods and sample size using fifteen GP methods in the rice439 dataset. a The PCC values of fifteen methods across different DR methods. We applied a z-score standardization to the PCC values for straightforward comparison across each traits. PCC Pearson correlation coefficient, VC Variance-correlation, LD Linkage disequilibrium, PCA Principal component analysis. b The PCC values of six method classes across different marker types. Kernel Kernel method cluster, BLRR Bayesian linear regression and RR-BLUP cluster, EL Ensemble learning cluster, CNN Convolutional neural network cluster, DNN Deep neural network cluster, FRD Feature relationship dependence cluster. c The PCC values of twelve methods processed by LD pruning at different r2 thresholds. GL Grain length, GR Grain ratio, GW Grain wright, PH Plant height. d The PCC values of the fourteen methods across varying population sizes. FT Flowering time, SOC Seed oil content, SW Seed weight
The fifteen GP methods were systematically classified into six distinct clusters based on their methodological frameworks. The classification schema is as follows: the ensemble learning (EL) cluster, comprising gradient-boosting architectures (LightGBM, RF, XGBoost); the Bayesian linear regression and RR-BLUP (BLRR) cluster, which includes Bayesian regression variants (BayesA, BayesB, BayesC, BL) alongside RR-BLUP; the feature relationship dependence (FRD) cluster, featuring GBLUP, RNN and LSTM; the deep neural network (DNN) cluster, represented by a standard DNN architecture; the convolutional neural network (CNN) cluster, containing residual networks (ResNet18, ResNet34); and the kernel method (Kernel) cluster, exclusively comprising support vector machines (SVM) with radial basis function kernels. To systematically evaluate the performance and applicability of the fifteen GP methods across different feature types, we performed independent t-tests comparing PCA-based features with SNPs features within each methodological cluster. The statistical analysis, as presented in Fig. 1b, revealed significant differences (p < 0.01) in three clusters: FRD cluster (p = 1.728E-31), DNN cluster (p = 0.0001), and CNN cluster (p = 0.0011). These findings demonstrated that both the FRD cluster and DNN cluster show significantly higher compatibility with SNPs features compared to PCA-based features, as demonstrated by their superior average PCC values (0.84 vs. 0.62 and 0.85 vs. 0.78, respectively. In contrast, within the CNN cluster, PCA-based features outperformed SNPs features, achieving average PCC values of 0.84 and 0.80, respectively. We compared the training and test results of ResNet18 and ResNet34 using the optimal parameters for SNPs features (LD0.8, LD with r2 threshold set to 0.8) and PCA-based features (PCA0.85, PCA with n_components parameter set to 0.85), as depicted in Fig. S2a and Fig. S2b. The results revealed that this discrepancy in the CNN cluster is primarily attributed to over-fitting issues during the training process. These results indicated the potential risk when designing CNN architectures for GP. Furthermore, this differential sensitivity to feature types provides valuable insights for selecting appropriate modeling strategies based on the specific feature processing method in GP studies.
To determine the optimal DR parameters for each method, we conducted a systematic performance comparison across varying parameter sets, as summarized in Table 1 and Table S2. The results demonstrated that the top three parameter configurations were LD0.8, LD0.7, and LD0.6 across all the 60 experimental sets (comprising 4 traits × 15 methods). Among these, LD0.8 exhibited the highest performance, achieving superior results in 11 experimental sets with an average PCC of 0.845. This comprehensive analysis highlights the variability in model performance across different DR configurations and provides valuable insights for selecting the most effective parameter settings based on specific modeling approaches.
Table 1.
The mean PCC values of fifteen GP methods under different DR configurations in the rice439 dataset
| DNN | LSTM | RNN | LightGBM | RF | XGBoost | RR-BLUP | BayesA | BayesB | BayesC | BL | GBLUP | ResNet18 | ResNet34 | SVM | Average PCC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LD0.1 | 0.815 | 0.756 | 0.798 | 0.8 | 0.804 | 0.81 | 0.81 | 0.818 | 0.813 | 0.818 | 0.819 | 0.8 | 0.812 | 0.81 | 0.761 | 0.803 |
| LD0.2 | 0.816 | 0.813 | 0.812 | 0.808 | 0.801 | 0.803 | 0.837 | 0.836 | 0.836 | 0.837 | 0.833 | 0.824 | 0.79 | 0.79 | 0.736 | 0.811 |
| LD0.3 | 0.85 | 0.844 | 0.845 | 0.822 | 0.819 | 0.824 | 0.848 | 0.849 | 0.85 | 0.847 | 0.843 | 0.837 | 0.773 | 0.762 | 0.746 | 0.824 |
| LD0.4 | 0.857 | 0.853 | 0.855 | 0.836 | 0.834 | 0.839 | 0.854 | 0.854 | 0.856 | 0.853 | 0.85 | 0.843 | 0.787 | 0.792 | 0.759 | 0.835 |
| LD0.5 | 0.861 | 0.856 | 0.857 | 0.842 | 0.839 | 0.842 | 0.857 | 0.859 | 0.861 | 0.856 | 0.852 | 0.847 | 0.793 | 0.782 | 0.766 | 0.838 |
| LD0.6 | 0.867 | 0.862 | 0.86 | 0.852 | 0.848 | 0.851 | 0.858 | 0.859 | 0.86 | 0.856 | 0.853 | 0.85 | 0.801 | 0.811 | 0.77 | 0.844 |
| LD0.7 | 0.867 | 0.859 | 0.86 | 0.854 | 0.846 | 0.844 | 0.858 | 0.859 | 0.857 | 0.858 | 0.855 | 0.852 | 0.804 | 0.813 | 0.771 | 0.844 |
| LD0.8 | 0.864 | 0.867 | 0.862 | 0.858 | 0.849 | 0.852 | 0.859 | 0.859 | 0.86 | 0.858 | 0.855 | 0.854 | 0.805 | 0.808 | 0.772 | 0.845 |
| PCA0.85 | 0.797 | 0.581 | 0.707 | 0.824 | 0.812 | 0.819 | 0.85 | 0.851 | 0.851 | 0.851 | 0.852 | 0.826 | 0.86 | 0.824 | 0.787 | 0.806 |
| PCA0.9 | 0.797 | 0.6 | 0.678 | 0.818 | 0.81 | 0.815 | 0.852 | 0.855 | 0.852 | 0.852 | 0.854 | 0.789 | 0.851 | 0.825 | 0.783 | 0.802 |
| PCA0.95 | 0.78 | 0.514 | 0.594 | 0.813 | 0.803 | 0.803 | 0.849 | 0.853 | 0.85 | 0.851 | 0.852 | 0.683 | 0.834 | 0.832 | 0.78 | 0.779 |
| PCA0.99 | 0.752 | 0.521 | 0.564 | 0.805 | 0.793 | 0.797 | 0.857 | 0.856 | 0.851 | 0.853 | 0.854 | 0.413 | 0.827 | 0.831 | 0.775 | 0.757 |
| VC0.7 | 0.852 | 0.842 | 0.853 | 0.837 | 0.825 | 0.823 | 0.852 | 0.849 | 0.833 | 0.85 | 0.847 | 0.836 | 0.782 | 0.792 | 0.751 | 0.828 |
| VC0.8 | 0.866 | 0.862 | 0.857 | 0.848 | 0.836 | 0.838 | 0.857 | 0.85 | 0.842 | 0.853 | 0.847 | 0.848 | 0.81 | 0.812 | 0.769 | 0.84 |
The effect of marker density on model performance
The quantity and linkage of genomic markers exert a significant influence on model performance. To evaluate the effects of marker density on model efficacy, we conducted a comprehensive analysis of performance across varying r2 thresholds in LD pruning. As depicted in Fig. 1c, our results revealed that model performance improved with increasing r2 thresholds and diminishing returns observed when r2 exceeded 0.4, in the following method clusters: the EL, BLRR, FRD, and DNN cluster. As described in Table S3, the BLRR cluster using an r2 threshold of 0.4 exhibited great PCC improvements of 4.87% (GL), 3.72% (GR), 4.80% (GW), and 5.35% (PH) relative to an r2 threshold of 0.1. In contrast, it showed reduced PCC values of -0.28%, -0.19%, -0.22%, and -1.72% for these traits, when compared to an r2 threshold of 0.8. This finding indicated that an r2 threshold of 0.4 represents the optimal threshold for rice genomic prediction, keeping a balance between computational efficiency and predictive accuracy. In contrast, the CNN cluster and the Kernel cluster exhibited no sensitivity to variations in marker density, as illustrated in Fig. S3.
The effect of population size on model performance
Plant breeders aim to balance accuracy and cost when estimating genetic values. To determine the optimal population size for predicting genomic estimated breeding values, we assessed eight subsets sourced from soybean20087 dataset, with sample sizes ranging from 100 to 3000 (100, 200, 400, 800, 1200, 1500, 2000, and 3000). The results demonstrated that the PCC values of all the fourteen methods (BL method was excluded from this analysis, due to its limitation in processing small population size) increased marginally when population size exceeded 800 across three traits (FT, SOC, SW), as shown in Fig. 1d and Table S4. For trait FT, a population size of 800 displayed PCC improvements of 43.97%, 15.46%, 110.38%, 49.87%, 113.93%, and 55.62% for CNN, Kernel, FRD, BLRR, DNN and EL cluster, respectively, compared to a size of 100. However, these clusters showed reduced PCC values of − 2.87%, − 12.84%, − 4.58%, − 5.78%, − 3.15% and − 9.51%, when compared to a population size of 3000. For trait SW, these model clusters demonstrated significant PCC improvements of 13.97%, 15.76%, 6.99%, 8.59%, 4.65% and 15.38% when using a population size of 800 compared to 100. However, when compared to a population size of 3000, the PCC values showed moderate decreases of − 3.53%, − 7.71%, − 1.59%, − 2.16%, − 1.62%, and − 3.93%. In contrast, as depicted in Fig. 1d, for complex trait yield, which governed by hundreds of quantitative trait loci (QTLs), exhibited progressively improved performance without reaching an inflection point, even at a population size of 3000. The PCC values of population size 800 increased by 0.75% (CNN), 12.54% (Kernel), − 0.01% (FRD), 11.82% (BLRR), -2.76% (DNN), and 19.60% (EL) comparing to population size 100. And the PCC values kept increasing by 6.58% (CNN), 9.1% (Kernel), 6.96% (FRD), 9.35% (BLRR), 5.97% (DNN), and 10.78% (EL) when the population size reached 3000. These results demonstrated that the required population size for GP is positively correlated with trait complexity, with more genetically complex traits demanding larger population sizes.
Comparative analysis on model performance across six datasets
Six datasets were utilized to evaluate model performance, aiming to identify the most suitable method for GP in different crops. Based on the results of the influence of feature processing methods, we selected the optimal LD parameter LD0.8 (LD pruning with r2 threshold set to 0.8) to compare the performance of the model. The number of markers retained in these six datasets following the LD0.8, was detailed in Table S3.
For the four key traits in the rice439 dataset, as illustrated in Fig. 2a and Table 2, the top-performing methods were BayesA, BayesB, and DNN for traits GL, GW, and PH, respectively. However, for trait GR, three methods achieved the same highest PCC value of 0.923, including DNN, RNN, and RR-BLUP. Notably, when considering the average STSscore ± SD (Standard Deviation) values across all four traits, the top three methods were LSTM (0.996 ± 0.003), DNN (0.992 ± 0.009), and RNN (0.990 ± 0.007). The LSTM model, which ranked first, achieved STScores of 0.993, 0.999, 0.995, and 0.996 for traits GL, GR, GW, and PH, respectively. In contrast, SVM model, which performed the worst, yielded an average STSscore ± SD value of 0.882 ± 0.164. The LSTM architecture achieved a statistically significant performance enhancement of 12.9% compared to the SVM model. The DNN model demonstrated superior performance in two out of the four traits. Meanwhile, the RNN model attained STScores of 0.983, 1.0, 0.989, and 0.989 for traits GL, GR, GW, and PH, respectively.
Fig. 2.
The PCC values of fifteen methods in the rice439 (a), maize1404 (b), wheat599 (c), tomato398 (d), soybean20087 (e), cotton1037 (f) datasets. GL Grain length, GR Grain ratio, GW Grain wright, PH Plant height, CW Cob weight, DTA Days to anthesis, DTS Days to silking, DTT Days to tasseling, EH Ear height, ELL Ear leaf length, ELW Ear leaf width, PH Plant height
Table 2.
The STScores of fifteen GP methods across six datasets
| Dataset | Trait | LSTM | DNN | RNN | ResNet34 | ResNet18 | BayesA | LightGBM | BayesB | BayesC | RR-BLUP | BL | RF | GBLUP | XGBoost | SVM |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| rice439 | GL | 0.993 | 0.985 | 0.983 | 0.961 | 0.954 | 1.000 | 0.994 | 0.995 | 0.983 | 0.983 | 0.979 | 0.991 | 0.982 | 0.987 | 0.963 |
| rice439 | GR | 0.999 | 1.000 | 1.000 | 0.928 | 0.945 | 0.996 | 0.994 | 0.996 | 0.997 | 1.000 | 0.995 | 0.975 | 0.995 | 0.983 | 0.972 |
| rice439 | GW | 0.995 | 0.983 | 0.989 | 0.885 | 0.867 | 0.993 | 0.979 | 1.000 | 0.994 | 0.994 | 0.989 | 0.973 | 0.986 | 0.973 | 0.958 |
| rice439 | PH | 0.996 | 1.000 | 0.989 | 0.937 | 0.929 | 0.956 | 0.973 | 0.957 | 0.966 | 0.966 | 0.964 | 0.963 | 0.959 | 0.97 | 0.637 |
| maize1404 | CW | 0.983 | 0.998 | 1.000 | 0.869 | 0.893 | 0.959 | 0.866 | 0.958 | 0.962 | 0.960 | 0.958 | 0.862 | 0.932 | 0.833 | 0.798 |
| maize1404 | DTA | 0.786 | 0.768 | 0.538 | 0.908 | 0.904 | 0.993 | 0.937 | 1.000 | 0.993 | 0.987 | 0.989 | 0.915 | 0.928 | 0.896 | 0.855 |
| maize1404 | DTS | 0.801 | 0.786 | 0.759 | 0.912 | 0.950 | 1.000 | 0.940 | 1.000 | 0.999 | 0.999 | 0.993 | 0.930 | 0.930 | 0.908 | 0.905 |
| maize1404 | DTT | 0.815 | 0.814 | 0.829 | 0.937 | 0.936 | 0.999 | 0.938 | 1.000 | 0.997 | 0.993 | 0.995 | 0.930 | 0.925 | 0.919 | 0.853 |
| maize1404 | EH | 0.990 | 0.964 | 1.000 | 0.918 | 0.913 | 0.993 | 0.939 | 0.991 | 0.991 | 0.991 | 0.975 | 0.904 | 0.933 | 0.893 | 0.706 |
| maize1404 | ELL | 0.954 | 0.951 | 0.898 | 0.921 | 0.948 | 0.999 | 0.912 | 0.997 | 0.997 | 1.000 | 0.995 | 0.897 | 0.959 | 0.905 | 0.838 |
| maize1404 | ELW | 0.911 | 0.921 | 0.934 | 0.876 | 0.883 | 1.000 | 0.957 | 0.999 | 0.975 | 0.955 | 0.965 | 0.957 | 0.914 | 0.931 | 0.929 |
| maize1404 | PH | 0.957 | 0.960 | 0.962 | 0.971 | 0.965 | 1.000 | 0.917 | 0.998 | 0.999 | 0.998 | 0.992 | 0.908 | 0.956 | 0.907 | 0.777 |
| wheat599 | yield1 | 0.987 | 0.998 | 0.998 | 0.975 | 1.000 | 0.812 | 0.844 | 0.803 | 0.81 | 0.817 | 0.808 | 0.917 | 0.845 | 0.942 | 0.927 |
| wheat599 | yield2 | 0.995 | 1.000 | 0.977 | 0.944 | 0.963 | 0.898 | 0.845 | 0.899 | 0.9 | 0.896 | 0.898 | 0.873 | 0.902 | 0.845 | 0.905 |
| wheat599 | yield3 | 1.000 | 0.976 | 0.977 | 0.999 | 0.955 | 0.803 | 0.82 | 0.799 | 0.803 | 0.801 | 0.803 | 0.907 | 0.847 | 0.824 | 0.901 |
| wheat599 | yield4 | 1.000 | 0.953 | 0.997 | 0.966 | 0.955 | 0.795 | 0.836 | 0.781 | 0.792 | 0.798 | 0.787 | 0.873 | 0.78 | 0.806 | 0.898 |
| tomato398 | 1-nitro-2-phenylethane | 1.000 | 0.878 | 0.813 | 0.875 | 0.704 | 0.598 | 0.903 | 0.592 | 0.600 | 0.605 | 0.599 | 0.600 | 0.558 | 0.575 | 0.539 |
| tomato398 | fructose | 1.000 | 0.955 | 0.945 | 0.837 | 0.895 | 0.761 | 0.735 | 0.761 | 0.764 | 0.763 | 0.766 | 0.756 | 0.766 | 0.727 | 0.681 |
| tomato398 | glucose | 0.988 | 1.000 | 0.964 | 0.866 | 0.938 | 0.750 | 0.707 | 0.751 | 0.753 | 0.757 | 0.758 | 0.699 | 0.756 | 0.643 | 0.676 |
| tomato398 | isovaleraldehyde | 1.000 | 0.924 | 0.949 | 0.978 | 0.975 | 0.811 | 0.873 | 0.805 | 0.810 | 0.804 | 0.816 | 0.860 | 0.820 | 0.874 | 0.742 |
| tomato398 | phenylacetaldehyde | 1.000 | 0.984 | 0.953 | 0.783 | 0.652 | 0.755 | 0.915 | 0.759 | 0.758 | 0.759 | 0.762 | 0.847 | 0.763 | 0.843 | 0.632 |
| tomato398 | soluble-solids | 0.977 | 1.000 | 0.976 | 0.812 | 0.818 | 0.772 | 0.792 | 0.767 | 0.767 | 0.767 | 0.764 | 0.766 | 0.769 | 0.757 | 0.710 |
| soybean20087 | FT | 0.978 | 0.979 | 0.977 | 0.958 | 0.964 | 0.990 | 1.000 | 0.990 | 0.990 | 0.989 | 0.988 | 0.975 | 0.986 | 0.997 | - |
| soybean20087 | PH | 0.998 | 0.989 | 1.000 | 0.975 | 0.979 | 0.991 | 0.994 | 0.991 | 0.987 | 0.980 | 0.987 | 0.982 | 0.966 | 0.992 | - |
| soybean20087 | SOC | 0.993 | 0.994 | 0.994 | 0.985 | 0.986 | 1.000 | 0.997 | 1.000 | 0.999 | 0.999 | 1.000 | 0.993 | 0.996 | 0.998 | - |
| soybean20087 | SPC | 0.988 | 0.990 | 0.985 | 0.964 | 0.966 | 0.997 | 0.977 | 0.997 | 0.995 | 0.993 | 0.997 | 0.977 | 1.000 | 0.990 | - |
| soybean20087 | SW | 0.999 | 1.000 | 1.000 | 0.988 | 0.991 | 0.997 | 0.989 | 0.997 | 0.997 | 0.997 | 0.997 | 0.967 | 0.995 | 0.986 | - |
| soybean20087 | YIELD | 0.998 | 0.994 | 1.000 | 0.984 | 0.983 | 0.988 | 1.000 | 0.987 | 0.988 | 0.987 | 0.988 | 0.987 | 0.997 | 0.996 | - |
| cotton1037 | LPA | 0.968 | 0.951 | 0.967 | 0.846 | 0.903 | 0.997 | 0.994 | 0.997 | 0.933 | 0.902 | 0.917 | 1 | 0.896 | 0.953 | 0.700 |
Eight agronomic traits were assessed in the maize1404 dataset, spanning flowering time, developmental agronomic, and ear-related traits (Fig. 2b; Table 2). The top three methods were BayesB (0.993 ± 0.014), BayesA (0.993 ± 0.014), and BayesC (0.989 ± 0.013). Among these methods, BayesB and BayesA both demonstrated superior performance in three out of the eight traits. Specifically, the BayesB method achieved STScores of 0.958, 1.0, 1.0, 1.0, 0.991, 0.997, 0.999, and 0.998 for the respective traits, while the BayesA attained STScores of 0.959, 0.993, 1.0, 0.999, 0.993, 0.999, 1.0, and 1.0. Furthermore, BayesB and BayesA both showed a significant improvement of 19.2% compare with SVM, which exhibited the lowest performance with an average STScore ± SD value of 0.833 ± 0.072.
For the wheat599 dataset, yield traits derived from four environments were analyzed, as detailed in Fig. 2c and Table 2. The top-performing methods were LSTM, RNN, and DNN, achieving average STSscore ± SD values of 0.995 ± 0.006, 0.987 ± 0.012, and 0.982 ± 0.022, respectively. Following these were the CNN models, ResNest34 and ResNest18, with average STSscore ± SD values of 0.971 ± 0.023 and 0.968 ± 0.022. Notably, the lowest-performing methods were BL and BayesB, with average STSscore ± SD values of 0.824 ± 0.050 and 0.821 ± 0.053, respectively. Compared to BayesB, LSTM demonstrated a significant improvement of 21.19%. Furthermore, the XGBoost architecture exhibited the highest standard deviation (SD=0.061) in prediction accuracy among all models, indicating significantly reduced robustness for wheat yield prediction compared to other approaches.
In addition to agronomic traits, flavor traits were also utilized to evaluate model performance. For the tomato398 dataset, six flavor traits were considered. The comparison results, as depicted in Fig. 2d and Table 2, indicated that the LSTM, DNN, and RNN were the top-performing methods, achieving average STScore ± SD values of 0.994 ± 0.010, 0.957 ± 0.048, and 0.933 ± 0.060, respectively. Notably, the LSTM demonstrated superior performance in four out of the six traits, including 1-nitro-2-phenylethane, fructose, isovaleraldehyde, and phenylacetaldehyde, highlighting its high accuracy for tomato flavor traits prediction. The DNN performed best in two traits, glucose and soluble solids. These were followed by the CNN models, ResNest34 and ResNest18, with average STScore ± SD values of 0.859 ± 0.068 and 0.830 ± 0.130. The poorest-performing methods was SVM, with average STScore ± SD value of 0.663 ± 0.071. Compared to SVM, the LSTM demonstrated a significant improvement of 49.92%.
The large-scale dataset soybean20087 was utilized to evaluate model performance across six traits: PH, FT, SOC, SPC, SW, and yield. Interestingly, the results, as illustrated in Fig. 2e and Table 2, revealed that all methods exhibited minimal differences in average STScores. The top-performing method was BayesA, with an STScore of 0.994, while the poorest-performing method was ResNest34, with an STScore of 0.976. Additionally, the SD values for these fourteen methods (SVM method was excluded from the analysis due to the substantial computational resources and time required) were consistently low, ranging from 0.005 to 0.013, indicating high prediction stability across the six traits. This observed methodological stability can be primarily explained by the substantial sample size of soybean20087, which ensures consistent robustness across all analytical approaches through providing sufficient statistical power for these complex agronomic traits.
In the cotton1037 dataset, only one key agronomic trait, leaf pubescence amount (LPA), was evaluated to compare the performance of the fifteen GP methods. The results, as illustrated in Fig. 2f and Table 2, revealed that RF, BayesB, BayesA, and LightGBM were the top-performing methods, achieving STScores of 1.0, 0.997, 0.997, and 0.994, respectively. These were followed by LSTM and RNN models, with STScores of 0.968 and 0.967. In contrast, the SVM was the poorest-performing method, with an STScore of 0.70.
The advantage of LSTM in genome analysis
Considering the results across all the six datasets, the top-performing methods were LSTM, DNN, and RNN, with average STScores of 0.967, 0.955, and 0.943, respectively, as described in Table S5. Among these, the LSTM architecture demonstrated superior performance in three out of the six datasets, specifically rice439, wheat599, and tomato398. The LSTM network is a specialized type of RNN which is designed to address the vanishing gradient problem inherent in standard RNN network, as described in Fig. 3. The core structure of LSTM revolves around a memory cell regulated by three adaptive gates: the forget gate, input gate, and output gate. The forget gate determines which information to discard from the cell state. The input gate regulates the flow of information by merging data from the previous hidden state, which carries the information related to the candidate cell state, with new input, thereby updating the candidate cell state accordingly. Finally, the output gate generates the new hidden state by processing the updated cell state and candidate state, which then serves as input for the subsequent cell state. The cell state itself is updated by combining the previous state with a candidate state, which is generated based on the current input and hidden state. This architecture makes LSTM particularly powerful for modeling complex sequential data with long-term dependencies.
Fig. 3.
The architecture of LSTM network. Xt, the input features at time step t. Ot, the output at time step t. Ht−2, the hidden state at time step t-2. C.t−2, the candidate state at time step t-2
Quantitative traits are often controlled by multiple quantitative trait loci (QTLs), which exhibit epistatic interactions and additive effects. We proposed that LSTM can effectively capture and process information about epistatic interactions and additive effects among QTLs and relationships among SNPs during the training process, making it highly suited for handling genomics data. The forget gate selectively retains biologically relevant genetic associations while filtering out unrelated information before passing it to the next cell state. The input gate updates the candidate cell state by merging novel genetic information with conserved relationships from the previous hidden state. Finally, The output gate generates the updated hidden state containing the refined genetic information for downstream propagation. To substantiate this claim, we compared the performance of using the outputs from all cell states and the output from only the last cell state for phenotype prediction. The results, as illustrated in Fig. 4, demonstrated that the average PCC for quantitative traits across each dataset was significantly higher when utilizing the outputs from all cell states compared to using only the last cell state. Specifically, improvements were observed as follows: 38.50% in rice439, 241.11% in maize1404, 20.47% in wheat599, 494.17% in cotton1037, 99.55% in tomato398, and 22.08% in soybean20087. These results indicated that each cell state output retains valuable features of the genomics data, such as epistatic interactions and additive effects, whereas the last cell state alone discards this critical information. This unique structure of LSTM provides a significant advantage in handling the complexities of genomics data, further validating their suitability for genomic prediction tasks.
Fig. 4.
The PCC values of LSTM model across the six datasets. All cell state: using all the output of cell states to predict phenotype; Last cell state: using the output of the last cell state to predict phenotype
Discussion
Genomic selection (GS), which utilizes GP approaches to identify superior candidate individuals, has undergone remarkable advancements over the past two decades (Alemu et al. 2024; Farooq et al. 2024). However, determining how to effectively use these GP methods for individuals selection based on genomic information remains an emergency challenge (Zou et al. 2019). The accuracy of GP is profoundly influenced by several key factors, including feature processing methods, marker density, population size, genetic diversity, and model architecture (Alemu et al. 2024).
Recent studies have demonstrated that selecting optimal marker subsets can significantly enhance GP accuracy (Bermingham et al. 2015; Weber et al. 2023). Wang et al. (2023) successfully applied PCA to reduce SNP marker density in GP. Another widely used feature selection methods in genomic prediction is LD, which minimizes redundancy among correlated markers (Weber et al. 2023; Alemu et al. 2023). Zhao et al. (2025) introduced a novel feature extraction method, pearson-collinearity selection (PCS), which combines pearson correlation analysis with collinearity removal. To systematically evaluate the impact of feature processing methods on model performance, we compared fifteen GP methods across three DR approaches, PCA, LD, and VC in the rice439 dataset. Our findings revealed that FRD methods (LSTM, RNN, and GBLUP) as well as DNN methods exhibited high sensitivity to feature types, performing best with SNPs features. This aligns with their underlying architectures that FRD methods rely on feature relationships and DNNs, as feed-forward networks, are well-suited for 0–1-2 SNPs encoding. Furthermore, the CNN architecture should be carefully designed to prevent over-fitting when processing SNPs feature data. These results provided valuable insights into feature processing strategies for different GP methods, enabling more efficient and accurate genomic prediction in breeding programs.
Marker density, which is determined by feature selection method, is crucial in GP, as most SNPs in large genomic datasets are phenotypically neutral, with only a limited subset contributing to specific traits (Bermingham et al. 2015; Weber et al. 2023; Zhang et al. 2017). Our analysis demonstrated that prediction accuracy increased gradually when the r2 threshold exceeded 0.4, suggesting that r2 = 0.4 represents the optimal marker density threshold for rice genomic prediction. The results evidenced that although a sufficient number of genetic markers is necessary for models to accurately characterize complex trait inheritance patterns, the improvement in accuracy becomes marginal once the marker number surpasses a critical threshold, particularly when the most contributing QTL markers of a target trait included in prediction models (Alemu et al. 2024). Furthermore, the optimal marker density in GP models depends on both the number and effects of QTLs underlying the target trait, as well as the extent of LD across the genome (Kaler et al. 2022). Traits controlled by numerous QTLs coupled with low genome-wide LD, generally require a higher density of SNP markers to achieve accurate prediction (Kaler et al. 2022). Selecting model-specific optimal marker density can achieve an optimal balance between prediction accuracy and computational efficiency.
Several studies have established that population size substantially affects GP accuracy, with prediction performance generally improving with larger sample size until reaching a plateau at an optimal population threshold (Lorenzana and Bernardo 2009; Isidro et al. 2015; Arruda et al. 2015). Identifying the optimal training population size is crucial for maximizing prediction accuracy while minimizing phenotyping costs. Arruda et al. (2015) suggested that a training population size reaching 192 lines achieves an optimal balance between prediction accuracy of fusarium head blight resistance and breeding costs in wheat when employing RR-BLUP, LASSO and elastic net methods. In contrast, our study identified that the optimal population size required for GP methods depends on the genetic complexity of traits. Shook et al. (2021) conducted a meta-GWAS analysis involving 73 published studies in soybean covering 17,566 soybean unique accessions sourced from USDA Soybean Germplasm Collection with the SoySNP50K iSelect BeadChip. The results demonstrated that yield, governed by numerous genes and sensitively influenced by environment, exhibits high genetic complexity. Notably, pleiotropic effects were observed, with gene colocalization between yield and other traits, including flowering date, branching, and plant height. In contrast, flowering time, showed relatively simpler genetic architecture, with major-effect loci (e.g., E2) playing predominant roles in its regulation. Our results demonstrated that FT, SW, and SOC traits showed modest improvements when population size reached 800. In contrast, yield exhibited continued improvement without reaching an inflection point, even at a population size of 3000. In addition, despite the increasing population size might benefit for complex traits, this increasing might also lead to smaller LD decay distance, which probably cause higher demand for marker density. Our results demonstrated that the outputs from LSTM cell states are critical for achieving high prediction accuracy. However, different population background might generate various genomic epistatic effect which requires making adjustments of the cell states when building LSTM models.
We comprehensively compared the performance of the fifteen GP methods across six datasets (wheat, maize, rice, soybean, cotton, and tomato) to identify optimal prediction algorithms. Xie et al. (2024) employed rice and wheat datasets to assess their ResGS model. Gao et al. (2023) expanded this approach to five crops (soybean, cotton, tomato, maize, and rice) for testing SoyDNGP. Our results revealed that LSTM network achieved superior performance in three out of the six crop datasets and robust performance across all the remaining datasets.
The unique architectures of LSTM provides its great advantage to process genomic data, as it can capture epistatic interactions, additive effects and relationships among SNPs during model training. LSTM architectures have been successfully applied in various bioinformatics domains, including protein engineering (Alley et al. 2019; Heffernan et al. 2017), promoter prediction (Oubounyt et al. 2019), single-cell type detection (Jiang et al. 2023), non-coding Y RNA prediction (Lima et al. 2023), and so on. Notably, Alley et al. (2019) proposed UniRep, an LSTM-based model that distills essential protein features into compact statistical representations. Heffernan et al. (2017) found that LSTM-based approaches (SPIDER3) substantially improved residue-wise protein structure prediction accuracy over conventional methods like SPIDER2. However, LSTM-based approaches remain underutilized in GP, with DeepCCR (Ma et al. 2024) being one of the few implementations combining CNN with LSTM. The demonstrated advantage of LSTM architecture across multiple crop species suggests substantial potential for improving prediction accuracy in GS programs.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
This work is supported by Biological Breeding-Major Projects (2023ZD04076); Nanfan special project, CAAS(YBXM2421); The Agricultural Science and Technology Innovation Program(CAASZDRW202503) and Chinese Academy of Agricultural Sciences (CAAS-ASTIP-201X-CNRRI).
Abbreviations
- GP
Genomic prediction
- GS
Genomic selection
- SNP
Single nucleotide polymorphism
- PCA
Principal component analysis
- DR
Dimensionality reduction
- VC
Variance-correlation
- DL
Deep learning
- ML
Machine learning
- BL
Bayesian LASSO
- GBLUP
Genomic best linear unbiased prediction
- RR-BLUP
Ridge regression best linear unbiased prediction
- LightGBM
Light gradient boosting machine
- XGBoost
Extreme gradient boosting
- SVM
Support vector machines
- RF
Random forest
- LSTM
Long short-term memory networks
- RNN
Recurrent neural networks
- DNN
Deep neural networks
- CNN
Convolutional neural networks
- PCC
Pearson correlation coefficient
- STScore
Standardized score
- QTL
Quantitative trait locus
- PH
Plant height
- GL
Grain length
- GW
Grain width
- GR
Grain ratio
- ELL
Ear leaf length
- ELW
Ear leaf width
- CW
Cob weight
- DTS
Days to silking
- DTT
Days to tasseling
- DTA
Days to anthesis
- EH
Ear height
- LPA
Leaf pubescence amount
- FT
Flowering time
- SOC
Seed oil content
- SPC
Seed protein content
- SW
Seed weight
- EL
Ensemble learning cluster
- BLRR
Bayesian linear regression and RR-BLUP cluster
- FRD
Feature relationship dependence cluster
Author contributions
RQP, XHW and MCZ designed and supervised this study. RQP and YLY performed experiments. YYZ and SPH performed data collection. QX, YF, JYC and WL participated in data analysis. RQP wrote the manuscript, MCZ revised the paper. All authors read and approved the manuscript.
Funding
Biological Breeding-Major Projects, 2023ZD04076, Yaolong Yang and Yuanyuan Zhang; Nanfan special project, CAAS(YBXM2421), Xinghua Wei; The Agricultural Science and Technology Innovation Program, CAASZDRW202503, Mengchen Zhang, Chinese Academy of Agricultural Sciences, CAAS-ASTIP-201X-CNRRI, Xinghua Wei.
Data availability
The script and a subset of training data are available on github: https://github.com/CNRRI-RGRT/GPcompare. The genomic and phenotype data of rice439 dataset was sourced from Ye et al. (2022) (NCBI Bioproject PRJNA880974). Maize1404 dataset was obtained from Liu et al. (2020) (NCBI Bioproject PRJNA597703). The VCF and phenotypic data for the cotton1037 dataset were obtained from the corresponding author of Wang et al. (2024). Wheat599 dataset were download from BGLR package for R, version 1.1.3 (Pérez and de los Campos 2014). Tomato398 dataset was obtained from the author of Tieman et al. (2017). The Soybean20087 dataset was downloaded from SoyBase database (https://data.soybase.org/Glycine/max/diversity/) and GRIN-Global database (https://npgsweb.ars-grin.gov/gringlobal/search/).
Declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Xinghua Wei, Email: weixinghua@caas.cn.
Mengchen Zhang, Email: zhangmengchen@caas.cn.
References
- Abdel-Hamid O, Mohamed AR, Jiang H, Penn G (2012) Applying convolutional neural networks concepts to hybrid Nn-Hmm model for speech recognition. Int Conf Acoust Spee. 10.1109/icassp.2012.6288864 [Google Scholar]
- Alemu A, Batista L, Singh PK, Ceplitis A, Chawade A (2023) Haplotype-tagged SNPs improve genomic prediction accuracy for Fusarium head blight resistance and yield-related traits in wheat. Theor Appl Genet. 10.1007/s00122-023-04352-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alemu A, Astrand J, Montesinos-Lopez OA, Isidro YSJ, Fernandez-Gonzalez J, Tadesse W, Vetukuri RR, Carlsson AS, Ceplitis A, Crossa J, Ortiz R, Chawade A (2024) Genomic selection in plant breeding: key factors shaping two decades of progress. Mol Plant 17(4):552–578. 10.1016/j.molp.2024.03.007 [DOI] [PubMed] [Google Scholar]
- Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM (2019) Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16(12):1315. 10.1038/s41592-019-0598-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Annicchiarico P, Nazzicari N, Li XH, Wei YL, Pecetti L, Brummer EC (2015) Accuracy of genomic selection for alfalfa biomass yield in different reference populations. BMC Genomics. 10.1186/s12864-015-2212-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arruda MP, Brown PJ, Lipka AE, Krill AM, Thurber C, Kolb FL (2015) Genomic selection for predicting head blight resistance in a wheat breeding program. Plant Genome-Us. 10.3835/plantgenome2015.01.0003 [Google Scholar]
- Bermingham ML, Pong-Wong R, Spiliopoulou A, Hayward C, Rudan I, Campbell H, Wright AF, Wilson JF, Agakov F, Navarro P, Haley CS (2015) Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci Rep. 10.1038/srep10312 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhering LL, Junqueira VS, Peixoto LA, Cruz CD, Laviola BG (2015) Comparison of methods used to identify superior individuals in genomic selection in plant breeding. Genet Mol Res 14(3):10888–10896. 10.4238/2015.September.9.26 [DOI] [PubMed] [Google Scholar]
- Browning BL, Tian XW, Zhou Y, Browning SR (2021) Fast two-stage phasing of large-scale sequence data. Am J Hum Genet 108(10):1880–1890. 10.1016/j.ajhg.2021.08.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen TQ, Guestrin C (2016). XGBoost: a scalable tree boosting system. Kdd'16: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining pp. 785–794. 10.1145/2939672.2939785
- Crossa J, Pérez-Rodríguez P, Cuevas J, Montesinos-López O, Jarquín D, de los Campos G, Burgueño J, González-Camacho JM, Pérez-Elizalde S, Beyene Y, Dreisigacker S, Singh R, Zhang XC, Gowda M, Roorkiwal M, Rutkoski J, Varshney RK (2017) Genomic selection in plant breeding: methods, models, and perspectives. Trends Plant Sci 22(11):961–975. 10.1016/j.tplants.2017.08.011 [DOI] [PubMed] [Google Scholar]
- Crossa J, Montesinos-Lopez OA, Costa-Neto G, Vitale P, Martini JWR, Runcie D, Fritsche-Neto R, Montesinos-Lopez A, Perez-Rodriguez P, Gerard G, Dreisigacker S, Crespo-Herrera L, Pierre CS, Lillemo M, Cuevas J, Bentley A, Ortiz R (2024) Machine learning algorithms translate big data into predictive breeding accuracy. Trends Plant Sci. 10.1016/j.tplants.2024.09.011 [DOI] [PubMed] [Google Scholar]
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, Genomes Project Anal G (2011) The variant call format and VCFtools. Bioinformatics 27(15):2156–2158. 10.1093/bioinformatics/btr330 [DOI] [PMC free article] [PubMed] [Google Scholar]
- de los Campos G, Gianola D, Rosa GJM, Weigel KA, Crossa J (2010) Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genet Res 92(4):295–308. 10.1017/S0016672310000285 [Google Scholar]
- Endelman JB (2011) Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4(3):250–255. 10.3835/plantgenome2011.08.0024 [Google Scholar]
- Farooq MA, Gao S, Hassan MA, Huang Z, Rasheed A, Hearne S, Prasanna B, Li X, Li H (2024) Artificial intelligence in plant breeding. Trends Genet 40(10):891–908. 10.1016/j.tig.2024.07.001 [DOI] [PubMed] [Google Scholar]
- Galvez RL, Bandala AA, Dadios EP, Vicerra RRP, Maningo JMZ (2018) Object detection using convolutional neural networks. Tencon Ieee Region. 10.1109/TENCON.2018.8650517 [Google Scholar]
- Gao P, Zhao H, Luo Z, Lin Y, Feng W, Li Y, Kong F, Li X, Fang C, Wang X (2023) SoyDNGP: a web-accessible deep learning framework for genomic prediction in soybean breeding. Brief Bioinform. 10.1093/bib/bbad349 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gill M, Anderson R, Hu H, Bennamoun M, Petereit J, Valliyodan B, Nguyen HT, Batley J, Bayer PE, Edwards D (2022) Machine learning models outperform deep learning models, provide interpretation and facilitate feature selection for soybean trait prediction. BMC Plant Biol 22(1):180. 10.1186/s12870-022-03559-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grenier C, Cao TV, Ospina Y, Quintero C, Châtel MH, Tohme J, Courtois B, Ahmadi N (2016) Accuracy of genomic selection in a rice synthetic population developed for recurrent selection breeding. Plos One. 10.1371/journal.pone.0154976 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heffernan R, Yang YD, Paliwal K, Zhou YQ (2017) Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility. Bioinformatics 33(18):2842–2849. 10.1093/bioinformatics/btx218 [DOI] [PubMed] [Google Scholar]
- Isidro J, Jannink JL, Akdemir D, Poland J, Heslot N, Sorrells ME (2015) Training set optimization under population structure in genomic selection. Theor Appl Genet 128(1):145–158. 10.1007/s00122-014-2418-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jeong S, Kim JY, Kim N (2020) GMStool: GWAS-based marker selection tool for genomic prediction from genomic data. Sci Rep 10(1):19653. 10.1038/s41598-020-76759-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jia YQ, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014). Caffe: Convolutional Architecture for Fast Feature Embedding. Proceedings of the 2014 Acm Conference on Multimedia (Mm'14) pp. 675–678. 10.1145/2647868.2654889
- Jiang HJ, Huang YB, Li QP, Feng BY (2023) Sclstm: single-cell type detection by siamese recurrent network and hierarchical clustering. BMC Bioinformatics. 10.1186/s12859-023-05494-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaler AS, Purcell LC, Beissinger T, Gillman JD (2022) Genomic prediction models for traits differing in heritability for soybean, rice, and maize. Bmc Plant Biol 22(1):87. 10.1186/s12870-022-03479-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. Proc Cvpr Ieee. 10.1109/Cvpr.2014.223 [Google Scholar]
- Ke GL, Meng Q, Finley T, Wang TF, Chen W, Ma WD, Ye QW, Liu TY (2017). LightGBM: a highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems 30 (Nips 2017) 30
- Khaki S, Wang L, Archontoulis SV (2019) A CNN-RNN framework for crop yield prediction. Front Plant Sci 10:1750. 10.3389/fpls.2019.01750 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kingma DP, Ba JJC (2014). Adam: a method for stochastic optimization. abs/1412.6980
- Lima DD, Amichi LJA, Fernandez MA, Constantino AA, Seixas FAV (2023) Ncypred: a bidirectional LSTM network with attention for Y RNA and short non-coding RNA classification. IEEE ACM T Comput Bi 20(1):557–565. 10.1109/Tcbb.2021.3131136 [Google Scholar]
- Liu Y, Wang D, He F, Wang J, Joshi T, Xu D (2019) Phenotype prediction and genome-wide association study using deep convolutional neural network of soybean. Front Genet. 10.3389/fgene.2019.01091 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu H-J, Wang X, Xiao Y, Luo J, Qiao F, Yang W, Zhang R, Meng Y, Sun J, Yan S, Peng Y, Niu L, Jian L, Song W, Yan J, Li C, Zhao Y, Liu Y, Warburton ML, Zhao J, Yan J (2020) CUBIC: an atlas of genetic architecture promises directed maize improvement. Genome Biol. 10.1186/s13059-020-1930-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- López OMA, González BAM, López AM, Crossa J (2023) Statistical machine-learning methods for genomic prediction using the SKM library. Genes-Basel. 10.3390/genes14051003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lorenzana RE, Bernardo R (2009) Accuracy of genotypic value predictions for marker-based selection in biparental plant populations. Theor Appl Genet 120(1):151–161. 10.1007/s00122-009-1166-3 [DOI] [PubMed] [Google Scholar]
- Ma W, Qiu Z, Song J, Li J, Cheng Q, Zhai J, Ma C (2018) A deep convolutional neural network approach for predicting phenotypes from genotypes. Planta 248(5):1307–1318. 10.1007/s00425-018-2976-9 [DOI] [PubMed] [Google Scholar]
- Ma X, Wang H, Wu S, Han B, Cui D, Liu J, Zhang Q, Xia X, Song P, Tang C, Geng L, Yang Y, Yan S, Zhou K, Han L (2024) DeepCCR: large-scale genomics-based deep learning method for improving rice breeding. Plant Biotechnol J. 10.1111/pbi.14384 [DOI] [PMC free article] [PubMed] [Google Scholar]
- McLaren CG, Bruskiewich RM, Portugal AM, Cosico AB (2005) The international rice information system. a platform for meta-analysis of rice crop data. Plant Physiol 139(2):637–642. 10.1104/pp.105.063438 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157(4):1819–1829 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oubounyt M, Louadi Z, Tayara H, Chong KT (2019) Deepromoter: robust promoter predictor using deep learning. Front Genet. 10.3389/fgene.2019.00286 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830 [Google Scholar]
- Pérez P, de los Campos G (2014) Genome-wide regression and prediction with the BGLR statistical package. Genetics 198(2):483-U463. 10.1534/genetics.114.164442 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Postman J, Hummer K, Ayala-Silva T, Bretting P, Franko T, Kinard G, Bohning M, Emberland G, Sinnott Q, Mackay M, Cyr P, Millard M, Gardner C, Guarino L, Weaver B (2010) GRIN-global: an international project to develop a global plant genebank information management system. Acta Hortic 859:49–55 [Google Scholar]
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81(3):559–575. 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen Z, Shen E, Zhu Q-H, Fan L, Zou Q, Ye C-Y (2023) Gsctool: a novel descriptor that characterizes the genome for applying machine learning in genomics. Adv Intell Syst. 10.1002/aisy.202300426 [Google Scholar]
- Shook JM, Zhang J, Jones SE, Singh A, Diers BW, Singh AK, Brown P (2021) Meta-GWAS for quantitative trait loci identification in soybean. G3 Genes|Genomes|Genetics. 10.1093/g3journal/jkab117 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tasdelen A, Sen BH (2021) A hybrid CNN-LSTM model for pre-mirna classification. Sci Rep. 10.1038/s41598-021-93656-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tieman D, Zhu GT, Resende MFR, Lin T, Taylor M, Zhang B, Ikeda H, Liu ZY, Fisher J, Zemach I, Monforte AJ, Zamir D, Granell A, Kirst M, Huang S, Klee H, Nguyen C, Bies D, Rambla JL, Beltran KSO (2017) Plant science a chemical genetic roadmap to improved tomato flavor. Science 355(6323):391–394. 10.1126/science.aal1556 [DOI] [PubMed] [Google Scholar]
- VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91(11):4414–4423. 10.3168/jds.2007-0980 [DOI] [PubMed] [Google Scholar]
- Wang K, Abid MA, Rasheed A, Crossa J, Hearne S, Li H (2023) DNNGP, a deep neural network-based method for genomic prediction using multi-omics data in plants. Mol Plant 16(1):279–293. 10.1016/j.molp.2022.11.004 [DOI] [PubMed] [Google Scholar]
- Wang X, Dai P, Li H, Wang J, Gao X, Wang Z, Peng Z, Tian C, Fu G, Hu D, Chen B, Xing A, Tian Y, Nazir MF, Ma X, Rong J, Liu F, Du X, He S (2024) The genetic basis of leaf hair development in upland cotton (Gossypium hirsutum). Plant J 120(2):729–747. 10.1111/tpj.17017 [DOI] [PubMed] [Google Scholar]
- Weber SE, Frisch M, Snowdon RJ, Voss-Fels KP (2023) Haplotype blocks for genomic prediction: a comparative evaluation in multiple crop datasets. Front Plant Sci. 10.3389/fpls.2023.1217589 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu C, Zhang Y, Ying Z, Li L, Wang J, Yu H, Zhang M, Feng X, Wei X, Xu X (2023) A transformer-based genomic prediction method fused with knowledge-guided module. Brief Bioinform. 10.1093/bib/bbad438 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xie Z, Xu X, Li L, Wu C, Ma Y, He J, Wei S, Wang J, Feng X (2024) Residual networks without pooling layers improve the accuracy of genomic predictions. Theor Appl Genet 137(6):138. 10.1007/s00122-024-04649-2 [DOI] [PubMed] [Google Scholar]
- Yan J, Xu Y, Cheng Q, Jiang S, Wang Q, Xiao Y, Ma C, Yan J, Wang X (2021) LightGBM: accelerated genomically designed crop breeding through ensemble learning. Genome Biol 22(1):271. 10.1186/s13059-021-02492-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ye J, Zhang M, Yuan X, Hu D, Zhang Y, Xu S, Li Z, Li R, Liu J, Sun Y, Wang S, Feng Y, Xu Q, Yang Y, Wei X (2022) Genomic insight into genetic changes and shaping of major inbred rice cultivars in China. New Phytol 236(6):2311–2326. 10.1111/nph.18500 [DOI] [PubMed] [Google Scholar]
- Zhang A, Wang HW, Beyene Y, Semagn K, Liu YB, Cao SL, Cui ZH, Ruan YY, Burgueño J, San Vicente F, Olsen M, Prasanna BM, Crossa J, Yu HQ, Zhang XC (2017) Effect of trait heritability, training population size and marker density on genomic prediction accuracy estimation in 22 bi-parental tropical maize populations. Front Plant Sci. 10.3389/fpls.2017.01916 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao LY, Tang P, Luo JJ, Liu JX, Peng X, Shen MY, Wang CR, Zhao JL, Zhou DG, Fan ZL, Chen YB, Wang RF, Tang XY, Xu Z, Liu Q (2025) Genomic prediction with NetGP based on gene network and multi-omics data in plants. Plant Biotechnol J. 10.1111/pbi.14577 [DOI] [PubMed] [Google Scholar]
- Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A (2019) A primer on deep learning in genomics. Nat Genet 51(1):12–18. 10.1038/s41588-018-0295-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The script and a subset of training data are available on github: https://github.com/CNRRI-RGRT/GPcompare. The genomic and phenotype data of rice439 dataset was sourced from Ye et al. (2022) (NCBI Bioproject PRJNA880974). Maize1404 dataset was obtained from Liu et al. (2020) (NCBI Bioproject PRJNA597703). The VCF and phenotypic data for the cotton1037 dataset were obtained from the corresponding author of Wang et al. (2024). Wheat599 dataset were download from BGLR package for R, version 1.1.3 (Pérez and de los Campos 2014). Tomato398 dataset was obtained from the author of Tieman et al. (2017). The Soybean20087 dataset was downloaded from SoyBase database (https://data.soybase.org/Glycine/max/diversity/) and GRIN-Global database (https://npgsweb.ars-grin.gov/gringlobal/search/).




