Abstract
Genomic selection (GS) and phenotypic selection (PS) are widely used for accelerating plant breeding. However, the accuracy, robustness, and transferability of these two selection methods are underexplored, especially when addressing complex traits. In this study, we introduce a novel data fusion framework, GPS (genomic and phenotypic selection), designed to enhance predictive performance by integrating genomic and phenotypic data through three distinct fusion strategies: data fusion, feature fusion, and result fusion. The GPS framework was rigorously tested using an extensive suite of models, including statistical approaches (GBLUP and BayesB), machine learning models (Lasso, RF, SVM, XGBoost, and LightGBM), a deep learning method (DNNGP), and a recent phenotype-assisted prediction model (MAK). These models were applied to large datasets from four crop species, maize, soybean, rice, and wheat, demonstrating the versatility and robustness of the framework. Our results indicated that: (1) data fusion achieved the highest accuracy compared with the feature fusion and result fusion strategies. The top-performing data fusion model (Lasso_D) improved the selection accuracy by 53.4% compared to the best GS model (LightGBM) and by 18.7% compared to the best PS model (Lasso). (2) Lasso_D exhibited exceptional robustness, achieving high predictive accuracy even with a sample size as small as 200 and demonstrating resilience to single-nucleotide polymorphism (SNP) density variations, underscoring its adaptability to diverse data conditions. Moreover, the model’s accuracy improved with the number of auxiliary traits and their correlation strength with target traits, further highlighting its adaptability to complex trait prediction. (3) Lasso_D demonstrated broad transferability, with substantial improvements in predictive accuracy when incorporating multi-environmental data. This enhancement resulted in only a 0.3% reduction in accuracy compared to predictions generated using data from the same environment, affirming the model’s reliability in cross-environmental scenarios. This study provides groundbreaking insights, pushing the boundaries of predictive accuracy, robustness, and transferability in trait prediction. These findings represent a significant contribution to plant science, plant breeding, and the broader interdisciplinary fields of statistics and artificial intelligence.
Key words: crop, data fusion framework, genomic selection, phenotypic selection, cross-environment prediction
This study presents a novel data fusion framework, GPS, that integrates genomic and phenotypic data to enhance the accuracy, robustness, and transferability of trait prediction. Through systematic evaluation across multiple statistical, machine learning, and deep learning models, the GPS strategy consistently outperforms conventional selection methods, even under challenging conditions such as small sample sizes and cross-environment predictions, marking a significant advancement in predictive breeding methodologies for complex plant traits.
Introduction
Global food security is facing unprecedented challenges due to climate change, farmland degradation, and a growing population (Tester and Langridge, 2010; Foley et al., 2011). Food demand is expected to exceed 3 billion tons by 2050, necessitating a 60% increase in food production (Godfray et al., 2010). Plant breeding plays a crucial role in selecting ideal varieties and promoting food production (He and Li, 2020; Qaim and Policy, 2020). However, traditional breeding practices often require multi-year and multi-location trials, with performance variability across environments, resulting in inefficient variety screening and slow progress in yield (YD) improvement. Thus, there is an urgent need to develop robust and transferable breeding strategies to secure global food production.
Genomic selection (GS) is a pivotal technology in crop breeding that predicts breeding values using whole-genome molecular markers (Meuwissen et al., 2001; Crossa et al., 2017; Wang et al., 2023a). Over the past two decades, GS has been widely adopted in plant and animal breeding for its ability to accelerate genetic gain while reducing time and cost (Bernardo and Yu, 2007; Crossa et al., 2017). Contemporary genomic prediction models primarily include GBLUP (VanRaden, 2008), BayesB (Meuwissen et al., 2001), random forest (RF) (Ogutu et al., 2011), least absolute shrinkage and selection operator (Lasso) (Usai et al., 2009), support vector machine (SVM) (Maenhout et al., 2007), extreme gradient boosting (XGBoost) (Deng et al., 2022), light gradient boosting machine (LightGBM) (Ke et al., 2017), deep neural network genomic prediction (DNNGP) (Wang et al., 2023a), among others (Wainberg et al., 2018; Alemu et al., 2024). Despite improvements in GS modeling, prediction accuracy for low-heritability traits—such as YD and protein (PT) content in wheat—remains limited (Juliana et al., 2019). GS also has limitations in cross-environment prediction: accuracy often declines when models trained in one environment are applied to another, primarily due to genotype-by-environment interactions (Crossa et al., 2017). As a result, complex traits with low heritability and cross-environment prediction continue to constrain the effectiveness of GS (Boomsma et al., 2010; Singh et al., 2016b).
Phenotypic selection (PS), an indirect selection method that leverages correlated auxiliary traits, has been widely recognized as a valuable complement to GS in multi-environment breeding programs (Bandillo et al., 2023). Unlike phenomic selection, which mainly relies on high-throughput phenotyping technologies (Rincent et al., 2018; Zhu et al., 2021), PS refers to the use of phenotypic data from conventional agronomic traits. PS has outperformed GS in cross-environment prediction of quantitative traits, as demonstrated in recent studies on various species, including wheat, poplar, and rapeseed (Rincent et al., 2018; Roscher-Ehrig et al., 2024). Moreover, auxiliary traits can facilitate indirect selection of target traits such as grain YD. Although these estimations may not match the accuracy of direct selection, they remain highly useful in the early stages of the breeding cycle (Gizaw et al., 2018; Winn et al., 2023). Given its reliance on indirect estimates of auxiliary traits, PS may fail to fully capture the underlying genetic complexity of target traits.
Integrating genomic and phenotypic data may be an effective strategy for improving the accuracy of model predictions (Xu et al., 2021; Gao et al., 2023a). While genomic data provide a foundation for understanding plant physiological mechanisms, and auxiliary traits offer indirect insights into their formation, combining both enables a more comprehensive understanding of the genetic basis of complex traits (Jackson et al., 2023; Luo et al., 2023). Traditional GS models predict single traits without fully leveraging the correlations between target and non-target traits (Anche et al., 2020). In contrast, by incorporating auxiliary traits, multi-trait GS models (e.g., MAK) have been shown to enhance the prediction accuracy of low-heritability traits (Calus and Veerkamp, 2011; Liang et al., 2023). For example, Brault et al. (2022) demonstrated the significant impact of phenotypic data on GS model performance by integrating near-infrared spectroscopy data of grapevines into traditional GS models, thereby improving prediction accuracy. Similarly, the Gene-SGAN approach effectively decomposes phenotypic variations associated with significant loci into distinct disease-related subtypes, integrating phenotypic and genetic data to analyze disease heterogeneity. This method highlights the potential of integrating significant genetic loci with image recognition models (Yang et al., 2024). These studies underscore the necessity of combining genomic and phenotypic data in trait prediction and selection.
Machine learning (ML) has become a powerful tool for multi-omics data analysis, capable of autonomous pattern recognition and model optimization (Ma et al., 2014; Abdollahi-Arpanahi et al., 2020). For example, the Lasso method outperforms traditional statistical methods in multi-omics-based phenotype prediction by reducing model complexity through shrinking coefficients and effectively selecting biologically relevant features, which reduces noise and improves model stability and accuracy (Usai et al., 2009). RF has powerful feature selection capabilities, in identifying key genetic markers that influence target traits and uncovering non-linear relationships among genomic, transcriptomic, and phenomics data (Holliday et al., 2012). LightGBM efficiently processes large multi-omics datasets, balancing speed and accuracy with low memory usage and strong performance on sparse data (Yan et al., 2021). Convolutional neural networks (CNNs) improve prediction by capturing local patterns through hierarchical structures and have been successfully applied to genomic pattern recognition, such as identifying gene network signatures linked to complex traits (Zou et al., 2019). Recurrent neural networks excel at modeling sequential data and the temporal dynamics of multi-omics datasets, such as time-series gene expression (Singh et al., 2016a). Models like DeepGS (Ma et al., 2017), DNNGP (Wang et al., 2023a), and SoyDNGP (Gao et al., 2023b) have further improved prediction accuracy. Overall, a wide range of ML and deep learning (DL) models have advanced GS, with multi-source data fusion emerging as a promising strategy for improving prediction accuracy (Montesinos-López et al., 2018). Multi-omics data integration has become a research hotspot due to its strong potential for improving breeding efficiency. However, key challenges remain, including the “curse of dimensionality” and “data heterogeneity” (Mohr et al., 2024). The “curse of dimensionality” requires the selection of appropriate models capable of effective feature selection and extraction, while “data heterogeneity” necessitates the use of different integration strategies to achieve effective fusion of multi-source data. Despite recent progress, practical fusion strategies and the factors that maximize the potential of genomic and phenotypic selection (GPS) have yet to be elucidated (Wainberg et al., 2018; Alemu et al., 2024).
In addition to accuracy, model sensitivity is also an important factor affecting the practical application of predictive models. Sensitivity analysis evaluates how responsive a model is to changes in input variables, helping to assess its predictive capability and optimize performance (Heslot et al., 2012). In both GS and PS models, sample size plays a critical role, as most models exhibit improved prediction accuracy with larger sample sizes (Crossa et al., 2014). In traditional statistical GS models, the quantity and quality of single-nucleotide polymorphisms (SNPs) are closely related to genomic prediction accuracy. An excessive number of SNPs or the inclusion of low-quality SNPs can increase model complexity and introduce noise, ultimately reducing accuracy (Daetwyler et al., 2013; Liu et al., 2023; Zhou et al., 2023). In PS models, the diversity and quantity of phenotypic data are equally important. Measurement errors in phenotypic data and weak correlations with target traits can significantly impact accuracy (Crain et al., 2018). Because model sensitivity to input quality directly influences predictive outcomes, identifying contributing factors affecting model accuracy and refining data inputs are essential for optimization.
Model transferability is another key factor in determining a model’s broad applicability. It refers to the ability to maintain predictive accuracy across different time periods, locations, environments, and populations (Azodi et al., 2019). In cross-year analyses of wheat and canola, ML models showed substantial variations in predicted means and variances, significantly constraining their predictive accuracy. In contrast, cross-environment prediction studies on wheat demonstrated strong predictive accuracy across diverse conditions when genotypic and phenotypic data were integrated (Zhu et al., 2022). Similarly, cross-location prediction studies in rice and sorghum demonstrated that expanding the geographic range of training data enhanced prediction accuracy (Guo et al., 2024). Although models generally perform best when predicting within the same environment in which they were trained, improving model transferability is essential for achieving accurate cross-environment predictions and broader practical application (Cuevas et al., 2017).
To address the limitations of existing GS and PS models, this study introduces a broadly applicable fusion strategy and identifies key factors for improving the fused model’s predictive accuracy, robustness, and transferability. Four major crop species—maize, soybean, rice, and wheat—along with five ML models (RF, Lasso, SVM, XGBoost, and LightGBM) and one advanced DL model (DNNGP) were selected for evaluation across three levels: data level, feature level, and result level. The study has three main objectives: (1) to identify the optimal fusion strategy and the most suitable model in terms of predictive accuracy; (2) to identify key factors that contribute to model robustness; and (3) to explore effective approaches for enhancing transferability in cross-environment prediction. This study provides new insights into enhancing the accuracy, robustness, and transferability of ML-driven selection methods in crop breeding.
Results
Design and comparison of fusion strategies
Comparison of different fusion strategies revealed differences in model response. The optimal fusion strategy for each model, based on average prediction accuracy across the four species, was identified as DNNGP_R, Lasso_D, LightGBM_D, RF_R, SVM_D, and XGBoost_D (Figure 1A and Supplemental Figure 1). The data fusion strategy consistently demonstrated superior performance compared to the other two fusion strategies across most models (Figure 1A and Supplemental Table 1). Models that demonstrated the highest accuracy were Lasso_D (75.6%), LightGBM_D (73.4%), and RF_R (71.8%), with Lasso_D being the most accurate method.
Figure 1.
Accuracy comparisons of different fusion strategies.
(A) Comparison of prediction accuracy across the three fusion strategies.
(B) Comparison of prediction accuracy between the data fusion strategy and the GS and PS models.
RFg, RFp, RF_D, RF_F, and RF_R represent genomic, phenotypic, data fusion, feature fusion, and result fusion predictions using the RF model, respectively. The same symbol combination rules apply to other models. Dots of the same color indicate prediction accuracies for different traits within the same species. The top horizontal line of each box plot represents the upper quartile, the middle horizontal line indicates the median, and the bottom horizontal line indicates the lower quartile. Vertical lines extending from the box indicate the maximum and minimum values, while dots outside this range represent outliers. Prediction accuracy was evaluated using the Pearson correlation between predicted and observed values in the testing dataset.
Comparison of the fusion strategies with traditional GS and PS models revealed that: (1) Data fusion significantly improved prediction accuracy across all models compared to traditional GS models. In the four datasets analyzed, the prediction accuracy of Lasso, RF, SVM, XGBoost, LightGBM, and DNNGP increased by 56.8%, 24.1%, 31.8%, 68.5%, 48.9%, and 49.7%, respectively, compared to GS methods (Figure 1B and Supplemental Table 2). (2) Most models also outperformed traditional PS models when using data fusion. The prediction accuracyies of Lasso, XGBoost, LightGBM, and DNNGPincreased by 18.7%, 17.8%, 15.3%, and 24.7%, respectively, compared to the PS methods. However, although data fusion improved the prediction accuracy of RF and SVM compared with GS , their performance was still inferior to that of PS (Figure 1B and Supplemental Figure 2). (3) The best-performing data fusion model (Lasso_D) increased selection accuracy by 53.4% compared to the best GS model (LightGBM) and by 18.7% compared to the best PS model (Lasso). (4) While the multitrait models MTGBLUP and MAK outperformed conventional GS models, they remained less effective than Lasso_D, which demonstrated an average improvement of 44.4% over MAK and 36.5% over MTGBLUP across the four species.
Effects of sample quantity and quality on fusion model performance
Sensitivity analysis using the Wheat2000 dataset revealed that the performance of all models declined as sample size decreased, with a more pronounced decrease when the sample size dropped below 1000. Prediction accuracy for TW (test weight), GP (grain protein), and GH (grain hardness) decreased by 26.5%, 30.6%, and 32.4%, respectively, when the sample size was reduced from 1800 to 200 (Figure 2A; Supplemental Figure 3A and 3B). Among the models, Lasso_D demonstrated strong performance on small datasets, outperforming RF_D, SVM_D, XGBoost_D, LightGBM_D, and DNNGP_D for TW prediction by 46.3%, 40.8%, 50.9%, 7.3%, and 3.5%, respectively, at a sample size of 200 (Figure 2A). Lasso_D showed similar advantages for GP and GH prediction (Supplemental Figure 3A and 3B).
Figure 2.
Sensitivity analysis across different scenarios.
The predicted trait shown is test weight (TW).
(A) Effect of sample size on prediction accuracy.
(B) Effect of SNP quality on prediction accuracy.
(C) Effect of phenotype quantity on prediction accuracy.
(D) Effect of correlation between the target trait and auxiliary trait on prediction accuracy.
Each line represents the mean prediction accuracy from 100 replicates per model. Shaded areas indicate 95% confidence intervals, calculated as mean ± 1.96 × standard error.
Sensitivity analysis of SNP density revealed that enhancing SNP quality reduced the number of SNPs but had no significant impact on model accuracy. Lasso_D consistently maintained high prediction accuracy despite changes in SNP density (Figure 2B; Supplemental Figure 3C and 3D). The prediction accuracy of nearly all models increased as the number of phenotypes used in the model increased, except for RF_D and SVM_D (Figure 2C; Supplemental Figure 3E and 3F). Additionally, the correlation coefficient between auxiliary and target traits significantly influenced the prediction accuracy of all models except RF_D and SVM_D. A higher absolute value of the correlation coefficient corresponded to increased prediction accuracy (Figure 2D; Supplemental Figure 3G and 3H). Sensitivity analysis using the soybean dataset produced consistent results (Supplemental Figure 4).
Model transferability analysis and improvement strategies
Model transferability analysis using public datasets
Transferability analysis of soybean oil content prediction was conducted using Lasso_D over three years and three locations (Figures 3A and 3B). The results revealed that (Figure 3C): (1) models trained and tested within the same year consistently outperformed cross-year models. Same-year prediction accuracies were 0.953 (2011), 0.947 (2012), and 0.958 (2013), compared to average cross-year testing accuracies of 0.93, 0.937, and 0.912, respectively. These correspond to improvements of 2.5%, 1.1%, and 5.1% for same-year models over cross-year averages. (2) Combining data from multiple years, excluding the testing year (ETY), improved cross-year prediction accuracy compared to single-year trained models. Prediction accuracies of ETY were 1%, 1.6%, and 1% higher than the average cross-year accuracies in 2011, 2012, and 2013, respectively. (3) The testing accuracy of ETY was comparable to that of same-year trained models. Although slightly lower, ETY accuracies decreased by only 2%, 1.8%, and 1.2% in 2011, 2012, and 2013, respectively, compared to the corresponding same-year training models.
Figure 3.
Model prediction accuracy under different environments.
(A) Distribution of oil content in 1260 soybean samples across different environments.
(B) Pearson correlation coefficients among seven traits in the soybean dataset. PH: plant height, YD: yield, ME: moisture, PT: protein, OIL: oil, FR: fiber, GW: grain weight.
(C) Cross-year oil prediction results. ETY indicates that the training data include all years except the testing year.
(D) Cross-location oil prediction results. ETL indicates that the training data include all locations except the testing location.
(E) Cross-environment oil prediction results. ETE indicates that the training data include all environments except the testing environment.
“Value” represents the Pearson correlation between predicted values and actual values in the testing dataset.
Location-based transferability analysis showed that models trained on data from the same location consistently outperformed cross-location models (Figure 3D). In Illinois (IL), Nebraska (NE), and Indiana (IN), same-location testing accuracies were 0.902, 0.928, and 0.951, respectively, compared to cross-location averages of 0.795, 0.898, and 0.873. These differences translated into 13.5%, 3.4%, and 9% increases compared to cross-location averages. Combining data from multiple locations, excluding the testing location (ETL), improved prediction accuracy compared to cross-location predictions trained on a single location. The testing accuracy of ETL was 4.3%, 7.4%, and 8.8% higher than the cross-location averages in IL, IN, and NE, respectively. ETL models were comparable to same-location trained models, with prediction accuracy decreases of only 1.5%, 1.9%, and 1.2% in IL, IN, and NE, respectively, compared with models trained on the same location.
Environment-based transferability analysis showed a similar trend. Models trained on data from the same environment consistently outperformed cross-environment models (Figure 3E). In 2011_IL, 2011_NE, 2012_IL, 2012_IN, 2012_NE, 2013_IL, and 2013_IN, same-environment prediction accuracies were 0.941, 0.945, 0.945, 0.958, 0.961, 0.96, and 0.948, respectively, outperforming the cross-environment averages of 0.917, 0.892, 0.928, 0.935, 0.884, 0.89, and 0.846. Same-environment accuracies were 2.5%, 6%, 1.9%, 2.5%, 8.8%, 7.8%, and 12% higher than those of the corresponding cross-environment averages. Combining data from multiple environments excluding the testing environment (ETE) improved accuracy compared to cross-environment predictions trained on a single environment. The testing accuracy of ETE was 7.2%, 5.3%, 4.6%, 5.1%, 5.3%, 5.5%, and 5.2% higher than the cross-environment averages in 2011_IL, 2011_NE, 2012_IL, 2012_IN, 2012_NE, 2013_IL, and 2013_IN, respectively. ETE models were comparable to same-environment trained models, with slight prediction accuracy decreases of only 0.1%, 0.3%, 0.4%, 0.6%, 0.3%, 0.3%, and 0.3% for 2011_IL, 2011_NE, 2012_IL, 2012_IN, 2012_NE, 2013_IL, and 2013_IN, respectively.
Model transferability from public to proprietary datasets
Consistent with results from public datasets, application of the model in two-year soybean field trials demonstrated that: (1) same-year prediction accuracies were higher than those of cross-year predictions. For GW (grain weight), the average same-year testing accuracy was 0.52, 15.6% higher than the average accuracy of cross-year testing at 0.45 (Supplemental Figure 6). (2) Training the model on data from both years further improved prediction accuracy. The testing accuracy using combined data was 2.9% higher than the average cross-year accuracy in 2022 and 2023 (Supplemental Figure 6). These results indicate that increasing the size of the training dataset improves prediction accuracy, even when the additional samples are from different genotypes and environments.
Discussion
Data fusion enhances model prediction accuracy
This study proposes a novel trait selection framework, GPS, incorporating three types of fusion strategies. The data-level fusion strategy consistently demonstrated the best performance, with prediction accuracy 5.9% and 9.9% higher than that of feature-level and result-level fusion, respectively. By enabling earlier feature extraction when handling multi-source data, data fusion significantly improved predictive accuracy compared to other fusion strategies (Picard et al., 2021; Płuciennik et al., 2022). Rather than focusing on the development of novel data integration methods, this study thoroughly analyzed the key challenges in current data integration research with the goal of improving the accuracy of GPS. Existing data integration methods, such as DIABLO (Singh et al., 2019) and MOFA (Argelaguet et al., 2018), primarily focus on combining multiple data types but often overlook the impact of the integration layer itself. Most approaches merge data at the feature level. In contrast, our study systematically compared fusion at three different levels across multiple models and found that integration at the data level consistently yielded the highest prediction accuracy. These findings provide valuable insights for the development of future data integration strategies. Data fusion enables direct discovery of genotype–phenotype mappings and genetic correlations across traits. Moreover, it subjects both -omics datasets to a common feature-processing workflow (e.g., normalization and kernel construction), harmonizing noise characteristics and regularization strength (Meng et al., 2016). In contrast, feature-level and result-level fusion rely on independent feature selection processes, which may omit subtle interaction signals (Duan et al., 2021).
The sub-optimal accuracy of feature fusion may be caused by the complexity of identifying the optimal fusion stage and method. Research has shown that different fusion strategies are suited to different applications. For example, early fusion can fully exploit intermodal information but may increase computational burdens due to higher dimensionality, whereas late fusion offers flexibility in handling high-dimensional data but may miss deeper intermodal correlations (Pawłowski et al., 2023). Future research should explore diverse fusion techniques to further enhance model performance. Recent studies have shown that DL models, such as CNNs or transformers, significantly enhance both the efficacy and granularity of feature fusion processes (He et al., 2023). Additionally, attention mechanisms can further refine fusion processes by dynamically weighting the interactions among features.
Although result-level fusion was generally less accurate than data fusion, it demonstrated the potential to achieve the highest prediction accuracy in specific cases (e.g., RF_R > RF_F > RF_D) (Figure 1A). This may be attributed to the use of the DEoptim (differential evolution optimization) algorithm for global optimization by assigning optimal weights to GS and PS. However, DEoptim requires extensive iterations, increasing both computation time and memory usage. Analysis of DEoptim-derived weight distribution patterns revealed correlations with heritability for GS models and with absolute trait correlations for PS models. To address these limitations, we proposed an alternative weight distribution method, FastW (fast weight) (Equation S1), which eliminates the need for global optimization by replacing DEoptim. Correlation coefficients between FastW and DEoptim weights, calculated for three traits per species, were 0.96, 0.99, and 0.99 for maize; 0.77, 0.99, and 0.94 for soybean; 0.98, 0.86, and 0.98 for rice; and 0.97, 0.82, and 0.96 for wheat (Supplemental Figure 7). These results demonstrate that FastW offers a computationally efficient alternative to DEoptim for weight distribution between GS and PS models (Supplemental Figure 8).
The accuracy of the data-fusion-based GPS was consistently better than GS and PS, except for PS predictions using RF and SVM (Figure 1B). This discrepancy may be attributed to the inability of RF and SVM to effectively handle high-dimensional, heterogeneous feature spaces. Specifically, genomic data such as SNPs are typically high dimensional and sparse, whereas phenotypic traits are low-dimensional and continuous. Simple concatenation of these data types can lead to feature imbalance and noise accumulation, which may hinder the learning capability of RF and SVM. In contrast, models such as Lasso, XGBoost, LightGBM, and DNNGP incorporate built-in mechanisms for feature selection and regularization, enabling them to extract relevant signals more effectively from both data types. These findings underscore the importance of selecting appropriate models and integration strategies when combining multi-source biological data for prediction tasks.
Data fusion improved prediction accuracy over models trained on single data sources, consistent with findings from previous studies (Singh et al., 2019). Unlike previous research, which primarily focused on data fusion within a single -omics category, this study addressed the challenges of integrating data across different -omics categories (Salcedo-Sanz et al., 2020), including differences in data types, feature diversity, and dimensional heterogeneity. Notably, the best-performing data fusion model (Lasso_D) improved selection accuracy by 53.4% compared to the best GS model (LightGBM) and by 18.7% compared to the best PS model (Lasso). SHAP (Shapley additive explanations) analysis further revealed that for rice GW, where GS outperformed PS, SNPs contributed more prominently to the model (Supplemental Figure 9A), whereas for soybean PT content, where PS outperformed GS, auxiliary phenotypic traits were more important (Supplemental Figure 9B). These results suggest that the proposed framework effectively integrates GS and PS data, identifies key predictive features, and improves prediction accuracy, thereby offering valuable guidance for breeding practices.
Sample quantity and quality are major factors affecting model accuracy
Model sensitivity is a critical determinant of predictive accuracy (Heffner et al., 2009). In this study, we systematically analyzed the impact of sample quantity and quality, especially SNP density, the number of phenotypes, and trait correlations, on model prediction accuracy. As sample size increases, predictive accuracy improves, especially for complex traits. Larger datasets enhance the model’s ability to capture patterns, reduce overfitting, and improve generalizability (Wientjes et al., 2013). However, given the challenges in obtaining large datasets that combine genomic and phenotypic data, it is important to develop models that perform well even with smaller sample sizes. Our study demonstrates that the correlation between auxiliary and target traits is a critical factor influencing model performance. Even models trained on small datasets can achieve high predictive power when they effectively leverage the relationships between auxiliary and target traits. Multi-objective optimization strategies may further enhance the performance of small-sample models.
SNP numbers and the strength of their association with traits are key factors influencing model accuracy (Desta and Ortiz, 2014). An excessively high number of SNPs or poor SNP quality can increase model complexity and introduce noise, thereby reducing prediction accuracy. In our study, filtering SNPs to retain only those strongly associated with target traits did not significantly affect the performance of ML models, suggesting that our approach maintains high accuracy even when using unfiltered SNPs. This may be due to the inclusion of auxiliary phenotypic traits and the feature selection capabilities inherent to the ML and DL models employed, making them less sensitive to noise and well suited for application to raw SNP data. This finding highlights the adaptability of our framework to whole-genome sequencing datasets. Additionally, developing low-density SNP chips is essential for cost-effective breeding programs. The number of phenotypes and their correlations with target traits also play a critical role in model performance. Prediction accuracy was positively correlated with the absolute correlation coefficient between auxiliary and target traits, indicating that incorporating information from auxiliary traits can significantly improve the prediction accuracy for target traits. High correlation between auxiliary and target traits may reflect shared genetic foundations, allowing auxiliary traits to serve as effective predictors in GS models (Arojju et al., 2020; Muvunyi et al., 2022; Gu et al., 2025).
In this study, we focused on the effects of sample size, SNP density, the number of phenotypes, and trait correlations on prediction accuracy. Although the wheat dataset used in this study contained 2,000 samples and multiple traits, improvements in predictive accuracy were limited, likely constrained by low genetic diversity and trait heritability. Future studies should explore the roles of genetic diversity and heritability in greater depth, as a better understanding of these factors could enhance the generalizability of prediction models across species and traits (Acosta-Pech et al., 2017).
Trait selection models with broad transferability through integration of multiyear and multilocation data
This study demonstrates the feasibility of constructing trait selection models with broad transferability by leveraging multi-source data fusion. This approach has significant implications for addressing prediction challenges across diverse environments, particularly through the integration of multi-year and multi-location datasets.
Model predictions across different years were generally more accurate than those across locations, likely due to smaller environmental variation between years. Notably, training subsets, 2012_IL and 2012_IN, significantly improved predictions compared to other subsets, which may be attributed to broader trait diversity in these subsets (Figure 3A). This suggests that broader phenotypic coverage enhances prediction accuracy across environments. Training with historical data spanning multiple years enables models to effectively capture the performance characteristics of target traits under varying environmental conditions, thereby improving model transferability. To further validate this, we compared Lasso models trained with the same number of samples drawn from either single or multiple environments (see supplemental information). The GPS model trained on multi environment data consistently outperformed those trained on data from a single environment, with an average improvement of ∼3.5% in prediction accuracy (Supplemental Figure 5). However, the PS model trained on the mixed dataset underperformed compared to models trained on 2012_IN or 2012_NE, and the GS model trained on the mixed dataset only outperformed those trained on 2012_NE and 2013_IL. These results highlight that GPS benefits from integrating multiple data types during data fusion, enabling better adaptation to cross-environment predictions.
Beyond transferability, data fusion also improves prediction accuracy for target traits that are difficult to measure by incorporating easily measurable auxiliary traits. Advances in low-cost, high-throughput phenotyping technologies—such as unmanned aerial vehicles and hyperspectral imaging—offer unprecedented opportunities to apply our methods in practice (Tao et al., 2022). For example, although soybean oil content is a complex trait influenced by both genetic and environmental factors, with low correlation across different years (Yuan et al., 2024), our model achieved significantly enhanced prediction accuracy by integrating genomic data with auxiliary traits (Figure 3B), thereby effectively leveraging trait correlations. This proposed framework is expected to significantly improve prediction accuracy for target traits by integrating genomic and early-stage phenomic data.
Consistent results were also observed when the data fusion model was applied to a private dataset (Supplemental Figure 4). Although cross-year predictions were generally less accurate than same-year predictions, models trained on data from 2 years outperformed those trained on data from a single year. These results underscore the practical utility of our framework in real-world production settings and demonstrate that integrating data across multiple years can effectively enhance prediction accuracy.
To further improve model transferability, it is crucial to develop hybrid models that integrate both data-driven and knowledge-driven approaches. For data-driven models, integrating genomic, phenotypic, and environmental data at the data level is essential. Incorporating advanced DL architectures—such as graph neural networks (Zhou et al., 2020) and attention mechanisms (Vaswani, 2017)—can better capture complex relationships in high-dimensional and unstructured data. Meanwhile, knowledge-driven models incorporate domain expertise and biological principles—such as known genetic mechanisms or functional gene annotations—to guide model development and enhance prediction accuracy (Li et al., 2024). These approaches also enhance model interpretability by integrating genetic and biological knowledge, thereby enabling biologically meaningful predictions and a deeper understanding of the relationships between genotypes and traits (Mbebi and Nikoloski, 2023). To ensure adaptability to new data and environments, future models should incorporate techniques such as online or incremental learning as databases continue to expand. By leveraging the complementary strengths of data-driven and knowledge-driven methodologies, hybrid models can address complex biological questions more effectively and offer broad applicability across diverse environments, species, and traits.
Contributions and implications
This study makes significant advances in trait selection modeling by proposing a generalizable trait prediction framework, GPS, which integrates the strengths of GS and PS using three fusion strategies: data-level, feature-level, and result-level fusion. Through systematic comparison across five ML models and one DL model, we identified data fusion as the most effective strategy for enhancing predictive accuracy. We also assessed model robustness and identified key factors that improve its accuracy. Robustness analysis demonstrated that GPS performs well with small datasets and is insensitive to SNP density. Furthermore, prediction accuracy was influenced by the number of auxiliary traits and their correlation with target traits. The fused model also showed strong transferability, performing well in cross-environment predictions. Importantly, training with multi-year and multi-location data markedly enhanced cross-environment prediction accuracy, making it comparable to same-environment predictions.
Despite these significant advances, several implications emerge that warrant further investigation. First, while the data fusion strategy demonstrated excellent performance in this study, the optimization of feature fusion remains underexplored. Most existing models operate at a single-feature level and may not fully leverage the rich information available in multi-layered data. Future research should explore fusion techniques that operate across different feature layers. This could involve developing methods that span multiple feature layers, combining high-level features from raw data with abstract features from deeper network layers (Yamanaka et al., 2017). Techniques such as layer normalization and skip connections may help manage scale differences and variability among these features, thereby enhancing overall model performance (De and Smith, 2020). Second, the inclusion of environmental factors in our current models has been limited. Future research should focus on integrating these external variables more comprehensively alongside multi-omics data. While plant genomes remain relatively stable during growth, gene expression is highly dynamic across different developmental stages. Incorporating epigenetic data, such as DNA methylation and proteomics data, can help decipher gene expression patterns and enable more accurate phenotypic prediction (Ren et al., 2024). This integration should span multiple levels of biological organization—from genomics to phenomics—to better understand how environmental conditions impact these levels (Yang et al., 2021; Montesinos-López et al., 2024). A layered data integration approach could uncover novel insights into gene–environment interactions, thereby refining prediction models and breeding strategies (Subramanian et al., 2020). Given that the inclusion of overall environmental factors does not significantly improve model accuracy (Gillberg et al., 2019), future work should focus on extracting micro-environmental conditions for each sample (Alemu et al., 2024). Additionally, incorporating information such as differences in the length of crop growth periods can help characterize unique environmental characteristics for each variety (Gobin et al., 2023). These approaches would enable each sample to possess distinct environmental features, even within the same environment. Third, future studies should aim to incorporate longitudinal genomic and phenotypic data that capture trait variability across multiple growing seasons or developmental stages (Cilas et al., 2011). This approach would improve our understanding of trait evolution under varying environmental conditions (Gobin et al., 2023) and facilitate the development of predictive models that assess the stability of traits over time, which is crucial for traits such as YD stability in agriculture and disease resistance (Baba et al., 2020).
Methods
The experimental workflow primarily included three steps: (1) comparing the accuracy of three fusion strategies to identify the most effective fusion strategy, (2) conducting sensitivity analysis on the selected strategy to determine key factors affecting model accuracy and to select the optimal predictive model, and (3) evaluating the effectiveness of fusion strategies in cross-environment prediction (Supplemental Figure 10).
Datasets
Public datasets
To assess the applicability of data fusion strategies, this study utilized publicly available datasets containing both genomic and phenotypic data for four species: rice, maize, wheat, and soybean (Table 1).
Table 1.
Description of the public and proprietary datasets.
| Dataset | Number of Samples | Number of SNPs | Number of Traits | Year × Location | Website |
|---|---|---|---|---|---|
| Ricea | 516 | 4 687 807 | 10 | 1 × 1 | https://ricevarmap.ncpgr.cn/ |
| Maizea | 244 | 9486 | 7 | 1 × 1 | http://ibi.zju.edu.cn/BreedingAIDB/ |
| Wheat | 2000 | 33 709 | 8 | 1 × 1 | https://genomics.cimmyt.org/ |
| SoyNAMa | 1260 | 4236 | 7 | 3 × 3 | https://www.soybase.org/SoyNAM/ |
| Soybeana | 600 | 848 529 | 3 | 2 × 1 | – |
Genomic data were filtered and phenotypic data were normalized.
The rice dataset was obtained from the Chinese Plant Genome Research Center (Wuhan) and represents genetic diversity among cultivated rice varieties. The dataset includes genomic and phenotypic data for 529 rice samples grown in Wuhan, China, in 2011. After removing samples with missing phenotypes, 516 samples remained. The genomic data include 4 687 807 SNPs, and the phenotypic data cover 10 agronomic traits: heading date, plant height (PH), number of panicles, number of effective panicles, YD, GW, spikelet length, grain length, grain width, and grain thickness (Zhao et al., 2021).
The maize dataset was collected from the Institute of Crop Science at Zhejiang University (Zhejiang) and includes genomic and phenotypic data for 326 maize samples grown in Hainan, China, in 2016. To ensure consistency, genomic sequences were processed together. After filtering out varieties with insufficient SNP marker data and incomplete phenotypic records, the final dataset comprised 244 samples. The genomic data consist of 9486 SNPs, and the phenotypic data cover seven agronomic traits: days to anthesis, ear height, leaf length, leaf width, lower leaf angle, PH, and upper leaf angle (Shen et al., 2024).
The wheat dataset was obtained from the wheat gene bank of the International Maize and Wheat Improvement Center (CIMMYT). It includes genomic and phenotypic data for 2000 wheat samples grown in Ciudad Obregón, Sonora, in northwestern Mexico in 2011, representing Iranian wheat (Triticum aestivum) landraces. The genomic data comprise 33 706 SNPs, while the phenotypic data cover eight agronomic traits: grain length, grain width, GH, thousand kernel weight, TW, sodium dodecyl sulfate, GP, and PH (Crossa et al., 2016). Due to its large sample size, high SNP density, and numerous phenotypic traits, this dataset was selected for subsequent model sensitivity analysis.
The SoyNAM dataset was collected from the USDA Agricultural Research Service and includes genomic and phenotypic data for 4312 soybean samples grown at eight locations across the United States from 2011 to 2013. For model transferability analysis, 1260 soybean accessions consistently present in IL, NE, and IN during this period were selected. The genomic data include 4236 SNPs, and the phenotypic data encompass seven agronomic traits: PH, YD, moisture, PT, oil, fiber, and GW (Song et al., 2017).
To ensure comparability across analyses, all datasets were processed uniformly. Genomic data were preprocessed using PLINK (Chang et al., 2015) and filtered by a minor allele frequency threshold of 0.05 and a missing rate threshold of 0.1. Phenotypic data were normalized using Z-score normalization (Fei et al., 2021) prior to data fusion to meet model requirements. Phenotypes with more than 10% missing data were excluded, and samples with any missing values for any of the remaining phenotypes were removed from subsequent analyses.
Proprietary dataset
A total of 600 soybean accessions were collected from Chengdu, Sichuan Province, China, with 300 samples planted in 2022 and another 300 in 2023. Phenotypic data were collected for each year. The experiment was conducted with a single replicate, using a row spacing of 0.5 m and plant spacing of 0.2 m. For phenotyping, at least five plants from the center of each plot were selected. Three traits were measured: flowering period, PH, and GW.
To obtain genomic data, young leaves from all 600 soybean accessions were collected, and genomic DNA was extracted using the cetyltrimethylammonium bromide method. Paired-end fragment reads (150 bp 2) were generated on a DNBSEQ-Tx sequencer (MGI Tech) following the manufacturer’s instructions. Quality control of raw sequencing data was carried out using FastQC (v0.10.1), and low-quality reads were removed. High-quality reads were then aligned to the soybean reference genome Wm82 using BWA-MEM (v0.7.13) (Li and Durbin, 2009; Wang et al., 2023b). The resulting alignments were sorted and indexed using SAMtools (v1.3.1) (Li et al., 2009). SNP and insertion/deletion variants were detected using the HaplotypeCaller, CombineGVCFs, and GenotypeGVCFs modules in GATK (v4.1.9.0) (McKenna et al., 2010). Finally, SNPs with a missing rate greater than 10% and an allele frequency below 5% were excluded using PLINK (v1.90) (Chang et al., 2015), resulting in 848 529 valid SNPs (Table 1).
Model selection
To identify the most suitable ML models for fusion, five widely adopted ML models (RF, Lasso, SVM, XGBoost, and LightGBM) and a cutting-edge DL model (DNNGP) were selected. Four GS models for benchmarking, including three statistical models (GBLUP, BayesB, and MTGBLUP) and one ML model (MAK), were also selected. To ensure robust performance estimation, each model was evaluated using 10-fold cross-validation with 10 repetitions. In each repetition, the dataset was randomly split into 10 folds; nine were used for training and one for testing, resulting in a total of 100 evaluations per model. To determine whether differences in prediction accuracy among models were statistically significant, one-way analysis of variance was performed, followed by Tukey’s honest significant difference test for pairwise comparisons.
RF is an ensemble learning method that combines predictions from multiple decision trees to enhance model accuracy and stability (Ogutu et al., 2011). RF is commonly applied in both GS and PS for extracting important features. The R package ranger (Wright and Ziegler, 2015) was used to implement RF, with all parameters set to their default values.
Lasso is a regression analysis method primarily used for variable selection and regularization, particularly suited to high-dimensional datasets such as genomic data (Usai et al., 2009). The R package glmnet (Friedman et al., 2010) was used to implement Lasso, with all parameters set to default.
SVM is a supervised learning algorithm used for both classification and regression tasks. SVM demonstrates robust performance with high-dimensional data and noise through the use of kernel functions (Maenhout et al., 2007). The R package e1071 (Dimitriadou et al., 2008) was used to implement SVM, with all parameters set to default.
XGBoost is an efficient gradient boosting algorithm widely used for both classification and regression. Unlike traditional gradient boosting, it introduces regularization terms to reduce overfitting and improve model generalization (Deng et al., 2022). The R package xgboost (Chen and Guestrin, 2016) was used to implement XGBoost, with all parameters set to default.
LightGBM is a highly efficient gradient boosting method designed for handling high-dimensional features and large-scale structured data, making it particularly suitable for future big-data-driven genomic prediction tasks. A key distinction between LightGBM and other decision-tree-based models lies in its use of a leaf-wise tree growth strategy, as opposed to the traditional level-wise approach, significantly improving model accuracy. The R package lightgbm (Ke et al., 2017) was used to implement LightGBM, with all parameters set to default.
DNNGP is a genomic prediction model leveraging deep CNNs. It achieves prediction accuracy comparable to that of ML models and demonstrates superior accuracy among existing DL models. To reduce computational time and memory usage while improving predictive performance, DNNGP incorporates principal-component analysis for dimensionality reduction (Wang et al., 2023a). Compared to other DL models such as DeepGS (Ma et al., 2018), DLGWAS (Liu et al., 2019), and SoyDNGP (Gao et al., 2023b), DNNGP’s dimensionality reduction techniques enhance the integration of genomic and phenotypic data (Wang et al., 2023a). We used DNNGP (https://github.com/AIBreeding/DNNGP) to test the effectiveness of the three fusion strategies. DNNGP was implemented using TensorFlow (v2.18.0) (Martín et al., 2015) in Python.
GBLUP and BayesB are two classical and widely applied GS models. In this study, both models were used exclusively for genomic prediction tasks. The R package BGLR (Pérez and de Los Campos, 2014) was used to implement GBLUP and BayesB.
MTGBLUP is a GS model that incorporates auxiliary traits and was also implemented using the R package BGLR (Pérez and de Los Campos, 2014).
MAK is a recently developed GS model that improves prediction accuracy by constructing multi-target ensemble regressor chains and extracting valuable information from phenotypic data (Liang et al., 2023). The MAK model also incorporates auxiliary traits during the GS process, significantly improving the prediction accuracy of the target trait in both animal and plant breeding. The MAK method was implemented following the public repository (https://github.com/ML-MAK/MAK).
DEoptim is a population-based global optimization algorithm. It generates new candidate solutions through mutation and crossover of existing solutions, selecting the best-performing individuals based on the objective function. This iterative process continues until convergence. DEoptim is effective for optimizing complex, non-linear, and high-dimensional problems and is commonly used for parameter tuning in ML models (Mullen et al., 2011).
Design and comparison of fusion strategies
To explore the optimal fusion strategy for GS and PS, this study proposed and systematically compared three types of fusion strategies using the six aforementioned models (Supplemental Figure 10A).
Data fusion (Figure 4A): the phenotypic data were represented by an matrix , where is the sample size and is the number of phenotypic traits. The genomic data were represented by an matrix , where is the number of SNPs. The fused data were represented by an matrix , where . In ML models (RF, Lasso, SVM, XGBoost, and LightGBM), serves as the direct input. However, since DNNGP requires dimensionality reduction for genomic data, is dimensionally reduced to an matrix , where is the number of reduced features. Dimensionality reduction of the features was performed using the principal-component analysis in PLINK (Chang et al., 2015). was then fused with to form an matrix , where , which served as the final input for DNNGP.
Figure 4.
Technical roadmap of the three fusion strategies.
(A) Data fusion.
(B) Feature fusion.
(C) Result fusion.
The modeling framework included six models: Lasso, RF, SVM, XGBoost, LightGBM, and DNNGP. For each of the three fusion strategies (A–C), only one model was applied at a time. For example, in the feature fusion strategy, RF was first used independently to extract features from genomic and phenotypic data. After the fusion step, RF was applied again to extract features from the combined data.
Feature fusion (Figure 4B): genomic and phenotypic data were first processed independently for feature extraction, and the resulting features were fused at an intermediate layer. and were used to extract features separately, producing an matrix , where is the number of features extracted from the phenotypic data, and an matrix , where is the number of features extracted from the genomic data. These matrices were then fused at the feature dimension to form an matrix , where . was subsequently used for additional feature extraction, resulting in the final prediction. For ML models, and were individually fed into the model for training, and the final layer’s features ( and ) were merged to form , which was then fed into the model to obtain the result. For ensemble methods (RF, XGBoost, and LightGBM), the “final layer’s features” refer to the outputs of all individual decision trees before they are aggregated to generate the final prediction. For Lasso, the “final layer’s features” refer to the product of the feature coefficients assigned by the model and the corresponding feature values. For SVM, the “final layer’s features” refer to the distance of each sample from the decision boundary, also known as the decision value, which is an intermediate result generated prior to the prediction in the regression model. For DL models, and were used as inputs, and the final layer’s features ( and ) were merged at the feature dimension to form , which was subsequently fed into the model to generate the final prediction. For DNNGP, the “final layer’s features” refer to the features extracted by the last layer of the model’s network. The same model was used for feature extraction, feature fusion, and prediction. For example, the RF model was selected to extract features from genomic and phenotypic data separately. Then, the merged features were input into the RF model to obtain the final prediction.
Result fusion (Figure 4C): genomic and phenotypic data were processed separately for model training, and the prediction results were fused using the DEoptim algorithm for global optimization (Mullen et al., 2011). The DEoptim algorithm was selected because it can optimize model weight allocation based on prediction accuracy, ensuring that prediction results remain stable and close to true values. was fed into the model to generate result , while was used to generate result . The DEoptim algorithm was then employed to optimize and assign weights and to the predictions, yielding the final result , where . For the ML models, and were independently fed into the model for training and prediction, generating and , respectively. The DEoptim algorithm then assigned the optimal weights, resulting in the final prediction . In the ML models, and were provided as inputs to generate results and . Model result extraction was carried out using the TensorFlow framework. The DEoptim algorithm assigned weights and to the genomic and phenotypic predictions, producing the final result , where .
The accuracy of these models was compared across the three fusion strategies and four datasets. Notably, GBLUP, BayesB, MTGBLUP, and MAK were applied exclusively to GS. To account for variation in traits across the datasets, three agronomic traits were randomly selected from each dataset to assess prediction accuracy. In the rice dataset, the selected traits were PH, YD, and GW; in the maize dataset, leaf height, ear height, and PH were chosen; in the wheat dataset, TW, GP, and GH were selected; and in the SoyNAM dataset, YD, PT, and GW were selected.
To maintain consistency in training and testing data across models and to address the need for validation data in DL to prevent overfitting, all datasets were uniformly partitioned into training, validation, and testing sets using an 8:1:1 ratio. The two statistical learning models and five ML models used only the training and testing sets, whereas the DL model utilized all three sets. Model accuracy was evaluated using the correlation coefficient between predicted and actual values. The model demonstrating the highest accuracy was subsequently used for transferability and sensitivity analyses (Table 2).
Table 2.
Models selected for studying three types of fusion strategies.
| Type | Model | GS | PS | Data Fusion | Feature Fusion | Result Fusion |
|---|---|---|---|---|---|---|
| Deep learning | DNNGP | √ | √ | √ | √ | √ |
| Machine learning | RF | √ | √ | √ | √ | √ |
| Lasso | √ | √ | √ | √ | √ | |
| SVM | √ | √ | √ | √ | √ | |
| XGBoost | √ | √ | √ | √ | √ | |
| LightGBM | √ | √ | √ | √ | √ |
√ indicates that this operation was performed using the model.
Sensitivity analysis
The wheat and SoyNAM datasets were selected to assess model performance at different scales due to their large sample sizes, high SNP density, and extensive phenotypic data. Sensitivity was evaluated for three traits in the wheat dataset (TW, GP, and GH) and three traits in the SoyNAM dataset (YD, PT, and GW) using the data fusion strategy that demonstrated the highest predictive accuracy (Supplemental Figure 10B).
Sensitivity to sample size: population size significantly affects prediction performance. To evaluate the influence of sample size on six models (RF_D, Lasso_D, SVM_D, XGBoost_D, LightGBM_D, and DNNGP_D), the 2000 samples were partitioned into 100 independent testing sets, 100 validation sets (for the DL model), and 1800 training samples, with experiments repeated 100 times. The 1800 training samples were randomly subsampled into sets of 200, 500, 800, 1000, 1200, 1500, and 1800 (Supplemental Table 3), while the validation and testing sets remained constant throughout the study. This process generated various combinations of training, validation, and testing samples to explore the impact of different training sample sizes on prediction accuracy. For the SoyNAM dataset, the training sets consisted of 200, 400, 600, 800, and 1000 samples.
Sensitivity to SNP density: the number of SNPs significantly affects genomic prediction performance. To evaluate the influence of SNP density on model performance, SNPs were screened and selected using varying significance thresholds (p values: 0.1, 0.01, 0.001, 1e−4, and 1e−5) based on linear regression analyses of the three target traits in the wheat dataset. The SNPs were subsequently partitioned into five subsets based on these significance thresholds, and the effect of SNP density on model accuracy was analyzed (Supplemental Table 4). The total sample size was consistently maintained at 2000, with 1800 for training, 100 for validation, and 100 for testing.
Sensitivity to phenotype: in phenotypic prediction, both the number of auxiliary traits and their correlation with target traits are crucial determinants of model performance. To assess the impact of phenotypic correlations on prediction accuracy, we incorporated one auxiliary trait into the model at a time, evaluating how the correlation between each auxiliary trait and the target trait affected prediction accuracy. To investigate the effect of the number of phenotypes on model performance, we used recursive feature elimination (Guyon et al., 2002) with the Wheat2000 dataset to determine the contribution of auxiliary traits to the prediction of the three target traits. Based on the ranking generated by recursive feature elimination, auxiliary traits were sequentially added to the model in ascending order of their contribution, allowing us to assess how the number of phenotypic traits influences prediction accuracy (Supplemental Tables 5 and 6).
Transferability analysis
Model transferability analysis using public datasets
To assess the transferability of the fused models across diverse environments, the best-performing model was applied to the SoyNAM dataset, which contains data collected from different years and locations. The impact of integrating multi-year, multi-location, and multi-environment training data on the model’s generalization capability was evaluated. Because the same samples were measured across different environments, we partitioned the training and test sets within each environment using 10-fold cross-validation to prevent data leakage in cross-environment predictions, thereby safeguarding prediction accuracy (Supplemental Figure 10C).
Transferability across years: data from a single year were used for training, followed by testing on both the same year and the cross-year validation (Supplemental Figure 11A). Additionally, training data from multiple years, excluding a specific year (Year A), were aggregated for cross-year testing on Year A (Supplemental Figure 11A).
Transferability across locations: data from a single location were used for training, followed by testing on both the same location and the cross-location validation (Supplemental Figure 11B). Additionally, training data from multiple locations, excluding a specific location (Location A), were aggregated for cross-location validation on Location A (Supplemental Figure 11B).
Transferability across environments: the dataset was partitioned into seven subsets based on location and year. A single subset was used for training, followed by testing on both the same environment and the cross-environment validation (Supplemental Figure 11C). Additionally, training data from multiple environmental subsets, excluding a specific subset (Environment Subset A), were aggregated for cross-environment validation on Environment Subset A (Supplemental Figure 11C).
Model transferability from public to proprietary datasets
To demonstrate the practical application of our models, we evaluated their transferability using a proprietary dataset comprising data collected over 2 distinct years. We designed our evaluation to test the model’s performance for a key trait: GW, using two distinct training and testing approaches.
First, the model was trained exclusively on data from one year. It was then tested both on data from the same year (to evaluate immediate performance) and on data from the other year (to assess cross-year transferability). This design enabled a direct evaluation of the model’s robustness to year-to-year variations in genetic and environmental factors.
Second, the model was trained using the combined data from both years. This approach was intended to incorporate a broader range of genetic and environmental variables, potentially enhancing the model’s generalization capabilities. The trained model was then tested using data from both 2022 and 2023. This step was critical for determining how well the model could predict traits across different years when trained on a multi-year dataset.
Data and code availability
The GPS scripts are available in the release package on GitHub (https://github.com/Jinlab-AiPhenomics/BioGPS). All public datasets used in this study are listed in the main text (Table 1).
Funding
This work was supported in part by the National Key Research and Development Program of China (2022YFD2300700), the Fundamental Research Funds for the Central Universities (YDZX2025021; KYT2024005; QTPY2025006), the Jiangsu Province Key Research and Development Program (BE2023369), the Natural Science Foundation of Jiangsu Province (BK20231469), the Hainan Yazhou Bay Seed Laboratory (B21HJ1005), the National Natural Science Foundation of China (32201656), the Sichuan Provincial Finance Department Project of China (1+3 ZYGG001), the JBGS Project of Seed Industry Revitalization in Jiangsu Province (JBGS [2021] 007), the Young Elite Scientists Sponsorship Program by CAST (YESS), the Science and Technology Innovation 2030–Major Project (2023ZD04034; 2023ZD0405605), the Zhongshan Biological Breeding Laboratory (ZSBBL-KY2023-03), and the Jiangsu Provincial Special Fund for Basic Research (Major Innovation Platform Plan) (BM2024005).
Acknowledgments
We would like to express our gratitude to Professor Wanneng Yang from Huazhong Agricultural University and Professor Huihui Li from the Chinese Academy of Sciences for their assistance in reviewing the manuscript. No conflict of interest is declared.
Author contributions
S.J. and H.W. designed the research; C.X. performed field experiments and collected data; H.W. analyzed the data; H.W. and S.J. wrote the manuscript. All authors contributed to revising the manuscript and approved the final version.
Published: June 11, 2025
Footnotes
Supplemental information is available at Plant Communications Online.
Supplemental information
References
- Abdollahi-Arpanahi R., Gianola D., Peñagaricano F. Deep learning versus parametric and ensemble methods for genomic prediction of complex phenotypes. Genet. Sel. Evol. 2020;52:1–15. doi: 10.1186/s12711-020-00531-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Acosta-Pech R., Crossa J., de Los Campos G., Teyssèdre S., Claustres B., Pérez-Elizalde S., Pérez-Rodríguez P. Genomic models with genotype× environment interaction for predicting hybrid performance: an application in maize hybrids. Theor. Appl. Genet. 2017;130:1431–1440. doi: 10.1007/s00122-017-2898-0. [DOI] [PubMed] [Google Scholar]
- Alemu A., Åstrand J., Montesinos-López O.A., Isidro Y Sánchez J., Fernández-Gónzalez J., Tadesse W., Vetukuri R.R., Carlsson A.S., Ceplitis A., Crossa J., et al. Genomic selection in plant breeding: Key factors shaping two decades of progress. Mol. Plant. 2024;17:552–578. doi: 10.1016/j.molp.2024.03.007. [DOI] [PubMed] [Google Scholar]
- Anche M.T., Kaczmar N.S., Morales N., Clohessy J.W., Ilut D.C., Gore M.A., Robbins K.R. Temporal covariance structure of multi-spectral phenotypes and their predictive ability for end-of-season traits in maize. Theor. Appl. Genet. 2020;133:2853–2868. doi: 10.1007/s00122-020-03637-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Argelaguet R., Velten B., Arnol D., Dietrich S., Zenz T., Marioni J.C., Buettner F., Huber W., Stegle O. Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets. Mol. Syst. Biol. 2018;14 doi: 10.15252/msb.20178124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arojju S.K., Cao M., Trolove M., Barrett B.A., Inch C., Eady C., Stewart A., Faville M.J. Multi-trait genomic prediction improves predictive ability for dry matter yield and water-soluble carbohydrates in perennial ryegrass. Front. Plant Sci. 2020;11:1197. doi: 10.3389/fpls.2020.01197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Azodi C.B., Bolger E., McCarren A., Roantree M., de Los Campos G., Shiu S.H. Benchmarking parametric and machine learning models for genomic prediction of complex traits. G3 (Bethesda) 2019;9:3691–3702. doi: 10.1534/g3.119.400498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baba T., Momen M., Campbell M.T., Walia H., Morota G. Multi-trait random regression models increase genomic prediction accuracy for a temporal physiological trait derived from high-throughput phenotyping. PLoS One. 2020;15 doi: 10.1371/journal.pone.0228118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bandillo N.B., Jarquin D., Posadas L.G., Lorenz A.J., Graef G.L. Genomic selection performs as effectively as phenotypic selection for increasing seed yield in soybean. Plant Genome. 2023;16 doi: 10.1002/tpg2.20285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bernardo R., Yu J. Prospects for genomewide selection for quantitative traits in maize. Crop Sci. 2007;47:1082–1090. doi: 10.2135/cropsci2006.11.0690. [DOI] [Google Scholar]
- Boomsma C.R., Santini J.B., West T.D., Brewer J.C., McIntyre L.M., Vyn T.J. Maize grain yield responses to plant height variability resulting from crop rotation and tillage system in a long-term experiment. Soil Tillage Res. 2010;106:227–240. doi: 10.1016/j.still.2009.12.006. [DOI] [Google Scholar]
- Brault C., Lazerges J., Doligez A., Thomas M., Ecarnot M., Roumet P., Bertrand Y., Berger G., Pons T., François P., et al. Interest of phenomic prediction as an alternative to genomic prediction in grapevine. Plant Methods. 2022;18:108. doi: 10.1186/s13007-022-00940-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Calus M.P.L., Veerkamp R.F. Accuracy of multi-trait genomic selection using different methods. Genet. Sel. Evol. 2011;43:26. doi: 10.1186/1297-9686-43-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang C.C., Chow C.C., Tellier L.C., Vattikuti S., Purcell S.M., Lee J.J. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen T., Guestrin C. Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. 2016. Xgboost: A scalable tree boosting system. [Google Scholar]
- Cilas C., Montagnon C., Bar-Hen A. Yield stability in clones of Coffea canephora in the short and medium term: longitudinal data analyses and measures of stability over time. Tree Genet. Genomes. 2011;7:421–429. [Google Scholar]
- Crain J., Mondal S., Rutkoski J., Singh R.P., Poland J. Combining high-throughput phenotyping and genomic information to increase prediction and selection accuracy in wheat breeding. Plant Genome. 2018;11 doi: 10.3835/plantgenome2017.05.0043. [DOI] [PubMed] [Google Scholar]
- Crossa J., Pérez P., Hickey J., Burgueño J., Ornella L., Cerón-Rojas J., Zhang X., Dreisigacker S., Babu R., Li Y., et al. Genomic prediction in CIMMYT maize and wheat breeding programs. Heredity. 2014;112:48–60. doi: 10.1038/hdy.2013.16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crossa J., Jarquín D., Franco J., Pérez-Rodríguez P., Burgueño J., Saint-Pierre C., Vikram P., Sansaloni C., Petroli C., Akdemir D., et al. Genomic prediction of gene bank wheat landraces. G3 (Bethesda) 2016;6:1819–1834. doi: 10.1534/g3.116.029637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crossa J., Pérez-Rodríguez P., Cuevas J., Montesinos-López O., Jarquín D., De Los Campos G., Burgueño J., González-Camacho J.M., Pérez-Elizalde S., Beyene Y., et al. Genomic selection in plant breeding: methods, models, and perspectives. Trends Plant Sci. 2017;22:961–975. doi: 10.1016/j.tplants.2017.08.011. [DOI] [PubMed] [Google Scholar]
- Cuevas J., Crossa J., Montesinos-López O.A., Burgueño J., Pérez-Rodríguez P., de Los Campos G. Bayesian genomic prediction with genotype× environment interaction kernel models. G3 (Bethesda) 2017;7:41–53. doi: 10.1534/g3.116.035584. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daetwyler H.D., Calus M.P.L., Pong-Wong R., de Los Campos G., Hickey J.M. Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics. 2013;193:347–365. doi: 10.1534/genetics.112.147983. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De S., Smith S. Batch normalization biases residual blocks towards the identity function in deep networks. Adv. Neural Inf. Process. Syst. 2020;33:19964–19975. [Google Scholar]
- Deng X., Li M., Deng S., Wang L. Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification. Med. Biol. Eng. Comput. 2022;60:663–681. doi: 10.1007/s11517-021-02476-x. [DOI] [PubMed] [Google Scholar]
- Desta Z.A., Ortiz R. Genomic selection: genome-wide prediction in plant improvement. Trends Plant Sci. 2014;19:592–601. doi: 10.1016/j.tplants.2014.05.006. [DOI] [PubMed] [Google Scholar]
- Dimitriadou E., Hornik K., Leisch F., Meyer D., Weingessel A. Misc functions of the Department of Statistics (e1071), TU Wien. R package. 2008;1:5–24. [Google Scholar]
- Duan R., Gao L., Gao Y., Hu Y., Xu H., Huang M., Song K., Wang H., Dong Y., Jiang C., et al. Evaluation and comparison of multi-omics data integration methods for cancer subtyping. PLoS Comput. Biol. 2021;17 doi: 10.1371/journal.pcbi.1009224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fei N., Gao Y., Lu Z., Xiang T. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. Z-score normalization, hubness, and few-shot learning. [Google Scholar]
- Foley J.A., Ramankutty N., Brauman K.A., Cassidy E.S., Gerber J.S., Johnston M., Mueller N.D., O’Connell C., Ray D.K., West P.C., et al. Solutions for a cultivated planet. Nature. 2011;478:337–342. doi: 10.1038/nature10452. [DOI] [PubMed] [Google Scholar]
- Friedman J., Hastie T., Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]
- Gao J., Hu X., Gao C., Chen G., Feng H., Jia Z., Zhao P., Yu H., Li H., Geng Z., et al. Deciphering genetic basis of developmental and agronomic traits by integrating high-throughput optical phenotyping and genome-wide association studies in wheat. Plant Biotechnol. J. 2023;21:1966–1977. doi: 10.1111/pbi.14104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gao P., Zhao H., Luo Z., Lin Y., Feng W., Li Y., Kong F., Li X., Fang C., Wang X. SoyDNGP: a web-accessible deep learning framework for genomic prediction in soybean breeding. Brief. Bioinform. 2023;24 doi: 10.1093/bib/bbad349. [DOI] [PubMed] [Google Scholar]
- Gillberg J., Marttinen P., Mamitsuka H., Kaski S. Modelling G×E with historical weather information improves genomic prediction in new environments. Bioinformatics. 2019;35:4045–4052. doi: 10.1093/bioinformatics/btz197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gizaw S.A., Godoy J.G.V., Pumphrey M.O., Carter A.H. Spectral reflectance for indirect selection and genome-wide association analyses of grain yield and drought tolerance in North American spring wheat. Crop Sci. 2018;58:2289–2301. [Google Scholar]
- Gobin A., Sallah A.-H.M., Curnel Y., Delvoye C., Weiss M., Wellens J., Piccard I., Planchon V., Tychon B., Goffart J.P., et al. Crop phenology modelling using proximal and satellite sensor data. Remote Sens. 2023;15:2090. [Google Scholar]
- Godfray H.C.J., Beddington J.R., Crute I.R., Haddad L., Lawrence D., Muir J.F., Pretty J., Robinson S., Thomas S.M., Toulmin C. Food security: the challenge of feeding 9 billion people. Science. 2010;327:812–818. doi: 10.1126/science.1185383. [DOI] [PubMed] [Google Scholar]
- Gu L.-L., Wu H.-S., Liu T.-Y., Zhang Y.-J., He J.-C., Liu X.-L., Wang Z.-Y., Chen G.-B., Jiang D., Fang M. Rapid and accurate multi-phenotype imputation for millions of individuals. Nat. Commun. 2025;16:387. doi: 10.1038/s41467-024-55496-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo T., Wei J., Li X., Yu J. Environmental context of phenotypic plasticity in flowering time in sorghum and rice. J. Exp. Bot. 2024;75:1004–1015. doi: 10.1093/jxb/erad398. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guyon I., Weston J., Barnhill S., Vapnik V. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002;46:389–422. doi: 10.1023/A:1012487302797. [DOI] [Google Scholar]
- He S., Yang H., Zhang X., Li X. MFTransNet: a multi-modal fusion with CNN-transformer network for semantic segmentation of HSR remote sensing images. Mathematics. 2023;11:722. [Google Scholar]
- Heffner E.L., Sorrells M.E., Jannink J. Genomic selection for crop improvement. Crop Sci. 2009;49:1–12. [Google Scholar]
- Heslot N., Yang H.P., Sorrells M.E., Jannink J.L. Genomic selection in plant breeding: a comparison of models. Crop Sci. 2012;52:146–160. [Google Scholar]
- Holliday J.A., Wang T., Aitken S. Predicting adaptive phenotypes from multilocus genotypes in Sitka spruce (Picea sitchensis) using random forest. G3 (Bethesda) 2012;2:1085–1093. doi: 10.1534/g3.112.002733. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jackson R., Buntjer J.B., Bentley A.R., Lage J., Byrne E., Burt C., Jack P., Berry S., Flatman E., Poupard B., et al. Phenomic and genomic prediction of yield on multiple locations in winter wheat. Front. Genet. 2023;14 doi: 10.3389/fgene.2023.1164935. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Juliana P., Poland J., Huerta-Espino J., Shrestha S., Crossa J., Crespo-Herrera L., Toledo F.H., Govindan V., Mondal S., Kumar U., et al. Improving grain yield, stress resilience and quality of bread wheat using large-scale genomics. Nat. Genet. 2019;51:1530–1539. doi: 10.1038/s41588-019-0496-6. [DOI] [PubMed] [Google Scholar]
- Ke G.L., Meng Q., Finley T., Wang T.F., Chen W., Ma W.D., Ye Q.W., Liu T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Adv. Neural Inf. Process. Syst. 2017;30:30. [Google Scholar]
- Li H., Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li X., Chen X., Wang Q., Yang N., Sun C. Integrating Bioinformatics and Machine Learning for Genomic Prediction in Chickens. Genes. 2024;15:690. doi: 10.3390/genes15060690. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang M., Cao S., Deng T., Du L., Li K., An B., Du Y., Xu L., Zhang L., Gao X., et al. MAK: a machine learning framework improved genomic prediction via multi-target ensemble regressor chains and automatic selection of assistant traits. Briefings Bioinf. 2023;24 doi: 10.1093/bib/bbad043. [DOI] [PubMed] [Google Scholar]
- Liu Y., Wang D., He F., Wang J., Joshi T., Xu D. Phenotype Prediction and Genome-Wide Association Study Using Deep Convolutional Neural Network of Soybean. Front. Genet. 2019;10:1091. doi: 10.3389/fgene.2019.01091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Y., Zhang Y., Zhou F., Yao Z., Zhan Y., Fan Z., Meng X., Zhang Z., Liu L., Yang J., et al. Increased Accuracy of Genomic Prediction Using Preselected SNPs from GWAS with Imputed Whole-Genome Sequence Data in Pigs. Animals. 2023;13:3871. doi: 10.3390/ani13243871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luo T., Liu X., Lakshmanan P. A combined genomics and phenomics approach is needed to boost breeding in sugarcane. Plant Phenomics. 2023;5 doi: 10.34133/plantphenomics.0074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma C., Zhang H.H., Wang X. Machine learning for big data analytics in plants. Trends Plant Sci. 2014;19:798–808. doi: 10.1016/j.tplants.2014.08.004. [DOI] [PubMed] [Google Scholar]
- Ma W., Qiu Z., Song J., Cheng Q., Ma C. DeepGS: Predicting phenotypes from genotypes using Deep Learning. bioRxiv. 2017 doi: 10.1101/241414. Preprint at. [DOI] [Google Scholar]
- Ma W., Qiu Z., Song J., Li J., Cheng Q., Zhai J., Ma C. A deep convolutional neural network approach for predicting phenotypes from genotypes. Planta. 2018;248:1307–1318. doi: 10.1007/s00425-018-2976-9. [DOI] [PubMed] [Google Scholar]
- Maenhout S., De Baets B., Haesaert G., Van Bockstaele E. Support vector machine regression for the prediction of maize hybrid performance. Theor. Appl. Genet. 2007;115:1003–1013. doi: 10.1007/s00122-007-0627-9. [DOI] [PubMed] [Google Scholar]
- Martín A., Ashish A., Paul B., Eugene B., Zhifeng C., Craig C., Greg S.C., Andy D., Jeffrey D., Matthieu D., et al. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow. 2015;7 [Google Scholar]
- Mbebi A.J., Nikoloski Z. Gene regulatory network inference using mixed-norms regularized multivariate model with covariance selection. PLoS Comput. Biol. 2023;19 doi: 10.1371/journal.pcbi.1010832. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meng C., Zeleznik O.A., Thallinger G.G., Kuster B., Gholami A.M., Culhane A.C. Dimension reduction techniques for the integrative analysis of multi-omics data. Brief. Bioinform. 2016;17:628–641. doi: 10.1093/bib/bbv108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meuwissen T.H.E., Hayes B.J., Goddard M.E. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819–1829. doi: 10.1093/genetics/157.4.1819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mohr A.E., Ortega-Santos C.P., Whisner C.M., Klein-Seetharaman J., Jasbi P. Navigating Challenges and Opportunities in Multi-Omics Integration for Personalized Healthcare. Biomedicines. 2024;12:1496. doi: 10.3390/biomedicines12071496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Montesinos-López A., Montesinos-López O.A., Gianola D., Crossa J., Hernández-Suárez C.M. Multi-environment genomic prediction of plant traits using deep learners with dense architecture. G3 (Bethesda) 2018;8:3813–3828. doi: 10.1534/g3.118.200740. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Montesinos-López O.A., Crespo-Herrera L., Pierre C.S., Cano-Paez B., Huerta-Prado G.I., Mosqueda-González B.A., Ramos-Pulido S., Gerard G., Alnowibet K., Fritsche-Neto R., et al. Feature engineering of environmental covariates improves plant genomic-enabled prediction. Front. Plant Sci. 2024;15 doi: 10.3389/fpls.2024.1349569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mullen K., Ardia D., Gil D., Windover D., Cline J. DEoptim: An R Package for Global Optimization by Differential Evolution. J. Stat. Softw. 2011;40:1–26. [Google Scholar]
- Muvunyi B.P., Zou W., Zhan J., He S., Ye G. Multi-trait genomic prediction models enhance the predictive ability of grain trace elements in rice. Front. Genet. 2022;13 doi: 10.3389/fgene.2022.883853. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ogutu J.O., Piepho H.-P., Schulz-Streeck T. BMC Proceedings. Springer; 2011. A comparison of random forests, boosting and support vector machines for genomic selection. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pawłowski M., Wróblewska A., Sysko-Romańczuk S. Effective techniques for multimodal data fusion: A comparative analysis. Sensors. 2023;23:2381. doi: 10.3390/s23052381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pérez P., de los Campos G. Genome-wide regression and prediction with the BGLR statistical package. Genetics. 2014;198:483–495. doi: 10.1534/genetics.114.164442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Picard M., Scott-Boyer M.-P., Bodein A., Périn O., Droit A., Journal S.B. Integration strategies of multi-omics data for machine learning analysis. Comput. Struct. Biotechnol. J. 2021;19:3735–3746. doi: 10.1016/j.csbj.2021.06.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Płuciennik A., Płaczek A., Wilk A., Student S., Oczko-Wojciechowska M., Fujarewicz K. Data Integration–Possibilities of Molecular and Clinical Data Fusion on the Example of Thyroid Cancer Diagnostics. Int. J. Mol. Sci. 2022;23 doi: 10.3390/ijms231911880. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ren Y., Wu C., Zhou H., Hu X., Miao Z. Dual-extraction modeling: A multi-modal deep-learning architecture for phenotypic prediction and functional gene mining of complex traits. Plant Commun. 2024;5 doi: 10.1016/j.xplc.2024.101002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rincent R., Charpentier J.P., Faivre-Rampant P., Paux E., Le Gouis J., Bastien C., Segura V. Phenomic Selection Is a Low-Cost and High-Throughput Method Based on Indirect Predictions: Proof of Concept on Wheat and Poplar. G3 (Bethesda) 2018;8:3961–3972. doi: 10.1534/g3.118.200760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roscher-Ehrig L., Weber S.E., Abbadi A., Malenica M., Abel S., Hemker R., Snowdon R.J., Wittkop B., Stahl A. Phenomic Selection for Hybrid Rapeseed Breeding. Plant Phenomics. 2024;6 doi: 10.34133/plantphenomics.0215. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salcedo-Sanz S., Ghamisi P., Piles M., Werner M., Cuadra L., Moreno-Martínez A., Izquierdo-Verdiguier E., Muñoz-Marí J., Mosavi A., Camps-Valls G. Machine learning information fusion in Earth observation: A comprehensive review of methods, applications and data sources. Inf. Fusion. 2020;63:256–272. [Google Scholar]
- Shen Z., Shen E., Yang K., Fan Z., Zhu Q.-H., Fan L., Ye C.-Y. BreedingAIDB: A database integrating crop genome-to-phenotype paired data with machine learning tools applicable to breeding. Plant Commun. 2024;5:100894. doi: 10.1016/j.xplc.2024.100894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singh A., Shannon C.P., Gautier B., Rohart F., Vacher M., Tebbutt S.J., Lê Cao K.-A. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics. 2019;35:3055–3062. doi: 10.1093/bioinformatics/bty1054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singh R., Lanchantin J., Robins G., Qi Y. DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics. 2016;32:i639–i648. doi: 10.1093/bioinformatics/btw427. [DOI] [PubMed] [Google Scholar]
- Singh R.P., Singh P.K., Rutkoski J., Hodson D.P., He X., Jørgensen L.N., Hovmøller M.S., Huerta-Espino J. Disease Impact on Wheat Yield Potential and Prospects of Genetic Control. Annu. Rev. Phytopathol. 2016;54:303–322. doi: 10.1146/annurev-phyto-080615-095835. [DOI] [PubMed] [Google Scholar]
- Song Q., Yan L., Quigley C., Jordan B.D., Fickus E., Schroeder S., Song B.H., Charles An Y.Q., Hyten D., Nelson R., et al. Genetic characterization of the soybean nested association mapping population. Plant Genome. 2017;10 doi: 10.3835/plantgenome2016.10.0109. [DOI] [PubMed] [Google Scholar]
- Subramanian I., Verma S., Kumar S., Jere A., Anamika K. Multi-omics data integration, interpretation, and its application. Bioinform. Biol. Insights. 2020;14 doi: 10.1177/1177932219899051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tao H., Xu S., Tian Y., Li Z., Ge Y., Zhang J., Wang Y., Zhou G., Deng X., Zhang Z., et al. Proximal and remote sensing in plant phenomics: 20 years of progress, challenges, and perspectives. Plant Commun. 2022;3 doi: 10.1016/j.xplc.2022.100344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tester M., Langridge P. Breeding technologies to increase crop production in a changing world. Science. 2010;327:818–822. doi: 10.1126/science.1183700. [DOI] [PubMed] [Google Scholar]
- Usai M.G., Goddard M.E., Hayes B.J. LASSO with cross-validation for genomic selection. Genet. Res. 2009;91:427–436. doi: 10.1017/S0016672309990334. [DOI] [PubMed] [Google Scholar]
- VanRaden P.M. Efficient methods to compute genomic predictions. J. Dairy Sci. 2008;91:4414–4423. doi: 10.3168/jds.2007-0980. [DOI] [PubMed] [Google Scholar]
- Vaswani A. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017;30:5998–6008 [Google Scholar]
- Wainberg M., Merico D., Delong A., Frey B.J. Deep learning in biomedicine. Nat. Biotechnol. 2018;36:829–838. doi: 10.1038/nbt.4233. [DOI] [PubMed] [Google Scholar]
- Wang K., Abid M.A., Rasheed A., Crossa J., Hearne S., Li H. DNNGP, a deep neural network-based method for genomic prediction using multi-omics data in plants. Mol. Plant. 2023;16:279–293. doi: 10.1016/j.molp.2022.11.004. [DOI] [PubMed] [Google Scholar]
- Wang L., Zhang M., Li M., Jiang X., Jiao W., Song Q. A telomere-to-telomere gap-free assembly of soybean genome. Mol. Plant. 2023;16:1711–1714. doi: 10.1016/j.molp.2023.08.012. [DOI] [PubMed] [Google Scholar]
- Wientjes Y.C.J., Veerkamp R.F., Calus M.P.L. The effect of linkage disequilibrium and family relationships on the reliability of genomic prediction. Genetics. 2013;193:621–631. doi: 10.1534/genetics.112.146290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Winn Z.J., Amsberry A.L., Haley S.D., DeWitt N.D., Mason R.E. Phenomic versus genomic prediction—A comparison of prediction accuracies for grain yield in hard winter wheat lines. The Plant Phenome Journal. 2023;6:2578–2703. doi: 10.1002/ppj2.20084. [DOI] [Google Scholar]
- Wright M.N., Ziegler A. ranger: A fast implementation of random forests for high dimensional data in C++ and R. arXiv. 2015 doi: 10.48550/arXiv.1508.04409. Preprint at. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu Y., Zhao Y., Wang X., Ma Y., Li P., Yang Z., Zhang X., Xu C., Xu S. Incorporation of parental phenotypic data into multi-omic models improves prediction of yield-related traits in hybrid rice. Plant Biotechnol. J. 2021;19:261–272. doi: 10.1111/pbi.13458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yamanaka J., Kuwashima S., Kurita T. In: Neural Information Processing: 24th International Conference, ICONIP 2017, Guangzhou, China, November 14–18, 2017, Proceedings, Part II 24. Liu D., Xie S., Li Y., Zhao D., El–Alfy E.–S.M, editors. Springer; 2017. Fast and accurate image super resolution by deep CNN with skip connection and network in network. [Google Scholar]
- Yan J., Xu Y., Cheng Q., Jiang S., Wang Q., Xiao Y., Ma C., Yan J., Wang X. LightGBM: accelerated genomically designed crop breeding through ensemble learning. Genome Biol. 2021;22:271. doi: 10.1186/s13059-021-02492-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Y., Saand M.A., Huang L., Abdelaal W.B., Zhang J., Wu Y., Li J., Sirohi M.H., Wang F. Applications of multi-omics technologies for crop improvement. Front. Plant Sci. 2021;12 doi: 10.3389/fpls.2021.563953. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z., Wen J., Abdulkadir A., Cui Y., Erus G., Mamourian E., Melhem R., Srinivasan D., Govindarajan S.T., Chen J., et al. Gene-SGAN: discovering disease subtypes with imaging and genetic signatures via multi-view weakly-supervised deep clustering. Nat. Commun. 2024;15:354. doi: 10.1038/s41467-023-44271-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan X., Jiang X., Zhang M., Wang L., Jiao W., Chen H., Mao J., Ye W., Song Q. Integrative omics analysis elucidates the genetic basis underlying seed weight and oil content in soybean. Plant Cell. 2024;36:2160–2175. doi: 10.1093/plcell/koae062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao H., Li J., Yang L., Qin G., Xia C., Xu X., Su Y., Liu Y., Ming L., Chen L.L., et al. An inferred functional impact map of genetic variants in rice. Mol. Plant. 2021;14:1584–1599. doi: 10.1016/j.molp.2021.06.025. [DOI] [PubMed] [Google Scholar]
- Zhou J., Cui G., Hu S., Zhang Z., Yang C., Liu Z., Wang L., Li C., Sun M. Graph neural networks: A review of methods and applications. AI Open. 2020;1:57–81. [Google Scholar]
- Zhou M., Yuan Y., Zhang Y., Zhang W., Zhou R., Ji J., Wu H., Zhao Y., Zhang D., Liu B., et al. The study of the genomic selection of white gill disease resistance in large yellow croaker (Larimichthys crocea) Aquaculture. 2023;574 [Google Scholar]
- Zhu X., Leiser W.L., Hahn V., Würschum T. Phenomic selection is competitive with genomic selection for breeding of complex traits. The Plant Phenome Journal. 2021;4 doi: 10.1002/ppj2.20027. [DOI] [Google Scholar]
- Zhu X., Maurer H.P., Jenz M., Hahn V., Ruckelshausen A., Leiser W.L., Würschum T. The performance of phenomic selection depends on the genetic architecture of the target trait. Theor. Appl. Genet. 2022;135:653–665. doi: 10.1007/s00122-021-03997-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zou J., Huss M., Abid A., Mohammadi P., Torkamani A., Telenti A. A primer on deep learning in genomics. Nat. Genet. 2019;51:12–18. doi: 10.1038/s41588-018-0295-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The GPS scripts are available in the release package on GitHub (https://github.com/Jinlab-AiPhenomics/BioGPS). All public datasets used in this study are listed in the main text (Table 1).




