Dear Editor,
Machine learning (ML) is one of the key technologies for next-generation breeding, and “big data” is the cornerstone for development of ML algorithms that are applicable to crop breeding practices. Currently, there is a shortage of databases containing phenotype data and corresponding genomic data, i.e., genome-to-phenotype (G2P) paired data, that can be used in the development of ML algorithms for breeding. To fill this gap, we constructed a user-friendly database named the BreedingAIDB (http://ibi.zju.edu.cn/BreedingAIDB) to provide breeders and ML experts with easily accessible G2P paired data for crops, as well as ML tools.
Conventional breeding techniques are insufficient to address the challenges posed by climate change and the increasing global demand for agricultural products (Jagermeyr et al., 2021). Meeting these challenges demands reorientation of breeding methods toward next-generation technologies, approaches collectively known as Breeding 4.0 (Wallace et al., 2018). Central to this frontier is G2P prediction.
A major limitation to the use of ML in current crop breeding practices is the limited availability of G2P paired data, i.e., the lack of well-curated and labeled (using phenotypic data as labels) genomic data (Bayer and Edwards, 2021; Danilevicz et al., 2022). As shown in Figure 1A, building an accurate ML model requires a collaborative effort across diverse domains encompassing proficient feature engineering, feature selection algorithms or strategies, and models with well-suited hyperparameters. Assessing the accuracy of a novel feature engineering or feature extraction strategy or algorithm requires properly labeled data. At the same time, ML models must be optimized using a large amount of labeled high-quality data to achieve highly accurate prediction. A database with both curated genome information and phenotype data would facilitate studies on the application of ML to G2P prediction and crop breeding practices; however, to the best of our knowledge, no such database is currently available. Although there are multiple genomic databases for mining genetic variations linked to traits, the data format of these databases does not align with the requirements of breeding ML research. Development of new coding methods, e.g., feature engineering, is one of the important directions for breeding ML research (Danilevicz et al., 2022). Feature engineering of breeding ML requires more original genomic information, e.g., genomic variant call format (GVCF) files (Figure 1A). As far as we know, current databases mainly provide loci for genome-wide association studies and genotype matrices of populations. For example, CropGS-Hub provides genomic data in a single format, the genotype matrix, which is currently the predominant way of encoding genomes (Chen et al., 2024). Unlike genotype matrices of populations, individual GVCF and VCF files can support research in various areas of breeding ML. Construction of a large-scale database of high-quality G2P pairs is not an easy task. First, the data aggregation demands significant time and effort, as vast quantities of data are currently located in isolated data silos and lack transparency (Bayer and Edwards, 2021). Second, data integration is complicated by inconsistencies in the naming of phenotypic traits and the use of different units for the same traits by different data generators, requiring significant manpower for data collation and standardization (Bayer and Edwards, 2021). In addition, genomic data are not directly applicable to the construction of ML models, necessitating preprocessing and transformation, which consume substantial time and computational resources (Bayer and Edwards, 2021; Danilevicz et al., 2022). Most importantly, for genome and phenotype data to be useful for ML, they must appear in pairs, i.e., G2P pairs. Unfortunately, this is not currently the case, although substantial genome resequencing data and a large body of phenotypic data are available for several crops.
Figure 1.
Summary and functions of the BreedingAIDB.
(A) Genomic data in the BreedingAIDB support the development of ML breeding.
(B) Summary of G2P paired data from the three crops in the BreedingAIDB.
(C) Downloading G2P paired data from the BreedingAIDB.
(D) Performance of ML models for rice grain width and length. R2, coefficient of determination; MSE, mean-squared error; r, Pearson correlation coefficient.
To address this critical need and to provide a robust data foundation for G2P ML research, we have built the BreedingAIDB, a comprehensive database containing extensive G2P paired data from several crops (the 1.0 release includes rice, soybean, and maize) (Figure 1B). The database comprises 143 477 rice G2P paired data for 41 distinct phenotypes, 284 395 soybean G2P paired data for 114 phenotypes, and 12 654 maize G2P paired data for 18 phenotypes.
We processed the genomic data into two forms of unstructured data and one form of structured data that are directly applicable to breeding ML (Figure 1A; Supplemental Data 1) (Poplin et al., 2018; Shen et al., 2023). The unstructured data consist of VCF and GVCF files. The VCF files contain high-quality SNP information generated by a strict genomic analysis process with common filtering criteria (Poplin et al., 2018). The GVCF files cover all genome loci, regardless of whether there is a genomic variant or not, thus offering a broader range of genomic information, e.g., insertions or deletions, copy-number variations, and so on, which can be represented by new feature engineering algorithms (Danilevicz et al., 2022). For structured data, we used GSCtool to generate gsctool features, which can be input directly into the G2P model (Shen et al., 2023). The Crop Ontology was created by the Consultative Group on International Agricultural Research for standardization of important traits (Arnaud et al., 2020). The collected phenotypes with corresponding trait ontologies have been annotated, and those lacking trait ontologies have been harmonized in terms of names and units. All three types of G2P paired data are accessible through the download function module of the BreedingAIDB (Figure 1C). Users can select their desired phenotype and obtain phenotype information and corresponding genomic data by clicking “export csv” or “export excel” (Figure 1C).
In addition to G2P paired data, three core functional modules are also offered at the BreedingAIDB: feature extraction, phenotype prediction, and ML project. The engine of the feature extraction module is GSCtool (Shen et al., 2023). The feature extraction module currently supports feature extraction for the three crops included in the database: rice, maize, and soybean. By uploading variant detection results (VCF files containing SNP information) generated using IRGSP 1.0 (rice), B37 (maize), or Williams 82 (soybean) as the reference genome and selecting the applicable reference genome (Shen et al., 2023), users can easily complete the feature extraction process using GSCtool, obtaining essential results for use in subsequent analyses and model construction. Currently, the phenotype prediction module offers predictions for grain length and width in rice. These two traits had the best prediction results, whereas the prediction accuracy of other traits was variable and lower. The GRU (gate recurrent unit) models of phenotype prediction were optimized from our previous work (Shen et al., 2023), achieving Pearson correlation coefficients of 0.76 and 0.82 for rice grain length and width, respectively (Figure 1D). Prediction can easily be performed by uploading the GSCtool feature files with sample name. The ML project module integrates Optuna and lightGBM to provide users with a highly customizable model optimization framework to meet their specific research needs (Ke et al., 2017; Akiba et al., 2019). It uses Optuna to explore the optimal lightGBM model configurations within the hyperparameter space. This includes multiple iterations of training and evaluation of various hyperparameter combinations to identify the best-performing model. Users can upload extracted feature data generated by any feature representation algorithms, just ensuring that the data include columns with labels, designated by the column name “labels.” All other columns will be treated as feature information. Users can customize the search space of hyperparameters to identify the model configuration that best suits their data. Once the data are uploaded, ML project will immediately initiate model training and optimization. After completion of model training and optimization, ML project will return a highly optimized lightGBM model with the best hyperparameter configuration for G2P prediction. This module offers users significant flexibility in constructing lightGBM models with optimal hyperparameters suitable for various datasets and research questions.
We collected and processed genomic and phenotypic data from disparate data silos to construct the BreedingAIDB, a manually curated G2P paired database specifically designed to support the application of ML to G2P prediction and crop breeding practices. The ML tools of the BreedingAIDB not only perform phenotype-specific prediction tasks but also can help researchers build G2P models, providing powerful ML tools to accelerate the progress of breeding research. The data and tools of the BreedingAIDB will facilitate interdisciplinary collaboration among researchers, advance areas such as Breeding 4.0, and expedite the integration of ML into breeding practices.
As the current database is still in preliminary form, we will continuously devote our efforts to updating the BreedingAIDB via data collection, model development, and integration of innovative tools. Specifically, we will continue to collect and process existing data for the three crops in the current database and further expand data collection to encompass a wider variety of crops. We will train or collect high-quality predictive models for various phenotypes of different crops, thus enabling breeders to carry out comprehensive phenotype prediction. In addition, we plan to develop auto-ML in the BreedingAIDB to simplify the application of ML to breeding practices. By providing high-quality data, user-friendly tools, and an ever-expanding library of models, the BreedingAIDB will facilitate interdisciplinary collaboration and advance the field of breeding ML.
Funding
This work was supported by the STI2030-Major Projects (2023ZD04076) and the Hainan Province Science and Technology Special Fund (ZDYF2022XDNY271).
Author contributions
C.-Y.Y. conceived the study. Z.S. collected and processed the data and built the database. E.S., K.Y., and Z.F. advised on the data analysis. Z.S. and C.-Y.Y. wrote the manuscript. Q.-H.Z. and L.F. edited the manuscript. All authors read and contributed to the manuscript.
Acknowledgments
All whole-genome sequencing analyses were performed at the Yazhou Bay Science and Technology City Advanced Computing Center. No conflict of interest is declared.
Published: April 3, 2024
Footnotes
Published by the Plant Communications Shanghai Editorial Office in association with Cell Press, an imprint of Elsevier Inc., on behalf of CSPB and CEMPS, CAS.
Supplemental information is available at Plant Communications Online.
Supplemental information
References
- Akiba T., Sano S., Yanase T., Ohta T., Koyama M. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019. Optuna: A next-generation hyperparameter optimization framework. [Google Scholar]
- Arnaud E., Laporte M.A., Kim S., Aubert C., Leonelli S., Miro B., Cooper L., Jaiswal P., Kruseman G., Shrestha R., et al. The Ontologies Community of Practice: A CGIAR Initiative for Big Data in Agrifood Systems. Patterns. 2020;1:100105. doi: 10.1016/j.patter.2020.100105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bayer P.E., Edwards D. Machine learning in agriculture: from silos to marketplaces. Plant Biotechnol. J. 2021;19:648–650. doi: 10.1111/pbi.13521. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J., Tan C., Zhu M., Zhang C., Wang Z., Ni X., Liu Y., Wei T., Wei X., Fang X., et al. CropGS-Hub: a comprehensive database of genotype and phenotype resources for genomic prediction in major crops. Nucleic Acids Res. 2024;52:D1519–D1529. doi: 10.1093/nar/gkad1062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Danilevicz M.F., Gill M., Anderson R., Batley J., Bennamoun M., Bayer P.E., Edwards D. Plant genotype to phenotype prediction using machine learning. Front. Genet. 2022;13:822173. doi: 10.3389/fgene.2022.822173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jägermeyr J., Müller C., Ruane A.C., Elliott J., Balkovic J., Castillo O., Faye B., Foster I., Folberth C., Franke J.A., et al. Climate impacts on global agriculture emerge earlier in new generation of climate and crop models. Nat. Food. 2021;2:873–885. doi: 10.1038/s43016-021-00400-y. [DOI] [PubMed] [Google Scholar]
- Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017;30 [Google Scholar]
- Poplin R., Ruano-Rubio V., DePristo M.A., Fennell T.J., Carneiro M.O., Van der Auwera G.A., Kling D.E., Gauthier L.D., Levy-Moonshine A., Roazen D., et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv. 2018 doi: 10.1101/201178. Preprint at. [DOI] [Google Scholar]
- Shen Z., Shen E., Zhu Q.H., Fan L., Zou Q., Ye C.Y. GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics. Adv. Intell. Syst. 2023;5 doi: 10.1002/aisy.202300426. [DOI] [Google Scholar]
- Wallace J.G., Rodgers-Melnick E., Buckler E.S. On the Road to Breeding 4.0: Unraveling the Good, the Bad, and the Boring of Crop Quantitative Genomics. Annu. Rev. Genet. 2018;52:421–444. doi: 10.1146/annurev-genet-120116-024846. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

