Abstract
Objectives
The Genomes to Fields (G2F) 2022 Maize Genotype by Environment (GxE) Prediction Competition aimed to develop models for predicting grain yield for the 2022 Maize GxE project field trials, leveraging the datasets previously generated by this project and other publicly available data.
Data description
This resource used data from the Maize GxE project within the G2F Initiative [1]. The dataset included phenotypic and genotypic data of the hybrids evaluated in 45 locations from 2014 to 2022. Also, soil, weather, environmental covariates data and metadata information for all environments (combination of year and location). Competitors also had access to ReadMe files which described all the files provided. The Maize GxE is a collaborative project and all the data generated becomes publicly available [2]. The dataset used in the 2022 Prediction Competition was curated and lightly filtered for quality and to ensure naming uniformity across years.
Keywords: Grain yield, Maize, Root mean squared error
Objective
The Maize GxE project is a collaborative effort that involves researchers from diverse areas of study. The datasets collected by the project are some of the largest public data of their kind and are therefore of broad interest to communities from genetics to agronomy to computer science and beyond. The competition was organized to connect these communities and others with interest in dissecting and exploring genotypic, environmental, and GxE information to predict hybrid maize performance in different environments across the US. The competition started on November 15, 2022, and ended on January 15, 2023. All the participants had access to the same curated data set, containing information collected on over 180,000 maize field plots and involving 4,683 hybrids. Participants were asked to create predictive models for maize grain yield for the 2022 Maize GxE project field trials, utilizing the existing Maize GxE project dataset and any other publicly available data. The trait of interest was grain yield, and the competitors were asked to submit absolute grain yield (Mg ha− 1) adjusted to 15.5% moisture for each hybrid in each location where data had been collected during the 2022 field season. The winner of the competition was the model with the lowest average root mean squared error (RMSE) across locations when compared with the actual yield data obtained in 2022.
Data description
The Prediction Competition data are publicly available via CyVerse/iPlant. This dataset contains training and testing set data and has been structured according to the specifications outlined in Table 1.
Table 1.
Overview of Genomes to Fields 2022 Maize Genotype by Environment Prediction Competition data files
Label | Name of data file | File types (Extension) | Data repository and identifier |
---|---|---|---|
Data file 1 | readme.txt | .txt | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 2 | COMPETITION_DATA_README.docx | .docx | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 3 | 1_Training_Trait_Data_2014_2021.csv | .csv | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 4 | 2_Training_Meta_Data_2014_2021.csv | .csv | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 5 | 3_Training_Soil_Data_2015_2021.csv | .csv | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 6 | 4_Training_Weather_Data_2014_2021.csv | .csv | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 7 | 5_Genotype_Data_All_Years.vcf.zip | .vcf | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 8 | 6_Training_EC_Data_2014_2021.csv | .csv | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 9 | All_hybrid_names_info.csv | .csv | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 10 | GenoDataSources.txt | .txt | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 11 | GenoDataSourcesWithUpdatedBioProject.txt | .txt | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 12 | 1_Submission_Template_2022.csv | .csv | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 13 | 2_Testing_Meta_Data_2022.csv | .csv | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 14 | 3_Testing_Soil_Data_2022.csv | .csv | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 15 | 4_Testing_Weather_Data_2022.csv | .csv | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 16 | 6_Testing_EC_Data_2022.csv | .csv | CyVerse (10.25739/tq5e-ak26) [3] |
Data file 17 | Test_Set_Observed_Values_ANSWER.csv | .csv | CyVerse (10.25739/tq5e-ak26) [3] |
Training_data: includes phenotypic, genotypic, soil, weather (downloaded from https://power.larc.nasa.gov), environmental covariate data, and metadata information from 2014 to 2021 for use in developing and training models.
Testing_data: includes genotypic, soil, weather, environmental covariate data, and metadata information for 2022 locations. Also, a submission template that contains the environments and hybrids that participants used to submit yield predictions.
Maize is cultivated as a hybrid crop, typically resulting from the cross of two inbred parents. Consequently, both the phenotypic data in the training and testing sets exhibit hybrid information. The genotypic data includes hybrid information generated in-silico from inbred genotypic data.
Limitations
These datasets contain missing data. When working with large agricultural datasets, missing data is a common occurrence due to various factors such as data collection limitations, measurement errors, plot losses, and environmental events. The genotypic data provided contains hybrid information derived from inbred genotypic data, a common practice. However, depending on the study goals, this may pose limitations for specific types of analysis. In instances where precise GPS coordinates were not available for certain environments (i.e., a location in a particular year), field coordinates were estimated. Depending on the research objective, the unavailability of accurate GPS coordinates could impact the reliability of the results.
Acknowledgements
We gratefully acknowledge contributions from National Corn Growers Association, Iowa Corn Promotion Board, and USDA-ARS. The weather data was obtained from the National Aeronautics and Space Administration (NASA) Langley Research Center (LaRC) Prediction of Worldwide Energy Resource (POWER) Project funded through the NASA Earth Science/Applied Science Program.
Abbreviations
- G2F
Genomes to Fields
- GxE
Genotype by Environment
Author contributions
DCL, JDW, JIV, QC, JLG, MCR, JH, DE, MLC, FMA, GDLC, SK, TB, MB, EB, JE, SFG, MAG, CNH, JEK, JM, RM, SCM, OAO, JCS, RSS, MPS, EES, AT, MT, JW, TW, WX, NDL were responsible for advising on data collection methods, collecting the data, reviewing data collection and curation methods, and the resulting datasets for the 2022 season. DCL, JDW, JIV, QC, JLG, MCR, JH, DE, NDL organized the Genomes to Fields (G2F) 2022 Maize Genotype by Environment Prediction Competition.
Funding
We gratefully acknowledge support from: National Corn Growers Association, Iowa Corn Promotion Board, and USDA-ARS.
Data Availability
The data described in this Data note can be freely and openly accessed on CyVerse under 10.25739/tq5e-ak26 [3]. Please see Table 1 for details and links to the data.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Dayane Cristina Lima, Email: dclima@wisc.edu.
Jacob D. Washburn, Email: jacob.washburn@usda.gov
José Ignacio Varela, Email: jvarela@wisc.edu.
Qiuyue Chen, Email: qchen29@ncsu.edu.
Joseph L. Gage, Email: jlgage@ncsu.edu
Maria Cinta Romay, Email: mcr72@cornell.edu.
James Holland, Email: Jim.Holland@usda.gov.
David Ertl, Email: dertl@iowacorn.org.
Marco Lopez-Cruz, Email: lopezcru@msu.edu.
Fernando M. Aguate, Email: fmaguate@gmail.com
Gustavo de los Campos, Email: gustavoc@msu.edu.
Shawn Kaeppler, Email: smkaeppl@wisc.edu.
Timothy Beissinger, Email: beissinger@gwdg.de.
Martin Bohn, Email: mbohn@illinois.edu.
Edward Buckler, Email: esb33@cornell.edu.
Jode Edwards, Email: Jode.edwards@usda.gov.
Sherry Flint-Garcia, Email: sherry.flint-garcia@usda.gov.
Michael A. Gore, Email: mag87@cornell.edu
Candice N. Hirsch, Email: cnhirsch@umn.edu
Joseph E. Knoll, Email: Joe.Knoll@usda.gov
John McKay, Email: john.mckay@colostate.edu.
Richard Minyo, Email: minyo.1@osu.edu.
Seth C. Murray, Email: sethmurray@tamu.edu
Osler A. Ortez, Email: ortez.5@osu.edu
James C. Schnable, Email: schnable@unl.edu
Rajandeep S. Sekhon, Email: sekhon@clemson.edu
Maninder P. Singh, Email: msingh@msu.edu
Erin E. Sparks, Email: esparks@udel.edu
Addie Thompson, Email: thom1718@msu.edu.
Mitchell Tuinstra, Email: mtuinstr@purdue.edu.
Jason Wallace, Email: jason.wallace@uga.edu.
Teclemariam Weldekidan, Email: tecle@udel.edu.
Wenwei Xu, Email: wxu@ag.tamu.edu.
Natalia de Leon, Email: ndeleongatti@wisc.edu.
References
- 1.Genomes to Fields. 2023. https://www.genomes2fields.org.
- 2.Genomes to Fields resources. 2023. https://www.genomes2fields.org/resources.
- 3.G2F Consortium. Genomes to Fields 2022 Maize Genotype by Environment Prediction Competition. CyVerse Data Commons. 2023. 10.25739/tq5e-ak26. [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data described in this Data note can be freely and openly accessed on CyVerse under 10.25739/tq5e-ak26 [3]. Please see Table 1 for details and links to the data.