Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2020 Jun 19;15(6):e0233951. doi: 10.1371/journal.pone.0233951

Predicting biomass of rice with intermediate traits: Modeling method combining crop growth models and genomic prediction models

Yusuke Toda 1, Hitomi Wakatsuki 2, Toru Aoike 1, Hiromi Kajiya-Kanegae 3, Masanori Yamasaki 4, Takuma Yoshioka 4, Kaworu Ebana 5, Takeshi Hayashi 6, Hiroshi Nakagawa 3, Toshihiro Hasegawa 7, Hiroyoshi Iwata 1,*
Editor: Lewis Lukens8
PMCID: PMC7304626  PMID: 32559220

Abstract

Genomic prediction (GP) is expected to become a powerful technology for accelerating the genetic improvement of complex crop traits. Several GP models have been proposed to enhance their applications in plant breeding, including environmental effects and genotype-by-environment interactions (G×E). In this study, we proposed a two-step model for plant biomass prediction wherein environmental information and growth-related traits were considered. First, the growth-related traits were predicted by GP. Second, the biomass was predicted from the GP-predicted values and environmental data using machine learning or crop growth modeling. We applied the model to a 2-year-old field trial dataset of recombinant inbred lines of japonica rice and evaluated the prediction accuracy with training and testing data by cross-validation performed over two years. Therefore, the proposed model achieved an equivalent or a higher correlation between the observed and predicted values (0.53 and 0.65 for each year, respectively) than the model in which biomass was directly predicted by GP (0.40 and 0.65 for each year, respectively). This result indicated that including growth-related traits enhanced accuracy of biomass prediction. Our findings are expected to contribute to the spread of the use of GP in crop breeding by enabling more precise prediction of environmental effects on crop traits.

Introduction

Genomic selection (GS) [1] is a novel method increasingly being used in plant and animal breeding. Meuwissen et al. proposed the use of genomic prediction (GP) to predict genotypic values (or breeding values) of selection candidates from whole-genome marker genotypes and a statistical model [1]. GP enables the prediction of genotypic values of a target trait without information about its causal genes, even when the target trait is controlled by a number of genes with complex interactions. Recent falls in the cost of genotyping high-density genome-wide markers have inspired the increased use of GP in animal breeding [2] and plant breeding [35]. Because phenotypic values predicted by GP can be used as alternatives to phenotypic values observed in field trials, GP can accelerate breeding by skipping field experiments for selection, and thus is expected to increase selection gains per unit time [6].

Because environmental effects, i.e., the main effects of the environment and of the genotype-by-environment interaction (G×E), are generally not trivial in plant breeding, the use of GP models without consideration of these effects can cause difficulties in the application of GP to yield-related traits, which can be strongly influenced by these effects [7]. Several methods have been proposed to consider environmental effects, including the modeling of covariance between genotype and environment [89], consideration of marker-by-environment interactions [10], and inclusion of environmental covariates [11]. Moreover, a GP model that can take environmental effects into account will benefit the application of GS in plant breeding because it will lead to more accurate predictions of genetic values for yield-related traits under a target environment and thus to a higher genetic gain per cycle [6].

Crop growth models (CGMs) are expected to be an important tool for plant breeding because they incorporate environmental effects into the GP framework [1213]. For example, Heslot et al. [14] used a CGM to select the environmental covariates which were included in a GP model. Technow et al. [15] proposed a method for integrating a CGM and a GP model with approximate Bayesian computation, and Cooper et al. [16] applied the method to maize data. However, the models in these studies attained only a small improvement in accuracy when applied to real data. One of the reasons for the small improvement may be the difficulty in parameter estimation of CGMs. The accurate estimation of CGM parameters is difficult when it is only based on observations of a target trait. In other words, the accuracy can be improved when observation of traits related to the target traits is included in the parameter estimation of CGM.

The growth-related traits may be good candidate traits to improve the prediction accuracy of target traits. Several studies have used growth-related traits with multi-trait GP models to improve the prediction accuracy of target traits [1718], suggesting that the growth-related traits convey precise growth details and provide useful information for target trait prediction. To date, there has been no research that used growth-related traits for CGM and GP integration. In this study, we proposed a method to use the phenotypic data of growth-related traits in the integrated models of GP and CGM. This method has two steps. First, the growth-related traits are treated as “intermediate traits” and are predicted by GP. Second, the target traits are predicted from the predicted values of the “intermediate traits” and environmental data using a CGM. By dividing the model into two steps that correspond to GP and CGM, the “intermediate traits” can be naturally included into the model without complex statistical modeling of the relation between GP and CGM.

To validate this integrated model, rice is a suitable research species because there have been previous studies of the application of GP [1924] and CGMs [2527], such as SIMRIW [28] and CERES-rice [29]. However, attempts to integrate these methods to predict phenotypic variations in rice have been lacking, with some exceptions [30]. Biomass is also a suitable trait for validation. Biomass is a direct target of breeding for biofuel rice [3132] and is an important component of grain yield [3334].

In this study, we developed models to predict the biomass of rice, in which the observed phenotypic data of growth-related traits, whole-genome marker genotype, and environmental data were used. The model comprised two steps wherein the intermediate traits were predicted with GP in the first step and biomass was predicted from the predicted values of the intermediate traits in the second step. In the intermediate traits, the heading date is exceptionally predicted using a development rate (DVR) model based on the data obtained from multi-environmental trials (METs) and the genotypes of heading-date-related markers. Additionally, in the second step, we evaluated the potential of a “black box”-type machine-learning model, in which a detailed model structure was not defined as a priority for substituting the CGM.

These models were validated with a recombinant inbred line (RIL) population of japonica rice for biomass prediction. We conducted 2-year field experiments of the population. The experiments were conducted with different timings of sowing (and planting) between both years to evaluate the potential of the models under different environments. The difference in sowing (and planting) dates was about one month, and this caused different phenological developments of the plants between those two years. Finally, the models were evaluated for their accuracy of biomass prediction within the experiments (using the same-year experiment for training and validation) and between the experiments (using one year’s experiment for training and the other year’s experiment for validation).

Materials and methods

Plant materials

We evaluated 123 RILs derived from a cross between two japonica cultivars—Koshihikari and Kinmaze—and both parental lines. The construction of RIL was in the F8 generation in 2014 and in the F9 generation in 2015. Because Kinmaze and Koshihikari have different growth patterns and plant structure, these RILs were expected to be suitable for analyzing genetic variations observed in growth differences. In 2014 and 2015, experiments were conducted in an experimental paddy field of the National Agriculture and Food Research Organization, Tsukuba, Ibaraki, Japan (36̊ 01’ N, 140̊ 06’ E, 22m above sea level). Sowing and transplanting were performed in different months between years to produce results under different conditions of day length and temperature; we sowed seeds on 22 April 2014 and 19 May 2015 and transplanted seedlings into the field on 20 May 2014 and 18 June 2015. Because of different cultivation periods during 2014 and 2015, the 2-year experiments were not simply yearly replications but were expected to induce different growth patterns under different environmental conditions. Plants were transplanted 15 cm apart in rows 30 cm apart in plots. We transplanted two seedlings per hill. The area for each line per replicate was 60 cm × 105 cm (2 rows × 7 hills). Inorganic fertilizer (80–100–100 kg of N-P2O5-K2O ha−1) was applied to the field. Aboveground plant organs were harvested to determine biomass at physiological maturity, which spanned from 29 August to 10 October in 2014 and from 17 September to 5 November in 2015 depending on variation among lines. Dry matter weight above ground was used as biomass.

We recorded leaf age and number of tillers on each of several dates to evaluate variations in the growth pattern of the RILs (Table 1). The leaf age is calculated using the following formula [35]:

Leafage=Numberofdevelopedleaves+LengthofthedevelopingleafFinallengthofthedevelopingleaf

Table 1. Dates of observation of leaf age and number of tillers.

Year Sowing date Year Dates
2014 22-Apr Leaf age 5/19, 6/2, 6/9, 6/16, 6/23, 6/30, 7/7, 7/14, 7/22, 7/28, 8/4
Number of tillers 6/9, 6/16, 6/23
2015 19-May Leaf age 6/15, 6/25, 7/2, 7/9, 7/16, 7/23, 7/30, 8/5, 8/10, 8/17, 8/24, 8/31
Number of tillers 6/25, 7/2, 7/9, 7/16, 7/23
Trait Year Dates
Leaf age 2014 5/19, 6/2, 6/9, 6/16, 6/23, 6/30, 7/7, 7/14, 7/22, 7/28, 8/4
2015 6/15, 6/25, 7/2, 7/9, 7/16, 7/23, 7/30, 8/5, 8/10, 8/17, 8/24, 8/31
Number of tillers 2014 6/9, 6/16, 6/23
2015 6/25, 7/2, 7/9, 7/16, 7/23

We used leaf age instead of leaf number to treat the development of leaves as continuous values. The maximum tiller number was determined on the basis of measurements of the tiller number observed at three and five different time points in 2014 and 2015, respectively. The measurements were continued until the leaf number on the main culm reached to 11 or more. This was because our preliminary experiments with nine diverse cultivars suggested that the tiller number reached its maximum before 11 leaves were observed.

Length of the fully expanded leaf blades was measured for the 5th leaf, 11th leaf, flag leaf and 2 leaves below the flag leaf. According to our preliminary study, the final length of the leaf blade on the main culm increased almost linearly with the leaf age from 5 to 11. The increment in the final length per leaf age (ΔLL) was derived from the 5th and 11th leaves. Leaf age, number of tillers and leaf blade length on the main culm were recorded for two plants per entry for each replicate. Heading date and biomass were recorded on 6 plants per entry.

We used a method similar to [36] for the genotyping of RILs by extracting DNA from bulked seedlings of each F7 line (corresponding to the F6 generation) via a CTAB-based extraction method [37]. We used single-nucleotide polymorphism (SNP) markers for the linkage map construction, and a total of 703 SNPs were selected from genome-wide SNP data [3839] and analyzed using a BeadStation 500G system (Illumina, CA, USA) according to the manufacturer’s instructions. Finally, using R software [40] and the R/qtl package [41], we deleted SNPs with identical genotypes with the findDupMarkers function. Finally, a total of 315 SNPs were used for the genotyping of RILs (S1 Fig).

Air temperature and solar radiation were recorded on-site (available at http://www.naro.affrc.go.jp/org/niaes/aws/). Photosynthetically active radiation (PAR) was estimated from the solar radiation assuming that proportion of PAR to the global solar radiation is 0.5 [42]. Daily means of temperatures are shown in Fig 1.

Fig 1. Environmental data during growing season.

Fig 1

Daily mean temperature, theoretical day length and photosynthetically active radiation (PAR) of Tsukuba under field trial of RILs are shown. Data in both 2014 and 2015 are expressed as solid and dotted lines, respectively.

Because the RIL population was cultivated in only one field, it was difficult to estimate model parameters for heading date in CGM. To obtain the model parameters, we used heading dates recorded in METs that tested 112 cultivars, including Kinmaze and Koshihikari, most of which were developed in Japan. METs were conducted in six locations in several years (33 trials, Table 2).

Table 2. Location, year, and number of replications of field experiments to record heading date.

Location 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Daisen, Akita 1 1 1
Tsukuba, Ibaraki 1 1 2 3 3 1 1
Tsukubamirai, Ibaraki 2 2
Kasai, Hyogo 1 1 1 1 1 1 1 1 1
Fukuyama, Hiroshima 1 1 1 1 2 2 2 2 2
Fukuoka, Fukuoka 1 1 1

Genetic analysis of observed traits

All statistical analyses were conducted in R software [40]. The arithmetic means of observed values were used as phenotypic values for each RIL in the following analysis. The number of replications for each trait was described in the previous section. Analysis of variance (ANOVA) was conducted to evaluate the significance of genotype and environmental effects and their interaction.

We evaluated the accuracy of GP of all traits with 10-fold cross-validation. For building the GP models, we employed four methods. Two of them were regularized regression: ridge regression (RR) and LASSO, and the other two were Gaussian process regression (reviewed by [43]): one based on an additive relationship matrix (GBLUP) and the other on a Gaussian kernel matrix (RKHS) as a representative of covariance matrix. We used the “glmnet” package [44] for RR and LASSO, and the “rrBLUP” package [45] for GBLUP and RKHS. The narrow-sense heritability of each trait was estimated using a mixed model based on an additive relationship matrix in GBLUP.

Growth process analysis

We analyzed the change in leaf age and the number of tillers during growth as a simple function of heat units (accumulated daily mean temperature). The leaf age and the number of tillers on the ith day from sowing (Leafi, Tilli, dimensionless) were represented as:

Leafi=min(ΔLeaf×HUi,LeafMAX) (1)
Tilli={1(HUi800)min(ΔTill×HUi,TillMAX)(HUi>800) (2)

where HUi (°C) represents heat unit (∑ daily mean temperature from emergence to the ith date); ΔLeaf (°C−1) and ΔTill (°C−1) represent the rate of change per HU; and LeafMAX and TillMAX represent maximum values. Because we observed the growth of each line, ΔLeaf and ΔTill were estimated as slopes of linear regression of phenotypic data during the study period, whereas LeafMAX and TillMAX were measured at the end of the growth period. Because leaf age and number of tillers are generally not considered linear to HU, we assumed that its growth was approximated by a combination of linear functions.

Generally, the growth of rice does not proceed when the daily temperature is low. To take this assumption into consideration, we developed the growth models of leaf age and number of tillers based on the heat unit, in which the base temperature of the growth of rice was considered (∑max(0, daily mean temperature–8°C)) instead of the simple heat unit. The lower bound of temperature was obtained from [42]. However, the result did not largely differ or was even more inaccurate in the prediction accuracy than models developed based on the simple heat unit. Thus, we present only the results based on the simple heat unit in this paper.

Prediction of heading date by DVR model

To predict heading date in a target environment, we used Yin et al.’s model [46] modified by Nakagawa et al. [47], which describes daily developmental rate (DVR) as a function of environmental factors (DVR model, hereafter). In the DVR model, daily progress of a developmental stage is expressed as a continuous value representing developmental stage (DVS), ranging from 0 (emergence) to 1 (heading). The DVS at the nth day after emergence is the sum of the daily development rates (DVRi):

DVSn=i=1nDVRi (3)

where DVRi is given by daily mean temperature (Ti, °C) and day length (Pi, h):

DVRi={f(Ti)αg(Pi)βG(if0.145+0.005GDVS0.345+0.005G)f(Ti)αG(ifDVS<0.145+0.005G,0.345+0.005G<DVS) (4)
f(Ti)={TiTbToTb(TcTiTcTo)TcToToTb(ifTbTiTc)0(ifTi<Tb,Tc<Ti) (5)
g(Pi)={PiPbPoPb(PcPiPcPo)PcPoPoPb(ifPiPo)1(ifPi<Po) (6)

Six parameters were fixed (Tb = 8°C, To = 30°C,Tc = 42°C,Pb = 0h,Po = 10h,Pc = 24h) among lines as in [46]. The parameters α,β,G represent sensitivity to temperature, sensitivity to day length, and growth period, respectively, and are assumed to have specific values for each line. We estimated the values from the MET data using particle swarm optimization [48], which is used to optimize non-linear functions (Experiment A in Fig 2).

Fig 2. Flow chart of model structures.

Fig 2

Experiment A: Process for estimating values of three parameters (α, β, G) related to heading date. Multi-environment trial data of heading date of 112 lines were used to model the relationship between parameter values and marker genotypes of heading-date-related genes using Extreme Learning Machine (ELM). Experiment B: Structure of conventional genomic prediction (GP), integrated CGM (IntCGM), and integrated machine-learning (IntML) models.

To calculate the values of α,β,G of the target RILs, we constructed models to predict them from marker genotypes (Experiment A in Fig 2) of six heading-date–related genes (Hd1, Hd3a, Hd6, Hd16, Hd17, and Ghd7) [4954] of 112 lines. We used Extreme Learning Machine (ELM) [55], which is a machine learning method based on a neural network with advantages in generalization performance and learning speed, to model the relationships between the parameter values and the marker genotypes. After modeling these relationships, we estimated the values of α,β,G of the RILs by using the ELM prediction model (Experiment B in Fig 2). The marker genotypes of the heading-date–related genes of the RILs were assumed to be the same as those of the SNP nearest to the genes, and were used as inputs of the ELM model.

Integrated GP–CGM model

We included environmental effects in the model of yield-related traits by integrating the GP models and a CGM proposed by [42], with modifications, to create an integrated CGM (IntCGM).

IntCGM has two steps (Experiment A in Fig 2). First, the GP and DVR models predict “intermediate traits” related to biomass. LASSO was selected as a representative GP model because it showed the highest accuracy among all the GP models in 10 of 14 traits (i.e., six intermediate traits and biomass in two years). Second, the CGM simulates the daily change in biomass from the “intermediate traits”.

Total biomass (BM, g m-2) was estimated as the product of the total biomass at the day of termination of seed growth (BMTSG, g m-2) and a technical coefficient τ (dimensionless):

BM=τBMTSG (7)

where τ represents the influence of factors that are not included in the model (e.g., precipitation, nutrient condition, disease) [27]. The parameter τ was estimated as an average of the ratio of BMTSG and observed BM when the prediction was conducted. The day of termination of seed growth was presumed to be the day when the accumulation of daily mean temperature after heading date reached 630°C [42]. BMTSG was calculated as the sum of daily increases of biomass:

BMTSG=i=1TSGRUEi×FINTi×PARi (8)

where TSG is the day of termination of seed growth, FINTi is fraction of PAR intercepted by canopy of ith day (dimensionless), RUEi is radiation use efficiency (g MJ−1), PARi is photosynttically active radiation (MJ m−2). RUEi is the product of the maximum RUE (IRUE = 2.2 g MJ−1) and the ratio of actual daily RUE to IRUE (TRFRUEi, dimensionless):

RUEi=IRUE×TRFRUEi (9)

where TRFRUEi is a function of daily mean temperature (Ti) (Soltani and Sinclair, 2012):

TRFRUEi={Ti1015(10<Ti25)1(25<Ti32)Ti4210(32<Ti42)0(otherwise) (10)

FINTi is estimated from the leaf area index, LAIi (dimensionless), and the extinction coefficient (k = 0.6):

FINTi=exp(1kLAIi). (11)

Although IRUE and k are known to have variation among lines and environments [42], they are assumed to be constant in this study because of the difficulty in the estimation of IRUE and k for each line and environment. LAIi is expressed as:

LAIi={βTillil=1Leafi(l×ΔLL)2}/S (12)

where ΔLL (m) is the increase of leaf length per unit increase of leaf age, β = 0.003 is a technical coefficient explaining shape of leaves and S = 225 cm2 is the ground area of one plant. Thus, l×ΔLL represents the length of a leaf in one node, which came out in lth order, and l=1Leafi(l×ΔLL)2 is expected to be proportional to leaf area of one tiller.

GP model integrated with machine learning

We also constructed a model replacing the CGM with a machine learning method. This integrated machine-learning model (IntML) has the same two-step structure as IntCGM, but the second step uses machine learning methods. In the second step, we built machine learning models that use intermediate traits as explanatory variables to predict biomass. We chose a multiple regression model as a linear machine-learning method (IntML1) and the Random Forest [56] model as a non-linear method (IntML2). The R package “randomForest” [57] was used to build the Random Forest prediction models. When building the model, the parameter “mtry” was set as 2 and the other parameters were set as default.

Model validation

To evaluate the ability of the models to predict biomass, we used 10-fold cross-validation among genotypes. We also predicted tested (i.e., training) and untested (i.e., validation) environments. In the prediction of the tested environment, the data from the same year were used as both training and validation data; that is, biomass of a fold in one year was predicted from the data of the remaining folds and environmental data in the same year. This assumption corresponds to the situation in which we want to predict the biomass of untested lines in tested environments. In the prediction of the untested environment, data from different years were chosen as training and validation data; that is, biomass of a fold in one year was predicted from the data of the remaining folds and environmental data in the other year. This assumption corresponds to the situation in which we want to predict the biomass of untested lines in untested environments.

We calculated three statistics to measure prediction accuracy. The correlation coefficient of observed versus predicted values (r) is a measure of strength of relative relation between both values. The root mean squared error (RMSE) expresses the discrepancy between predicted and observed values. The regression coefficient of observed versus predicted values (slope) is a measure of shrinkage in the predicted values over the observed values. Observed and predicted values were used as dependent and independent variables, respectively. When predicted values approach observed values, r and slope approach 1 and RMSE decreases. We repeated cross validation of 100 replicates for each combination of models and prediction schemes to estimate the standard deviation of indices (r and slope) of prediction accuracy. The Steel–Dwass test, a nonparametric multiple comparison test, was performed to examine significant differences in prediction accuracy.

Results

Growth patterns and correlations among traits

Growth curves and fitted models of leaf age and number of tillers are shown in Fig 3. The results indicated that the models could express the growth of each trait despite their simplicity.

Fig 3. Growth curves and growth models of leaf age and number of tillers.

Fig 3

The line means of both traits in 2014 and 2015 are plotted in four figures in the left side. The parents, Koshihikari and Kinmaze, and RILs are expressed as blue, red and gray lines, respectively. The growth models of those traits are shown in two figures in the right side. The growth model and the observed values of parents in 2015 are shown. Heat unit is used as horizontal axes.

The comparison of phenotypic values between the two years of experiment is shown in Fig 4. Among estimated parameters of the growth models, strong correlations between the years were observed in LeafMAX and heading date whereas weak correlations were observed in TillMAX (Fig 4). However, the distributions of ΔLeaf and ΔTill differed between the years. The ranges of phenotypic values of the heading date (e.g., minimum values were ca. 90 and 80 days in 2014 and 2015, respectively) and biomass also differed between the years, despite their high correlations. The G×E effect was found to be significant (p < 0.01) for all traits using ANOVA. The correlation coefficients between growth-related traits and biomass were higher in 2015 than in 2014.

Fig 4. Comparison of observed traits between 2014 and 2015.

Fig 4

Estimates of correlation coefficients between phenotypes of two years are shown in the top-left of each box. Abbreviations: ΔLeaf, growth rate of leaf age; LeafMAX, final leaf age; ΔTill, growth rate of number of tillers; TillMAX, maximum number of tillers; ΔLL, growth rate of leaf length per leaf age; HD, heading date; HI, harvest index; BM, biomass.

Genomic prediction of growth-related traits

We assessed the prediction accuracy of the GP models (Fig 5) in growth-related traits, which corresponded to the first step of integrated models (IntCGM and IntML, Fig 2B). Accuracy was higher in 2015 than in 2014. Traits that showed higher correlation between years in Fig 4 tended to have higher values both in heritability and prediction accuracy. In ΔTill and Tillmax, the accuracy was lower than in biomass. In the following analyses, we chose LASSO as a representative GP model because it showed the highest accuracy among the models in 10 of 14 traits (six intermediate traits and biomass for two years). For heading date, we compared five models: the DVR model which used weather data and genome-wide marker data as explanatory variables and 4 GP models used only genome-wide marker data. The prediction accuracy was slightly lower in the DVR model than that in GP.

Fig 5. Comparison of prediction accuracy of GP and heritability in growth-related traits.

Fig 5

Estimated correlation coefficients of observed values and values predicted using the five models for seven growth-related traits are shown as bars. The five models included four methods of whole-genome prediction (for all traits) and a DVR model with marker genotypes of the heading-date–related genes (for heading dates). The square roots of heritability of the seven traits are shown as crosses. Error bars represent ± 1 s.d.

Prediction of biomass

In the tested environment, IntCGM, IntML, or both were more accurate at biomass prediction than GP with LASSO by all three statistics (Fig 6A), especially when the 2014 dataset was used as validation data: that is, IntCGM and IntML gave higher r values and regression slopes closer to one than GP, and IntML gave lower RMSE than GP. This tendency was supported by the fact that differences between r and slope of our models and those of GP were all statistically significant (p < 0.01).

Fig 6. Comparison of prediction accuracy of biomass.

Fig 6

Result of prediction of tested environment (A) and untested environment (B) are shown. LASSO was chosen as a representative GP model. Three indices are used: Correlation coefficient (r), RMSE (root mean squared error), and slope of the regression line for predicted and observed values. Error bars represent ± 1 s.d. Letters above the bars indicate a significant difference as determined by the Steel–Dwass test (p < 0.01).

IntCGM, IntML, or both performed better than or the same as GP in the untested environment (Fig 6B); both models gave significantly higher r and slope than GP except when IntML2 was tested with 2014 dataset as validation. IntCGM had a lower RMSE than that of GP using the 2015 dataset for validation but had a higher RMSE than that of GP using the 2014 dataset for validation.

We attempted to predict the panicle weight with IntCGM, wherein the panicle weight was expressed as the multiplication of biomass and harvest index and the harvest index was predicted using GP. However, the prediction accuracy of IntCGM was worse than GP because the harvest index itself was largely affected by the environment (S2 Fig).

Discussion

Accuracy of prediction of biomass

The r in our new models was the same as, or higher than, that of the conventional GP in the prediction of biomass (Fig 6). There was a substantial difference in the r of GP between 2014 and 2015 in the prediction of the tested environment, indicating that there was a difficulty in explaining the variation of biomass in 2014 through the direct linear regression of the genotypic markers. In contrast, the integrated models showed the significant increase in r compared with that of GP in the 2014 prediction. These results indicate that the use of the intermediate traits was beneficial for improving accuracy of biomass prediction. Heading date prediction, which showed high heritability in both years, mostly contributed to the improved prediction accuracy.

Focusing on the GP trained with biomass of 2014, the accuracy was higher in biomass prediction of 2015 than in that of 2014. This intuitively unexpected result might be owing to two reasons. One is the low heritability of biomass in 2014, which led to lower prediction accuracy in the models [5859]. To reduce the influence of the heritability level on the index of the prediction accuracy (i.e., a correlation coefficient between observed and predicted phenotypes), the value of r was adjusted by dividing it by the square root of genomic heritability. The adjusted values of r became 0.652 and 0.746 for the biomass in 2014 and 2015, respectively, and had smaller differences than the previous r. Another reason for the higher biomass prediction accuracy in 2015 is the GS model built with LASSO. In Fig 5, the biomass prediction accuracy was lower in LASSO than in other models in 2014, whereas the result was the opposite in 2015. Polygenic marker effects seemed more dominant in biomass in 2014 than in 2015 because LASSO is not good at capturing the small effects of a large number of variables. In contrast, the estimation of genomic heritability effectively reflects polygene effects. The differences in the characteristics of each estimation method subsequently caused the difference in the adjusted values of r for the biomass in 2014 and 2015.

Although heading date was predicted by ELM and DVR models in our models, the prediction accuracy was worse than that by GP. One possible reason is that the heading date of RILs that we used could not be completely explained by heading-date-related genes, (i.e., Hd1, Hd3a, Hd6, Hd16, Hd17, and Ghd7) considered in ELM and DVR models. However, we employed the DVR model in our models because it can be used to predict the heading date in a new environment.

Comparison with models in other studies

An advantage of our new approach over conventional researches of integrated models of GP and CGM is the inclusion of observed growth data in the model as “intermediate traits”. This enables us to treat parameters in the model as representations of actual crop conditions. Two studies designed to integrate a genomic prediction model with a crop model [15, 19] tried to estimate growth parameters by using only phenotypic values of target traits. Technow et al. integrated GP and CGM to predict the yield of maize using parameter estimation with the approximate Bayesian computation [15]. Onogi et al. also constructed an integrated model to predict the heading date of rice [19]. However, this approach is difficult to apply to a complex trait, such as yield, and did not improve the prediction accuracy when it was applied to real-yield data [16]. It is also difficult to validate the accuracy of the estimated growth parameters. The use of the intermediate traits was beneficial for improving prediction accuracy and for further understanding how the parameters influence the target traits.

A multi-trait GP is another approach to predict target traits with intermediate traits (or secondary traits). In this model, the covariance structure among target and intermediate traits is considered to improve prediction accuracy [6061]. For example, there are studies in which longitudinal traits measured by remote sensing were used as intermediate (or secondary) traits and modeled with a multi-trait GP model to predict wheat grain yield [1718]. In the study of [18], grain yield was predicted for untested environment in which phenotypic data of a target population was not available. The prediction accuracy, however, was not improved with a multi-trait GP model [18]. Compared with multi-trait GP model approach, our two-step approach has a good flexibility to model nonlinear relationship among target and intermediate traits through applying a nonlinear model at the second step (e.g. CGM as in IntCGM or Random Forest as in IntML2).

Another benefit of IntCGM was that the range of predicted among-lines variation [i.e., the regression coefficient (slope) of observed versus predicted values of IntCGM] was closer to 1 compared with that of GP (Fig 6). This would be important in breeding programs [30, 62], although it has not been evaluated in recent studies of the prediction of G×E by GP [8,9,14,16]. In those studies, the accuracy of prediction models was assessed mainly by correlation between predicted and observed (or estimated) values. Although correlation is a good measure of the ordinal accuracy of the prediction (i.e., the accuracy of predicting the order of genotypic values), it does not necessarily reflect the range of genetic variations [63]. In some cases, the accurate prediction of phenotypic values is important for breeding; for example, we may need to maintain the flowering date within a certain range for ease of field management or limit plant height to prevent lodging. When aiming at the application of GP to actual breeding the accurate prediction of the size of genetic variation in a population is as important as the ordinal relationship among genotypes in the population.

Further improvement of the prediction model

The prediction accuracy of the models was validated using 2-year experiments, which had a 1-month difference in the timing of sowing and planting; one year was used for training, whereas the other year was used as previous researches did [1516]. Although experiments in 2014 and 2015 were performed in one location, the 2-year experiments were conducted under different environmental conditions (e.g., temperature, day length, and radiation) by employing different cropping seasons. However, other environmental factors, such as soil condition, were fixed in these experiments. To apply our models to a dataset with multiple locations and years, we should take into account other environmental factors, such as soil condition, water supply, and cultivation management, in the models.

In this study the biomass was selected as the target trait for prediction, but the prediction of yield was more challenging. A possible method of implementing accurate prediction of yield is the use of sophisticated CGMs. The potential of several CGMs, such as APSIM [64], has been already demonstrated in practical applications. However, certain complexities may create problems. One of the problems is the accumulation of errors: the errors of parameter estimation would be large if the model includes several parameters. Therefore, models must be simplified in ways such as the use of machine learning (IntML) or variable selection. A sensitivity analysis will be effective to select modules of the models in which variables with little influence on target traits will be distinguished.

Another problem is the increased effort required for measuring plant growth if a model requires a large number of growth parameters. Parameter estimation is one effective solution [15,19,27]. Through these methods, we may be able to omit the measurement of some growth-related traits and to estimate them as parameters in a CGM while measuring the remaining traits in the field. The use of high-throughput phenotyping is another way to enable plant growth to be measured in detail. For example, LAI [6566] and biomass [6768] can be measured in a non-destructive way by remote sensing with unmanned aerial vehicles. Such techniques would enable us to measure various kinds of growth-related traits continuously during growth. GP and high-throughput phenotyping technologies could revolutionize breeding [69].

Moreover, the use of a deterministic model in IntCGM may reduce phenotyping costs for the target traits. In IntCGM, the phenotypic values of biomass in the training data were used only for scaling the model’s prediction values onto the phenotypic values with τ as the scaling parameter. Using τ, the RMSE of biomass in known environments decreased by 45% and 68% in 2014 and 2015, respectively. However, the scaling procedure (i.e., the training of model with the phenotypic values of biomass) was not necessary with the use of the prediction values for selecting superior genotypes because the correlation between the predicted and genotypic values of biomass did not change with scaling. This is because the CGM used in this study was deterministic and did not include any parameters to be estimated other than τ. This is another great advantage of IntCGM because the model does not require the phenotypic data of biomass, which in turn requires the laborious destructive measurements of plants.

Toward application for breeding

In this study, we validated our method with the dataset of the 2-year experiments, which had a 1-month difference in their timings of sowing and planting to simulate different environmental conditions. Although the validation is insufficient to evaluate the potential of the method, our models may be applicable to multi-location-multi-year dataset because CGM is expected to describe G×E when it has an appropriate model structure and the necessary environmental factors. Thus, IntCGM may enable accurate prediction of phenotypes in each target environment and accelerate the development of varieties having excellent viability in the target environments.

Our models may also help to explain the mechanisms causing G×E effects on yield-related traits because they can predict the effects physiologically through CGMs. The predicted values of growth-related “intermediate traits”, as well as of yield-related traits, allow us to understand how environmental factors affect growth and have a large impact on yield. This understanding will be of benefit to the mechanical evaluation of environmental characteristics of locations and the appropriate choice of locations used in METs.

Supporting information

S1 Fig. Genetic map of SNP markers of RILs.

(TIF)

S2 Fig. Comparison of prediction accuracy of panicle weight.

The result of prediction of tested (left) and untested (right) environments are shown. LASSO was chosen as a representative GP model. Three indices were used: Correlation coefficient (r), RMSE (root mean squared error), and slope of the regression line for predicted and observed values. Error bars represent ± 1 s.d. Letters above the bars indicate significant differences determined using the Steel–Dwass test (p < 0.01).

(TIFF)

S1 Table. List of information of genetic markers.

(CSV)

S2 Table. List of names of 112 Japanese cultivars used to estimate growth parameters of heading date.

(TXT)

S3 Table. Results of ANOVA test of observed traits to detect the effect of G×E.

(CSV)

S4 Table. Calculation of wide-sense heritability of observed traits using replicates.

(CSV)

Acknowledgments

We thank Hironori Wakabayashi and Koji Watanabe for managing the field experiments. We also thank Takashi Harigae, Teruyo Omura, Miyuki Ishibashi, Noriko Kimoto and Chie Muto for assisting the field measurements. The authors thank Maya Watanabe for organizing the MET data and for conducting the initial analysis of the CGM.

Data Availability

All relevant data are available from the GitHub repository (https://github.com/YT100100/ReferenceData_2018_PLoSONE). Powered by

Funding Statement

Author HI received funding from the Japan Society for the Promotion of Science (https://www.jsps.go.jp/index.html), KAKENHI, grants JP25252002 and JP16H0458. Author HI received funding from the Japan Science and Technology Agency (https://www.jst.go.jp), CREST, grant JPMJCR16O2. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Meuwissen THE, Hayes BJ, Goddard ME. Prediction of Total Genetic Value Using Genome-Wide Dense Marker Maps. Genetics. 2001;157(4):1819–1829. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.García-Ruiz A, Cole JB, VanRaden PM, Wiggans GR, Ruiz-López FJ, Tassell CP. Changes in genetic selection differentials and generation intervals in US Holstein dairy cattle as a result of genomic selection. Proceedings of the National Academy of Sciences. 2016;113: E3995–E4004. 10.1073/pnas.1519061113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Asoro G, Newell A, Beavis D, Scott M, Tinker A, Jannink. Genomic, Marker-Assisted, and Pedigree-BLUP Selection Methods for β-Glucan Concentration in Elite Oat. Crop Sci. 2013;53: 1894–1906. 10.2135/cropsci2012.09.0526 [DOI] [Google Scholar]
  • 4.Rutkoski J, Singh R, Huerta-Espino J, Bhavani S, Poland J, Jannink J, et al. Genetic Gain from Phenotypic and Genomic Selection for Quantitative Resistance to Stem Rust of Wheat. Plant Genome. 8: 0. 10.3835/plantgenome2014.10.0074 [DOI] [PubMed] [Google Scholar]
  • 5.Yabe S, Hara T, Ueno M, Enoki H, Kimura T, Nishimura S, et al. Potential of Genomic Selection in Mass Selection Breeding of an Allogamous Crop: An Empirical Study to Increase Yield of Common Buckwheat. Front Plant Sci. 2018;9: 276 10.3389/fpls.2018.00276 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Heffner EL, Sorrells ME, Jannink JL. Genomic Selection for Crop Improvement. Crop Sci. 2009;49(1):1–12. 10.2135/cropsci2008.08.0512 [DOI] [Google Scholar]
  • 7.Kang MS. Genotype-environment interaction: progress and prospect In: Kang MS, editors. Quantitative Genetics, Genomics and Plant Breeding. Oxon: UK: CABI Publishing; 2001. p. 221–243. [Google Scholar]
  • 8.Burgueño J, de los Campos G, Weigel K, Crossa J. Genomic Prediction of Breeding Values When Modeling Genotype × Environment Interaction Using Pedigree and Dense Molecular Markers. Crop Sci. 2012;52(2):707–719. 10.2135/cropsci2011.06.0299 [DOI] [Google Scholar]
  • 9.Jarquín D, Crossa J, Lacaze X, Cheyron PD, Daucourt J, Lorgeou J, et al. A Reaction Norm Model for Genomic Selection Using High-Dimensional Genomic and Environmental Data. Theor Appl Genet. 2013;127(3):595–607. 10.1007/s00122-013-2243-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Schulz-Streeck T, Ogutu JO, Gordillo A, Karaman Z, Knaak C, Piepho HP. Genomic Selection Allowing for Marker-by-Environment Interaction. Plant Breed. 2013;132(6):532–538. 10.1111/pbr.12105 [DOI] [Google Scholar]
  • 11.Saint Pierre C, Burgueño J, Crossa J, Dávila GF, López PF, Moya ES, et al. Genomic prediction models for grain yield of spring bread wheat in diverse agro-ecological zones. Sci Rep. 2016;6:1–11. 10.1038/s41598-016-0001-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ramirez-Villegas J, Watson J, Challinor AJ. Identifying Traits for Genotypic Adaptation Using Crop Models. Journal of Experimental Botany. 2015;66(12), 3451–3462. 10.1093/jxb/erv014 [DOI] [PubMed] [Google Scholar]
  • 13.Bustos-Korts D., Eeuwijk FA, Malosetti M. Modelling of genotype by environment interaction and prediction of complex traits across multiple environments as a synthesis of crop growth modelling, genetics and statistics In: Yin X, Struik PC, editors. Crop Systems Biology. Wageningen: Wageningen University; 2017. p. 55–82. [Google Scholar]
  • 14.Heslot N, Akdemir D, Sorrells ME, Jannink JL. Integrating Environmental Covariates and Crop Modeling into the Genomic Selection Framework to Predict Genotype by Environment Interactions. Theor Appl Genet. 2014;127(2):463–480. 10.1007/s00122-013-2231-5 [DOI] [PubMed] [Google Scholar]
  • 15.Technow F, Messina CD, Totir LR, Cooper M. Integrating Crop Growth Models with Whole Genome Prediction through Approximate Bayesian Computation. PLoS One. 2015;10(6):e0130855 10.1371/journal.pone.0130855 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Cooper M, Technow F, Messina C, Gho C, Radu TL. Use of Crop Growth Models with Whole-Genome Prediction: Application to a Maize Multienvironment Trial. Crop Sci. 2016;56(5):2141–2156. 10.2135/cropsci2015.08.0512 [DOI] [Google Scholar]
  • 17.Rutkoski J, Poland J, Mondal S, Autrique E, Pérez L, Crossa J, et al. Canopy Temperature and Vegetation Indices from High-Throughput Phenotyping Improve Accuracy of Pedigree and Genomic Selection for Grain Yield in Wheat. G3-Genes Genom Genet. 2016;6:2799–2808. 10.1534/g3.116.032888 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Sun J, Rutkoski J, Poland J, Crossa J, Jannink J-L, Sorrells ME. Multitrait, Random Regression, or Simple Repeatability Model in High-Throughput Phenotyping Data Improve Genomic Prediction for Wheat Grain Yield. Plant Genom. 2017;10 10.3835/plantgenome2016.11.0111 [DOI] [PubMed] [Google Scholar]
  • 19.Onogi A, Ideta O, Inoshita Y, Ebana K, Yoshioka T, Yamasaki M, et al. Exploring the Areas of Applicability of Whole-Genome Prediction Methods for Asian Rice (Oryza sativa L.). Theor Appl Genet. 2014;128(1):41–53. 10.1007/s00122-014-2411-y [DOI] [PubMed] [Google Scholar]
  • 20.Xu S, Zhu D, Zhang Q. Predicting Hybrid Performance in Rice Using Genomic Best Linear Unbiased Prediction. Proceedings of the National Academy of Sciences. 2014;111(34):12456–12461. 10.1073/pnas.1413750111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Grenier C, Cao TV, Ospina Y, Quintero C, Châtel M, Tohme J, et al. Accuracy of Genomic Selection in a Rice Synthetic Population Developed for Recurrent Selection Breeding. PLoS One. 2015;10(8):e0136594 10.1371/journal.pone.0136594 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Spindel JE, Begum H, Akdemir D, Collard B, Redoña E, Jannink JL, et al. Genome-wide Prediction Models that Incorporate de novo GWAS are a Powerful New Tool for Tropical Rice Improvement. Heredity. 2016;116(4):395–408. 10.1038/hdy.2015.113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Spindel J, Begum H, Akdemir D, Virk P, Collard B, Redoña E, et al. Genomic Selection and Association Mapping in Rice (Oryza sativa): Effect of Trait Genetic Architecture, Training Population Composition, Marker Number and Statistical Model on Accuracy of Rice Genomic Selection in Elite, Tropical Rice Breeding Lines. PLoS Genet. 2015;11(2):e1004982 10.1371/journal.pgen.1004982 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Wang X, Li L, Yang Z, Zheng X, Yu S, Xu C, et al. Predicting rice hybrid performance using univariate and multivariate GBLUP models based on North Carolina mating design II. Heredity. 2016;118(3):302–310. 10.1038/hdy.2016.87 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Pinnschmidt HO, Batchelor WD, Teng PS. Simulation of Multiple Species Pest Damage in Rice Using CERES-Rice. Agr Syst. 1995;48(2):193–222. 10.1016/0308-521X(94)00012-G [DOI] [Google Scholar]
  • 26.Timsina J, Humphreys E. Performance of CERES-Rice and CERES-Wheat Models in Rice–Wheat Systems: A Review. Agricultural Systems. 2006;90(1–3):5–31. 10.1016/j.agsy.2005.11.007 [DOI] [Google Scholar]
  • 27.Iizumi T, Yokozawa M, Nishimori M. Parameter Estimation and Uncertainty Analysis of a Large-Scale Crop Model for Paddy Rice: Application of a Bayesian Approach. Agric For Meteorol. 2009;149(2):333–348. 10.1016/j.agrformet.2008.08.015 [DOI] [Google Scholar]
  • 28.Horie T. A Model for Evaluating and Water Balance of Its Application to Climatic Productivity Irrigated Rice and Southeast Asia. Southeast Asian Studies. 1987;25(1):62–74. 10.20495/tak.25.1_62 [DOI] [Google Scholar]
  • 29.Singh U, Ritchie JT, Godwin DC. A User’s Guide to CERES-Rice—V2.10. Muscle Shoals, Ala: International Fertilizer Development Center; 1993. [Google Scholar]
  • 30.Onogi A, Watanabe M, Mochizuki T, Hayashi T, Nakagawa H, Hasegawa T, et al. Toward Integration of Genomic Selection with Crop Modelling: The Development of an Integrated Approach to Predicting Rice Heading Dates. Theor Appl Genet. 2016;129(4):805–817. 10.1007/s00122-016-2667-5 [DOI] [PubMed] [Google Scholar]
  • 31.Oraby H, Venkatesh B, Dale B, Ahmad R, Ransom C, Oehmke J, et al. Enhanced conversion of plant biomass into glucose using transgenic rice-produced endoglucanase for cellulosic ethanol. Transgenic Res. 2007;16:739–749. 10.1007/s11248-006-9064-9 [DOI] [PubMed] [Google Scholar]
  • 32.Jahn CE, Mckay JK, Mauleon R, Stephens J, Mcnally KL, Bush DR, et al. Genetic Variation in Biomass Traits among 20 Diverse Rice Varieties. Plant Physiol. 2010;155:157–168. 10.1104/pp.110.165654 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Zhang ZH, Li P, Wang LX, Hu ZL, Zhu LH, Zhu YG. Genetic dissection of the relationships of biomass production and partitioning with yield and yield related traits in rice. Plant Sci. 2004;167:1–8. 10.1016/j.plantsci.2004.01.007 [DOI] [Google Scholar]
  • 34.Khush GS. Strategies for increasing the yield potential of cereals: case of rice as an example. Plant Breed. 2013;132: n/a-n/a. 10.1111/pbr.1991 [DOI] [Google Scholar]
  • 35.Zhou Y, Li W, Wu W, Chen Q, Mao D, Worland AJ. Genetic dissection of Heading Time and its Components in Rice. Theor Appl Genet. 2001;102(8):1236–1242. 10.1007/s001220100539 [DOI] [Google Scholar]
  • 36.Okada S, Suehiro M, Ebana K, Hori K, Onogi A, Iwata H, et al. Genetic Dissection of Grain Traits in Yamadanishiki, an Excellent Sake-Brewing Rice Cultivar. Theor Appl Genet. 2017;130(12):2567–2585. 10.1007/s00122-017-2977-2 [DOI] [PubMed] [Google Scholar]
  • 37.Murray M, Thompson W. Rapid Isolation of High Molecular Weight Plant DNA. Nucleic Acids Research. 1980;8(19):4321–4326. 10.1093/nar/8.19.4321 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Nagasaki H, Ebana K, Shibaya T, Yonemaru JI, Yano M. Core Single-Nucleotide Polymorphisms—A Tool for Genetic Analysis of the Japanese Rice Population. Breed Sci. 2010;60(5):648–655. 10.1270/jsbbs.60.648 [DOI] [Google Scholar]
  • 39.Yamamoto T, Nagasaki H, Yonemaru JI, Ebana K, Nakajima M, Shibaya T, et al. Fine Definition of the Pedigree Haplotypes of Closely Related Rice Cultivars by means of Genome-Wide Discovery of Single-Nucleotide Polymorphisms. BMC Genomics. 2010;11(1):267 10.1186/1471-2164-11-267 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.R Core Team. R: A language and environment for statistical computing. Version 3.4.3 [software]. R Foundation for Statistical Computing, Vienna, Austria. Available from: https://www.R-project.org/.
  • 41.Broman KW, Wu H, Sen S, Churchill GA. R/qtl: QTL Mapping in Experimental Crosses. Bioinformatics. 2003;19(7):889–890. 10.1093/bioinformatics/btg112 [DOI] [PubMed] [Google Scholar]
  • 42.Soltani A, Sinclair TR. Modeling Physiology of Crop Development, Growth and Yield. CABI; . Wallingford, Oxfordshire; 2002. [Google Scholar]
  • 43.Xavier A, Muir W, Craig B, Rainey K. Walking through the statistical black boxes of plant breeding. Theor Appl Genet. 2016;129(10):1933–1949. 10.1007/s00122-016-2750-y [DOI] [PubMed] [Google Scholar]
  • 44.Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw. 2010;33(1):1–22. 10.18637/jss.v033.i01 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Endelman JB. Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP. Plant Genome J. 2011;4(3):250–255. 10.3835/plantgenome2011.08.0024 [DOI] [Google Scholar]
  • 46.Yin X, Kropff MJ, Horie T, Nakagawa H, Centeno HGS, Zhu D., et al. A Model for Photothermal Responses of Flowering in Rice. I. Model Description and Parameterization. F Crop Res. 1997;51(3):189–200. 10.1016/S0378-4290(96)03456-9 [DOI] [Google Scholar]
  • 47.Nakagawa H, Yamagishi J, Miyamoto N, Motoyama M, Yano M, Nemoto K. Flowering Response of Rice to Photoperiod and Temperature: A QTL Analysis Using a Phenological Model. Theor Appl Genet. 2005;110(4):778–786. 10.1007/s00122-004-1905-4 [DOI] [PubMed] [Google Scholar]
  • 48.Eberhart R, Kennedy J. A New Optimizer Using Particle Swarm Theory. MHS95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science. 1995; p. 39–43. 10.1109/MHS.1995.494215 [DOI]
  • 49.Yano M, Katayose Y, Ashikari M, Yamanouchi U, Monna L, Fuse T, et al. Hd1, a Major Photoperiod Sensitivity Quantitative Trait Locus in Rice, Is Closely Related to the Arabidopsis Flowering Time Gene CONSTANS. The Plant Cell Online. 2000;12(12):2473–2483. 10.1105/tpc.12.12.2473 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Takahashi Y, Shomura A, Sasaki T, Yano M. Hd6, a Rice Quantitative Trait Locus Involved in Photoperiod Sensitivity, Encodes the Alpha Subunit of Protein Kinase CK2. Proc Natl Acad Sci USA. 2001;98(14):7922–7927. 10.1073/pnas.111136798 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Kojima S. Hd3a, a Rice Ortholog of the Arabidopsis FT Gene, Promotes Transition to Flowering Downstream of Hd1 under Short-Day Conditions. Plant Cell Physiol. 2002;43(10):1096–1105. 10.1093/pcp/pcf156 [DOI] [PubMed] [Google Scholar]
  • 52.Xue W, Xing Y, Weng X, Zhao Y, Tang W, Wang L, et al. Natural Variation in Ghd7 Is an Important Regulator of Heading Date and Yield Potential in Rice. Nat Genet. 2008;40(6):761–767. 10.1038/ng.143 [DOI] [PubMed] [Google Scholar]
  • 53.Matsubara K, Ogiso-Tanaka E, Hori K, Ebana K, Ando T., Yano M. Natural Variation in Hd17, a Homolog of Arabidopsis ELF3 That Is Involved in Rice Photoperiodic Flowering. Plant Cell Physiol. 2012;53(4):709–716. 10.1093/pcp/pcs028 [DOI] [PubMed] [Google Scholar]
  • 54.Hori K, Ogiso-Tanaka E, Matsubara K, Yamanouchi U, Ebana K, Yano M. Hd16, a Gene for Casein Kinase I, Is Involved in the Control of Rice Flowering Time by Modulating the Day-Length Response. Plant J. 2013;76(1):36–46. 10.1111/tpj.12268 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Huang GB, Zhu QY, Siew CK. Extreme Learning Machine: Theory and Applications. Neurocomputing. 2006;70:489–501. 10.1016/j.neucom.2005.12.126 [DOI] [Google Scholar]
  • 56.Breiman L. Random Forests. Mach Learn. 2001;45(1):5–32. 10.1023/A:1010933404324 [DOI] [Google Scholar]
  • 57.Liaw A. Wiener M. Classification and Regression by randomForest. R News. 2002;2(3):18–22. [Google Scholar]
  • 58.Daetwyler HD, Villanueva B, Woolliams JA. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS One. 2008;3: e3395 10.1371/journal.pone.0003395 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Meuwissen THE. Accuracy of breeding values of “unrelated” individuals predicted by dense SNP genotyping. Genetics Sel Evol Gse. 2009;41: 35 10.1186/1297-9686-41-35 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Calus MP, Veerkamp RF. Accuracy of multi-trait genomic selection using different methods. Genetics Selection Evolution. 2011;43 10.1186/1297-9686-43-26 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Jia Y, Jannink J-L. Multiple-Trait Genomic Selection Methods Increase Genetic Value Prediction Accuracy. Genetics. 2012;192:1513–1522. 10.1534/genetics.112.144246 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Moser G, Tier B, Crump RE, Khatkar MS, Raadsma HW. A comparison of five methods to predict genomic breeding values of dairy bulls from genome-wide SNP markers. Genetics Selection Evolution. 2009;41:56 10.1186/1297-9686-41-56 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.González-Recio O, Rosa GJ, Gianola D. Machine learning methods and predictive ability metrics for genome-wide prediction of complex traits. Livestock Science. 2014;166: 217–231. 10.1016/j.livsci.2014.05.036 [DOI] [Google Scholar]
  • 64.Holzworth DP, Huth NI, deVoil PG, Zurcher EJ, Herrmann NI, McLean G, et al. APSIM—Evolution towards a New Generation of Agricultural Systems Simulation. Environ Model Softw. 2014;62:327–350. 10.1016/j.envsoft.2014.07.009 [DOI] [Google Scholar]
  • 65.Córcoles JI, Ortega JF, Hernández D, Moreno MA. Estimation of Leaf Area Index in Onion (Allium Cepa L.) Using an Unmanned Aerial Vehicle. Biosyst Eng. 2013;115(1):31–42. 10.1016/j.biosystemseng.2013.02.002 [DOI] [Google Scholar]
  • 66.Duan SB, Li ZL, Wu H, Tang BH, Ma L, Zhao E, et al. Inversion of the PROSAIL Model to Estimate Leaf Area Index of Maize, Potato, and Sunflower Fields from Unmanned Aerial Vehicle Hyperspectral Data. Int J Appl Earth Obs Geoinf. 2014;26:12–20. 10.1016/j.jag.2013.05.007 [DOI] [Google Scholar]
  • 67.Montes JM, Technow F, Dhillon BS, Mauch F, Melchinger AE. High-Throughput Non-Destructive Biomass Determination during Early Plant Development in Maize under Field Conditions. F Crop Res. 2011;121(2):268–273. 10.1016/j.fcr.2010.12.017 [DOI] [Google Scholar]
  • 68.Watanabe K, Guo W, Arai K, Takanashi H, Kajiya-Kanegae H, Kobayashi M, et al. High-Throughput Phenotyping of Sorghum Plant Height Using an Unmanned Aerial Vehicle and Its Application to Genomic Prediction Modeling. Frontiers Plant Sci. 2017;8:421 10.3389/fpls.2017.00421 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Cabrera‐Bosquet L, Crossa J, Zitzewitz J, Serret M, Araus J. High‐throughput Phenotyping and Genomic Selection: The Frontiers of Crop Breeding Converge. Journal of Integrative Plant Biology. Journal of Integrative Plant Biology; 2012;54(5):312–320. 10.1111/j.1744-7909.2012.01116.x [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Lewis Lukens

11 Dec 2019

PONE-D-19-23130

Predicting G×E in biomass of rice: modeling method combining crop growth models and genomic prediction models

PLOS ONE

Dear Dr. Iwata,

Thank you for submitting your revised manuscript to PLOS ONE. I apologize for the delay in returning this manuscript to you. I was able to secure only one external review, and I reviewed the article myself. We invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please address both the reviewers and my concerns.

My major concern was with the clarity of the work. The major contribution is the combination of genomic selection and a crop growth model to predict biomass. The approach is quite complex (Figure 2), but the logic is straightforward. Biomass is very strongly correlated with leaf area accumulated prior to flowering time. Both growth rate and flowering time have genetic components. Thus, genomic prediction can predict differences in biomass through these "intermediate traits." The key point seems to be that direct genomic prediction of biomass  is worse (though not that much worse) than the integration the crop growth models. The discussion would benefit from expanding on why using the growth models usually leads to better predictions. I would also include the analysis of yield mentioned in the discussion in results.

I was surprised that r of the predicted values vs untested environment values were so high (Figure 6). It seems the model trains on the means/ lsmeans from two very different replicates in each year. Perhaps this approach is leading to high correlations. How does the model perform if predictions are tested across the single replicates in the following year? This work would be more likely used in this context it seems.

The rationale for the other work reported in the manuscript was unclear to me.  The discussion first describes how rice growth differed between 2014 and 2015. I'd omit this. The discussion then discusses Figure 6 which has the results of the growth model/ genomic prediction. If the other reported data made a scientific contribution, I would discuss them more fully or remove them.

We would appreciate receiving your revised manuscript by Jan 25 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Lewis Lukens

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements:

  1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.plosone.org/attachments/PLOSOne_formatting_sample_main_body.pdf and http://www.plosone.org/attachments/PLOSOne_formatting_sample_title_authors_affiliations.pdf

  2. Please ensure that you refer to Figure 5 in your text as, if accepted, production will need this reference to link the reader to the figure.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 2019-11-19

I have read the revised manuscript and authors’ responses to reviewer’s comments, and thank that the manuscript is well prepared and the responses are almost adequate. I have two questions and would like the authors to address in their Discussion section.

The first is related to the reviewer 4’s question on how to deal with non-repeatable genotype by environment interaction (GE). This is a very good question for any study to be useful for breeding. The authors response was to use “historical environmental data or simulated (assumed) future environmental data.” I think this is a very important point and needs to be emphasized if their models are to be used in assisting breeding. Indeed, GE has always been a challenge to plant breeding, and it is important to differentiate repeatable vs. non-repeatable GE and deal with them differently. In plant breeding, no-repeatable GE is dealt with by testing at multiple locations for multiple years so as to represent the population of environments in the target environment (Yan, 2016 Crop Science). For reliable genomic prediction (GP), genomic models must also be developed based on phenotypic data from such multi-location multi-year trials (Yan et al., 2019, Crop Breeding, genetics, and Genomics). This also applies to prediction models that integrate GP and physiological models. Furthermore, models integrating multi-location multi-year data have to be validated by similarly obtained phenotypic data. It remains a question whether an “integrated” model actually predicted better than GP alone.

My second question/comment is how far does the crop simulation models have to go? In the current study, leaf number, tiller number, and leaf size were simulated using daily temperature and day length to simulate biomass. However, the complexity of the physiological and biological process leading to biomass or grain yield is unlimited. For example, should we go by hourly temperature? Should we also consider CO2, water, and mineral nutrient availability in the models? Is there an optimum level of complexity to target in integrated models?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Weikai Yan

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Jun 19;15(6):e0233951. doi: 10.1371/journal.pone.0233951.r003

Author response to Decision Letter 0


2 Feb 2020

> My major concern was with the clarity of the work. The major contribution is the combination of genomic selection and a crop growth model to predict biomass. The approach is quite complex (Figure 2), but the logic is straightforward. Biomass is very strongly correlated with leaf area accumulated prior to flowering time. Both growth rate and flowering time have genetic components. Thus, genomic prediction can predict differences in biomass through these "intermediate traits." The key point seems to be that direct genomic prediction of biomass is worse (though not that much worse) than the integration the crop growth models. The discussion would benefit from expanding on why using the growth models usually leads to better predictions.

The improvement in the prediction accuracy of our model was partly caused by the prediction of heading date, which was included in both IntCGM and IntML. Because we considered the interaction between genotypes on one hand and environmental factors the other, the predicted values of heading date as intermediate traits can contribute to explain the pattern of G×E in biomass. We added the related descriptions to explain the reason why the new models (IntCGM and IntML) performed better than GP (L388-393). Another benefit to use IntCGM has been described in a later part (L425-450).

> I would also include the analysis of yield mentioned in the discussion in results.

As you suggested, we moved the description about the prediction of panicle weight (L470-L474 in old manuscript) to the result part (L370-373). Also, we provide S2 Figure to describe the prediction accuracy of panicle weight (The figure caption was added in L733-738).

> I was surprised that r of the predicted values vs untested environment values were so high (Figure 6). It seems the model trains on the means/ lsmeans from two very different replicates in each year. Perhaps this approach is leading to high correlations. How does the model perform if predictions are tested across the single replicates in the following year? This work would be more likely used in this context it seems.

We added the table of narrow-sense heritability calculated by replicates (S4 Table). We, however, did not test performances of the models when they are trained with a single replicate, but it is clear that the prediction accuracy of all the models decreases at a certain level. Because the replicates were cultivated in the same environment (at least same in environmental variables used in this study), the differences in phenotypic values between the replicates cannot be explained with either genotype data or environmental data. Thus, the separate modelling with a single replicate will only enlarge noises to use a fewer number of replicates for training models, and thus leads to decreases in the prediction accuracy for all the models.

> The rationale for the other work reported in the manuscript was unclear to me. The discussion first describes how rice growth differed between 2014 and 2015. I'd omit this.

As you suggested, we removed the paragraph “Comparison of crop traits between different temperature patterns” (L378-385 in original manuscript).

> The discussion then discusses Figure 6 which has the results of the growth model/genomic prediction. If the other reported data made a scientific contribution, I would discuss them more fully or remove them.

We described about three indices (r, RMSE and slope) in the paragraph (L388-400 in original manuscript). We removed the notation about the slope because it was duplicated with a later paragraph (L436-450). Also, we removed the notation about RMSE (L396-398 in original manuscript). The RMSE of IntCGM in the prediction under the untested environment showed different tendencies in the prediction in the untested environment (Fig 6), but the reason was due to the specific structure of crop growth model we used. Because the explanation for the reason will be so long but have a little generality for other studies, we decided to remove the description. Finally, the description about r was shortened and moved to L384-388, before a newly added part describing the reason why our integrated models could improve prediction accuracies (L388-393). We believe that this revision made the aim of this paragraph clearer.

> Reviewer #1: 2019-11-19

> I have read the revised manuscript and authors’ responses to reviewer’s comments, and thank that the manuscript is well prepared and the responses are almost adequate. I have two questions and would like the authors to address in their Discussion section.

> The first is related to the reviewer 4’s question on how to deal with non-repeatable genotype by environment interaction (GE). This is a very good question for any study to be useful for breeding. The authors response was to use “historical environmental data or simulated (assumed) future environmental data.” I think this is a very important point and needs to be emphasized if their models are to be used in assisting breeding.

> Indeed, GE has always been a challenge to plant breeding, and it is important to differentiate repeatable vs. non-repeatable GE and deal with them differently. In plant breeding, non-repeatable GE is dealt with by testing at multiple locations for multiple years so as to represent the population of environments in the target environment (Yan, 2016 Crop Science). For reliable genomic prediction (GP), genomic models must also be developed based on phenotypic data from such multi-location multi-year trials (Yan et al., 2019, Crop Breeding, genetics, and Genomics). This also applies to prediction models that integrate GP and physiological models.

As you suggested, we viewed our result from the aspect of the decomposition of G×E. We assume that non-repeatable G×E can decompose G×E in variations explained by environmental factors and ones unexplained by environmental factors; the former and latter are referred to as explainable and non-explainable G×E, respectively. The explainable G×E was because of differences in environments between years, whereas the latter was not explained by any obvious environmental factors, therefore, treated as noise. Improvement of prediction accuracies of models in the tested environment indicate that the integrated models can predict the explainable G×E variations. Furthermore, the improvement of accuracies in the untested environment indicate that the integrated models can predict the repeatable G×E variations, because difference between the field trials in two years, where the patterns of temperature and day length differed, was as large as those between locations.

We added a discussion about this topic (L486-503) with a reference you noted [64] (L728-729).

> Furthermore, models integrating multi-location multi-year data have to be validated by similarly obtained phenotypic data. It remains a question whether an “integrated” model actually predicted better than GP alone.

Indeed, the performances of our integrated models were not tested with multi-location data. However, as emphasized in the method part (L103-109), sowing and transplanting were performed in different months between years to produce results under different conditions of day length and temperature. Because of different cultivation periods during 2014 and 2015, the 2-year experiments were not simply yearly replications but were expected to induce different growth patterns under different environmental conditions. Thus, as shown in the better prediction accuracies in the untested environment than GP (Fig 6), it is certain that our models predicted better in the prediction under different patterns of day length and temperature.

> My second question/comment is how far does the crop simulation models have to go? In the current study, leaf number, tiller number, and leaf size were simulated using daily temperature and day length to simulate biomass. However, the complexity of the physiological and biological process leading to biomass or grain yield is unlimited. For example, should we go by hourly temperature? Should we also consider CO2, water, and mineral nutrient availability in the models? Is there an optimum level of complexity to target in integrated models?

As we wrote in the discussion part (L466-468), it should be avoided to use a crop growth model with too much complexity. However, it is an important problem to find an optimum model with an appropriate number of parameters. We should omit unnecessary variables depending on the target dataset. A sensitivity analysis will be effective to select modules of the models in which variables with little influence on target traits will be distinguished. We added a description of that point (L468-471).

Attachment

Submitted filename: ResponseToReviewers_20200202.docx

Decision Letter 1

Lewis Lukens

20 Feb 2020

PONE-D-19-23130R1

Predicting G×E in biomass of rice: modeling method combining crop growth models and genomic prediction models

PLOS ONE

Dear Dr. Iwata,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

We would appreciate receiving your revised manuscript by Apr 05 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Lewis Lukens

Academic Editor

PLOS ONE

Additional Editor Comments (if provided):

Thank you for the revision. The manuscript is interesting and has improved. My major concern remains with the clarity of the work. For example, a reader should be able to move from introduction to results and get a sense of the findings. The questions posed in the introduction should specifically addressed in results. Now, results is mostly composed of technical statements. Re-writing the section to contain paragraphs with topic sentences would help improve clarity. The article will have greater impact if it can be readily understood. Specific lines to improve are listed below but please edit broadly.

A central point of the article is that the inclusion of growth-related traits into models enabled G x E effects to be predicted. G X E indicates variation in genotypic responses to environment variation. In this study, it seems genotypes were analyzed in single environments, so there is no variation due to the environment except random error. Within this single environment, some genotypes grow faster and others slower, so genotypes will have different growth rates relative to environmental factors such as temperature, but I would refer to these genotypic differences as genetic effects not G x E. G x E effects could explain differences in genotypes between the two years of the study, but this aspect does not seem to be the focus. Specific text includes lines 38-41; 388-390; 391-393, and elsewhere such as 483. Please clarify.

In previous comments, I wrote, “I was surprised that r of the predicted values vs untested environment values were so high (Figure 6).” My understanding is that genomic prediction accuracies estimated for 2014 lines using 2014 data (“tested”) were less accurate than genomic prediction models trained with 2014 data and used to predict 2015 (“untested”) (Figure 6). This odd result, if correct, still does not seem to be explained.

Specific points to potentially clarify:

Line 32: “environmental data was predicted.” Is this in results?

Line 248. Fourteen traits are mentioned as subjected to genomic prediction, but I did not see a clear summary of these traits.

Line 322 “Tiller length” but number of tillers in the figure. Please define axes in Fig 3.

Line 337 Define what is plotted in this figure.. e.g. adjusted line means?

Line 402 “An advantage” over what other option? Why specifically is treating parameters as representations of actual crop conditions beneficial?

Line 434. The text notes the authors’ approach could save time because biomass is destructive and hard to obtain. This point seems overstated. The intermediate traits themselves do not seem easy to measure. Please clarify. In addition, is biomass a useful trait for rice breeders/ others? If so, please state. This point about the usefulness of the approach was also brought up by the reviewer in terms of validating the model. How much confidence can one have from an analysis of two locations of data?

Line 436 “The prediction accuracy of our models was better…” At times, they are equal (Fig 6).

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Jun 19;15(6):e0233951. doi: 10.1371/journal.pone.0233951.r005

Author response to Decision Letter 1


19 Apr 2020

> My major concern remains with the clarity of the work. For example, a reader should be able to move from introduction to results and get a sense of the findings. The questions posed in the

introduction should specifically addressed in results. Now, results is mostly composed of technical statements. Re-writing the section to contain paragraphs with topic sentences would help improve clarity. The article will have greater impact if it can be readily understood. Specific lines to improve are listed below but please edit broadly.

As you pointed out, we revised our manuscript in the following points.

L71-79: We clarified what was remained to be improved in recent studies of the integrated models of GP and CGM.

L80-90, L97-116: We stated about the originality of our research following to the description of use of growth-related traits (L80-90). This revision enabled readers to easily understand why we chose two-step approach. Also, we modified later paragraphs to take consistency of the manuscript (L97-116).

L196: We moved a description of 10-fold cross-validation (L344 in the original manuscript), which should be placed in the material and methods part.

L349-351, L352-L353, and L374-376: We added explanations about topics of each paragraph for whom move to the result part from the introduction part directly.

L378-379: We added a necessary explanation about the result.

L381: We changed the expression for easy understandings.

L400-404: We re-written the paragraph to make the result clearer.

L470-471: We added a description about recent studies of multi-trait GP to clarify the improvement of our study.

> A central point of the article is that the inclusion of growth-related traits into models enabled G x E effects to be predicted. G X E indicates variation in genotypic responses to environment variation. In this study, it seems genotypes were analyzed in single environments, so there is no variation due to the environment except random error. Within this single environment, some genotypes grow faster and others slower, so genotypes will have different growth rates relative to environmental factors such as temperature, but I would refer to these genotypic differences as genetic effects not G x E. G x E effects could explain differences in genotypes between the two years of the study, but this aspect does not seem to be the focus. Specific text includes lines 38-41; 388-390; 391-393, and elsewhere such as 483. Please clarify.

As you pointed out, we edited our manuscript in the following points.

L26-30, L38-39: We changed expressions in the abstract to take consistency with the following revisions.

L55-56, L59-60, L62-63, and L66-67: We changed expressions of explanations about recent researches in the introduction part to make it correspond with revisions in the discussion part.

L420-426: We discussed the improvement of the prediction accuracy in IntCGM and IntML from an aspect of the use of the intermediate traits, not from an aspect of prediction of G×E.

L536-543: We placed a paragraph to discuss the future possibility of prediction of G×E with IntCGM, instead of the paragraph in which we argued that IntCGM succeeded in the prediction of G×E.

Also, we offer to change the title to “Predicting biomass of rice with intermediate traits: modeling method combining crop growth models and genomic prediction models” and running title to “Predicting biomass of rice with intermediate traits” (L1-4). We removed the expression of prediction of G×E from them.

> In previous comments, I wrote, “I was surprised that r of the predicted values vs untested environment values were so high (Figure 6).” My understanding is that genomic prediction accuracies estimated for 2014 lines using 2014 data (“tested”) were less accurate than genomic prediction models trained with 2014 data and used to predict 2015 (“untested”) (Figure 6). This odd result, if correct, still does not seem to be explained.

The result is correct. We added a description about this result (L427-442) and two references (L751-756). Also, we modified figure labels of Fig 6 to make it easier to understand which data was used as training or test data. Corresponded revisions in manuscript are in L395, L401, L410.

This intuitively unexpected result might be owing to two reasons. One is the low heritability of biomass in 2014, which led to lower prediction accuracy in the models [58-59]. To reduce the influence of the heritability level on the index of the prediction accuracy (i.e., a correlation coefficient between observed and predicted phenotypes), the value of r was adjusted by dividing it by the square root of genomic heritability. The adjusted values of r became 0.652 and 0.746 for the biomass in 2014 and 2015, respectively, and had smaller differences than the previous r.

Another reason for the higher biomass prediction accuracy in 2015 is the GS model built with LASSO. In Fig 5, the biomass prediction accuracy was lower in LASSO than in other models in 2014, whereas the result was the opposite in 2015. Polygenic marker effects seemed more dominant in biomass in 2014 than in 2015 because LASSO is not good at capturing the small effects of a large number of variables. In contrast, the estimation of genomic heritability effectively reflects polygene effects. The differences in the characteristics of each estimation method subsequently caused the difference in the adjusted values of r for the biomass in 2014 and 2015.

> Specific points to potentially clarify:

> Line 32: “environmental data was predicted.” Is this in results?

No, environmental data was not predicted in this study. Because the sentence was misleading, we changed the sentence to “Second, the biomass was predicted from these GP-predicted values and the environmental data using machine learning or crop growth modeling.” (L31-32)

> Line 248. Fourteen traits are mentioned as subjected to genomic prediction, but I did not see a clear summary of these traits.

Fourteen traits consisted of six intermediate traits and the biomass for two years. The result of the prediction of these traits are shown in fig 5. We added an explanation about the traits (L275).

> Line 322 “Tiller length” but number of tillers in the figure. Please define axes in Fig 3.

As you mentioned, the trait should be the number of tillers. We corrected the trait name (L349) and apology for the mistake.

> Line 337 Define what is plotted in this figure.. e.g. adjusted line means?

As you mentioned, we clearly described that the plotted values are “The adjusted mean values of each line” (L361-362).

> Line 402 “An advantage” over what other option?

It is an advantage over conventional researches of integrated models of GP and CGM. We made the description clearer (L451-452).

> Why specifically is treating parameters as representations of actual crop conditions beneficial?

One benefit is an easiness to understand the influence of the parameters on the target traits because phenotypic data of intermediate traits are available, as written in the original manuscript (L410-412). Another benefit is the improvement in prediction accuracy in a real datasets because the prediction of complex traits such as yield with an integrated model of GP and CGM was so difficult that no improvement in prediction accuracy had been found in the real datasets [16]. We added a description in the introduction part (L71-77) and the discussion part (L458-463).

> Line 434. The text notes the authors’ approach could save time because biomass is destructive and hard to obtain. This point seems overstated. The intermediate traits themselves do not seem easy to measure. Please clarify.

As you mentioned, the time-series observation of the intermediate traits is also laborious. We moved the paragraph about the scaling parameter τ (L425-435 in the original manuscript) after a paragraph about the measurement of the intermediate traits using high-throughput phenotyping (L522-533). Following the paragraph describing the reduction of the phenotyping cost of the intermediate traits, it is natural to discuss the possibility of reducing the phenotyping cost of destructive sampling.

> In addition, is biomass a useful trait for rice breeders/ others? If so, please state.

Rice biomass is an important trait for rice breeding, not only by a need of biomass itself to be used as biofuel [31-32], but also as an essential factor to determine grain yield together with harvest index [33-34]. We added the description in the introduction part (L91-92, L95-96) and four references (L668-679).

> This point about the usefulness of the approach was also brought up by the reviewer in terms of validating the model. How much confidence can one have from an analysis of two locations of data?

It is true that the result will be more confident if we test the model in a larger number of environments. However, we think two environments are not insufficient to prove the ability of the models, as previous researches about an integrated model of GP and CGM also tested in two environments [15-16]. We added a description of this point (L492-494).

> Line 436 “The prediction accuracy of our models was better…” At times, they are equal (Fig 6).

Your indication is correct. Because this notation is repeated in the result and discussion parts, we erased the sentence.

Attachment

Submitted filename: ResponseToReviewers_20200416.docx

Decision Letter 2

Lewis Lukens

18 May 2020

Predicting biomass of rice with intermediate traits: modeling method combining crop growth models and genomic prediction models

PONE-D-19-23130R2

Dear Dr. Iwata,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Lewis Lukens

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

I apologize for my slow response. I reviewed the article. Thank you for addressing previous concerns.

Reviewers' comments:

Acceptance letter

Lewis Lukens

26 May 2020

PONE-D-19-23130R2

Predicting biomass of rice with intermediate traits: modeling method combining crop growth models and genomic prediction models

Dear Dr. Iwata:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Lewis Lukens

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Genetic map of SNP markers of RILs.

    (TIF)

    S2 Fig. Comparison of prediction accuracy of panicle weight.

    The result of prediction of tested (left) and untested (right) environments are shown. LASSO was chosen as a representative GP model. Three indices were used: Correlation coefficient (r), RMSE (root mean squared error), and slope of the regression line for predicted and observed values. Error bars represent ± 1 s.d. Letters above the bars indicate significant differences determined using the Steel–Dwass test (p < 0.01).

    (TIFF)

    S1 Table. List of information of genetic markers.

    (CSV)

    S2 Table. List of names of 112 Japanese cultivars used to estimate growth parameters of heading date.

    (TXT)

    S3 Table. Results of ANOVA test of observed traits to detect the effect of G×E.

    (CSV)

    S4 Table. Calculation of wide-sense heritability of observed traits using replicates.

    (CSV)

    Attachment

    Submitted filename: ResponseToReviewers.docx

    Attachment

    Submitted filename: ResponseToReviewers_20200202.docx

    Attachment

    Submitted filename: ResponseToReviewers_20200416.docx

    Data Availability Statement

    All relevant data are available from the GitHub repository (https://github.com/YT100100/ReferenceData_2018_PLoSONE). Powered by


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES