Genomic prediction of rice mesocotyl length indicative of directing seeding suitability using a half-sib hybrid population

Liang Chen; Jindong Liu; Sang He; Liyong Cao; Guoyou Ye

doi:10.1371/journal.pone.0283989

. 2023 Apr 5;18(4):e0283989. doi: 10.1371/journal.pone.0283989

Genomic prediction of rice mesocotyl length indicative of directing seeding suitability using a half-sib hybrid population

Liang Chen ¹, Jindong Liu ¹, Sang He ^1,^*, Liyong Cao ^2,^*, Guoyou Ye ^1,³

Editor: Muhammad Abdul Rehman Rashid⁴

PMCID: PMC10075464 PMID: 37018326

Abstract

Direct seeding has been widely adopted as an economical and labor-saving technique in rice production, though problems such as low seedling emergence rate, emergence irregularity and poor lodging resistance are existing. These problems are currently partially overcome by increasing seeding rate, however it is not acceptable for hybrid rice due to the high seed cost. Improving direct seeding by breeding is seen as the ultimate solution to these problems. For hybrid breeding, identifying superior hybrids among a massive number of hybrids from crossings between male and female parental populations by phenotypic evaluation is tedious and costly. Contrastingly, genomic selection/prediction (GS/GP) could efficiently detect the superior hybrids capitalizing on genomic data, which holds a great potential in plant hybrids breeding. In this study, we utilized 402 rice inbred varieties and 401 hybrids to investigate the effectiveness of GS on rice mesocotyl length, a representative indicative trait of direct seeding suitability. Several GP methods and training set designs were studied to seek the optimal scenario of hybrid prediction. It was shown that using half-sib hybrids as training set with the phenotypes of all parental lines being fitted as a covariate could optimally predict mesocotyl length. Partitioning the molecular markers into trait-associated and -unassociated groups based on genome-wide association study using all parental lines and hybrids could further improve the prediction accuracy. This study indicates that GS could be an effective and efficient method for hybrid breeding for rice direct seeding.

1. Introduction

Rice, as an essential food crop, feeds more than half of the world human population. To meet this huge demand, modern and advanced agricultural technologies were used to improve rice production. Mechanized direct seeding can conspicuously improve the planting efficiency, which has been widely adopted in rice production [1,2]. However, direct seeding in rice also faces some difficulties such as the low emergence rate, irregular emergence and easy lodging of seedlings [3]. Increasing seeding rate might resolve the problems for inbred lines yet it is not an option for hybrids due to the high seed cost. Considering the advantage of exploiting heterosis from hybrids in rice breeding, e.g., Jumin et al. [4] reported that the F₁ hybrids had an approximate 20% higher grain yield than the inbred lines, developing hybrid varieties suited for direct seeding is of great significance and now has become the focus of many rice breeding programs. Several traits that are indicative of the ease of direct seeding have been identified. One representative is the mesocotyl length as a long mesocotyl could markedly improve emergence rate, early vigor and lodging tolerance [5]. However, modern varieties developed for well irrigated ecosystem by transplanting regularly normally have short mesocotyl (≤1.0 cm) [6]. Thereby, it is crucial to breed hybrid varieties with long mesocotyl for direct seeding.

For hybrid development, identifying excellent hybrid combinations is pivotal. Since the number of hybrids producible is far more than the number of parental lines, selecting the exceptional combinations to be produced and tested is difficult for breeders. Accurately predicting hybrid performance so that only promising combinations are field-tested has long been a research hotspot. Mid-parent performance, general and specific combining abilities, genetic distance between parental lines estimated using traits or markers have been tested but are of limited usefulness depending on the traits and parental populations [7–10]. Currently, genomic selection (GS) has been widely used to predict hybrid performance in various crops. In GS, the performances of untested genotyped plant individuals are predicted based on the genomic relationship between them and a well-composed training set with both phenotypic and genotypic data. Riedelsheimer et al. [11] used 285 maize inbred lines to test cross with two maize varieties to obtain 570 hybrids. Through the cross-validation within the hybrids, it was found that the prediction accuracies of seven traits ranged from 0.72 to 0.81 with the heritabilities varied from 0.82 to 0.98. Xu et al. [12] predicted the yield of all possible 21,945 hybrid progenies using 278 hybrids generated from random crossings of 210 recombinant inbred lines. If the top 10 hybrid combinations were selected for hybrid breeding, the yield would be increased by 16%. The inclusion of non-additive effects, i.e., dominant and epistatic effects, in genomic prediction brought no benefit in real data but showed usefulness in simulation when non-additive effects were simulated [12]. Thereby, it is potentially profitable to accommodate the non-additive effects though the additive effects are dominant [12]. In addition to genomic information, other omics information is also able to assist genomic prediction. Xu et al. [13] reported that combining the parental phenotypes with other predictors can significantly improve the predictability of yield-related traits in rice. Fu et al. [14] used four methods including multiple linear regression, PLS, SVM and transcriptome distance to predict the phenotype of maize hybrids and found that the prediction based on transcriptome distance was the most accurate. Xu et al. [15] used the metabolic data of 210 inbred lines to predict the yield of their hybrids and found that the prediction ability was almost twice than that of genome markers. Westhues et al. [16] revealed the advantages of combining transcriptome data with genomic data measured for parents for the prediction of untested hybrids. In the prediction of hybrid rice, Wang et al. [17] compared the predictability of combinations between multi-omics data including genomics, transcriptome and metabolome data and eight GS methods, finding that the GBLUP approach integrating genomics and metabolome data performed overall the best.

The abovementioned studies have shown that GS holds the potential to effectively predict yield and yield-related traits in hybrid rice, but no study ever investigated the potential of GS on hybrid rice mesocotyl length which is indicative to direct seeding. In this study, we measured the mesocotyl length of 402 rice inbred lines including a famous male sterile line Taifeng A and their 401 hybrid progenies produced by test-crossing the 401 lines with Taifeng A as the female parent. We examined several genomic prediction scenarios including mid-parental value prediction, marker-assisted selection (MAS) and genome-wide association study (GWAS). Our major aim is to find the optimal hybrid rice prediction scenario with the highest prediction accuracy of mesocotyl length to disclose the potential of using genomic selection to accelerate the breeding of hybrid rice suited to direct seeding.

2. Materials and methods

2.1 Rice materials

The 402 rice varieties used in this study are mainly from South Asia, Southeast Asia and South China, conserved in the International Rice Research Institute. The specific variety information was shown in S1 Table. The 401 F1 hybrid populations were produced by test crossing 401 rice varieties (as male parent) with a widely used male sterile line Taifeng A (as female parent).

2.2 Phenotypic data

A randomized complete block design with three replicates was used to layout the test of mesocotyl length measurement for all parental lines and hybrids. In order to minimize the impact of environment on the phenotypic performances of hybrid and its parent, each hybrid was planted next to its male parent. Total 15 full seeds per variety were taken for sowing at the depth of 6 cm in each block. The plastic cavity tray with 50 hole was used for sowing. The hole depth, upper diameter and bottom diameter of the tray was 9.5 cm, 4.5 cm, and 2.1 cm, respectively. After sowing, each plastic cavity tray was placed in the corresponding plastic pallet with nutrient soil covered the bottom at the depth of 3 cm, and then all the pallets were transported into a large-volume oven to culture at 30 °C under the dark. Keep the soil in the tray and pallet moist until the seeds germinated unearthed. Record the emergence rate every day until that of all varieties reached 100%. After that, take out all seedlings in the hole and wash them with clean water and then randomly select 10 seedlings per variety with uniform rise to take photos. The mesocotyl length measurement was performed using image J (https://imagej.en.softonic.com/). The phenotypic values were adjusted to derive the best linear unbiased estimates (BLUE) using formula: y = Xb + Zu + e, where y is the observed phenotypic values of mesocotyl length for all lines and hybrids, b is the block effect, u is the genetic effect, X and Z are the design matrices for b and u, $e ~ N (0, I σ_{e}^{2})$ is the random residual where I is identity matrix and $σ_{e}^{2}$ is the residual variance component. Both b and u were regarded as fixed effect. The phenotypic adjustment process was implemented in R [18] using package sommer [19].

2.3 Heterosis analysis

The heterosis performance (Hp) was calculated using the formula: Hp = 2 × (F₁ − MP) / | P1 − P2 |, where F₁ is the performance of the hybrid, MP is the average performance of the two parents, P1 and P2 is the performance of male parent and female parent, respectively. According to the value of Hp, high-parent heterosis (HPH), the mid-parent heterosis (MPH), low-parent heterosis (LPH), and hybrid inferiority (HI) were defined as Hp > 1, 0 < Hp ≤ 1, -1 ≤ Hp < 0, and Hp < -1, respectively [20].

2.4 Genomic data

The DNA of 402 inbred rice samples was extracted by CTAB method. The sequencing platform used Illumina Hiseq 2000 (PE 150) (https://www.berrygenomics.com/) from Beirui Gene Company with the sequencing depth 50×. The sequencing reads were against with Japonica reference genome (IRGSP-1.0) (http://rice.plantbiology.msu.edu/index.shtml) by BWA-MEM V0.7.10 (http://bio-bwa.sourceforge.net/bwa.shtml). Repeated reads were classified using the Picard tool (http://broadinstitute.github.io/picard/). The variation sites such as high quality SNP and INDEL per variety were captured utilizing GATK V3.2.2 (https://gatk.broadinstitute.org/hc/en-us) with the parameter setting of QUAL < 30.0, QD < 10.0, FS > 200.0, MQ Rank Sum < -12.5 and Read Pos Rank Sum < -8.0.

A total of 7,882,841 bi-allelic SNPs were identified for the 402 lines. Quality control for the SNPs followed criteria that 1) remove SNPs with minor allele frequency (MAF) less than 0.05; 2) remove SNPs with genotyping call rate less than 90%; 3) exclude SNPs with heterozygotes rate more than 10%. As a result, 196,640 high quality SNPs retained. Genotype imputation was implemented to impute the missing genotypic profiles of the 196,640 high quality SNPs by IMPUTE2 software [21]. The heterozygotes were all arbitrarily set to missing values and imputed. Once imputation was done, a quality control for linkage disequilibrium (LD) between SNPs was applied to keep independent SNPs. The software PLINK [22] was used with the parameters window size, shifting step, and r² threshold respectively being set to 50 SNPs, 5 SNPs, and 0.1. Finally, 10,547 independent SNPs were available for the 402 lines. The genotypic data of the 402 lines was provided in S2 Table.

The genotypes of the hybrids were deduced from the genotypes of their parents. Specifically, for a particular SNP, the two types of homozygotes in the parental lines were numerically coded as 0 or 2, indicating the number of copies of the alternative allele. The profile of hybrids was the mean value of the genotypes of their parents, i.e., 0 or 1 or 2.

To investigate the population structure underlying the lines, a cluster analysis based on the SNP genotypic data was performed.

2.5 Mid-parental value prediction

The mid-parental values of the hybrids in the test sets of each cross-validation scenario (details can be found in section 2.8) were used as the phenotypically predicted genetic values of the hybrids.

2.6 Marker-assisted selection

The marker-assisted selection includes two steps. In the first step, the GWAS was performed in the training set of each cross-validation scenario using a mixed linear model: $y_{r} = 1_{r} μ + Σ_{k = 1}^{3} P C_{k} + X_{r_{j}} b_{j} + Z_{r} g_{r} + e$ , where y_r is a r-dimensional vector of adjusted phenotypic values of Mesocotyl length, r is the number of genotypes in the training set, 1_r is a r-dimensional vector of ones, μ is the intercept, PC_k is the k^th the principal component vector derived from the genomic data, b is the additive genetic effect of j^th SNP, $X_{r_{j}}$ is a r-dimensional vector containing genotypic profiles of j^th SNP, g_r is a r-dimensional vector of additive genetic effects of genotypes following $g_{r} ~ N (0, A_{r} σ_{a_{r}}^{2})$ , A_r is a r×r-dimensional additive genomic relationship matrix estimated following Yang et al. [23], $σ_{a_{r}}^{2}$ is the corresponding variance component, Z_r is the design matrix of g_r, and e are the random residuals following $e ~ N (0, I σ_{e}^{2})$ where I is identity matrix and $σ_{e}^{2}$ is the residual variance component. The thresholds of filtering significant SNPs ranged from 5×10⁻⁵ to 0.01. Once GWAS was done, for each significance threshold, a linear model using the identified trait-associated SNPs (TA-SNPs) was fitted as: $y_{r} = 1_{r} μ + \sum_{j = 1}^{m} X_{j} b_{j} + e$ in cross-validation scenarios 1–5 (details can be found in section 2.8), where m is the number of TA-SNPs. The estimated effects of the TA-SNPs $\hat{b_{j}}$ were accordingly derived. The effective number of TA-SNPs were calculated following Jiang et al. [24]. Briefly, a principal component analysis was performed using the genotypic profiles of all the TA-SNPs. The number of principal components in total explaining 95% variation is the effective number of TA-SNPs. In the second step, the phenotypic values of hybrids in the test set $\hat{y_{s}}$ were predicted using the formula $\hat{y_{s}} = 1_{s} \hat{μ} + \sum_{j = 1}^{m} X_{s_{j}} \hat{b_{j}}$ in cross-validation scenario 1–5, where 1_s is a s-dimensional vector of ones, s is the number of hybrids in the test set, $\hat{μ}$ , $\hat{β}$ and $\hat{b_{j}}$ are respectively the estimated values of intercept, and effect of j^th TA-SNP from the linear model in the first step, $X_{s_{j}}$ is a s-dimensional vector of genotypic profiles of j^th SNP in the test set.

The GWAS analyses were implemented in GCTA software [25] using option “—mlma”. The calibration and prediction linear models were fitted in R [18].

2.7 Genomic prediction methods

Two BLUP models, GBLUP and EGBLUP, and two Bayesian approaches, BayesB and BayesR were used in genomic prediction. The two BLUP models could be uniformly formulated as y = 1_nμ + Zg + ε in cross-validation scenarios 1–5 and y_h = 1_hμ_h + Wβ + Z_hg_h +ε_h in the cross-validation scenario incorporating mid-parental value as a covariate (details can be found in section 2.8), where y is a n-dimensional vector of adjusted phenotypic values of mesocotyl length, n is the number of genotypes in both training and test sets, y_h is a h-dimensional vector of adjusted phenotypic values of mesocotyl length for all hybrids, h is the number of all hybrids, 1_n and 1_h are n- and h-dimensional vectors of ones, μ and μ_h are the intercepts, W is a h-dimensional covariate vector of mid-parental values, β is the covariate effect, g and g_h are the n- and h-dimensional vectors of genetic effect of genotyped individuals and genotyped hybrids, Z and Z_h are the design matrices for g and g_h, ε and ε_h are the random residuals following $ε ~ N (0, I_{n} σ_{ε}^{2})$ and $ε_{h} ~ N (0, I_{h} σ_{ε_{h}}^{2})$ where I_n and I_h are identity matrices, $σ_{ε}^{2}$ and $σ_{ε_{h}}^{2}$ are the variance component of residuals. For GBLUP, the genetic effect g and g_h were the additive effect following $g = a ~ N (0, A σ_{a}^{2})$ and $g_{h} = a_{h} ~ N (0, A_{h} σ_{a_{h}}^{2})$ where a and a_h are an n- and h- dimensional vector of additive genetic effects of genotyped individuals and genotyped hybrids, $σ_{a}^{2}$ and $σ_{a_{h}}^{2}$ are corresponding variance components, and A and A_h are the additive genomic relationship matrices [26]. For EGBLUP, the genetic effect g and g_h contain both additive and additive-by-additive epistatic effects assuming $g = (\begin{array}{l} a \\ p \end{array}) ~ M N ((\begin{matrix} 0 \\ 0 \end{matrix}), (\begin{matrix} A σ_{a}^{2} & 0 \\ 0 & A # A σ_{p}^{2} \end{matrix}))$ and $(\begin{matrix} a_{h} \\ p_{h} \end{matrix}) ~ M N ((\begin{matrix} 0 \\ 0 \end{matrix}), (\begin{matrix} A_{h} σ_{a_{h}}^{2} & 0 \\ 0 & A_{h} # A_{h} σ_{p_{h}}^{2} \end{matrix})$ where # denotes the Hadamard product, p and p_h are an n- and h-dimensional vector of epistatic genetic effect of genotyped individuals and genotyped hybrids, $σ_{p}^{2}$ and $σ_{p_{h}}^{2}$ are corresponding variance components. The genomic heritability was calculated based on the GBLUP model using the formula $h_{g}^{2} = \frac{σ_{a}^{2}}{σ_{a}^{2} + σ_{ε}^{2}}$ where the variance components were estimated respectively in the populations of lines and hybrids. The two Bayesian approaches could be uniformly formulated as y_r = 1_rμ + X_rγ + e in cross-validation scenarios 1–5 and $y_{h_{r}} = 1_{h_{r}} μ_{h} + W_{h_{r}} β + X_{h_{r}} γ_{h} + e_{h}$ in the cross-validation scenario incorporated mid-parental value as a covariate, where y_r is a r-dimensional vector of adjusted phenotypic values of mesocotyl length, r is the number of genotypes in the training set, $y_{h_{r}}$ is a h_r-dimensional vector of adjusted phenotypic values of msocotyl length, h_r is the number of genotypes in the training set of hybrids, 1_r and $1_{h_{r}}$ are r- and h_r-dimensional vectors of ones, μ and μ_h are the intercepts, γ and γ_h are m-dimensional vector of additive genetic effects of each SNP respectively predicted in the training set and training set of hybrids, m is the number of SNPs, X_r and $X_{h_{r}}$ are an r × m- and h_r × m-dimensional matrix with elements 0, 1, and 2 representing the copies of alternative alleles of SNPs, W_r is a h_r-dimensional covariate vector of mid-parental values, e and e_h are the random residuals following $e ~ N (0, I_{r} σ_{e}^{2})$ and $ε_{h} ~ N (0, I_{h_{r}} σ_{e_{h}}^{2})$ where I_r and $I_{h_{r}}$ are identity matrices, $σ_{e}^{2}$ and $σ_{e_{h}}^{2}$ are the variance component of residuals. In BayesB, the prior distribution of marker effect is assumed to be a mixture of a t distribution with a fixed probability π and a point mass at zero with a probability 1–π [27]. In BayesR, the marker effect is assumed to follow a mixture of four normal distributions with zero mean and varied variances. The sum of proportions of each normal distribution π = (π₁, π₂, π₃, π₄) is constrained to unity [28]. The phenotypic values of hybrids in the test set $\hat{y_{s}}$ were predicted using the formula $\hat{y_{s}} = 1_{s} \hat{μ} + X_{s} \hat{γ}$ in cross-validation scenarios 1–5 and $\hat{y_{s}} = 1_{s} \hat{μ_{h}} + W_{s} \hat{β} + X_{s} \hat{γ_{h}}$ in the cross-validation scenario incorporating mid-parental value as a covariate, where 1_s is a s-dimensional vector of ones, s is the number of hybrids in the test set, W_s is a s-dimensional covariate vector of mid-parental values, X_s is an s × m-dimensional matrix of genotypic profiles of hybrids in the test set, $\hat{μ}$ , $\hat{μ_{h}}$ and $\hat{β}$ are respectively the estimated values of intercepts and covariate, and $\hat{γ}$ and $\hat{γ_{h}}$ are m-dimensional vector of SNP effects respectively predicted in the training set and training set of hybrids.

In different cross-validation scenarios, theoretically the calibration models using hybrids (scenarios 2–5 and the scenario incorporating mid-parental value as a covariate) could predict both additive and dominant genetic effects and part them. However, as only one female was used in our study, deductively, the additive genotypic profiles of the hybrids were completely collinear with their dominant genotypic profiles. Therefore, the additive and dominant genetic effects could not be de facto partitioned. Despite this, the collinearity resulted the dominant effect compounded with the additive effect, thereupon the genetic merit of dominant effect was still involved and utilized in the scenarios using hybrids (scenarios 2–5).

The BLUP models were realized in R [18] using package BGLR [29]. The Bayesian approaches were fitted using GCTB software [30]. The iteration times, burn-in, and thinning of all models were set to 30,000, 5,000, and 5 respectively.

2.8 Cross-validation scenarios

The 401 hybrids were stochastically and evenly divided into five folds. One fold formed the test set. Five scenarios to compose the training set were considered as follows: Scenario 1) reference hybrids’ parents: the parental lines of the hybrids not in the test set formed the training set; Scenario 2) reference hybrids: other four folds of hybrids beside the test set constituted the training set; Scenario 3) reference hybrids and their parents: other four folds of hybrids beside the test set and their parents collectively comprised the training set; Scenario 4) reference hybrids and all lines: other four folds of hybrids beside the test set and all lines were combined as the training set; Scenario 5) reference hybrids and parents of test set: other four folds of hybrids beside the test set and the parental lines of the test set collectively comprised the training set. Another scenario using mid-parental values of all hybrids as a covariate in the prediction models was considered. The training set was constituted by the four folds of hybrids beside the test set. In this scenario, the phenotypic data of all parents and reference hybrids was taken advantage, which contained comparable reference information as scenario 5. Each cross-validation scenario was repeated 20 times, yielding in total 100 times random partitioning of training and test sets for each scenario. The prediction accuracy of GS and MAS was evaluated on the basis of combining five test sets in each repeat of cross-validation. Specifically, the genomic predicted genetic values of the five tests in each repeat of cross-validation were combined and the Pearson correlation coefficient between the combined predicted values and the corresponding adjusted phenotypic values was calculated to measure the genomic prediction accuracy. Thereupon, 20 prediction accuracies from the 20 repeats of cross-validations were shown for each training set composition scenario. In the mid-parental value prediction scenario, there was no model-training and the predicted genetic values of hybrids in the test set were perpetually the mid-parental values of their parents. Therefore, when the five hybrid test sets in each repeat of cross-validation were combined to measure the prediction accuracy, the predicted values in the combination were invariable disregarding to the samples of training and test sets, that is, the mid-parental values of the total hybrid population. Due to this, there was just one prediction accuracy value in the scenario of the mid-parental value prediction. The scenarios of different training set compositions were illustrated in S1 Fig. All prediction accuracies of GS and MAS were z-transformed for statistical test and analysis of variance (ANOVA).

To investigate the impact of training set size on genomic prediction accuracy, 5% to 80% of reference hybrids in each training set composition scenario were randomly sampled to establish training set subsets with different sizes. The sampling of training set subsets was repeated 20 times for each sampling ratio, yielding in total 2000 times (20×100) calibrations and predictions in marker-assisted selection and genomic prediction.

2.9 Classification of SNPs in genomic prediction

The SNP markers were classified into two groups by GWAS. One group consisted of the TA-SNPs identified in GWAS and another group was the remaining genome-wide SNPs. GWAS was implemented respectively using all lines and the total population including all lines and hybrids. The GWAS model was identical to that used in MAS. The threshold of significance determining the TA-SNPs was decided by the best performing MAS model with overall highest prediction accuracy. The GBLUP model was used to validate the effectiveness of classifying markers in genomic prediction. The group of TA-SNPs was respectively fitted as fix and random effect in the model. As fix effect, considering the number of TA-SNPs would be large, a principal component analysis was utilized. The principal components accounting for 95% variation were used in place of TA-SNPs in the model. When TA-SNPs were fitted as random effect, two separate kernels respectively composed by TA-SNPs and remaining genome-wide SNPs were fitted in the GBLUP model. The GBLUP model was implemented in R package BGLR [29] with 30,000 iterations, 5,000 times burn-in, and thinning of 5.

3. Results

3.1 Phenotypic analysis statistics and population diversity

Results of the phenotypic analysis were summarized in Table 1 and in details shown in S3 Table. The BLUE of mesocotyl length ranged from -0.14 to 5.87 for parent lines and -0.1 to 5.61 for hybrids. Heterosis analysis indicated that among 401 hybrids, 41% show high parent heterosis, 19% shown mid-parent heterosis, 26% shown low parent heterosis and the remaining 14% shown hybrid inferiority (S4 Table and S2 Fig). High parent heterosis was the major contributor to mesocotyl length. The additive effect variance component was 1.22 for parent lines and 1.7 for hybrids. The heritability estimate was 0.8 for parent lines and 0.58 for hybrids. The genetic diversity of the parental lines was overall high, as indicated by the wider range of genetic similarities between parental lines (Fig 1).

Table 1. Range and coefficient of variation (CV) of best linear unbiased estimates (BLUE) of genetic effect of genotypes, and variance components of additive genetic effect ( $σ_{a}^{2}$ ) and random residual ( $σ_{ε}^{2}$ ), and genomic heritability ( $h_{g}^{2}$ ) of mesocotyl length, separately estimated from the parental lines and hybrids populations.

Population	Size	Range	CV	$σ_{a}^{2}$	$σ_{ε}^{2}$	$h_{g}^{2}$
Lines	402	[-0.14, 5.87]	0.74	1.22	0.30	0.80
Hybrids	401	[-0.10, 5.61]	0.78	1.70	1.21	0.58

Open in a new tab

Fig 1 — The average clustering method was used to order the lines.

3.2 Predictability of marker-assisted selection

The prediction accuracies of MAS in different training set composition scenarios were shown in Fig 2 and S5 Table. Using mid-parental values to predict the performances of hybrids (mid-parental value prediction) resulted in an accuracy of 0.59, which was used as a reference to other prediction scenarios and marked using a red dash line in Fig 2. When the training population contains parental lines of the reference hybrids for cross-validation only (scenario 1), which assumed no data on hybrids is available, the prediction accuracy ranged from 0 to 0.39, which increased to the maximum value with the increase of different significance thresholds (P value) to 0.001, which was obviously lower than the result of mid-parental value prediction. The effect of P value on MAS prediction was significant. When only the reference hybrids were used in the training set (scenario 2), which was the typical scheme commonly applied in other GS studies of rice hybrid prediction, the prediction accuracy ranged from 0 to 0.48, increased to the maximum with the increase of P value to 0.0025, which was lower than the result of mid-parental value prediction. Surprisingly, when P value was 0.0075 or 0.01, the prediction accuracy dramatically dropped to less than 0.15. This might indicate that the increase of number of markers due to a liberal P value brings more noise than signal into the multiple linear regression models we applied. When the reference hybrids and their parental lines were used in the training set (scenario 3), which modelled the situation that phenotypic test was conducted for some of the hybrids and their parental lines, the prediction accuracy varied from 0.01 to 0.51 with the maximum value being achieved when P value was 0.001, which was lower than the result of mid-parental value prediction. When the training set contained reference hybrids and their parental lines, and parental lines of the untested hybrids (scenario 4), which modelled the situation that parental lines of untested hybrids have been tested, the prediction accuracy ranged from 0.39 to 0.61, increased to the maximum value with the increase of P value to 0.005. When P value was 0.0025, 0.005 or 0.0075, the prediction accuracies in scenario 4 were significantly higher than the result of mid-parental value prediction. When the training set contained reference hybrids and parental lines of the test set (scenario 5), which assumed the paternal lines of reference hybrids are not helpful to the prediction of test hybrids due to the genetic distance, the prediction accuracy ranged from 0.18 to 0.61, increased to the maximum value with the increase of P value to 0.005. When P value was 0.005 or 0.0075, the prediction accuracy in scenarios 5 was also significantly higher than the result of mid-parental value prediction. Scenario 4 achieved the higher prediction accuracy in a wider range of P values, which was the best scenario.

Two-factor variance analysis showed that the mean variance of the scenario was over threefold higher than that of the P value, indicating that the training set composition had much higher impact on prediction accuracy than the P value (S6 Table). The interaction between the scenario and P value was also significant but relatively less important (S6 Table). Although there was no single best P value for all training set compositions, 0.0025 was a better choice when all compositions were considered (Fig 2).

3.3 Predictability of genomic prediction

The average prediction accuracies were obtained using different GP models and the result were shown in Fig 3. In the model by GBLUP, the prediction accuracies of scenario 1 to 5 were 0.54, 0.63, 0.6, 0.63 and 0.67, respectively, which were mostly significantly higher than the result of mid-parent value prediction except scenario 1. The prediction accuracy of scenario 2 was significantly higher than that of scenario 1, indicating reference hybrids as training set performed more outstandingly than the parents of reference hybrids. The prediction accuracy of scenario 3 was significantly higher than that of scenario 1, but significantly lower than that of scenario 2, implying combining the parents of reference hybrids with reference hybrids as training set was better than only using parents as training set but inferior to using reference hybrids as training set. However, the prediction accuracy of scenario 4, which was the best group performing in MAS, was significantly higher than that of scenario 3 and equal to that of scenario 2, demonstrating that integrating all parents into the training set consisting of reference hybrids would only marginally improve the predictability. The prediction accuracy of scenario 5 performed the best among all scenarios, which indicated integrating parents of test set into the training set constituted by reference hybrids would significantly improve the predictability in GS.

Comparing different genomic prediction models, their prediction accuracies were quite similar in each scenario except for scenario 4 in which the EGBLUP method performed conspicuously better than other approaches (Fig 3).

In variance analysis, the mean variance of the scenario was over 30 folds higher than that of the prediction model (S7 Table), indicating that the training set composition had much higher effect on prediction accuracy than the prediction model. The interaction between the scenario and the prediction model was also significant but relatively less important (S7 Table).

3.4 Using parental performance as covariates in genomic prediction

Previous studies have concluded that incorporating parent information into the model could improve the prediction accuracy [13]. We also used the mid-parental value as a covariate incorporated into the model. Surprisingly, the prediction accuracy was markedly and significantly higher than that of scenario 2 (only using reference hybrids as training set), demonstrating the huge advantage of integrating the mid-parent value as a covariate into the model (Fig 4).

Fig 4 — Different letters above the bars indicated the genomic prediction accuracies after Fisher’s z-transformation were significantly different (p < 0.05, t-test) between the two scenarios.

3.5 Different training set sizes with subsets of reference hybrids

Next, we investigated the effect of training set size on genomic prediction under different training set compositions. For scenario 2 to 5, n hybrids out of the 321 reference hybrids in the training set of cross-validation were randomly selected and form the reference hybrid subsets, together with the lines respectively in scenario 2 to 5 to constitute the training sets with different sizes, where n ranged from 5% to 100%. In scenario 1, the parents of sampling reference hybrids formed the training set. The results were given in Fig 5. For all methods, the prediction accuracies in scenario 1 to 5 were all growing with the sampling rate of reference hybrids n increased from 5% to 100% and reached a plateau when n became 40%. The specific sample size was shown in S8 Table. The increasing trend of prediction accuracies in all scenarios were similar for different prediction methods, except for scenario 2 in which the two Bayesian methods displayed no apparent improvement when n increased from 5% to 10%. The prediction accuracies of two BLUP methods were significantly higher than those of the two Bayesian approaches in scenario 2 when n < 20%, but that was similar when n ≥ 20%. Overall, the BLUP methods performed superiorly to the Bayesian approaches when the training set is small (n < 20%) and the male parents of the test hybrids were not used, i.e., scenarios 1–3.

Fig 5 — The training set size varied resulted from the alteration of the number of reference hybrids involved in the training set.

3.6 GBLUP separately fitting trait-associated and -unassociated markers

Since genome wide markers can be used to identify markers associated with trait, i.e., TA-SNPs, via GWAS, it might be better if TA-SNPs and trait-unassociated markers were fitted separately in GS. The TA-SNPs can be separately fitted either as fixed effect or as random effect (see Method for details). We first performed GWAS analysis using all parental lines and selected the TA-SNPs with P-value incurring the overall highest prediction accuracies in MAS, i.e., < 0.005, for prediction. The result was shown in Fig 6A, where the prediction accuracies of scenario 1 to 5 with the TA-SNPs incorporated in GBLUP as fixed effect were 0.566, 0.639, 0.613, 0.616, 0.655, respectively. In contrast, the prediction accuracies of scenario 1 to 5 with the TA-SNPs incorporated in GBLUP as random effect were 0.573, 0.665, 0.628, 0.625 and 0.666, respectively, which were all significantly higher than that with TA-SNPs used as fixed effect. Among all scenarios with the TA-SNPs incorporated in GBLUP either as fixed or random effect, the prediction accuracy of scenario 5 was the highest (0.655 and 0.666, respectively).

Fig 6 — The trait-associated SNPs (TA-SNPs) were respectively used as fixed covariates (fixed effect) and independent random kernel (random effect) in the GBLUP model. Undifferentiated use of SNPs was using all SNPs in the GBLUP model. Different letters above the bars indicated the genomic prediction accuracies achieved by varying treatments of SNPs in the model were significantly different (p < 0.05, t-test) after a Fisher’s z-transformation.

Next, we conducted GWAS analysis using all parental lines and hybrids and selected the TA-SNPs also with the P-value < 0.005 for prediction, which was shown in Fig 6B, where the prediction accuracies of scenario 1 to 5 with the TA-SNPs incorporated in GBLUP as fixed effect were 0.622, 0.705, 0.678, 0.671, 0.691, respectively. In contrast, the prediction accuracies of scenario 1 to 5 with the TA-SNPs incorporated in GBLUP as random effect were 0.616, 0.703, 0.669, 0.664 and 0.687, respectively. Interestingly, the prediction accuracies of all scenarios with the TA-SNPs incorporated as fixed effect were higher than that with TA-SNPs incorporated as random effect, especially for scenario 1, 3 and 4, where significant differences were found between them. Among all scenarios with the TA-SNPs incorporated in GBLUP either as fixed or random effect, the prediction accuracy of scenario 2 was the highest (0.705 and 0.703, respectively).

We also compared the prediction accuracies of different scenarios with TA-SNPs incorporated in GBLUP as fixed effect or random effect and those with undifferentiated using all SNPs in GBLUP. When the TA-SNPs were from GWAS based on all lines, some significant increases were found between the prediction accuracies of scenario 1 to 3 with TA-SNPs incorporated in GBLUP as fixed effect and those with undifferentiated use of all SNPs, and some significant reductions were found between the prediction accuracies of scenario 4 and 5 with TA-SNPs incorporated in GBLUP as fixed effect and those with undifferentiated use of all SNPs (Fig 6A).

Meanwhile, the prediction accuracies of scenario 1 to 3 with TA-SNPs incorporated in GBLUP as random effect were significantly higher than those with undifferentiated use of all SNPs, and significant reduction was found between the prediction accuracy of scenario 4 with TA-SNPs incorporated in GBLUP as random effect and that with undifferentiated use of all SNPs (Fig 6A). No difference was found between the prediction accuracy of scenario 5 with TA-SNPs incorporated in GBLUP as random effect and that with undifferentiated use of all SNPs (Fig 6A). In contrast, when the TA-SNPs were GWAS analyzed based on all lines and hybrids, no matter incorporating TA-SNPs as fixed effect or as random effect in GBLUP, the prediction accuracies of scenario 1 to 5 were significantly higher than those with undifferentiated use of all SNPs (Fig 6B).

In summary, the best choice for modeling was accommodating the TA-SNPs from GWAS based on all lines and hybrids as fixed effect in GBLUP under scenario 2.

4. Discussion

This study demonstrated the potential of GS in breeding hybrid rice varieties suited for direct seeding capitalizing on an indicative trait mesocotyl length. We based on 401 hybrid combinations from test-crossing 401 sequenced rice varieties from Southeast Asia, Guangdong and South Asia with a sequenced variety Taifeng A to underpin the prediction of mesocotyl length in hybrid rice. The inbred lines used as male parents are the ancestral parents of many elite varieties, and genetically diverse. Taifeng A is a female sterile line with excellent agronomic characters and is widely used in developing hybrid varieties. Therefore, the results from our study have a high practical value.

4.1 Relatedness driving the prediction accuracy in MAS

Previous studies have demonstrated that the relatedness between the training and test sets could impact the prediction accuracy in MAS [31,32]. This finding is validated in our study in rice. The training set composition scenarios 4 and 5 included the parents of test hybrids and the prediction accuracies in these two scenarios were remarkably higher than those in other scenarios without the parents of test hybrids especially when the threshold of P value to determine the significant SNPs was relatively liberal which may incur redundancy of predictors and impede the predictability of MAS model (Fig 2). The relatively high relatedness between the training and test sets in scenarios 4 and 5 could compensate for the nuisance as compared to other scenarios. Since the relatedness is described by SNP genotypes and a reliable estimation of relatedness requires a certain number of markers, when the P value threshold was strict and only a few significant SNPs were available, the impact of relatedness is negligible and the prediction accuracies were driven by the number of available significant SNPs (Fig 2).

The impact of relatedness could also be inspected in scenarios 1–3 in which the prediction accuracies in scenario 2 were conspicuously higher than those in scenarios 1 and 3 when the P value thresholds were liberal (≥ 0.0025) resulting in a comparable number of available significant SNPs in the three scenarios. Theoretically, the relatedness between the training and test sets in scenario 2 is higher than that in scenarios 1 and 3 because the male parents of the reference hybrids were genetically distant to the test hybrids thus including them in the training set would impair the relatedness.

4.2 GS is superior to MAS in rice mesocotyl length prediction

In the absence of genotypic data, using the mid-parental values can realize the prediction of hybrids performances. We used the mid-parental value of the hybrids as a reference to the genomics-enabled predictions. Interestingly, the mid-parental prediction accuracy was overall comparable to that of MAS but significantly lower than that of GS, indicating that GS is an efficient genomics-enabled approach in mesocotyl length breeding of hybrid rice.

The conspicuous advantage of scenarios 2–5 over scenario 1 could be attributed to the accommodation of dominant effects in addition to additive effects and also the relatedness exploited because the male parents of the reference hybrids are genetically distant to the test hybrids. The superiority of scenario 5 over scenario 4 substantiates the importance of relatedness (Fig 3). The advantage of EGBLUP over other genomic prediction approaches in scenarios 3 and 4 indicates using more inbred lines is helpful to capture the epistatic effects (Fig 3).

4.3 Incorporating parental performance as covariates improves prediction accuracy

Compared to the genomic prediction based solely on genomics data, including parental phenotypes in the model could significantly improve the prediction accuracy [13]. Our study underpinned this finding (Fig 4). What is worth to notice is the magnitude of reference information contained in the training set composition scenario 4 was identical to that in the scenario using mid-parental values as phenomics data in the model, however, the prediction accuracies in the former scenario were significantly lower than those in the latter scenario (Figs 3 and 4), which indicates using mid-parental values as phenomics data in GS is a more efficient way to exploit the parental information. Xu et al. [13] mentioned using the parental phenotypes as a covariate (predictor) in the model might intrinsically capture environmental effects and genotype-by-environment interactions. In breeding, breeders could learn from it as the phenotypes of parental lines are often available prior to the crossing, therefore, no additional spending is needed.

4.4 Separately modelling the trait-associated and -unassociated markers significantly improved the genomic prediction accuracy

Previous studies have demonstrated that the predictability would be significantly enhanced by integrating the associated markers into the model in GS [33–35]. Here, we found similar results. The prediction accuracies for all scenarios with associated markers, which were identified from GWAS analysis using all parental lines and hybrids, either as fixed effect or random effect, were significantly higher than those undiscriminatingly using all the SNPs (Fig 6B). As compared, the advantage of distinguishingly using the SNPs in genomic prediction reduced when the GWAS was implemented only using the lines (Fig 6A). This could be explained by that expanding the population for GWAS would enhance the power of identifying trait-associated makers thereupon enhancing genomic predictability.

Comparing the effectiveness of using TA-SNPs as fixed and random effect, when GWAS was conducted using all the parental lines and hybrids, fitting the TA-SNPs as fixed effect in the genomic prediction model was overall better than that as random effects. However, the precedence was reversed when GWAS was implemented only based on the parental lines (Fig 6). This could be attributed to that GWAS using all lines and hybrids is more powerful and able to identify more reliable TA-SNPs. Because being a fixed effect in the linear model mostly would have stronger effect relative to being a random effect, a more reliable identification of trait-marker association in GWAS could underpin the fixed effect treatment. If the GWAS is not so powerful, a more conservative usage of treating the trait-associated makers as random effects would be more proper.

Overall, prior to implementing GS, using GWAS to identify trait-associated markers and discriminatingly modelling the trait-associated and -unassociated markers in GS models is suggested.

5. Conclusion

Based on a population of 402 rice lines and their 401 hybrid combinations, we demonstrated that using half-sib hybrids as the training set together with the mid-parental phenotypic values of all hybrids fitted as a covariate in the genomic models could achieve an optimal prediction of mesocotyl length, which is indicative of rice direct seeding ease. Including approximately 60 hybrids (20% of total hybrids) in the training set is able to obtain a comparable prediction accuracy to using all hybrids. Dividing the SNPs into trait-associated and -unassociated groups using GWAS based on the entire population could further improve the prediction accuracy. In practice, we suggest to firstly implement GWAS to differentiate the trait-associated and -unassociated markers based on all observations, and then using the phenotyped hybrids as training set to predict the untested hybrids with their parents’ phenotypes fitted as a covariate in the GP models separately accommodating the trait-associated and -unassociated markers.

Supporting information

S1 Fig. Illustration of different training set composition scenarios.

(TIF)

Click here for additional data file.^{(612.9KB, tif)}

S2 Fig. Heterosis analysis of 401 hybrids.

(TIF)

Click here for additional data file.^{(274.5KB, tif)}

S1 Table. The information of 402 rice accessions.

(DOCX)

Click here for additional data file.^{(36.6KB, docx)}

S2 Table. The genotypic data of the 402 lines.

(XLSX)

Click here for additional data file.^{(13MB, xlsx)}

S3 Table. The best unbiased estimated values (BLUE) of mesocotyl length of 402 rice accessions and 401 hybrids.

(DOCX)

Click here for additional data file.^{(77.2KB, docx)}

S4 Table. Heterosis analysis of 401 hybrids.

Mid-parent value is the average best linear unbiased estimates (BLUE) of mesocotyl length of parents of the hybrids. d-value is the difference of BLUE between hybrid and its mid-parent value. a-value is absolute value of the difference of BLUE between parents. Hp is d-value / a-value. High parent heterosis (HPH) represents the Hp of hybrid was over 1. Mid-parent heterosis (MPH) represents the Hp of hybrid was over 0 and below 1 (including 1). Low-parent heterosis (LPH) represents the Hp of hybrid was over -1(including -1) and below 0. Hybrid inferiority (HI) represents the Hp of hybrid was below -1.

(DOCX)

Click here for additional data file.^{(80.6KB, docx)}

S5 Table. The prediction accuracies of scenario 1–5 in MAS.

The asterisks indicate the Fisher’s z-transformed prediction accuracies in scenarios 1–5 were significantly higher (p < 0.05, t-test) than the Fisher’s z-transformed mid-parental value prediction accuracy.

(DOCX)

Click here for additional data file.^{(17.3KB, docx)}

S6 Table. Two-Way ANOVA in MAS prediction accuracies.

Scenario and P value are the two factors. Scenario:P value represents the interaction effect between scenario and P value. Df represents degree of freedom. SS represents sum of squares. MS represents mean squares. F value is MS / MS_Error. P (F) is the P value of F-test. All prediction accuracies were Fisher’s z-transformed.

(DOCX)

Click here for additional data file.^{(16.8KB, docx)}

S7 Table. Two-Way ANOVA in GS prediction accuracies.

Scenario and prediction model are the two factors. Scenario:prediction model represents the interaction effect between scenario and prediction model. Df represents degree of freedom. SS represents sum of squares. MS represents mean squares. F value is MS / MS_Error. P (F) is the P value of F-test. All prediction accuracies were Fisher’s z-transformed.

(DOCX)

Click here for additional data file.^{(16.3KB, docx)}

S8 Table. The specific sample size of training set in all scenarios.

(DOCX)

Click here for additional data file.^{(16.3KB, docx)}

Acknowledgments

The authors thank Sanwen Huang, Agricultural Genomics Institute in Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China for providing all the necessary test facilities.

Data Availability

All relevant data are within the paper and its Supporting information files.

Funding Statement

This work was supported by National Key R&D Program of China (2020YFE0202300), the Young Elite Scientists Sponsorship Program by CAST (YESS, 2020QNRC001) and the Agricultural Science and Technology Innovation Program (ASTIP). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Farooq M, Siddique KHM, Rehman H, Aziz T, Lee DJ, Wahid A. Rice direct seeding: experiences, challenges and opportunities. Soil Till Res. 2011; 111:87–98. doi: 10.1016/j.still.2010.10.008 [DOI] [Google Scholar]
2.Zhou W, Guo Z, Chen J, Jiang J, Hui D, Wang X, et al. Direct seeding for rice production increased soil erosion and phosphorus runoff losses in subtropical China. Sci. Total. Environ. 2019; 695:133845. doi: 10.1016/j.scitotenv.2019.133845 [DOI] [PubMed] [Google Scholar]
3.Mahender A, Anandan A, Pradhan SK. Early seedling vigour, an imperative trait for direct-seeded rice: an overview on physio-morphological parameters and molecular markers. Planta. 2015; 241:1027–1050. doi: 10.1007/s00425-015-2273-9 [DOI] [PubMed] [Google Scholar]
4.Jumin T, Zhang G, Datta K, Xu C, He Y, Zhang Q, et al. Field performance of transgenic elite commercial hybrid rice expressing bacillus thuringiensis dendotoxin. Nat. Biotechnol. 2000; 18:1101–1104. doi: 10.1038/80310 [DOI] [PubMed] [Google Scholar]
5.Zhan J, Lu X, Liu H, Zhao Q, Ye G. Mesocotyl elongation, an essential trait for dry-seeded rice (Oryza sativa L.): a review of physiological and genetic basis. Planta. 2020; 251:27. doi: 10.1007/s00425-019-03322-z [DOI] [PubMed] [Google Scholar]
6.Wu J, Feng F, Lian X, Teng X, Wei H, Yu H, et al. Genome-wide Association Study (GWAS) of mesocotyl elongation based on re-sequencing approach in rice. BMC Plant Biol. 2015; 15:218. doi: 10.1186/s12870-015-0608-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Singh VK, Upadhyay P, Sinha P, Mall AK, Ellur RK, Singh A, et al. Prediction of hybrid performance based on the genetic distance of parental lines in two-line rice (Oryza sativa L.) hybrids. Journal of Crop Science and Biotechnology. 2011; 14(1): 1–10. doi: 10.1007/s12892-010-0111-y [DOI] [Google Scholar]
8.Tiwari DK, Pandey P, Giri SP, Dwivedi JL. Prediction of gene action, heterosis and combining ability to identify superior rice hybrids. International Journal of Botany. 2011; 7:126–144. doi: 10.3923/ijb.2011.126.144 [DOI] [Google Scholar]
9.Widyastuti Y, Kartina N, Rumanti IA. Prediction of Combining Ability and Heterosis in the Selected Parents and Hybrids in Rice (Oryza Sativa. L.). Informatika Pertanian. 2017; 26:31–40. doi: 10.21082/ip.v26n1.2017.p31-40 [DOI] [Google Scholar]
10.Sreewongchai T, Sripichitt P, Matthayatthaworn W. Parental genetic distance and combining ability analyses in relation to heterosis in various rice origins. Journal of Crop Science and Biotechnology. 2021; 24:327–336. doi: 10.1007/s12892-020-00081-2 [DOI] [Google Scholar]
11.Riedelsheimer C, Czedik-Eysenberg A, Grieder C, Lisec J, Melchinger AE. Genomic and metabolic prediction of complex heterotic traits in hybrid maize. Nat. Genet. 2012; 44:217–220. doi: 10.1038/ng.1033 [DOI] [PubMed] [Google Scholar]
12.Xu S, Zhu D, Zhang Q. Predicting hybrid performance in rice using genomic best linear unbiased prediction. Proc. Natl. Acad. Sci. 2014; 111:12456–12461. doi: 10.1073/pnas.1413750111 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Xu Y, Zhao Y, Wang X, Ma Y, Li P, Yang Z, et al. Incorporation of parental phenotypic data into multi-omic models improves prediction of yield-related traits in hybrid rice. Plant Biotechnol J. 2021; 19:261–272. doi: 10.1111/pbi.13458 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Fu J, Falke KC, Thiemann A, Schrag TA, Melchinger AE, Scholten S, et al. Partial least squares regression, support vector machine regression, and transcriptome-based distances for prediction of maize hybrid performance with gene expression data. Theor. Appl. Genet. 2012; 124:825–833. doi: 10.1007/s00122-011-1747-9 [DOI] [PubMed] [Google Scholar]
15.Xu S, Xu Y, Gong L, Zhang Q. Metabolomic prediction of yield in hybrid rice. Plant J. 2016; 88:219–227. doi: 10.1111/tpj.13242 [DOI] [PubMed] [Google Scholar]
16.Westhues M, Schrag TA. Omics-based hybrid prediction in maize. Theor. Appl. Genet. 2017; 130:1927–1939. doi: 10.1007/s00122-017-2934-0 [DOI] [PubMed] [Google Scholar]
17.Wang S, Wei J, Li R, Qu H, Chater JM, Ma RY, et al. Identification of optimal prediction models using multi-omic data for selecting hybrid rice. Heredity. (2019). 123:395–406. doi: 10.1038/s41437-019-0210-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.R Core Team. R: a language and environment for statistical computing. Vienna, Austria. 2016; https://www.R-project.org/.
19.Covarrubias-Pazaran G. Genome-Assisted Prediction of Quantitative Traits Using the R Package sommer. PLoS One. 2016; 11(6):e0156744. doi: 10.1371/journal.pone.0156744 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Griffing B. Use of a controlled-nutrient experiment to test heterosis hypotheses. Genetics. 1990. 126:753–767. doi: 10.1093/genetics/126.3.753 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Howie BN, Donnelly P, Marchini J. A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. PloS Genetics. 2009. 5(6):e1000529. doi: 10.1371/journal.pgen.1000529 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007; 81:559–575. doi: 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Yang J, Benyamin B, McEvoy B, Gordon S, Henders A, Nyholt D, et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010; 42:565–569. doi: 10.1038/ng.608 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Jiang Y, Zhao Y, Rodemann B, Plieske J, Kollers S, Korzun V, et al. Potential and limits to unravel the genetic architecture and predict the variation of Fusarium head blight resistance in European winter wheat (Triticum aestivum L.). Heredity. 2015; 114:318–326. doi: 10.1038/hdy.2014.104 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. American Journal of Human Genetics. 2011; 88:76–82. doi: 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Endelman JB, Jannink JL. Shrinkage estimation of the realized relationship matrix. G3 (Bethesda). 2012; 2(11):1405–13. doi: 10.1534/g3.112.004259 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001; 157(4):1819–1829. doi: 10.1093/genetics/157.4.1819 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Moser G, Lee SH, Hayes BJ, Goddard ME, Wray NR, Visscher PM. Simultaneous Discovery, Estimation and Prediction Analysis of Complex Traits Using a Bayesian Mixture Model. Plos Genetics. 2015; 11(4):e1004969. doi: 10.1371/journal.pgen.1004969 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Perez P, De L. Genome-Wide Regression and Prediction with the BGLR Statistical Package. Genetics. 2014; 198:483–95. doi: 10.1534/genetics.114.164442 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Zeng J, Vlaming RD, Wu Y, Robinson MR, Lloyd-Jones LR, Yengo L, et al. Signatures of negative selection in the genetic architecture of human complex traits. Nat. Genetics. 2018; 50:746–753. doi: 10.1038/s41588-018-0101-4 [DOI] [PubMed] [Google Scholar]
31.Gowda M, Zhao Y, Würschum T, Longin CFH, Miedaner T, Ebmeyer E, et al. Relatedness severely impacts accuracy of marker-assisted selection for disease resistance in hybrid wheat. Heredity. 2014; 112:552–561. doi: 10.1038/hdy.2013.139 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Jiang Y, Schulthess AW, Rodemann B, Ling J, Plieske J, Kollers S, et al. Validating the prediction accuracies of marker-assisted and genomic selection of fusarium head blight resistance in wheat using an independent sample. Theor. Appl. Genet. 2017; 130:1–12. doi: 10.1007/s00122-016-2827-7 [DOI] [PubMed] [Google Scholar]
33.Zhang Z, Ober U, Erbe M, Zhang H, Gao N, He J, et al. Improving the accuracy of whole genome prediction for complex traits using the results of genome wide association studies. PLoS ONE. 2014; 9:e93017. doi: 10.1371/journal.pone.0093017 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Bernardo R. Genomewide selection when major genes are known. Crop Sci. 2014; 54:68–75. doi: 10.2135/cropsci2013.05.0315 [DOI] [Google Scholar]
35.Spindel JE, Begum H, Akdemir D, Collard B, Redoña E, Jannink JL, et al. Genome-wide prediction models that incorporate de novo gwas are a powerful new tool for tropical rice improvement. Heredity. 2016; 116:395–408. doi: 10.1038/hdy.2015.113 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0283989.r001

Decision Letter 0

Muhammad Abdul Rehman Rashid

9 Feb 2023

PONE-D-22-25635Genomic prediction of rice mesocotyl length indicative of directing seeding suitability using a half-sib hybrid populationPLOS ONE

Dear Dr. He,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Mar 26 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Muhammad Abdul Rehman Rashid, PhD

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf

and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. Thank you for stating the following financial disclosure:

“This work was supported by National Key R&D Program of China (2020YFE0202300)，and the Agricultural Science and Technology Innovation Program (ASTIP).”

Please state what role the funders took in the study. If the funders had no role, please state: ""The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.""

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“The authors acknowledge the support of Director Sanwen Huang, Agricultural Genomics Institute in Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China for providing all the necessary facilities including the funding for conducting the experiment.

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“This work was supported by National Key R&D Program of China (2020YFE0202300)，and the Agricultural Science and Technology Innovation Program (ASTIP).”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

Reviewer #3: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: I Don't Know

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have presented a meaningful study to predict rice mesocotyl length, which is an indicative trait associated with emergence rate, early vigor, and lodging tolerance. The authors have used half-sib hybrids to investigate the accuracy of several prediction models. The authors have demonstrated that a two-step linear mixed model with parental information can be advantages in prediction.

But before acceptance of publication, there are still some parts unclear to me.

First, why filter out the SNPs with heterozygotes rate more than 10%? Including heterozygotes might tell additive effects from dominant effects.

Second, why ever examining using TA-SNPs as random effects? In mixed linear model, the expectation of random effects should be 0. But the contribution from TA-SNPs would be strong to predict the dependence variable.

Third, what is the protocol to select p-value threshold of significant SNPs? seems that it is arbitrary.

The caption of Figure 3 is hard to interpret. What does the different letters above the bars mean? How do the letters show "the genomic prediction accuracies of varied scenarios were significantly different"? How to tell from the letters what is significant?

Overall, the manuscript is well-written and promising. The methodology is clear for reproducibility.

Reviewer #2: Its a problem with many papers pertaining to prediction models, that instead of explaining the tangible working principle of their prediction methods BLUP etc, They tend to write matrix equations on paper too much. look at the end its a paper intended to improvise plant breeding process, you should focus on how actually SNPs data is collected and how it is applied to phenotypic data (comparative application) and how SNP variances are related to trait variances. Your paper focuses on applications of different strategies to predict g values etc. but it doesnt tell how these prediction strategies are working actually. In nutshell there is a lot of ambiguity pertaining to the process of mathematical application in the experiment. As a breeder this paper doesnt seem helpful and it takes a learner further away from modern available education. The paper should not focus on names of models and softwares involved but on the tangible explainations of processes involved.

Reviewer #3: This manuscript has scientific merit and could contribute to the literature, but unfortunately in its present form it is full of grammatical errors. These errors make the manuscript difficult to read and effectively evaluate.

Major concerns center around the clarity of the methods section. The experimental design is poorly described, and it is not clear how parental information was incorporated into the models for prediction scenarios including parental information. I assume the parents’ genotypes were included in the training of the Bayesian approaches and in the calculation of the genomic relationship matrices for the GBLUP models; however, I did not see this explicitly stated and there were comments about the collinearity of the additive and dominance relationship matrices that make me question whether this was the case.

Specific comments related to grammar.

Initially I highlighted issues with grammar and typos but gave up once I got to the methods section given the many errors. Below are the issues I highlighted before abandoning the effort:

Typo line 53: normally have short mesocotyl (≤1.0 cm) [6]. Thereby, it is of crucial to breed long

53 mesocotyl hybrid varieties for direct seeding. – need to reword

Line 74: Thereby, it is recommended to incorporate non-additive effects despite the additive effects are determinant [12]. – need to reword.

Line 87: metabolome data and eight GS methods, founding that the GBLUP approach integrating genomics and metabolome data performed overall the best. – Should be finding?

Line 91: traits of hybrid rice, but no study ever reported the potential of

GS on mesocotyl length in hybrid rice, which is indicative to direct seeding – need to reword

Line 95: We experimented several genomic prediction scenarios combining with mid-parental value prediction, marker-assisted selection (MAS) and genome-wide association study (GWAS). – Should be examined several … ?

Line 98: with highest prediction accuracy of mesocotyl length whereby disclose the potential of using genomic selection to improve hybrid rice direct seeding efficiency. – need to reword

Line 144: In consequence, 196,640 high quality SNPs retained. – Should be As a result, ...?

line 178: validation scenarios 1-5 (detailedly introduced in section 2.8) – details can be found in section 2.8. - detailedly is rarely, if ever, used in English.

Line 210: could be uniformed as = + + – do you mean represented as?

Specific comments related to methods:

You never justify why you are using mesocotyl length as a proxy for improving emergence as opposed to measuring emergence directly. Is it more cost effect? Is it more heritable? Too challenging to generate enough seed?

Experimental design – There needs to be a better description of the experimental design. Was there no replication for any of the hybrids? What is meant by partially randomized (is this in reference to parental lines being planted next to hybrids)? Was there some type of blocking? At one point the manuscript mentions corrected phenotypes, which implies some sort of correction based on the experimental design, but none of the models have any experimental factors included.

However, as only one female was used in our study, deductively, the additive genomic profiles of the hybrids were completely collinear with their dominant genomic profiles.

- The dominance relationship of the hybrids would be colinear with the additive relationship of the male inbreds. The additive relationship of the hybrids would be different (higher) as they all share a common female parent. You should be clear in the methods how you calculated the relationship matrix and modify this statement accordingly.

The iteration times of all models were uniformly set to 30,000 and first 5,000 times were set as burn-in.

- Did you confirm burn-in and good mixing of the Markov chains? No thinning was needed?

For cross-validation scenarios 1 and 5

- For GBLUP were the parents included in the genomic relationship matrix? If that is the case, there would be a large difference in the dominance and additive relationship matrices.

Line 259 …the corresponding adjusted phenotypic values was calculated to measure the genomic prediction accuracy.

- How were the phenotypic values adjusted?

In the mid-parental value prediction trial, as the five hybrid test sets in each repeat of cross-validation were collectively the total hybrid population disregarding to the randomness of sampling, thus there was just one prediction accuracy value in the scenario using the mid-parental values as a covariate.

- This statement is confusing. Perhaps some type of figure to illustrate the various cross-validations schemes would help make this clearer.

- It is confusing how parental information is included for prediction in the models.

Line 296 The additive effect variance component was 1.22 for parent lines and

1.7 for hybrids. The heritability estimate was 0.8 for parent lines and 0.58 for hybrids.

The genetic diversity of the parental lines was overall high, as indicated by the wider

range of genetic similarities between parental lines

- Which model was used to calculate the additive variance components?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Weihao Ge

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Apr 5;18(4):e0283989. doi: 10.1371/journal.pone.0283989.r002

Author response to Decision Letter 0

14 Mar 2023

Dear editor:

We have revised the manuscript according to the constructive comments from the reviewers and you. The point-by-point responses have been given to each comment. All revisions in the manuscript were marked in red text. Thanks for all the comments.

Reviewers’ comments:

Reviewer #1:

The authors have presented a meaningful study to predict rice mesocotyl length, which is an indicative trait associated with emergence rate, early vigor, and lodging tolerance. The authors have used half-sib hybrids to investigate the accuracy of several prediction models. The authors have demonstrated that a two-step linear mixed model with parental information can be advantages in prediction. But before acceptance of publication, there are still some parts unclear to me.

1. Why filter out the SNPs with heterozygotes rate more than 10%? Including heterozygotes might tell additive effects from dominant effects.

Response: Thanks for the comment. We agreed that including more heterozygotes can tell additive effects from dominant effects. However, the SNP genotypes of hybrids were deduced from the genotypes of their parental lines. Heterozygotes would incur uncertainty in the deduction of genotypes of hybrids. The ideal situation is all the SNPs are completely genotyped and there is no missing value. But in reality it is often not the case. Therefore, we could alternatively wipe out the heterozygotes by setting them as missing values and then impute them. If the heterozygotes rate is high, arbitrarily setting heterozygotes as missing values is too manipulative and unreasonable. Therefore, we excluded the SNPs with heterozygotes rate more than 10% to keep a relatively low heterozygotes rate.

2. Why ever examining using TA-SNPs as random effects? In mixed linear model, the expectation of random effects should be 0. But the contribution from TA-SNPs would be strong to predict the dependence variable.

Response: Thanks for the remark. Sometimes if the trait of interest is complex, there would be a large number of TA-SNPs (also dependent on the significance threshold of P value) present. Fitting them as fixed effect could also raise the “large p, small n” problem and incur convergence issue of linear models. Therefore, using them as random effects is more proper. There are also some studies ever examining the TA-SNPs as random effects, for instance, the BayesRC approach proposed by MacLeod et al. (2016)

(MacLeod I M, Bowman P J, Vander Jagt C J, et al. Exploiting biological priors and sequence variants enhances QTL discovery and genomic prediction of complex traits. BMC genomics, 2016, 17: 1-21.).

3. What is the protocol to select p-value threshold of significant SNPs? seems that it is arbitrary. The caption of Figure 3 is hard to interpret. What does the different letters above the bars mean? How do the letters show "the genomic prediction accuracies of varied scenarios were significantly different"? How to tell from the letters what is significant?

Response: Thanks for the comment. 1) The p-value threshold was chosen based on the number of significant SNPs available. If the threshold is too strict, there was no significant SNP available in some scenarios and it will cause imbalance between the scenarios. The threshold is also not needed to be too liberal because the number of effective significant SNPs would not markedly increase until the threshold rose to 0.01.

2) The letters in Figure 3 represent that within the same model algorithm, the prediction accuracies in different scenarios were significantly different or not at the level of p < 0.05. The letters a-e refer to the scenarios 1-5 (from left to right). If two scenarios share a letter, it means the prediction accuracies in these two scenarios were not significantly different. Otherwise, if two scenarios have no common letter, it means the prediction accuracies in these two scenarios were significantly different. We have augmented in the Figure 3 caption. This can be seen in line 413-415.

Overall, the manuscript is well-written and promising. The methodology is clear for reproducibility.

Reviewer #2:

It’s a problem with many papers pertaining to prediction models, that instead of explaining the tangible working principle of their prediction methods BLUP etc, They tend to write matrix equations on paper too much. look at the end its a paper intended to improvise plant breeding process, you should focus on how actually SNPs data is collected and how it is applied to phenotypic data (comparative application) and how SNP variances are related to trait variances. Your paper focuses on applications of different strategies to predict g values etc., but it doesn’t tell how these prediction strategies are working actually. In nutshell, there is a lot of ambiguity pertaining to the process of mathematical application in the experiment. As a breeder this paper doesn’t seem helpful and it takes a learner further away from modern available education. The paper should not focus on names of models and softwares involved but on the tangible explainations of processes involved.

Response: Thanks for the comment. Our intention is just to provide an optimal way to predict hybrids’ performance capitalizing on genomic data. It is not a study systematically directing the application of genomic prediction in breeding. We agreed the application is important and we added one conclusive sentence in the conclusion section in line 610-615 with red font. We also enriched the explanation of different scenarios to make the scenarios clearer and more tangible. Please see the red text in the Materials and methods section.

Reviewer #3:

This manuscript has scientific merit and could contribute to the literature, but unfortunately in its present form it is full of grammatical errors. These errors make the manuscript difficult to read and effectively evaluate.

1. Major concerns center around the clarity of the methods section. The experimental design is poorly described, and it is not clear how parental information was incorporated into the models for prediction scenarios including parental information. I assume the parents’ genotypes were included in the training of the Bayesian approaches and in the calculation of the genomic relationship matrices for the GBLUP models; however, I did not see this explicitly stated and there were comments about the collinearity of the additive and dominance relationship matrices that make me question whether this was the case.

Response: Thanks for the remark. 1) the introduction of experimental design has been enriched in line 110-113 with red text. 2) What you assumed is right, the parents’ genotypes together with hybrids’ genotypes were all included in one genomic relationship matrix for scenario 1 and 3-5. But for the scenario incorporated mid-parental value, we took mid-parental value as a covariate incorperated into the model. We have elaborated the models to clarify the use of parental information. Please see the red text in Materials and methods section 2.7 from line 203-254.

2. Specific comments related to grammar.

Initially I highlighted issues with grammar and typos but gave up once I got to the methods section given the many errors. Below are the issues I highlighted before abandoning the effort:

2.1 Typo line 53: normally have short mesocotyl (≤1.0 cm) [6]. Thereby, it is of crucial to breed long 53 mesocotyl hybrid varieties for direct seeding. – need to reword.

Response: We have revised this sentence. Please see line 53-54 with red text.

2.2 Line 74: Thereby, it is recommended to incorporate non-additive effects despite the additive effects are determinant [12]. – need to reword.

Response: We have reworded this sentence. Please see line 75-76 with red text.

2.3 Line 87: metabolome data and eight GS methods, founding that the GBLUP approach integrating genomics and metabolome data performed overall the best. – Should be finding?

Response: Yes, we have revised and please see line 88-90 with red text.

2.4 Line 91: traits of hybrid rice, but no study ever reported the potential of GS on mesocotyl length in hybrid rice, which is indicative to direct seeding – need to reword.

Response: This sentence has been revised. Please see line 92-93 with red text.

2.5 Line 95: We experimented several genomic prediction scenarios combining with mid-parental value prediction, marker-assisted selection (MAS) and genome-wide association study (GWAS). – Should be examined several … ?

Response: We have revised this sentence. Please see line 96-98 with red text.

2.6 Line 98: with highest prediction accuracy of mesocotyl length whereby disclose the potential of using genomic selection to improve hybrid rice direct seeding efficiency. – need to reword.

Response: We have reworded. This can be seen in line 99-101 with red text.

2.7 Line 144: In consequence, 196,640 high quality SNPs retained. – Should be As a result, ...?

Response: We have revised. Please see line 152-153 with red text.

2.8 Line 178: validation scenarios 1-5 (detailedly introduced in section 2.8) – details can be found in section 2.8. - detailedly is rarely, if ever, used in English.

Response: We have revised. Please see line 188 with red text.

2.9 Line 210: could be uniformed as = + + – do you mean represented as?

Response: We have revised. Please see line 228 with red text.

3. Specific comments related to methods:

3.1 You never justify why you are using mesocotyl length as a proxy for improving emergence as opposed to measuring emergence directly. Is it more cost effect? Is it more heritable? Too challenging to generate enough seed?

Response: Thanks for the comment. Mesocotyl length is significantly positively correlated with the emergence rate and in practice commonly used to evaluate the effect of direct seeding. Emergence is also important for direct seeding, however, measuring emergence requires more seeds. It is not cost-effective and too challenging to generate enough hybrid seeds. Therefore, we chose the length of mesocotyl as a proxy of emergence.

3.2 Experimental design – There needs to be a better description of the experimental design. Was there no replication for any of the hybrids? What is meant by partially randomized (is this in reference to parental lines being planted next to hybrids)? Was there some type of blocking? At one point the manuscript mentions corrected phenotypes, which implies some sort of correction based on the experimental design, but none of the models have any experimental factors included.

Response: Thanks for the comment. We have added the details of experimental design including the number of replicates for the test of mesocotyl length measurement to the Materials and Methods section. This can be seen in line 110-113 with red text. 1) We have set three replicates for hybrids. 2) In order to minimize the impact of environment on the phenotypic performances of parents and their hybrid, each hybrid was planted next to its male parent. The spot of each hybrid planting totally depends on its male parent, therefore, we called it partially randomized. But considering “partially randomized” will incur confusion, we have removed it in the manuscript. 3) We have elaborated the phenotype correction process in line 124-130 with red text. Please check it.

3.3 However, as only one female was used in our study, deductively, the additive genomic profiles of the hybrids were completely collinear with their dominant genomic profiles. - The dominance relationship of the hybrids would be colinear with the additive relationship of the male inbreds. The additive relationship of the hybrids would be different (higher) as they all share a common female parent. You should be clear in the methods how you calculated the relationship matrix and modify this statement accordingly.

Response: Thanks for the comment. Let us exemplify it. For a given locus, assuming the genotypic profiles of five male parents are (0 2 0 0 2). As all hybrids share one female, the genotypic profiles of the five hybrids should be (0 1 0 0 1) if the genotypic profile of female is 0, and (1 2 1 1 2) if the genotypic profile of female is 2. The corresponding dominant genotypic profile should (0 1 0 0 1) and (1 0 1 1 0) because only heterozygotes have dominant effect. It is clear that (0 1 0 0 1) is colinear with (0 1 0 0 1), and (1 2 1 1 2) is colinear with (1 0 1 1 0). The additive and dominant relationship matrices could be uniformly described as WWT/c where W is either additive or dominant genotypic matrix, c is a constant. The standardization of W applied in many studies is just a linear transformation of W, which will not change the rank of W. Theoretically, the rank of WWT is same as W. So, if the additive profile matrix is colinear with the dominant profile matrix, the additive relationship matrix must be colinear with the dominant relationship matrix.

3.4 The iteration times of all models were uniformly set to 30,000 and first 5,000 times were set as burn-in. - Did you confirm burn-in and good mixing of the Markov chains? No thinning was needed?

Response: Thanks for the remark. The thinning for BLUP models was 5 (default setting in BGLR package) and for Bayesian models was 10 (default setting in gctb software). In order to uniform the thinning, we have reset the thinning of Bayesian models to be 5 and rerun all the scenarios using Bayesian models in our study. It was shown that there is no observable difference between the results by setting the thinning being 5 and 10. We randomly selected some scenarios using GBLUP and BayesB (after resetting thinning=5) to show the convergence of estimates of mu, variance components of genotype (var_a) and residual (var_e) in the MCMC as:

As shown, there were good convergences of all the estimates for both models.

3.5 For cross-validation scenarios 1 and 5 - For GBLUP were the parents included in the genomic relationship matrix? If that is the case, there would be a large difference in the dominance and additive relationship matrices.

Response: Thanks for the comment. The parents were included in the relationship matrix in scenarios 1 and 3-5. As parents are all pure lines, there is no heterozygote in parents. So even the dominant relationship matrix was constructed, all the elements corresponding to lines per se (diagonal) or relatedness between line and hybrid (off-diagonal) are zero. The elements corresponding to hybrids per se and relatedness between hybrids are not zero but this submatrix is collinear with the additive relationship matrix, as introduced in comment 3.3. So, there is no need to construct the dominant relationship matrix.

3.6 Line 259 …the corresponding adjusted phenotypic values was calculated to measure the genomic prediction accuracy. - How were the phenotypic values adjusted?

Response: Thanks for the remark. We have elaborated the phenotype correction process in line 124-130 with red text. Please check it.

3.7 In the mid-parental value prediction trial, as the five hybrid test sets in each repeat of cross-validation were collectively the total hybrid population disregarding to the randomness of sampling, thus there was just one prediction accuracy value in the scenario using the mid-parental values as a covariate. - This statement is confusing. Perhaps some type of figure to illustrate the various cross-validations schemes would help make this clearer. - It is confusing how parental information is included for prediction in the models.

Response: Thanks for the comment. 1) We have elaborated the prediction accuracy calculation in mid-parental value prediction scenario. Please see line 290-299 with red font. One supplementary figure (S1 Fig) was also added to illustrate different cross-validation scenarios. 2) For scenario 1 and 3-5, the parents’ genotypes together with hybrids’ genotypes were all included in one genomic relationship matrix. But for the scenario incorporated mid-parental value, we took mid-parental value as a covariate incorporated into the model. We have elaborated the models to clarify the use of the parental information. Please see the red text in Materials and methods section 2.7 from line 203-254.

3.8 Line 296 The additive effect variance component was 1.22 for parent lines and 1.7 for hybrids. The heritability estimate was 0.8 for parent lines and 0.58 for hybrids. The genetic diversity of the parental lines was overall high, as indicated by the wider range of genetic similarities between parental lines - Which model was used to calculate the additive variance components?

Response: Thanks for the remark. The additive and other variance components used to calculate the heritability were from GBLUP model. We have added relevant text in line 225-227 with red font.

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(466.6KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0283989.r003

Decision Letter 1

Muhammad Abdul Rehman Rashid

21 Mar 2023

Genomic prediction of rice mesocotyl length indicative of directing seeding suitability using a half-sib hybrid population

PONE-D-22-25635R1

Dear Dr. He,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Muhammad Abdul Rehman Rashid, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

PLoS One. doi: 10.1371/journal.pone.0283989.r004

Acceptance letter

Muhammad Abdul Rehman Rashid

27 Mar 2023

PONE-D-22-25635R1

Genomic prediction of rice mesocotyl length indicative of directing seeding suitability using a half-sib hybrid population

Dear Dr. He:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Muhammad Abdul Rehman Rashid

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Illustration of different training set composition scenarios.

(TIF)

Click here for additional data file.^{(612.9KB, tif)}

S2 Fig. Heterosis analysis of 401 hybrids.

(TIF)

Click here for additional data file.^{(274.5KB, tif)}

S1 Table. The information of 402 rice accessions.

(DOCX)

Click here for additional data file.^{(36.6KB, docx)}

S2 Table. The genotypic data of the 402 lines.

(XLSX)

Click here for additional data file.^{(13MB, xlsx)}

S3 Table. The best unbiased estimated values (BLUE) of mesocotyl length of 402 rice accessions and 401 hybrids.

(DOCX)

Click here for additional data file.^{(77.2KB, docx)}

S4 Table. Heterosis analysis of 401 hybrids.

(DOCX)

Click here for additional data file.^{(80.6KB, docx)}

S5 Table. The prediction accuracies of scenario 1–5 in MAS.

(DOCX)

Click here for additional data file.^{(17.3KB, docx)}

S6 Table. Two-Way ANOVA in MAS prediction accuracies.

(DOCX)

Click here for additional data file.^{(16.8KB, docx)}

S7 Table. Two-Way ANOVA in GS prediction accuracies.

(DOCX)

Click here for additional data file.^{(16.3KB, docx)}

S8 Table. The specific sample size of training set in all scenarios.

(DOCX)

Click here for additional data file.^{(16.3KB, docx)}

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(466.6KB, docx)}

Data Availability Statement

All relevant data are within the paper and its Supporting information files.

[pone.0283989.ref001] 1.Farooq M, Siddique KHM, Rehman H, Aziz T, Lee DJ, Wahid A. Rice direct seeding: experiences, challenges and opportunities. Soil Till Res. 2011; 111:87–98. doi: 10.1016/j.still.2010.10.008 [DOI] [Google Scholar]

[pone.0283989.ref002] 2.Zhou W, Guo Z, Chen J, Jiang J, Hui D, Wang X, et al. Direct seeding for rice production increased soil erosion and phosphorus runoff losses in subtropical China. Sci. Total. Environ. 2019; 695:133845. doi: 10.1016/j.scitotenv.2019.133845 [DOI] [PubMed] [Google Scholar]

[pone.0283989.ref003] 3.Mahender A, Anandan A, Pradhan SK. Early seedling vigour, an imperative trait for direct-seeded rice: an overview on physio-morphological parameters and molecular markers. Planta. 2015; 241:1027–1050. doi: 10.1007/s00425-015-2273-9 [DOI] [PubMed] [Google Scholar]

[pone.0283989.ref004] 4.Jumin T, Zhang G, Datta K, Xu C, He Y, Zhang Q, et al. Field performance of transgenic elite commercial hybrid rice expressing bacillus thuringiensis dendotoxin. Nat. Biotechnol. 2000; 18:1101–1104. doi: 10.1038/80310 [DOI] [PubMed] [Google Scholar]

[pone.0283989.ref005] 5.Zhan J, Lu X, Liu H, Zhao Q, Ye G. Mesocotyl elongation, an essential trait for dry-seeded rice (Oryza sativa L.): a review of physiological and genetic basis. Planta. 2020; 251:27. doi: 10.1007/s00425-019-03322-z [DOI] [PubMed] [Google Scholar]

[pone.0283989.ref006] 6.Wu J, Feng F, Lian X, Teng X, Wei H, Yu H, et al. Genome-wide Association Study (GWAS) of mesocotyl elongation based on re-sequencing approach in rice. BMC Plant Biol. 2015; 15:218. doi: 10.1186/s12870-015-0608-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref007] 7.Singh VK, Upadhyay P, Sinha P, Mall AK, Ellur RK, Singh A, et al. Prediction of hybrid performance based on the genetic distance of parental lines in two-line rice (Oryza sativa L.) hybrids. Journal of Crop Science and Biotechnology. 2011; 14(1): 1–10. doi: 10.1007/s12892-010-0111-y [DOI] [Google Scholar]

[pone.0283989.ref008] 8.Tiwari DK, Pandey P, Giri SP, Dwivedi JL. Prediction of gene action, heterosis and combining ability to identify superior rice hybrids. International Journal of Botany. 2011; 7:126–144. doi: 10.3923/ijb.2011.126.144 [DOI] [Google Scholar]

[pone.0283989.ref009] 9.Widyastuti Y, Kartina N, Rumanti IA. Prediction of Combining Ability and Heterosis in the Selected Parents and Hybrids in Rice (Oryza Sativa. L.). Informatika Pertanian. 2017; 26:31–40. doi: 10.21082/ip.v26n1.2017.p31-40 [DOI] [Google Scholar]

[pone.0283989.ref010] 10.Sreewongchai T, Sripichitt P, Matthayatthaworn W. Parental genetic distance and combining ability analyses in relation to heterosis in various rice origins. Journal of Crop Science and Biotechnology. 2021; 24:327–336. doi: 10.1007/s12892-020-00081-2 [DOI] [Google Scholar]

[pone.0283989.ref011] 11.Riedelsheimer C, Czedik-Eysenberg A, Grieder C, Lisec J, Melchinger AE. Genomic and metabolic prediction of complex heterotic traits in hybrid maize. Nat. Genet. 2012; 44:217–220. doi: 10.1038/ng.1033 [DOI] [PubMed] [Google Scholar]

[pone.0283989.ref012] 12.Xu S, Zhu D, Zhang Q. Predicting hybrid performance in rice using genomic best linear unbiased prediction. Proc. Natl. Acad. Sci. 2014; 111:12456–12461. doi: 10.1073/pnas.1413750111 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref013] 13.Xu Y, Zhao Y, Wang X, Ma Y, Li P, Yang Z, et al. Incorporation of parental phenotypic data into multi-omic models improves prediction of yield-related traits in hybrid rice. Plant Biotechnol J. 2021; 19:261–272. doi: 10.1111/pbi.13458 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref014] 14.Fu J, Falke KC, Thiemann A, Schrag TA, Melchinger AE, Scholten S, et al. Partial least squares regression, support vector machine regression, and transcriptome-based distances for prediction of maize hybrid performance with gene expression data. Theor. Appl. Genet. 2012; 124:825–833. doi: 10.1007/s00122-011-1747-9 [DOI] [PubMed] [Google Scholar]

[pone.0283989.ref015] 15.Xu S, Xu Y, Gong L, Zhang Q. Metabolomic prediction of yield in hybrid rice. Plant J. 2016; 88:219–227. doi: 10.1111/tpj.13242 [DOI] [PubMed] [Google Scholar]

[pone.0283989.ref016] 16.Westhues M, Schrag TA. Omics-based hybrid prediction in maize. Theor. Appl. Genet. 2017; 130:1927–1939. doi: 10.1007/s00122-017-2934-0 [DOI] [PubMed] [Google Scholar]

[pone.0283989.ref017] 17.Wang S, Wei J, Li R, Qu H, Chater JM, Ma RY, et al. Identification of optimal prediction models using multi-omic data for selecting hybrid rice. Heredity. (2019). 123:395–406. doi: 10.1038/s41437-019-0210-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref018] 18.R Core Team. R: a language and environment for statistical computing. Vienna, Austria. 2016; https://www.R-project.org/.

[pone.0283989.ref019] 19.Covarrubias-Pazaran G. Genome-Assisted Prediction of Quantitative Traits Using the R Package sommer. PLoS One. 2016; 11(6):e0156744. doi: 10.1371/journal.pone.0156744 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref020] 20.Griffing B. Use of a controlled-nutrient experiment to test heterosis hypotheses. Genetics. 1990. 126:753–767. doi: 10.1093/genetics/126.3.753 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref021] 21.Howie BN, Donnelly P, Marchini J. A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. PloS Genetics. 2009. 5(6):e1000529. doi: 10.1371/journal.pgen.1000529 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref022] 22.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007; 81:559–575. doi: 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref023] 23.Yang J, Benyamin B, McEvoy B, Gordon S, Henders A, Nyholt D, et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010; 42:565–569. doi: 10.1038/ng.608 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref024] 24.Jiang Y, Zhao Y, Rodemann B, Plieske J, Kollers S, Korzun V, et al. Potential and limits to unravel the genetic architecture and predict the variation of Fusarium head blight resistance in European winter wheat (Triticum aestivum L.). Heredity. 2015; 114:318–326. doi: 10.1038/hdy.2014.104 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref025] 25.Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. American Journal of Human Genetics. 2011; 88:76–82. doi: 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref026] 26.Endelman JB, Jannink JL. Shrinkage estimation of the realized relationship matrix. G3 (Bethesda). 2012; 2(11):1405–13. doi: 10.1534/g3.112.004259 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref027] 27.Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001; 157(4):1819–1829. doi: 10.1093/genetics/157.4.1819 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref028] 28.Moser G, Lee SH, Hayes BJ, Goddard ME, Wray NR, Visscher PM. Simultaneous Discovery, Estimation and Prediction Analysis of Complex Traits Using a Bayesian Mixture Model. Plos Genetics. 2015; 11(4):e1004969. doi: 10.1371/journal.pgen.1004969 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref029] 29.Perez P, De L. Genome-Wide Regression and Prediction with the BGLR Statistical Package. Genetics. 2014; 198:483–95. doi: 10.1534/genetics.114.164442 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref030] 30.Zeng J, Vlaming RD, Wu Y, Robinson MR, Lloyd-Jones LR, Yengo L, et al. Signatures of negative selection in the genetic architecture of human complex traits. Nat. Genetics. 2018; 50:746–753. doi: 10.1038/s41588-018-0101-4 [DOI] [PubMed] [Google Scholar]

[pone.0283989.ref031] 31.Gowda M, Zhao Y, Würschum T, Longin CFH, Miedaner T, Ebmeyer E, et al. Relatedness severely impacts accuracy of marker-assisted selection for disease resistance in hybrid wheat. Heredity. 2014; 112:552–561. doi: 10.1038/hdy.2013.139 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref032] 32.Jiang Y, Schulthess AW, Rodemann B, Ling J, Plieske J, Kollers S, et al. Validating the prediction accuracies of marker-assisted and genomic selection of fusarium head blight resistance in wheat using an independent sample. Theor. Appl. Genet. 2017; 130:1–12. doi: 10.1007/s00122-016-2827-7 [DOI] [PubMed] [Google Scholar]

[pone.0283989.ref033] 33.Zhang Z, Ober U, Erbe M, Zhang H, Gao N, He J, et al. Improving the accuracy of whole genome prediction for complex traits using the results of genome wide association studies. PLoS ONE. 2014; 9:e93017. doi: 10.1371/journal.pone.0093017 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0283989.ref034] 34.Bernardo R. Genomewide selection when major genes are known. Crop Sci. 2014; 54:68–75. doi: 10.2135/cropsci2013.05.0315 [DOI] [Google Scholar]

[pone.0283989.ref035] 35.Spindel JE, Begum H, Akdemir D, Collard B, Redoña E, Jannink JL, et al. Genome-wide prediction models that incorporate de novo gwas are a powerful new tool for tropical rice improvement. Heredity. 2016; 116:395–408. doi: 10.1038/hdy.2015.113 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Genomic prediction of rice mesocotyl length indicative of directing seeding suitability using a half-sib hybrid population

Liang Chen

Jindong Liu

Sang He

Liyong Cao

Guoyou Ye

Roles

Abstract

1. Introduction

2. Materials and methods

2.1 Rice materials

2.2 Phenotypic data

2.3 Heterosis analysis

2.4 Genomic data

2.5 Mid-parental value prediction

2.6 Marker-assisted selection

2.7 Genomic prediction methods

2.8 Cross-validation scenarios

2.9 Classification of SNPs in genomic prediction

3. Results

3.1 Phenotypic analysis statistics and population diversity

Fig 1. Pairwise genetic dissimilarities between 401 parental lines based on Euclidean distance.

3.2 Predictability of marker-assisted selection

Fig 2. Prediction accuracies of mesocotyl length using marker-assisted selection with different significance thresholds (P value) for selecting trait-associated SNP markers based on different training set composition scenarios.

3.3 Predictability of genomic prediction

Fig 3. Genomic prediction accuracies of mesocotyl length using four prediction models based on different training set composition scenarios.

3.4 Using parental performance as covariates in genomic prediction

Fig 4. Genomic prediction accuracies of mesocotyl length using mid-parental value as a covariate in the genomic prediction models.

3.5 Different training set sizes with subsets of reference hybrids

Fig 5. Genomic prediction accuracies of mesocotyl length using four prediction models based on different training set composition scenarios and sizes.

3.6 GBLUP separately fitting trait-associated and -unassociated markers

Fig 6. Genomic prediction accuracies of mesocotyl length based on different training set composition scenarios using GBLUP by partitioning SNP markers into trait-associated and -unassociated sets using genome-wide association study (GWAS) based on all lines (A) and all lines and hybrids (B).

4. Discussion

4.1 Relatedness driving the prediction accuracy in MAS

4.2 GS is superior to MAS in rice mesocotyl length prediction

4.3 Incorporating parental performance as covariates improves prediction accuracy

4.4 Separately modelling the trait-associated and -unassociated markers significantly improved the genomic prediction accuracy

5. Conclusion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Muhammad Abdul Rehman Rashid

Roles

Author response to Decision Letter 0

Decision Letter 1

Muhammad Abdul Rehman Rashid

Roles

Acceptance letter

Muhammad Abdul Rehman Rashid

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases