Assessment of genomic prediction capabilities of transcriptome data in a barley multi-parent RIL population

Christopher Arlt; Delphine van Inghelandt; Jinquan Li; Benjamin Stich

doi:10.1007/s00122-025-05029-0

. 2025 Sep 10;138(10):247. doi: 10.1007/s00122-025-05029-0

Assessment of genomic prediction capabilities of transcriptome data in a barley multi-parent RIL population

Christopher Arlt ^1,², Delphine van Inghelandt ^2,³, Jinquan Li ⁴, Benjamin Stich ^2,^4,^5,^6,^✉

PMCID: PMC12423136 PMID: 40931219

Abstract

Key message

Low-cost and high-throughput RNA sequencing data for barley RILs achieved GP performance comparable to or better than traditional SNP array datasets when combined with parental whole-genome sequencing SNP data.

Abstract

The field of genomic selection (GS) is advancing rapidly on many fronts including the utilization of multi-omics datasets with the goal of increasing prediction ability and becoming an integral part of an increasing number of breeding programs ensuring future food security. In this study, we used RNA sequencing (RNA-Seq) data to perform genomic prediction (GP) on three related barley RIL populations. We investigated the potential of increasing prediction ability by combining genomic and transcriptomic datasets, adding whole-genome sequencing (WGS) SNP data, functional annotation-based filtering, and empirical quality filtering. Our RNA-Seq data were generated cost-efficiently using small-footprint plant cultivation, high-throughput RNA extraction, and Library preparation miniaturization. We also examined sequencing depth reduction as an additional cost-saving measure. We used fivefold cross-validation to evaluate the prediction ability of the gene expression dataset, the RNA-Seq SNP dataset, and the consensus SNP dataset between the RNA-Seq and parental WGS data, resulting in prediction abilities between 0.73 and 0.78. The consensus SNP dataset performed best, with five out of eight traits performing significantly better compared to a 50K SNP array, which served as a benchmark. The advantage of the consensus SNP dataset was most prominent in the inter-population predictions, in which the training and validation sets originated from different RIL sub-populations. We were therefore able to not only show that RNA-Seq data alone are able to predict various complex traits in barley using RILs, but also that the performance can be further increased with WGS data for which the public availability will steadily increase.

Supplementary Information

The online version contains supplementary material available at 10.1007/s00122-025-05029-0.

Introduction

Due to the continuous increase in the global population and in the per capita crop demand, there is a need for agricultural expansion and advancements in plant breeding (Lenaerts et al. 2019; Tilman et al. 2011). The ecological impact of land clearing due to intensified agriculture is already substantial and will increase further (Zabel et al. 2019; Burney et al. 2010). This makes improving crop yield one of the most important tasks (Burgess et al. 2023).

For thousands of years, agriculturally important quantitative traits were improved through plant and animal breeding without knowledge of the principles of genetics. Instead, this was simply based on artificial human selection of phenotypes (Purugganan and Fuller 2009; Pourkheirandish and Komatsuda 2007; Wright 2005). This has changed in recent decades with the increase in knowledge and capabilities in genetics, which in turn was made possible by dramatic progress on genotyping and sequencing approaches. This enabled the development and application of marker-assisted selection (MAS) (Dekkers and Hospital 2002; Lande and Thompson 1990; Fernando and Grossman 1989). The challenge is that most agriculturally relevant traits are quantitative traits controlled by many genes which each have only a small effect (Glazier et al. 2002; Mackay 2001), while MAS is most effective when large-effect loci contribute to the trait of interest (Heffner et al. 2009). With the increasing availability of high-density genome-wide marker data, genomic selection (GS) was introduced as a method for estimating breeding values using all available marker data (Meuwissen et al. 2001) rather than only those that were significantly associated with the trait. GS alleviated the downside of classical MAS when trying to predict complex quantitative traits with many small-effect loci (Zhao et al. 2014) and increased the rate of genetic gain. While GS was first studied in the context of animal breeding, it was later adopted by plant breeders (Heffner et al. 2009; Zhong et al. 2009; Bernardo and Yu 2007). The reduction in costs to produce marker data using next-generation sequencing techniques such as genotyping by sequencing (GBS) further increased the popularity of GS (Bhat et al. 2016).

Over the last decade, advances in GS methods have led to increased prediction abilities. For example, utilizing high-throughput phenotyping data as predictors increased the performance of multivariate GS models (Rutkoski et al. 2016). Additionally, traditional GS models were expanded to multi-trait GS models (Tsai et al. 2020; Lyra et al. 2017; Jia and Jannink 2012). Furthermore, multi-environmental GS models increased the predictability in multiple studies (Hu et al. 2023; Li et al. 2019). The most recent advances are the inclusion of deep learning, machine learning, and artificial intelligence in the GS workflow (Sandhu et al. 2021; Montesinos-Lopez et al. 2021; Bayer et al. 2021; Washburn et al. 2020; Harfouche et al. 2019). This is an ongoing field of research and has not yet been fully explored, with studies showing limitations or at least the need for adjustments to current implementations (Ubbens et al. 2021).

While the core of most GS models is a relationship matrix derived from genome-wide marker data, additional sources of information like transcriptome and metabolome data can be used as predictors in the field of multi-omic GS. The transcriptome is a promising predictor, bridging the gap between the genome and the trait (Azodi et al. 2020). Quantified gene expression can be captured using a microarray or mRNA sequencing (RNA-Seq). RNA-Seq data are more versatile than microarrays because not only gene expression information but also sequence variants in the portion of the genome covered by the sequenced RNA can be extracted from such data (Azodi et al. 2020; Weisweiler et al. 2019). Because both aspects are rarely considered together and to our knowledge no research is available that evaluates functional or quality based sub-setting of sequence variants from RNA-Seq data, the full potential of RNA-Seq datasets for GS has not yet been sufficiently studied.

Within the field of plant breeding, several studies have previously shown that including transcriptome and metabolome data (multi-omics) in GS models has the potential to increase prediction capabilities. For example, Michel et al. (2021) added incomplete RNA-Seq data to complete genome-wide marker data to assess disease resistance phenotypes in wheat. In addition, a multi-omics prediction study in oat used transcriptomic and metabolomic data to compare single-environmental trials and multi-environmental trials (Hu et al. 2021). Guo et al. (2016) were able to successfully combine transcriptomic and metabolomic data with genomic markers from diverse maize inbred lines to increase predictability in GS. Data using multi-omics GS in barley, on the other hand, are extremely limited and only recently started to emerge (Wu et al. 2022).

Almost all previous studies focused on diversity panels to validate the capabilities of their GS model (Hu et al. 2021; Michel et al. 2021; Westhues et al. 2019; Schrag et al. 2018; Westhues et al. 2017; Guo et al. 2016). However, in plant breeding programs, half-sib or full-sib families are typically used from which the most appropriate progenies are selected. Nevertheless, to the best of our knowledge, no earlier study evaluated the potential of multi-omic GS models in this context.

In this study, we explore the capabilities of low-cost RNA-Seq data to perform genomic prediction (GP) on three connected spring barley RIL populations with 237 individual lines for eight agriculturally important traits, each measured in up to seven environments. Based on this dataset, we evaluate the potential for increasing the performance of the GP model by combining genomic and transcriptomic data, functional annotation-based filtering, and empirical quality filtering. Lastly, we examine multiple optimization parameters that could lead to cost and time savings without sacrificing prediction ability.

Materials and methods

Genetic material

The HvDRR population was developed from pairwise crosses among 23 diverse parental inbreds (Weisweiler et al. 2019) using the double round-robin (DRR) mating design (Stich 2009). Our study was based on 237 recombinant inbred lines (RIL) from three HvDRR sub-populations (Casale et al. 2022) that were derived from pairwise crosses among parental inbreds Spratt-Archer, HOR8160, and Unumli-Arpa (Fig. 1). In the following, the three sub-populations are referred to as HvDRR13 (65 RILs), HvDRR27 (92 RILs), and HvDRR28 (80 RILs).

Fig. 1 — Genetic material and workflow overview. The crossing scheme shows the three homozygous parental inbreds that were used to create the recombinant inbred line populations HvDRR13, HvDRR27, and HvDRR28 (F7/F8). The workflow is shown by connecting the major steps in consecutive order

Plant cultivation for RNA extraction

All RILs were cultivated in a randomized augmented incomplete block design. A block consisted of 24 samples, including 21 RILs and all three parents as controls (Fig. 1). The cultivation workflow was identical to that described previously (Arlt et al. 2023). In summary, for each RIL, 15 seedlings were cultivated in vertically stacked square Petri dishes for seven days in a reach-in growth chamber under the following conditions: 70% relative Humidity, 16 h of light (6:00–22:00), 22 degrees Celsius (day) / 20 degrees Celsius (night), and Light intensity about 400 $μ m o l$ $m^{- 2}$ $s^{- 1}$ (Fig. S1). The time of day for planting and harvesting were similar (within two hours) for all samples. All plants in the same block were processed simultaneously.

RNA extraction

The seedlings were harvested as a whole, immediately frozen, and ground. From 50 mg of plant material, total RNA was extracted using TRIzol reagent (Thermo Fisher, USA). The manufacturer’s protocol was adapted as described below to fit a 96-well format and to use less reagent (Arlt et al. 2023). The input plant material and all the reagents for extraction were halved. The final washing step in 75% ethanol was repeated one additional time to ensure that any remaining phenol was removed. All other steps were performed as proposed by the manufacturer. The total RNA concentration was quantified using a NanoPhotometer NP 80 from Implen (Germany). A total of 33 extractions were randomly selected for evaluation using the Fragment Analyzer (Agilent, USA).

Library preparation

The mRNA was selected based on a poly-A tail mRNA capture method (Vazyme, China) using $1 μ g$ total RNA input. The full-length mRNA library was constructed using the VAHTS Universal V6 RNA-seq Library Prep Kit for Illumina kit from Vazyme (China). We miniaturized the procedure by reducing the volume of reagents to 25% of the original amount (Arlt et al. 2023). Size selection and cleanup were performed using magnetic DNA Clean Beads (Vazyme, China). Apart from the miniaturization, the manufacturer’s protocol was followed aiming for 250–450bp long inserts.

Sequencing and read processing

Sequencing was performed by BGI on the DNBSEQ-G400 platform. All 284 samples were pooled, and a total of 3.70 billion 150-bp paired end reads were sequenced with an average of 13.0 million read pairs per sample (Arlt and Stich 2022). Various quality statistics of raw sequencing reads were evaluated using fastQC (Andrews 2019) and afterward trimmed with trimmomatic (ILLUMINACLIP:TruSeq3-PE:2:30:10:1:TRUE SLIDINGWINDOW:4:15 LEADING:3 TRAILING:3 MINLEN:36) (Bolger et al. 2014). The trimmed reads were then aligned to the Morex V3 reference sequence (Mascher 2019) using Hisat2 (--no-softclip --max-seeds 1000) (Kim et al. 2019).

Sequence variant calling

Variant calling was performed using the following three datasets: RNA sequencing, SNP array, and whole-genome sequencing (WGS) data. (1) For all RNA-Seq datasets, variant calling was performed using the bcftools mpileup (filter: -q 20 -Q 20) and call function (Li et al. 2009). All duplicated genotypes were united using the major allele as consensus. All variants with missing parental genotype information were excluded. Afterward, the data were cleaned by setting all heterozygous alleles to NA, due to the near complete homozygosity of all inbred lines used in this study. Additionally, all RIL alleles inconsistent with parental alleles were set to NA. The missing data were median imputed for each RIL population independently. In the following, the resulting dataset is referred to as $S N P_{RNAseq}^{Total}$ . To test the impact of quality filtering on a less strictly cleaned dataset, $S N P_{RNAseq}^{Raw}$ was created, which did not set heterozygous / inconsistent allele calls to NA and did not remove variants with missing parental data. (2) Already existing SNP array data for the same RILs were included in this study ( $S N P_{Array}^{Total}$ ). The data were generated using the Illumina 50K iSelect SNP array for barley (Bayer et al. 2017). The SNP array dataset was filtered as described by Casale et al. (2022). (3) Additionally, all available RNA-Seq SNPs were intersected with WGS SNPs from the parental inbreds (Weisweiler et al. 2022), selecting only the variants that were present in both datasets. Next, the missing SNP data were imputed using parental WGS SNP data with Beagle (Browning et al. 2021), creating the consensus subset $S N P_{WGS}^{Total}$ . For all datasets described above, genetic differentiation $G_{st}$ was calculated according to Nei (1973).

RNA-Seq SNP data: functional annotation-based filtering

The annotation-based functional prediction of SNPs in the dataset $S N P_{RNAseq}^{Total}$ was performed using SnpSift and SnpEff (Cingolani et al. 2012a, b). We used the SNPs functional information to filter the total dataset and create two new subsets. The first functional subset contained only SNPs within the 5’UTR and 3’UTR within a 5kb distance to the coding region (SnpSift.jar filter “(ANN[*].EFFECT has ’upstream_gene_variant’) || (ANN[*].EFFECT has ’downstream_gene_variant’)”) and was hereafter referred to as $S N P_{RNAseq}^{R e g .}$ . The second function subset excluded all synonymous SNPs, only selecting missense variant SNPs in the coding region (SnpSift.jar filter “ANN[*].EFFECT has ’missense_variant”’) and is in the following referred to as $S N P_{RNAseq}^{CDS}$ . Missing data were median imputed and all monomorphic markers were excluded.

RNA-Seq SNP data: empirical quality filtering

The following quality filtering parameters were used to filter $S N P_{RNAseq}^{Total}$ : read depth (DP), minor allele frequency (MAF), quality score (QUAL), and number of samples without data (NS). The filtering was based on the information from the vcf file. After determining the minimum and maximum values of $S N P_{RNAseq}^{Total}$ for the four criteria, the full spectrum was divided into 21 segments based on relative filtering strength (from 0 to 100% in 5% steps). For each step, the results of the genomic prediction cross-validation were analyzed to determine the relative filter strength that performs best for each criterion.

In a last step, the subsets with the best filtering performances were combined, creating inclusive marker intersections that were evaluated for their prediction ability. The best-performing combination was selected, and the dataset was cleaned and imputed analogously to the procedure of $S N P_{RNAseq}^{Total}$ . The resulting quality-filtered RNA-Seq SNP is in the following referred to as $S N P_{RNAseq}^{QC}$ . The results were compared to $S N P_{RNAseq}^{S t d .}$ . $S N P_{RNAseq}^{S t d .}$ was created using a standard SNP filtering procedure (missing rate < 20%, MAF > 0.05) (Atanda et al. 2022; Wen et al. 2018). The prediction abilities of $S N P_{RNAseq}^{S t d .}$ and $S N P_{RNAseq}^{QC}$ were compared using pairwise t-tests to determine significant differences.

Read count calculation

The expression per transcript was determined for all samples with the help of htseq count (--mode union) (Anders et al. 2015). All transcripts in which at least 2% of the samples had nonzero read counts were included in the total expression dataset. The original read counts ( $G E_{RNAseq}^{Total}$ ) were filtered using a counts per million (cpm) threshold that maximizes the number of differentially expressed genes (DEG) resulting in the filtered expression dataset ( $G E_{RNAseq}^{F i l t e r .}$ ). The method used is similar to the DEseq2 filtering approach (Love et al. 2014). DEG were identified using a likelihood ratio test (LRT) using the “glmLRT” function in edgeR. The LRT was used to test the goodness of fit between two models. The full model included the genotype of all samples as the only fixed effect. In the reduced model, the genotype effect was removed and therefore represents the general read count mean. The dispersion was estimated by edgeR based on the replicated parental inbreds. We created an additional dataset that included only the DEG ( $G E_{RNAseq}^{DEG}$ ). For all gene expression datasets mentioned above, the un-normalized cpm counts are used unless otherwise stated.

For normalization testing, the Trimmed Mean of the M-values (TMM) method was used to apply a between-sample normalization using the R package edgeR (Robinson and Oshlack 2010). Additionally, a per transcript normalization was applied to correct for any block effects on the expression levels. We used the following linear mixed model for each transcript: $y_{ijk} = μ + G_{i} + B_{j} + ε_{ijk}$ , with the genotype effect ( $G_{i}$ , fixed), block effect ( $B_{j}$ , random), and an error term ( $ε_{ijk}$ ). Afterward, the adjusted entry means were calculated using the emmeans package (Searle et al. 1980). This normalization method in the following referred to as EMM normalization. The GP performance was evaluated using only EMM normalization, only TMM normalization, both normalization methods together (EMM first, TMM first), and with un-normalized read counts.

Phenotypic datasets

In this study, eight phenotypic traits were considered (Table 1). All phenotypic information was collected in field experiments using an augmented row-column design in which the RILs were planted with a single replicate. The parental inbreds were used as checks with multiple replicates. The awn and spike length data were collected between 2019 and 2021 in five different environments. Flowering time and plant height were part of field experiments from 2017 to 2019 in seven environments (Cosenza et al. 2024). Grain length, grain width, grain area, and thousand grain weight were measured between 2017 and 2019 in four different environments (Shrestha et al. 2022). The adjusted entry means for the phenotypic values were calculated using the following mixed linear model: $y_{ijk} = μ + G_{i} + E_{j} + {(G \times E)}_{ij} + ε_{ijk}$ , with the genotype effect ( $G_{i}$ , fixed), environment effect ( $E_{j}$ , random), and the genotype-environment interaction effect ( ${(G \times E)}_{ij}$ , random), as well as an error term ( $ε_{ijk}$ ). The heritability for all phenotypic traits was calculated as $H^{2} = σ_{G}^{2} / (σ_{G}^{2} + \bar{ν} / 2)$ , where $\bar{ν}$ was the mean variance of difference between two adjusted entry means (Piepho and Möhring 2007).

Table 1.

Overview of all studied phenotypic traits

Traits	Environments	$H^{2}$	Min	Max	Median	Mean	SD
Ear/spike length [cm]	5	0.88	6.04	13.64	8.96	8.92	1.28
Awn length [cm]	5	0.74	10.07	17.60	13.45	13.52	1.39
Flowering time [d]	7	0.86	55.73	87.74	70.36	69.62	6.10
Plant height [cm]	5	0.76	38.90	99.28	69.60	69.67	9.34
Grain length [mm]	4	0.84	7.35	13.92	10.69	10.84	1.02
Grain width [mm]	4	0.85	2.80	3.76	3.41	3.40	0.18
Grain area [ ${mm}^{2}$ ]	4	0.89	16.48	30.83	25.59	25.28	2.66
Thousand grain weight [g]	4	0.84	25.10	55.09	37.76	37.96	5.73

Open in a new tab

Number of environments in which the trait was assessed, broad sense heritability ( $H^{2}$ ), minimum (Min.), maximum (Max.), median, mean, and standard deviation (SD) of adjusted entry means of all traits included in this study

Genomic prediction

We used all eleven datasets described above (Table 2) to predict the adjusted entry means for all eight traits. The values of all expression and sequence variant datasets were converted into z-scores ( $μ = 0,$ $σ = 1$ ) for error normalization. Afterward, additive relationship matrices were calculated for each of the datasets: $G = \frac{W^{*} W^{*^{T}}}{m}$ , where $W^{*}$ was the z-score matrix of the feature measurements W, $W^{*^{T}}$ was the transposed z-score matrix and m the number of features per dataset. We used the following genomic best linear unbiased prediction (GBLUP) model (VanRaden 2008): $y = μ + Z u + ε$ , where y was the vector of adjusted entry means of the examined trait, $μ$ the general mean, Z the incidence matrix of genotypic effects, and u the vector of genotypic effects that are assumed to be normally distributed with $u \sim N (0, G σ_{u}^{2})$ , in which G denotes the relationship matrix between inbreds and $σ_{u}^{2}$ the genetic variance. In addition, $ε$ is the vector of residuals following a normal distribution $ε \sim N (0, I σ_{e}^{2})$ . This approach was used across all three segregating populations. Multiple predictor datasets were included in the model by calculating the weighted average between the relationship matrices G (Wu et al. 2022). The weight of each matrix was equal, unless otherwise stated. We used all available samples and randomly separated training and validation populations using a fivefold cross-validation scheme with 50 repetitions.

Table 2.

Overview of all predictor datasets

Name	Abbreviation	Origin	No. of features
Total transcript expression	$G E_{RNAseq}^{Total}$	RNA-Seq	42.6K
Filtered transcript expression	$G E_{RNAseq}^{F i l t e r .}$	RNA-Seq	37.7K
Differentially expressed transcript expression	$G E_{RNAseq}^{DEG}$	RNA-Seq	7.3K
Unfiltered transcriptome sequence variants	$S N P_{RNAseq}^{Total}$	RNA-Seq	147.5K
Functional transcriptome sequence regulatory variants	$S N P_{RNAseq}^{R e g .}$	RNA-Seq	81.7K
Functional transcriptome sequence non-synonymous variants	$S N P_{RNAseq}^{CDS}$	RNA-Seq	25.8K
Standard filtered transcriptome sequence variants	$S N P_{RNAseq}^{S t d .}$	RNA-Seq	52.2K
Quality-filtered transcriptome sequence variants	$S N P_{RNAseq}^{QC}$	RNA-Seq	42.2K
RNA-Seq / WGS consensus sequence variants	$S N P_{WGS}^{Total}$	Hybrid	426.4K
RNA-Seq / WGS consensus sequence non-synonymous variants	$S N P_{WGS}^{CDS}$	Hybrid	52.4K
RNA-Seq / WGS consensus sequence regulatory variants	$S N P_{WGS}^{R e g .}$	Hybrid	245.6K
50k SNP array data	$S N P_{Array}^{Total}$	Array	17.3K

Open in a new tab

Name, abbreviation, origin, and number of features (after filtering) for all genomic and transcriptomic datasets included in this study

Genomic prediction: reduced sequencing depth

To evaluate the impact of reduced sequencing depth on GP performance, we tested 13 sequencing depth subsets. The number of reads per sequencing depth subset ranged from 10K to 7 M. The reads to include were randomly selected from all uniquely mapped reads of the alignment data. Samples with fewer uniquely mapped reads than required were excluded. To ensure comparability, the same samples were included across all datasets for which GP was performed. All other steps of the data processing workflow were not changed. We tested the impact on RNA-Seq sequencing variant datasets for three different filtering methods creating the reduced sequencing depth equivalents of $S N P_{RNAseq}^{Total}$ , $S N P_{RNAseq}^{S t d .}$ , and $S N P_{RNAseq}^{QC}$ for each of the 13 sequencing depths. Similarly, the expression datasets were evaluated including all transcripts ( $G E_{RNAseq}^{Total}$ ), filtered transcripts ( $G E_{RNAseq}^{F i l t e r .}$ ), and DEG ( $G E_{RNAseq}^{DEG}$ ).

Genomic prediction: Intra and inter-population analysis

In addition to the prediction across the three RIL populations, various intra- and inter-population analyses were performed. For the intra- and inter-population genomic prediction, fixed-size randomized samples were used as training and validation sets with 200 repetitions. The training set size for the cross-population GP varied between 20 and 170 (step size: 15) and was fixed for the intra- and inter-population GP at 50. The validation set size was fixed for the cross- and inter-population GP at 65 and for the intra-population GP at 15. The relationship matrices and GP model were identical to the 5-fold cross-validation method described above. We used multiple subsets from a single RIL population with a fixed number of individuals as training and validation sets to evaluate the intra-population prediction ability of our data.

We created a training and validation set from the same RIL population. The number of individuals for each set was the same for all RIL populations. The inter-population prediction ability was tested by using one of the three RIL populations as a training set and a second as a validation set. The number of individuals included in both sets was kept constant. Intra- and inter-population results were compared to the GP performance using cross-population training and validation sets.

Results

Characterization of gene expression data and their GP performance

RNA-Seq data were used to characterize the expression pattern of 237 RILs from the HvDRR13, HvDRR27, and HvDRR28 populations and their parental inbreds Spratt-Archer, HOR8160, and Unumli-Arpa (Fig. 1). The number of detected transcripts per sample ranged from 22.1K to 30.9K (mean 28K) and did not show significant differences (p > 0.05) between the genetic material groups (Fig. 2A). Different segregating populations were clearly separated in the PCA based on the un-normalized cpm of the DEG (Fig. 2B). The RILs within each population were arranged according to their phenotypic values (Fig. S2).

Fig. 2 — Characteristics and genomic prediction (GP) performance of the gene expression datasets. A The number of detected transcripts is grouped by genetic material with mean (black square) and standard deviation (black line). Populations marked by the same letter are not significantly ( $α = 0.05$ ) different from each other. B Principle component analysis using the differentially expressed genes (DEG) subset ( $G E_{RNAseq}^{DEG}$ ). PC 1 and PC 2 are the first and second principal component, respectively, and the number in parentheses refers to the proportion of variance explained by the principal components in percent. C Number of transcripts included in all data subsets used for genomic prediction (GP). D GP results are compared between all gene expression subsets and all traits included in this study. The red line shows the mean prediction ability per subset

The total gene expression dataset ( $G E_{RNAseq}^{Total}$ ), which included 42.6K transcripts, was divided into two subsets: (1) transcripts with an average cpm > 0.39 ( $G E_{RNAseq}^{F i l t e r .}$ ; 37.7K transcripts) and (2) a subset only including DEG ( $G E_{RNAseq}^{DEG}$ ; 7.3K transcripts, Fig. 2C). We determined the cpm threshold for the $G E_{RNAseq}^{F i l t e r .}$ dataset based on the maximum number of significant DEG detected (Fig. S3). This resulted in a cpm threshold lower than the standard edgeR threshold of about 1.7. The standard filtering approach of edgeR resulted in < 6K DEG.

We measured gene expression per transcript as well as per gene and detected a slight but significant increase (p = 0.009) in prediction ability for the former in $G E_{RNAseq}^{Total}$ (Fig. S4A). We also compared different normalization methods. For the un-normalized cpm and the TMM normalized cpm, the GP performance did not differ significantly (p > 0.05), while the EMM normalized GP results were significantly lower ( $p = 0.003$ ) compared to the prediction ability of the un-normalized results (Fig. S4B–C).

The prediction ability of the $G E_{RNAseq}^{Total}$ set was 0.68 on average across all traits, the lowest of all gene expression subsets. The GP performance of $G E_{RNAseq}^{F i l t e r .}$ was slightly higher (0.69), but both were outperformed by $G E_{RNAseq}^{DEG}$ which had an average prediction ability of 0.73 (Fig. 2D). The prediction ability differed among the traits. Flowering time was the most difficult trait to predict in all subsets, resulting in an average prediction ability between 0.59 and 0.67. Ear length was the trait most accurately predicted, with an average prediction ability between 0.78 and 0.82.

Prediction ability of sequence variant datasets

In addition to the expression data, sequence variant calling was performed for the RNA-Seq dataset comprising 148K variants after cleaning and imputation ( $S N P_{RNAseq}^{Total}$ ). The SNP data were further subdivided based on variant function annotation, creating a set that only included variants in the regulatory regions of genes ( $S N P_{RNAseq}^{R e g .}$ ) and a set that only included missense variants in gene coding sequences ( $S N P_{RNAseq}^{CDS}$ ). The latter led to an increase in prediction ability of 0.013 compared to $S N P_{RNAseq}^{Total}$ (Fig. 3). A comparable increase in prediction ability was achieved by filtering the RNA-Seq variants using quality criteria.

Fig. 3 — Prediction ability of four dataset groups: Gene expression data, RNA-Seq SNP data, WGS consensus SNP data, and SNP array data across all the eight examined traits. For each dataset, the mean prediction ability value is shown as number in the bottom

In this study, two quality-filtered subsets were evaluated. First, our own empirical quality filtering subset hereafter referred to as $S N P_{RNAseq}^{QC}$ and second, the subset $S N P_{RNAseq}^{S t d .}$ that is based on previously published filtering criteria (Atanda et al. 2022; Wen et al. 2018). We used the following quality criteria for the empirical quality filtering workflow: read depth (DP), minor allele frequency (MAF), missing data (NS), and quality score (QUAL). More than 20 quality filtering subsets of $S N P_{RNAseq}^{Total}$ , ranging from no filtering to maximum filtering strength, were created, and their GP performance was evaluated (Fig. S5). For each of the quality filtering criteria, the best-performing $S N P_{RNAseq}^{Total}$ subset was selected and combined with one or more of the remaining subsets (Fig. 4A). Most combinations showed a slight increase in prediction ability, and the best-performing combination (hereafter referred to as $S N P_{RNAseq}^{QC}$ ) was 0.002 higher than the best-performing single-criterion subset of the combination (not significant: p > 0.05). $S N P_{RNAseq}^{QC}$ included 42.2K variants, utilized MAF and QUAL as quality filter criteria, and performed considerably better than $S N P_{RNAseq}^{Total}$ with a 0.015 prediction ability increase. A slightly smaller increase was achieved by $S N P_{RNAseq}^{S t d .}$ . However, when comparing the GP results of $S N P_{RNAseq}^{S t d .}$ and $S N P_{RNAseq}^{QC}$ for each of the traits, $S N P_{RNAseq}^{QC}$ performed significantly better ( $p \leq$ 0.025) for half of them (Fig. 4B). A PCA was performed using the $S N P_{RNAseq}^{QC}$ subset and a separation between the three parental inbreds and RIL populations was observed as expected (Fig. S6A). Each of the eight traits showed clear differences among the parental inbreds that caused phenotypic variance in the corresponding segregating populations (Fig. S6B–I).

A Combination of quality filtering subsets of genomic variants and their relative genomic prediction (GP) performance. For each of the combinations, the number of remaining variants (top) is shown after selecting the inclusive intersect between all included filtering subsets (middle). The four quality filtering criteria are minor allele frequency (MAF), variant quality score (QUAL), missing rate (NS), and read depth (DP). The subsets are the best-performing subsets that are marked in Fig. 4 with “T.” The relative GP performance (bottom) shows the difference in prediction ability in percent between the combined subsets and the top performing single quality filtered subset (DP). The best-performing subset is marked (red). B GP performance comparison between a standard quality filtering subset (white) (NA < 20%, MAF > 0.05) and the best-performing combination (red). The GP results were averaged across all eight traits. Significant differences: p-value < 0.01 = **, p-value < 0.001 = ***

Lastly, WGS data from the parental inbreds were used to create an optimized WGS imputed RNA-Seq consensus variant dataset ( $S N P_{WGS}^{Total}$ ) achieving a prediction ability of 0.776. This constitutes a 0.033 increase compared to $S N P_{RNAseq}^{QC}$ and 0.009 compared to the SNP array dataset. $S N P_{WGS}^{Total}$ showed the highest prediction ability averaged across all traits. Based on $S N P_{WGS}^{Total}$ , two function-based subsets ( $S N P_{WGS}^{R e g .}$ and $S N P_{WGS}^{CDS}$ ) were created, but none of them had a positive impact on the GP results.

GP performance of combined datasets

We selected the best-performing datasets from these four main groups: gene expression ( $G E_{RNAseq}^{DEG}$ ), RNA-Seq SNP data ( $S N P_{RNAseq}^{QC}$ ), RNA-Seq/WGS consensus SNP data ( $S N P_{WGS}^{Total}$ ), and array SNP data ( $S N P_{Array}^{Total}$ ). We tested the combined GP performance by merging two or more of these datasets (Fig. 5). For that, the additive relationship matrices calculated from each of the individual datasets were averaged, and the resulting matrix was used for GP. The combination of two datasets resulted in an average increase in prediction ability of 0.01. Three of the six combined matrices were able to outperform the best dataset included in the combination (Fig. 5A). The average performance of all triple dataset combinations (0.77) was increased by 0.02 compared to the single datasets and by 0.01 compared to the double dataset combinations. Combining all four datasets resulted in an average prediction ability of 0.78, that is, 0.02 higher compared to the average single dataset GP and slightly better than the triple dataset combination average. However, none of the combinations showed a significantly higher prediction ability (p > 0.05) than the best individual dataset included in the combination.

Fig. 5 — Overview of genomic prediction abilities using multiple datasets. A The four different single datasets: $S N P_{RNAseq}^{QC}$ , $S N P_{WGS}^{Total}$ , $G E_{RNAseq}^{DEG}$ , and $S N P_{Array}^{Total}$ were combined and prediction ability was calculated (top). The combinations are indicated as a combination of white (not included) and black (included) dots (bottom). The red line shows the mean prediction ability within each combination group. B The number of significantly (p > 0.05) increased (blue), decreased (red), and not significantly changed (gray) phenotypic traits compared to the value of the single best dataset for each combination. C The mean prediction ability is plotted for the weighted combination of $S N P_{WGS}^{Total}$ and $G E_{RNAseq}^{DEG}$ (2.4). The weight is gradually changed from 100% $G E_{RNAseq}^{DEG}$ (left) to 100% $S N P_{WGS}^{Total}$ (right) with a step size of 10%. The results are shown for each weighted combination and all traits

Examining each trait separately revealed a significant (p < 0.01) increase in GP performance for at least one of the traits in multiple combined datasets (Fig. 5B). The best-performing combination consisted of $S N P_{RNAseq}^{QC}$ and $S N P_{Array}^{Total}$ , as it significantly increased the prediction ability for three traits ( $p \leq$ 0.016) and only one trait was significantly decreased (p < 0.001). Combining $S N P_{WGS}^{Total}$ and $G E_{RNAseq}^{DEG}$ with equal weight resulted in a significant decrease in prediction ability for all eight traits. We also tested for significant differences compared to $S N P_{Array}^{Total}$ (Table S1). $S N P_{WGS}^{Total}$ significantly outperformed $S N P_{Array}^{Total}$ in five of eight traits. Combining $S N P_{WGS}^{Total}$ , $S N P_{RNAseq}^{QC}$ and $G E_{RNAseq}^{DEG}$ significantly increased half of the traits ( $p \leq$ 0.025), which was the best-performing combination without including $S N P_{Array}^{Total}$ . We further evaluated the combination including $G E_{RNAseq}^{DEG}$ and $S N P_{WGS}^{Total}$ by testing a range of weighted combinations (Fig. 5C). We started with only the $G E_{RNAseq}^{DEG}$ dataset and gradually increased the percentage of weight contribution $S N P_{WGS}^{Total}$ until only the $S N P_{WGS}^{Total}$ data remained. The results of the trait flowering time showed a significant increase (p = 0.007) in GP performance for an unbalanced combination with a 90% weight contribution by $S N P_{WGS}^{Total}$ . For all other traits, the combinations resulted in a lower prediction ability.

We quantified the differences of the same four datasets by calculating Pearson’s correlation coefficient of the additive relationship matrices (Fig. S7). The correlation between $S N P_{RNAseq}^{QC}$ and $S N P_{WGS}^{Total}$ was the highest with a correlation coefficient of 0.98. The correlation of $S N P_{Array}^{Total}$ with $S N P_{WGS}^{Total}$ ( $r = 0.91$ ) and $S N P_{RNAseq}^{QC}$ ( $r = 0.89$ ) showed a slightly lower level of similarity. The gene expression matrix ( $G E_{RNAseq}^{DEG}$ ) was more dissimilar compared to the remaining datasets ( $r = 0.59 - 0.61$ ). We split the additive relationship matrices into segments based on the populations to calculate the correlation for all populations and inter-population combinations separately. We focused on the correlation between $S N P_{Array}^{Total}$ and the remaining three datasets (Fig. S8) and showed for all three matrix combinations that the correlation coefficient was lower for the intra-population covariances than for the inter-population covariances. Additionally, large differences in similarity were shown for intra-population covariances.

The effect of sequencing depth reduction on GP performance

We tested the GP performance of 13 sub-sampled datasets with reduced sequencing depths from 7 M to 10K reads using RNA-Seq SNP and gene expression data including 155 samples with sufficient reads (Fig. S10) and from 3 M to 10K reads including 229 samples (Fig. 6). The GP performance of the three RNA-Seq SNP datasets was significantly (p < 0.001) affected by a reduction in sequencing depth (Fig. 6A). For $S N P_{RNAseq}^{Total}$ , the prediction ability decreased only insignificantly (p > 0.05) for all subsets that started with more than 400K uniquely mapped reads (200K for $S N P_{RNAseq}^{S t d .}$ and $S N P_{RNAseq}^{QC}$ ). Reduction in sequencing depth below 100K reads had a considerably increased impact on the GP performance. $S N P_{RNAseq}^{Total}$ showed higher prediction ability values than $S N P_{RNAseq}^{S t d .}$ below 50K variants, while the quality filtering dataset $S N P_{RNAseq}^{QC}$ increased the performance in all subsets. For the three gene expression datasets, performance started to decline earlier, with only the 2 M and 3 M subsets showing no significant differences between each other (p > 0.05) (Fig. 6B). $G E_{RNAseq}^{DEG}$ performed worst in the 10K subset and only exceeded the prediction ability values of the other two datasets at sequencing depths over 100K.

Fig. 6 — Genomic prediction (GP) performance of artificially reduced sequencing depth subsets created by random read sub-sampling of A RNA-Seq genomic variant datasets ( $S N P_{RNAseq}^{Total}$ , $S N P_{RNAseq}^{QC}$ ) and B gene expression datasets ( $G E_{RNAseq}^{Total}$ , $G E_{RNAseq}^{F i l t e r .}$ , $G E_{RNAseq}^{DEG}$ ) including 229 of the 240 samples. The sub-sampling ranged in 11 steps between 10 thousand to 3 million reads

Comparison of intra- vs. inter-population GP

To compare the GP performance using population-specific combinations of training sets (TS) and validation sets (VS), we had to adjust the validation procedure to account for the variable population sizes. Therefore, we switched from a 5-fold cross-validation scheme to a random subset validation with fixed training and VS sizes. We established a baseline by performing cross-population GP using samples from multiple populations in TS and VS (Fig. 7A). The prediction ability with a low TS size (20) varied between 0.47 ( $G E_{RNAseq}^{DEG}$ ) and 0.58 ( $S N P_{Array}^{Total}$ ). These results could be improved by 0.25 for $G E_{RNAseq}^{DEG}$ and by 0.18 for $S N P_{Array}^{Total}$ when increasing TS size to 170.

Fig. 7 — A Genomic prediction (GP) results of the best-performing datasets from each data origin group using randomly selected cross-population training sets. The TS size was gradually increased from 20 to 170 individuals (step size: 15). B GP results for intra-population (diagonal) or inter-population (off diagonal) comparisons. The validation set population are shown on the x-axis, and the training set population are shown on the y-axis. The training set sizes were set to 50 for (B)

We then separated the populations and tested intra-population GP performance (TS $=$ VS population) and inter-population GP performance (TS $\neq$ VS population) with a fixed TS size of 50 (Fig. 7B). For most datasets and comparisons, the intra-population GP results were slightly lower than cross-population results, with only $S N P_{RNAseq}^{QC}$ and $S N P_{WGS}^{Total}$ showing slightly higher performance (0.02) for HvDRR28. The prediction abilities of the inter-population GP results were lower compared to the intra-population GP results. GP performance depended on the combination of populations tested and resulted in prediction abilities ranging from 0 to 0.29. The worst-performing combination between HvDRR28 (TS) and HvDRR13 (VS) was best predicted by $G E_{RNAseq}^{DEG}$ .

Although the overall ranking between the datasets did not change between intra- and inter-population GP (Fig. S9), the relative differences between them varied. $S N P_{WGS}^{Total}$ performed 5% better than $S N P_{RNAseq}^{QC}$ in intra-population GP and the difference increased to 25% in inter-population GP. The difference in prediction ability between $S N P_{WGS}^{Total}$ and $S N P_{Array}^{Total}$ increased from < 1% in intra-population GP to almost 3% better performance of $S N P_{WGS}^{Total}$ in inter-population GP. The difference in performance between $S N P_{RNAseq}^{QC}$ and $G E_{RNAseq}^{DEG}$ was largest in intra-population GP (13%) but decreased to 3% in inter-population GP aligning the performance of both datasets.

Discussion

Filtering of RNA-Seq datasets to harness the full GP potential

We started by independently evaluating the GP performance of sequence variants and gene expression datasets derived from RNA-Seq data in a cross-population scenario. For most applications, the accuracy of RNA-Seq gene expression data is increased by the inclusion of replicates to counteract the technical errors introduced by experimental factors (Schurch et al. 2016). However, we expect that RNA-Seq will only find its way into routine plant breeding programs, if it can be applied without replications. Therefore, it was examined that way in our study. We normalized the gene expression values per transcript based on a linear model including a block effect to adjust the read counts (EMM) (Fig. S4B). Although in 88% of all transcripts in $G E_{RNAseq}^{DEG}$ the block effect was significant (p < 0.05), adjusted gene expression values negatively affected GP performance and led to a significant decrease in prediction ability ( $p < 0.001$ ). Further research is necessary to learn whether a block normalization can be helpful in normalizing gene expression data in a controlled environmental setting similar to this study.

In addition to the EMM normalization, we tried to account for differences in transcriptome composition or sequencing lane effects by applying the per-sample normalization TMM implemented in edgeR (Robinson and Oshlack 2010; Dillies et al. 2013), but it did not significantly (p > 0.05) change the prediction ability of $G E_{RNAseq}^{DEG}$ (Fig. S4C). We assume that this is because all examined genotypes were partially related, thus limiting differences in transcriptome composition. Furthermore, the library pool was evenly distributed across all lanes.

Although normalization did not lead to an increase in GP performance, we were able to increase the prediction ability by filtering for DEG. The resulting dataset $G E_{RNAseq}^{DEG}$ was the best-performing gene expression dataset across all traits with an average prediction ability of 0.73 (Fig. 2). Filtering also increased GP performance in the sequence variant datasets extracted from the RNA-Seq data (Fig. 3). In the end, the prediction based on sequence variation exceeded the prediction based on gene expression when averaged across all traits. This finding is in agreement with previous results (Azodi et al. 2020). This indicates a limited prediction potential for gene expression data, which could have multiple reasons. Similarly to Azodi et al. (2020), we were using only seedling material for GP. Therefore, most spatiotemporal specific DEGs were not captured in our data (Klepikova et al. 2016), which potentially limits the performance of GP. Conversely, creating trait-specific transcriptome data by focusing on the most relevant time point, developmental stage, and tissue could increase the ability to predict specific traits. However, this is unlikely to be realistic in the context of commercial plant breeding programs. The total number of transcripts detected was limited by the sequencing depth (Conesa et al. 2016). However, the sequencing depth in our study was chosen so that the related costs are comparable to those of genotyping using a SNP array. Ultimately, the prediction ability potential of transcriptome datasets can only be fully exploited by grouping and weighting individual genes, using prior knowledge of the trait and associated gene networks. Similar methods exist but have never been assessed in the context of plant genetics (Zarringhalam et al. 2018). We are convinced that future research can further increase the GP performance of transcriptome data, while the costs of creating such datasets will remain comparable to the costs of an SNP array.

We evaluated the impact of function filtering of the unfiltered RNA-Seq-based variant data ( $S N P_{RNAseq}^{Total}$ ) based on two criteria: SNPs in the 3’UTR and 5’UTR and non-synonymous SNPs in the CDS. In our study, both function filtering criteria achieved comparable GP performance while relying on only a fraction of the total variants, which was also previously observed (Li et al. 2022; Cappetta et al. 2021; Tan et al. 2017). While the 3’UTR and 5’UTR are known to be regulatory regions in which SNPs can be associated with gene expression variation (Dossa et al. 2021; Zhang et al. 2019), in our study filtering missense SNPs in the CDS resulted in a greater increase in GP performance.

Quality filtering using the four quality criteria—read depth, minor allele frequency, missing rate, and quality score—was able to increase prediction ability compared to $S N P_{RNAseq}^{Total}$ . The combination of multiple quality filter criteria further increased the GP results and led to the best-performing RNA-Seq sequence variant dataset $S N P_{RNAseq}^{QC}$ (Fig. 4). Although our empirical quality filtering workflow did not result in a much larger average prediction ability compared to the commonly used filter settings (Atanda et al. 2022; Wen et al. 2018), for half of the traits evaluated in this study, prediction ability was significantly increased ( $p \leq$ 0.025) by empirical quality filtering. We leveraged the information of the homozygosity of the RILs, the population structure, and the parental information to strictly clean and correct the sequence variant data. All of that reduced the impact of quality filtering on the GP performance. However, we observed that without strict cleaning using the aforementioned prior knowledge, the impact of quality filtering was much greater (Fig. S11). We applied the empirical quality filtering workflow to $S N P_{RNAseq}^{Raw}$ , which resulted in a prediction ability of 0.75, which is comparable to that of $S N P_{RNAseq}^{QC}$ . This observation suggests that our empirical quality filtering workflow is therefore also able to provide good prediction abilities in situations with limited prior knowledge and, thus, might be relevant for other studies.

The last approach to improve the prediction ability that we examined was the exploitation of SNPs from parental inbred lines. In our study, the consensus dataset from the RNA-Seq and WGS variants ( $S N P_{WGS}^{Total}$ ) was the overall best-performing dataset for GP. $S N P_{WGS}^{Total}$ outperformed $S N P_{Array}^{Total}$ for five out of eight traits significantly (p < 0.05). Adding WGS data to improve GP performance was previously only examined in combination with SNP array data (Weber et al. 2024; Brøndum et al. 2015). Weber et al. (2024) showed insignificant improvements to the prediction accuracy and concluded that the WGS data were unable to add important information due to the relatedness of the sample groups and linkage disequilibrium between many WGS markers. In contrast, in our study the advantage of including WGS data is apparent and illustrates that for some predictor types, for instance, RNA-Seq and potentially also GBS datasets, the benefit from WGS data is more significant than for the already highly curated markers included in a SNP Array.

Cost optimization of GP using RNA-Seq

In addition to the filtering steps discussed above to improve the prediction ability, the selection of the optimal TS size is also important. In our study, increasing the TS size from 20 to 80 led to a drastically increased prediction ability (Fig. 7A). Increasing the TS to 170 only slightly increased the prediction ability further. The diminishing returns of TS size depend on the population structure and have been studied extensively before and therefore are not discussed further (Zhu et al. 2022; Bustos-Korts et al. 2016; Isidro et al. 2015; Rincent et al. 2012; Lorenz et al. 2012).

One further aspect of optimizing the balance between prediction ability and cost is sequencing depth. Our resampling simulations illustrated that the sequencing depth of RNA-Seq can be significantly reduced without sacrificing GP performance (Fig. 6). For $S N P_{RNAseq}^{QC}$ , the reduction in GP performance was only marginal until the number of reads fell below 200K. This finding indicated that many sequence variants do not contain additional information, as they are part of the same haplotype block, and therefore reducing the number of reads and thereby sequence variants does not necessarily lead to a decreased trait prediction.

However, when reducing the sequencing depth further, we observed for the datasets that underwent a strong quality filtering a considerable reduction in prediction ability. The $S N P_{RNAseq}^{S t d .}$ filtering resulted in < 100 sequence variants in the 10K read depth subset compared to the 6.9K marker in $S N P_{RNAseq}^{Total}$ . Therefore, GP performance decreased as a consequence of an insufficient number of polymorphic variants, which in turn led to an imprecise estimation of genetic relatedness. $S N P_{RNAseq}^{S t d .}$ and $S N P_{RNAseq}^{Total}$ were outperformed by $S N P_{RNAseq}^{QC}$ with 1.5K sequence variants at the same sequencing depth. These results illustrate the balance between the number of variants needed to achieve the necessary LD between variants and trait-coding polymorphisms and the selection of informative variants.

In comparison with the above outlined influence of a reduced sequencing depth on the prediction ability of sequence variant datasets, we observed a reduction in the prediction ability already at higher number of reads (around 3 M) for the gene expression datasets. Even more pronounced was that the reduced number of reads led to a reduction in statistical power to detect DEG. For example, 400K reads were able to detect 76% of the transcripts of the total dataset, but only 22% of the DEG. This ultimately resulted in less than 100 DEG in sequencing depth subsets below 20K, which led to dramatically reduced prediction abilities.

With reduced sequencing depth, sequencing is most likely no longer the most expensive step of the project, and the remaining cost saving potential can be realized by adjusting the library preparation workflow. In this project, we used a thoroughly evaluated miniaturization workflow that reduced the costs of the library preparation by a factor of four (Arlt et al. 2023).

Combining multiple datasets maximized GP performance

Besides the above described independent evaluations of the sequence variants and gene expression datasets, we also examined various scenarios on combining them. In detail, the overall best-performing dataset $S N P_{WGS}^{Total}$ , the gene expression dataset $G E_{RNAseq}^{DEG}$ , the RNA-Seq sequence variant dataset $S N P_{RNAseq}^{QC}$ , and the 50k SNP array dataset $S N P_{Array}^{Total}$ were selected to be examined for their joint prediction ability. We observed a clear positive trend in prediction ability when combining these datasets (Fig. 5). This was expected because their different information content complement each other.

However, not all combinations of individual datasets outperformed the best single predictor. The GP performance of all combinations was significantly decreased (p < 0.05) for at least one trait. This highly trait-specific GP performance was in accordance with previous reports (Zhu et al. 2022; Michel et al. 2021) and can be explained by differences in heritability and the number of small- and large-effect loci contributing to the trait. The high similarities between the sequence variant datasets $S N P_{RNAseq}^{QC}$ , $S N P_{WGS}^{Total}$ , and $S N P_{Array}^{Total}$ (Fig. S7) could explain the limited potential of their combinations to increase GP performance. The correlation coefficient indicated more differences between $G E_{RNAseq}^{DEG}$ and $S N P_{Array}^{Total}$ or $S N P_{RNAseq}^{QC}$ , but combining them only resulted in significantly increased prediction ability for two traits (p $\leq$ 0.014). No clear increase in GP performance was observed by adding gene expression as an additional information layer to sequence variants, which was consistent with previous studies (Azodi et al. 2020; Guo et al. 2016).

In the first chapter of the discussion, we listed potential reasons for the limited prediction ability potential of gene expression data and how to improve them. It is not clear whether such an improved transcriptome dataset would better complement or replace sequence variant data for the purpose of genomic prediction analyses. Additionally, we see further potential for datasets like ours to be improved. For example, using logarithmic scaling of read counts increased the correlation coefficient between $G E_{RNAseq}^{DEG}$ and the remaining datasets by 2–3% and slightly increased GP performance. This indicates that GP, similar to differential expression analysis or co-expression analysis, profits from the increased homoscedasticity of log-scaled expression values (Johnson and Krishnan 2022; Love et al. 2014). Log-scaling could be especially beneficial when gene expression data are later combined with SNP data. Further research is needed to test this hypothesis.

Genetic diversity vs. relatedness: cross-, inter-, and intra-population prediction

Our multi-parent recombinant inbred line population allowed us to test multiple GP population designs and evaluate their performance using different input datasets (Fig. 6). Combining multiple populations in the TS (cross-population) increased the GP performance in our study. This finding was in accordance with the results of Berro et al. (2019), but this contrasts with those of Lorenz et al. (2012) and illustrates that the relationship between population structure and diversity dictates whether the creation of a cross-population TS is beneficial.

The relative difference in prediction ability between $S N P_{WGS}^{Total}$ and the remaining datasets was higher for the inter-population GP vs. the intra-population GP (Fig. 6C). This could be explained by the increased marker count of $S N P_{WGS}^{Total}$ , which could have led to a higher linkage between marker and QTL, which was shown to be more relevant when the genetic distance between TS and VS increased (De Roos et al. 2009). Also noticeable, $G E_{RNAseq}^{DEG}$ and $S N P_{RNAseq}^{QC}$ performed similarly in inter-population GP, while in intra-population GP $S N P_{RNAseq}^{QC}$ was clearly superior. This shift was mainly caused by a single inter-population combination that could be predicted by $G E_{RNAseq}^{DEG}$ but not by $S N P_{RNAseq}^{QC}$ .

Summary

In this study, we evaluated various approaches to optimize the prediction potential of RNA-Seq datasets. Our results show that the RNA-Seq sequence data in combination with parental sequence data outperformed the SNP array data in five out of eight traits, showing a higher prediction ability overall. That approach has the potential to become even more important in the future due to the increasing number of breeding-relevant barley lines for which sequencing data are available. The integration of publicly accessible data without additional costs could increase the overall importance of GS in barley. The same is also true for many other crop species.

When relying solely on RNA-Seq data, we showed that focusing on differentially expressed genes noticeably increased the prediction ability of the gene expression dataset but did not reach the prediction ability level of the sequence variant datasets. The GP performance of the RNA-Seq sequence variant data was substantially increased by strict data cleaning and quality filtering, but it only exceeded the prediction performance of the SNP array for one trait. Combining predictor datasets does not significantly (p > 0.05) increase the prediction ability averaged across all traits, but it was able to significantly increase individual traits in half of the combinations tested. Therefore, the expected benefit of combining the RNA-Seq sequence variant and gene expression data did not materialize for our genetic material.

In this study, we showed that low-cost and high-throughput RNA-Seq data can achieve comparable or better GP performance than traditional SNP array datasets. As the RNA-Seq datasets are more flexible, do not rely on a pre-selected set of genomic variants, and allow for further decreased costs by adjusting the read depth and library preparation protocols, we see high application potential for RNA-Seq in plant breeding programs.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 3372 KB)^{(3.3MB, pdf)}

Acknowledgements

The computational infrastructure and support were provided by the Centre for Information and Media Technology at Heinrich-Heine-University Duesseldorf.

Author contributions

The study was conceptualized and designed by C.A and B.S.; the RIL population was designed by B.S and J.L; the genetic material was propagated and maintained by D.I and B.S; the data were analyzed and interpreted by C.A and B.S; the manuscript was written by C.A. and edited by C.A and B.S; all authors read and approved the final manuscript.

Funding

Open Access funding enabled and organized by Projekt DEAL. This research is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy (EXC 2048/1, Project ID: 390686111).

Data availability

The raw read RNA-Seq dataset analyzed in this study is available in the NCBI Sequence Read Archive (SRA), BioProject ID: PRJNA1088431, URL: https://www.ncbi.nlm.nih.gov/bioproject/1088431. https://www.ncbi.nlm.nih.gov/sra/PRJNA1088431.

Declarations

Conflict of interest

Benjamin Stich is a member of the editorial board of Theoretical and Applied Genetics. Furthermore, the authors have no Conflict of interest to declare that are relevant to the content of this article.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Anders S, Pyl PT, Huber W (2015) Htseq-a python framework to work with high-throughput sequencing data. Bioinformatics 31(2):166–169. 10.1093/bioinformatics/btu638 [DOI] [PMC free article] [PubMed] [Google Scholar]
Andrews S (2019) Fastqc. URL https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Arlt C, Wachtmeister T, Köhrer K, Stich B (2023) Affordable, accurate and unbiased RNA sequencing by manual library miniaturization: a case study in barley. Plant Biotechnol J 21(11):2241–2253. 10.1111/pbi.14126 [DOI] [PMC free article] [PubMed] [Google Scholar]
Arlt C, Stich B (2022) Raw read RNA-Seq data: HvDRR population 13, pp 27–28
Atanda SA, Govindan V, Singh R, Robbins KR, Crossa J, Bentley AR (2022) Sparse testing using genomic prediction improves selection for breeding targets in elite spring wheat. Theor Appl Genet 135(6):1939–1950. 10.1007/s00122-022-04085-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
Azodi CB, Pardo J, VanBuren R, Campos G, Shiu SH (2020) Transcriptome-based prediction of complex traits in maize. Plant Cell 32(1):139–151. 10.1105/tpc.19.00332 [DOI] [PMC free article] [PubMed] [Google Scholar]
Bayer MM, Rapazote-Flores P, Ganal M, Hedley PE, Macaulay M, Plieske J, Ramsay L, Russell J, Shaw PD, Thomas W, Waugh R (2017) Development and evaluation of a barley 50k iselect snp array. Front Plant Sci 8
Bayer PE, Petereit J, Danilevicz MF, Anderson R, Batley J, Edwards D (2021) The application of pangenomics and machine learning in genomic selection in plants. Plant Genome, 14(3).
Bernardo R, Yu JM (2007) Prospects for genomewide selection for quantitative traits in maize. Crop Sci 47(3):1082–1090. 10.2135/cropsci2006.11.0690 [Google Scholar]
Berro I, Lado B, Nalin RS, Quincke M, Gutiérrez L (2019) Training population optimization for genomic selection. Plant Genome 12(3):190028 [Google Scholar]
Bhat JA, Ali S, Salgotra RK, Mir ZA, Dutta S, Jadon V, Tyagi A, Mushtaq M, Jain N, Singh PK, Singh GP, Prabhu KV (2016) Genomic selection in the of next generation sequencing for complex traits in plant breeding. Front Genet 7:221. 10.3389/fgene.2016.00221 [DOI] [PMC free article] [PubMed] [Google Scholar]
Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30(15):2114–2120. 10.1093/bioinformatics/btu170 [DOI] [PMC free article] [PubMed] [Google Scholar]
Browning BL, Tian XW, Zhou Y, Browning SR (2021) Fast two-stage phasing of large-scale sequence data. Am J Hum Genet 108(10):1880–1890. 10.1016/j.ajhg.2021.08.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
Brøndum RF, Su G, Janss L, Sahana G, Guldbrandtsen B, Boichard D, Lund MS (2015) Quantitative trait loci markers derived from whole genome sequence data increases the reliability of genomic prediction. J Dairy Sci 98(6):4107–4116. 10.3168/jds.2014-9005 [DOI] [PubMed] [Google Scholar]
Burgess AJ, Masclaux-Daubresse C, Strittmatter G, Weber APM, Taylor SH, Harbinson J, Yin XY, Long S, Paul MJ, Westhoff P, Loreto F, Ceriotti A, Saltenis VLR, Pribil M, Nacry P, Scharff LB, Jensen PE, Muller B, Cohan JP, Foulkes J, Rogowsky P, Debaeke P, Meyer C, Nelissen H, Inzé D, Lankhorst RK, Parry MAJ, Murchie EH, Baekelandt A (2023) Improving crop yield potential: underlying biological processes and future prospects. Food Energy Sec 12(1):2048–3694 [Google Scholar]
Burney JA, Davis SJ, Lobell DB (2010) Greenhouse gas mitigation by agricultural intensification. Proc Natl Acad Sci USA 107(26):12052–12057. 10.1073/pnas.0914216107 [DOI] [PMC free article] [PubMed] [Google Scholar]
Bustos-Korts D, Malosetti M, Chapman S, Biddulph B, Eeuwijk F (2016) Improvement of predictive ability by uniform coverage of the target genetic space. G3-Genes Genomes Genet 6(11):3733–3747. 10.1534/g3.116.035410 [Google Scholar]
Cappetta E, Andolfo G, Guadagno A, Matteo A, Barone A, Frusciante L, Ercolano MR (2021) Tomato genomic prediction for good performance under high-temperature and identification of loci involved in thermotolerance response. Horticult Res 8(1):2662–6810. 10.1038/s41438-021-00647-3 [Google Scholar]
Casale F, Inghelandt D, Weisweiler M, Li JQ, Stich B (2022) Genomic prediction of the recombination rate variation in barley—a route to highly recombinogenic genotypes. Plant Biotechnol J 20(4):676–690. 10.1111/pbi.13746 [DOI] [PMC free article] [PubMed] [Google Scholar]
Cingolani P, Patel VM, Coon M, Nguyen T, Land SJ, Ruden DM, Lu X (2012a) Using drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Front Genet 3:35. 10.3389/fgene.2012.00035 [DOI] [PMC free article] [PubMed] [Google Scholar]
Cingolani P, Platts A, Wang L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM (2012b) A program for annotating and predicting the effects of single nucleotide polymorphisms, Snpeff: Snps in the genome of drosophila melanogaster strain. Fly 6(2):80–92. 10.4161/fly.19695 [DOI] [PMC free article] [PubMed] [Google Scholar]
Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szcześniak MW, Gaffney DJ, Elo LL, Zhang X, Mortazavi A (2016) A survey of best practices for Rna-seq data analysis. Genome Biol 17(1):1474–760X. 10.1186/s13059-016-0881-8 [Google Scholar]
Cosenza F, Shrestha A, Inghelandt D, Casale FA, Wu PY, Weisweiler M, Li J, Wespel F, Stich B (2024) Genetic mapping reveals new loci and alleles for flowering time and plant height using the HvDRR population of barley. J Exp Bot. 10.1093/jxb/erae010 [DOI] [PMC free article] [PubMed] [Google Scholar]
Roos APW, Hayes BJ, Goddard ME (2009) Reliability of genomic predictions across multiple populations. Genetics 183(4):1545–1553. 10.1534/genetics.109.104935 [DOI] [PMC free article] [PubMed] [Google Scholar]
Dekkers JC, Hospital F (2002) The use of molecular genetics in the improvement of agricultural populations. Nat Rev Genet 3(1):22–32. 10.1038/nrg701 [DOI] [PubMed] [Google Scholar]
Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloë D, Le Gall C, Schaëffer B, Le Crom S, Guedj M, Jaffrézic F, and French StatOmique Consortium (2013) A comprehensive evaluation of normalization methods for illumina high-throughput rna sequencing data analysis. Briefings in Bioinformatics, 14(6):671–683. 10.1093/bib/bbs046
Dossa K, Zhou R, Li D, Aili Liu L, Qin MA, Mmadi RS, Zhang Y, Wang J, Gao Y, Zhang X, You J (2021) A novel motif in the 5’-UTR of an orphan gene ‘big root biomass’ modulates root biomass in sesame. Plant Biotechnol J 19(5):1065–1079. 10.1111/pbi.13531 [DOI] [PMC free article] [PubMed] [Google Scholar]
Fernando RL, Grossman M (1989) Marker assisted selection using best linear unbiased prediction. Genet Sel Evol 21(4):467–477. 10.1051/gse:19890407 [Google Scholar]
Glazier AM, Nadeau JH, Aitman TJ (2002) Finding genes that underlie complex traits. Science 298(5602):2345–9. 10.1126/science.1076641 [DOI] [PubMed] [Google Scholar]
Guo ZG, Magwire MM, Basten CJ, Xu ZY, Wang DL (2016) Evaluation of the utility of gene expression and metabolic information for genomic prediction in maize. Theor Appl Genet 129(12):2413–2427. 10.1007/s00122-016-2780-5 [DOI] [PubMed] [Google Scholar]
Harfouche AL, Jacobson DA, Kainer D, Romero JC, Harfouche AH, Mugnozza GS, Moshelion M, Tuskan GA, Keurentjes JJB, Altman A (2019) Accelerating climate resilient plant breeding by applying next-generation artificial intelligence. Trends Biotechnol 37(11):1217–1235. 10.1016/j.tibtech.2019.05.007 [DOI] [PubMed] [Google Scholar]
Heffner EL, Sorrells ME, Jannink JL (2009) Genomic selection for crop improvement. Crop Sci 49(1):1–12. 10.2135/cropsci2008.08.0512 [Google Scholar]
Hu H, Campbell MT, Yeats TH, Zheng X, Runcie DE, Covarrubias-Pazaran G, Broeckling C, Yao L, Caffe-Treml M, Gutierrez LA, Smith KP, Tanaka J, Hoekenga OA, Sorrells ME, Gore MA, Jannink JL (2021) Multi-omics prediction of oat agronomic and seed nutritional traits across environments and in distantly related populations. Theor Appl Genet 134(12):4043–4054. 10.1007/s00122-021-03946-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
Hu X, Carver BF, El-Kassaby YA, Zhu L, Chen C (2023) Weighted kernels improve multi-environment genomic prediction. Heredity 130(2):82–91. 10.1038/s41437-022-00582-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
Isidro J, Jannink JL, Akdemir D, Poland J, Heslot N, Sorrells ME (2015) Training set optimization under population structure in genomic selection. Theor Appl Genet 128(1):145–58. 10.1007/s00122-014-2418-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
Jia Y, Jannink JL (2012) Multiple-trait genomic selection methods increase genetic value prediction accuracy. Genetics 192(4):1513–22. 10.1534/genetics.112.144246 [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnson KA, Krishnan A (2022) Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data. Genome Biol. 10.1186/s13059-021-02568-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL (2019) Graph-based genome alignment and genotyping with hisat2 and hisat-genotype. Nat Biotechnol 37(8):907–915. 10.1038/s41587-019-0201-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
Klepikova AV, Kasianov AS, Gerasimov ES, Logacheva MD, Penin AA (2016) A high resolution map of the arabidopsis thaliana developmental transcriptome based on scprna/scp-seq profiling. Plant J 88(6):1058–1070. 10.1111/tpj.13312 [DOI] [PubMed] [Google Scholar]
Lande R, Thompson R (1990) Efficiency of marker-assisted selection in the improvement of quantitative traits. Genetics 124(3):743–56. 10.1093/genetics/124.3.743 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lenaerts B, Collard BCY, Demont M (2019) Review: improving global food security through accelerated plant breeding. Plant Sci. 10.1016/j.plantsci.2019.110207 [DOI] [PMC free article] [PubMed] [Google Scholar]
Li D, Xu Z, Gu R, Wang P, Lyle D, Xu J, Zhang H, Wang G (2019) Enhancing genomic selection by fitting large-effect SNPs as fixed effects and a genotype-by-environment effect using a maize bc1f3:4 population. PLoS ONE 14(10):e0223898. 10.1371/journal.pone.0223898 [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, and Subgroup Genome Project Data Processing (2009) The sequence alignment/map format and samtools. Bioinformatics 25(16):2078–9. 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
Li Y, Ruperao P, Batley J, Edwards D, Martin W, Hobson K, Sutton T (2022) Genomic prediction of preliminary yield trials in chickpea: Effect of functional annotation of SNPs and environment. The Plant Genome. 10.1002/tpg2.20166 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lorenz AJ, Smith KP, Jannink J-L (2012) Potential and optimization of genomic selection for fusarium head blight resistance in six-row barley. Crop Sci 52(4):1609–1621. 10.2135/cropsci2011.09.0503 [Google Scholar]
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with deseq2. Genome Biol. 10.1186/s13059-014-0550-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lyra DH, Mendonça LD, Galli G, Alves FC, Granato ISC, Fritsche-Neto R (2017) Multi-trait genomic prediction for nitrogen response indices in tropical maize hybrids. Mol Breeding. 10.1007/s11032-017-0681-1 [Google Scholar]
Mackay TF (2001) The genetic architecture of quantitative traits. Annu Rev Genet 35:303–39. 10.1146/annurev.genet.35.102401.090633 [DOI] [PubMed] [Google Scholar]
Mascher M (2019) Pseudomolecules and annotation of the second version of the reference genome sequence assembly of barley cv. morex [morex v2]. URL https://doi.ipk-gatersleben.de:443/DOI/83e8e186-dc4b-47f7-a820-28ad37cb176b/d1067eba-1d08-42e2-85ec-66bfd5112cd8/2
Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157(4):1819–1829 [DOI] [PMC free article] [PubMed] [Google Scholar]
Michel S, Wagner C, Nosenko T, Steiner B, Samad-Zamini M, Buerstmayr M, Mayer K, Buerstmayr H (2021) Merging genomics and transcriptomics for predicting fusarium head blight resistance in wheat. Genes. 10.3390/genes12010114 [DOI] [PMC free article] [PubMed] [Google Scholar]
Montesinos-Lopez OA, Montesinos-Lopez JC, Salazar E, Barron JA, Montesinos-Lopez A, Buenrostro-Mariscal R, Crossa J (2021) Application of a Poisson deep neural network model for the prediction of count data in genome-based prediction. Plant Genome. 10.1002/tpg2.20118 [DOI] [PMC free article] [PubMed] [Google Scholar]
Nei M (1973) Analysis of gene diversity in subdivided populations. Proc Natl Acad Sci 70(12):3321–3323. 10.1073/pnas.70.12.3321 [DOI] [PMC free article] [PubMed] [Google Scholar]
Piepho HP, Möhring J (2007) Computing heritability and selection response from unbalanced plant breeding trials. Genetics 177(3):1881–1888. 10.1534/genetics.107.074229 [DOI] [PMC free article] [PubMed] [Google Scholar]
Pourkheirandish M, Komatsuda T (2007) The importance of barley genetics and domestication in a global perspective. Ann Bot 100(5):999–1008. 10.1093/aob/mcm139 [DOI] [PMC free article] [PubMed] [Google Scholar]
Purugganan MD, Fuller DQ (2009) The nature of selection during plant domestication. Nature 457(7231):843–848. 10.1038/nature07895 [DOI] [PubMed] [Google Scholar]
Rincent R, Laloë D, Nicolas S, Altmann T, Brunel D, Revilla P, Rodríguez VM, Moreno-Gonzalez J, Melchinger A, Bauer E, Schoen CC, Meyer N, Giauffret C, Bauland C, Jamin P, Laborde J, Monod H, Flament P, Charcosset A, Moreau L (2012) Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds ( l.). Genetics 192(2):715. 10.1534/genetics.112.141473 [DOI] [PMC free article] [PubMed] [Google Scholar]
Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 10.1186/gb-2010-11-3-r25 [DOI] [PMC free article] [PubMed] [Google Scholar]
Rutkoski J, Poland J, Mondal S, Autrique E, Pérez LG, Crossa J, Reynolds M, Singh R (2016) Canopy temperature and vegetation indices from high-throughput phenotyping improve accuracy of pedigree and genomic selection for grain yield in wheat. G3-Genes Genomes Genetics 6(9):2799–2808. 10.1534/g3.116.032888 [DOI] [PMC free article] [PubMed] [Google Scholar]
Sandhu K, Patil SS, Pumphrey M, Carter A (2021) Multitrait machine- and deep-learning models for genomic selection using spectral information in a wheat breeding program. Plant Genome 14(3):e20119. 10.1002/tpg2.20119 [DOI] [PubMed] [Google Scholar]
Schrag TA, Westhues M, Schipprack W, Seifert F, Thiemann A, Scholten S, Melchinger AE (2018) Beyond genomic prediction: combining different types of data can improve prediction of hybrid performance in maize. Genetics 208(4):1373–1385. 10.1534/genetics.117.300374 [DOI] [PMC free article] [PubMed] [Google Scholar]
Schurch NJ, Schofield P, Gierlinski M, Cole C, Sherstnev A, Singh V, Wrobel N, Gharbi K, Simpson GG, Owen-Hughes T, Blaxter M, Barton GJ (2016) How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA 22(6):839–51. 10.1261/rna.053959.115 [DOI] [PMC free article] [PubMed] [Google Scholar]
Searle SR, Speed FM, Milliken GA (1980) Population marginal means in the linear-model—an alternative to least-squares means. Am Stat 34(4):216–221. 10.2307/2684063 [Google Scholar]
Shrestha A, Cosenza F, Inghelandt D, Wu PY, Li JQ, Casale FA, Weisweiler M, Stich B (2022) The double round-robin population unravels the genetic architecture of grain size in barley. J Exp Bot 73(22):7344–7361. 10.1093/jxb/erac369 [DOI] [PMC free article] [PubMed] [Google Scholar]
Stich B (2009) Comparison of mating designs for establishing nested association mapping populations in maize and. Genetics 183(4):1525–1534. 10.1534/genetics.109.108449 [DOI] [PMC free article] [PubMed] [Google Scholar]
Tan B, Grattapaglia D, Martins GS, Ferreira KZ, Sundberg B, Ingvarsson PK (2017) Evaluating the accuracy of genomic prediction of growth and wood traits in two eucalyptus species and their f1 hybrids. BMC Plant Biol 17(1):1471–2229. 10.1186/s12870-017-1059-6 [Google Scholar]
Tilman D, Balzer C, Hill J, Befort BL (2011) Global food demand and the sustainable intensification of agriculture. Proc Natl Acad Sci USA 108(50):20260–20264. 10.1073/pnas.1116437108 [DOI] [PMC free article] [PubMed] [Google Scholar]
Tsai HY, Cericola F, Edriss V, Andersen JR, Orabiid J, Jensen JD, Jahoor A, Janss L, Jensen J (2020) Use of multiple traits genomic prediction, genotype by environment interactions and spatial effect to improve prediction accuracy in yield data. PLoS ONE 15(5):e0232665. 10.1371/journal.pone.0232665 [DOI] [PMC free article] [PubMed] [Google Scholar]
Ubbens J, Parkin I, Eynck C, Stavness I, Sharpe AG (2021) Deep neural networks for genomic prediction do not estimate marker effects. Plant Genome 14(3):e20147. 10.1002/tpg2.20147 [DOI] [PMC free article] [PubMed] [Google Scholar]
VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91(11):4414–4423. 10.3168/jds.2007-0980 [DOI] [PubMed] [Google Scholar]
Washburn JD, Burch MB, Franco JAV (2020) Predictive breeding for maize: making use of molecular phenotypes, machine learning, and physiological crop models. Crop Sci 60(2):622–638. 10.1002/csc2.20052 [Google Scholar]
Weber SE, Roscher-Ehrig L, Kox T, Abbadi A, Stahl A, Snowdon RJ (2024) Genomic prediction in brassica napus: evaluating the benefit of imputed whole-genome sequencing data. Genome 67(7):210–222. 10.1139/gen-2023-0126 [DOI] [PubMed] [Google Scholar]
Weisweiler M, Montaigu A, Ries D, Pfeifer M, Stich B (2019) Transcriptomic and presence/absence variation in the barley genome assessed from multi-tissue MRNA sequencing and their power to predict phenotypic traits. BMC Genomics. 10.1186/s12864-019-6174-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
Weisweiler M, Arlt C, Wu PY, Inghelandt D, Hartwig T, Stich B (2022) Structural variants in the barley gene pool: precision and sensitivity to detect them using short-read sequencing and their association with gene expression and phenotypic variation. Theor Appl Genet 135(10):3511–3529. 10.1007/s00122-022-04197-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wen ZX, Tan RJ, Zhang SC, Collins PJ, Yuan JZ, Du WY, Gu CH, Ou SJ, Song QJ, An YQC, Boyse JF, Chilvers MI, Wang DC (2018) Integrating GWAs and gene expression data for functional characterization of resistance to white Mould in soya bean. Plant Biotechnol J 16(11):1825–1835. 10.1111/pbi.12918 [DOI] [PMC free article] [PubMed] [Google Scholar]
Westhues M, Schrag TA, Heuer C, Thaller G, Utz HF, Schipprack W, Thiemann A, Seifert F, Ehret A, Schlereth A, Stitt M, Nikoloski Z, Willmitzer L, Schön CC, Scholten S, Melchinger AE (2017) Omics-based hybrid prediction in maize. Theor Appl Genet 130(9):1927–1939. 10.1007/s00122-017-2934-0 [DOI] [PubMed] [Google Scholar]
Westhues M, Heuer C, Thaller G, Fernando R, Melchinger AE (2019) Efficient genetic value prediction using incomplete omics data. Theor Appl Genet 132(4):1211–1222. 10.1007/s00122-018-03273-1 [DOI] [PubMed] [Google Scholar]
Wright SI (2005) The effects of artificial selection on the maize genome (vol 308, pg 1310, 2005). Science, 310(5745):54–54. ISSN 0036-8075. URL Go to ISI://WOS:000232477000025
Wu PY, Stich B, Weisweiler M, Shrestha A, Erban A, Westhoff P, Inghelandt D (2022) Improvement of prediction ability by integrating multi-omic datasets in barley. BMC Genomics. 10.1186/s12864-022-08337-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zabel F, Delzeit R, Schneider JM, Seppelt R, Mauser W, Václavík T (2019) Global impacts of future cropland expansion and intensification on agricultural markets and biodiversity. Nat Commun. 10.1038/s41467-019-10775-z [DOI] [PMC free article] [PubMed] [Google Scholar]
Zarringhalam K, Degras D, Brockel C, Ziemek D (2018) Robust phenotype prediction from gene expression data using differential shrinkage of co-regulated genes. Sci Rep 8(1):1237. 10.1038/s41598-018-19635-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang C, Huang L, Zhang H, Hao Q, Lyu B, Wang M, Epstein L, Liu M, Kou C, Qi J, Chen F, Li M, Gao G, Ni F, Zhang L, Hao M, Wang J, Chen X, Luo M-C, Zheng Y, Wu J, Liu D, Fu D An ancestral nb-lrr with duplicated 3’UTRs confers stripe rust resistance in wheat and barley. Nature Communications, 10.1038/s41467-019-11872-9
Zhao Y, Mette MF, Gowda M, Longin CFH, Reif JC (2014) Bridging the gap between marker-assisted and genomic selection of heading time and plant height in hybrid wheat. Heredity 112(6):638–645. 10.1038/hdy.2014.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhong SQ, Dekkers JCM, Fernando RL, Jannink JL (2009) Factors affecting accuracy from genomic selection in populations derived from multiple inbred lines: a barley case study. Genetics 182(1):355–364. 10.1534/genetics.108.098277 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu XT, Maurer HP, Jenz M, Hahn V, Ruckelshausen A, Leiser WL, Würschum T (2022) The performance of phenomic selection depends on the genetic architecture of the target trait. Theor Appl Genet 135(2):653–665. 10.1007/s00122-021-03997-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary file 1 (pdf 3372 KB)^{(3.3MB, pdf)}

Data Availability Statement

[CR1] Anders S, Pyl PT, Huber W (2015) Htseq-a python framework to work with high-throughput sequencing data. Bioinformatics 31(2):166–169. 10.1093/bioinformatics/btu638 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] Andrews S (2019) Fastqc. URL https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

[CR3] Arlt C, Wachtmeister T, Köhrer K, Stich B (2023) Affordable, accurate and unbiased RNA sequencing by manual library miniaturization: a case study in barley. Plant Biotechnol J 21(11):2241–2253. 10.1111/pbi.14126 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] Arlt C, Stich B (2022) Raw read RNA-Seq data: HvDRR population 13, pp 27–28

[CR5] Atanda SA, Govindan V, Singh R, Robbins KR, Crossa J, Bentley AR (2022) Sparse testing using genomic prediction improves selection for breeding targets in elite spring wheat. Theor Appl Genet 135(6):1939–1950. 10.1007/s00122-022-04085-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] Azodi CB, Pardo J, VanBuren R, Campos G, Shiu SH (2020) Transcriptome-based prediction of complex traits in maize. Plant Cell 32(1):139–151. 10.1105/tpc.19.00332 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] Bayer MM, Rapazote-Flores P, Ganal M, Hedley PE, Macaulay M, Plieske J, Ramsay L, Russell J, Shaw PD, Thomas W, Waugh R (2017) Development and evaluation of a barley 50k iselect snp array. Front Plant Sci 8

[CR8] Bayer PE, Petereit J, Danilevicz MF, Anderson R, Batley J, Edwards D (2021) The application of pangenomics and machine learning in genomic selection in plants. Plant Genome, 14(3).

[CR9] Bernardo R, Yu JM (2007) Prospects for genomewide selection for quantitative traits in maize. Crop Sci 47(3):1082–1090. 10.2135/cropsci2006.11.0690 [Google Scholar]

[CR10] Berro I, Lado B, Nalin RS, Quincke M, Gutiérrez L (2019) Training population optimization for genomic selection. Plant Genome 12(3):190028 [Google Scholar]

[CR11] Bhat JA, Ali S, Salgotra RK, Mir ZA, Dutta S, Jadon V, Tyagi A, Mushtaq M, Jain N, Singh PK, Singh GP, Prabhu KV (2016) Genomic selection in the of next generation sequencing for complex traits in plant breeding. Front Genet 7:221. 10.3389/fgene.2016.00221 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30(15):2114–2120. 10.1093/bioinformatics/btu170 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] Browning BL, Tian XW, Zhou Y, Browning SR (2021) Fast two-stage phasing of large-scale sequence data. Am J Hum Genet 108(10):1880–1890. 10.1016/j.ajhg.2021.08.005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] Brøndum RF, Su G, Janss L, Sahana G, Guldbrandtsen B, Boichard D, Lund MS (2015) Quantitative trait loci markers derived from whole genome sequence data increases the reliability of genomic prediction. J Dairy Sci 98(6):4107–4116. 10.3168/jds.2014-9005 [DOI] [PubMed] [Google Scholar]

[CR15] Burgess AJ, Masclaux-Daubresse C, Strittmatter G, Weber APM, Taylor SH, Harbinson J, Yin XY, Long S, Paul MJ, Westhoff P, Loreto F, Ceriotti A, Saltenis VLR, Pribil M, Nacry P, Scharff LB, Jensen PE, Muller B, Cohan JP, Foulkes J, Rogowsky P, Debaeke P, Meyer C, Nelissen H, Inzé D, Lankhorst RK, Parry MAJ, Murchie EH, Baekelandt A (2023) Improving crop yield potential: underlying biological processes and future prospects. Food Energy Sec 12(1):2048–3694 [Google Scholar]

[CR16] Burney JA, Davis SJ, Lobell DB (2010) Greenhouse gas mitigation by agricultural intensification. Proc Natl Acad Sci USA 107(26):12052–12057. 10.1073/pnas.0914216107 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] Bustos-Korts D, Malosetti M, Chapman S, Biddulph B, Eeuwijk F (2016) Improvement of predictive ability by uniform coverage of the target genetic space. G3-Genes Genomes Genet 6(11):3733–3747. 10.1534/g3.116.035410 [Google Scholar]

[CR18] Cappetta E, Andolfo G, Guadagno A, Matteo A, Barone A, Frusciante L, Ercolano MR (2021) Tomato genomic prediction for good performance under high-temperature and identification of loci involved in thermotolerance response. Horticult Res 8(1):2662–6810. 10.1038/s41438-021-00647-3 [Google Scholar]

[CR19] Casale F, Inghelandt D, Weisweiler M, Li JQ, Stich B (2022) Genomic prediction of the recombination rate variation in barley—a route to highly recombinogenic genotypes. Plant Biotechnol J 20(4):676–690. 10.1111/pbi.13746 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] Cingolani P, Patel VM, Coon M, Nguyen T, Land SJ, Ruden DM, Lu X (2012a) Using drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Front Genet 3:35. 10.3389/fgene.2012.00035 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] Cingolani P, Platts A, Wang L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM (2012b) A program for annotating and predicting the effects of single nucleotide polymorphisms, Snpeff: Snps in the genome of drosophila melanogaster strain. Fly 6(2):80–92. 10.4161/fly.19695 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szcześniak MW, Gaffney DJ, Elo LL, Zhang X, Mortazavi A (2016) A survey of best practices for Rna-seq data analysis. Genome Biol 17(1):1474–760X. 10.1186/s13059-016-0881-8 [Google Scholar]

[CR23] Cosenza F, Shrestha A, Inghelandt D, Casale FA, Wu PY, Weisweiler M, Li J, Wespel F, Stich B (2024) Genetic mapping reveals new loci and alleles for flowering time and plant height using the HvDRR population of barley. J Exp Bot. 10.1093/jxb/erae010 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] Roos APW, Hayes BJ, Goddard ME (2009) Reliability of genomic predictions across multiple populations. Genetics 183(4):1545–1553. 10.1534/genetics.109.104935 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] Dekkers JC, Hospital F (2002) The use of molecular genetics in the improvement of agricultural populations. Nat Rev Genet 3(1):22–32. 10.1038/nrg701 [DOI] [PubMed] [Google Scholar]

[CR26] Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloë D, Le Gall C, Schaëffer B, Le Crom S, Guedj M, Jaffrézic F, and French StatOmique Consortium (2013) A comprehensive evaluation of normalization methods for illumina high-throughput rna sequencing data analysis. Briefings in Bioinformatics, 14(6):671–683. 10.1093/bib/bbs046

[CR27] Dossa K, Zhou R, Li D, Aili Liu L, Qin MA, Mmadi RS, Zhang Y, Wang J, Gao Y, Zhang X, You J (2021) A novel motif in the 5’-UTR of an orphan gene ‘big root biomass’ modulates root biomass in sesame. Plant Biotechnol J 19(5):1065–1079. 10.1111/pbi.13531 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] Fernando RL, Grossman M (1989) Marker assisted selection using best linear unbiased prediction. Genet Sel Evol 21(4):467–477. 10.1051/gse:19890407 [Google Scholar]

[CR29] Glazier AM, Nadeau JH, Aitman TJ (2002) Finding genes that underlie complex traits. Science 298(5602):2345–9. 10.1126/science.1076641 [DOI] [PubMed] [Google Scholar]

[CR30] Guo ZG, Magwire MM, Basten CJ, Xu ZY, Wang DL (2016) Evaluation of the utility of gene expression and metabolic information for genomic prediction in maize. Theor Appl Genet 129(12):2413–2427. 10.1007/s00122-016-2780-5 [DOI] [PubMed] [Google Scholar]

[CR31] Harfouche AL, Jacobson DA, Kainer D, Romero JC, Harfouche AH, Mugnozza GS, Moshelion M, Tuskan GA, Keurentjes JJB, Altman A (2019) Accelerating climate resilient plant breeding by applying next-generation artificial intelligence. Trends Biotechnol 37(11):1217–1235. 10.1016/j.tibtech.2019.05.007 [DOI] [PubMed] [Google Scholar]

[CR32] Heffner EL, Sorrells ME, Jannink JL (2009) Genomic selection for crop improvement. Crop Sci 49(1):1–12. 10.2135/cropsci2008.08.0512 [Google Scholar]

[CR33] Hu H, Campbell MT, Yeats TH, Zheng X, Runcie DE, Covarrubias-Pazaran G, Broeckling C, Yao L, Caffe-Treml M, Gutierrez LA, Smith KP, Tanaka J, Hoekenga OA, Sorrells ME, Gore MA, Jannink JL (2021) Multi-omics prediction of oat agronomic and seed nutritional traits across environments and in distantly related populations. Theor Appl Genet 134(12):4043–4054. 10.1007/s00122-021-03946-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] Hu X, Carver BF, El-Kassaby YA, Zhu L, Chen C (2023) Weighted kernels improve multi-environment genomic prediction. Heredity 130(2):82–91. 10.1038/s41437-022-00582-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] Isidro J, Jannink JL, Akdemir D, Poland J, Heslot N, Sorrells ME (2015) Training set optimization under population structure in genomic selection. Theor Appl Genet 128(1):145–58. 10.1007/s00122-014-2418-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] Jia Y, Jannink JL (2012) Multiple-trait genomic selection methods increase genetic value prediction accuracy. Genetics 192(4):1513–22. 10.1534/genetics.112.144246 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] Johnson KA, Krishnan A (2022) Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data. Genome Biol. 10.1186/s13059-021-02568-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] Kim D, Paggi JM, Park C, Bennett C, Salzberg SL (2019) Graph-based genome alignment and genotyping with hisat2 and hisat-genotype. Nat Biotechnol 37(8):907–915. 10.1038/s41587-019-0201-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] Klepikova AV, Kasianov AS, Gerasimov ES, Logacheva MD, Penin AA (2016) A high resolution map of the arabidopsis thaliana developmental transcriptome based on scprna/scp-seq profiling. Plant J 88(6):1058–1070. 10.1111/tpj.13312 [DOI] [PubMed] [Google Scholar]

[CR40] Lande R, Thompson R (1990) Efficiency of marker-assisted selection in the improvement of quantitative traits. Genetics 124(3):743–56. 10.1093/genetics/124.3.743 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] Lenaerts B, Collard BCY, Demont M (2019) Review: improving global food security through accelerated plant breeding. Plant Sci. 10.1016/j.plantsci.2019.110207 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] Li D, Xu Z, Gu R, Wang P, Lyle D, Xu J, Zhang H, Wang G (2019) Enhancing genomic selection by fitting large-effect SNPs as fixed effects and a genotype-by-environment effect using a maize bc1f3:4 population. PLoS ONE 14(10):e0223898. 10.1371/journal.pone.0223898 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, and Subgroup Genome Project Data Processing (2009) The sequence alignment/map format and samtools. Bioinformatics 25(16):2078–9. 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] Li Y, Ruperao P, Batley J, Edwards D, Martin W, Hobson K, Sutton T (2022) Genomic prediction of preliminary yield trials in chickpea: Effect of functional annotation of SNPs and environment. The Plant Genome. 10.1002/tpg2.20166 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] Lorenz AJ, Smith KP, Jannink J-L (2012) Potential and optimization of genomic selection for fusarium head blight resistance in six-row barley. Crop Sci 52(4):1609–1621. 10.2135/cropsci2011.09.0503 [Google Scholar]

[CR46] Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with deseq2. Genome Biol. 10.1186/s13059-014-0550-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] Lyra DH, Mendonça LD, Galli G, Alves FC, Granato ISC, Fritsche-Neto R (2017) Multi-trait genomic prediction for nitrogen response indices in tropical maize hybrids. Mol Breeding. 10.1007/s11032-017-0681-1 [Google Scholar]

[CR48] Mackay TF (2001) The genetic architecture of quantitative traits. Annu Rev Genet 35:303–39. 10.1146/annurev.genet.35.102401.090633 [DOI] [PubMed] [Google Scholar]

[CR49] Mascher M (2019) Pseudomolecules and annotation of the second version of the reference genome sequence assembly of barley cv. morex [morex v2]. URL https://doi.ipk-gatersleben.de:443/DOI/83e8e186-dc4b-47f7-a820-28ad37cb176b/d1067eba-1d08-42e2-85ec-66bfd5112cd8/2

[CR50] Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157(4):1819–1829 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR51] Michel S, Wagner C, Nosenko T, Steiner B, Samad-Zamini M, Buerstmayr M, Mayer K, Buerstmayr H (2021) Merging genomics and transcriptomics for predicting fusarium head blight resistance in wheat. Genes. 10.3390/genes12010114 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR52] Montesinos-Lopez OA, Montesinos-Lopez JC, Salazar E, Barron JA, Montesinos-Lopez A, Buenrostro-Mariscal R, Crossa J (2021) Application of a Poisson deep neural network model for the prediction of count data in genome-based prediction. Plant Genome. 10.1002/tpg2.20118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR53] Nei M (1973) Analysis of gene diversity in subdivided populations. Proc Natl Acad Sci 70(12):3321–3323. 10.1073/pnas.70.12.3321 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR54] Piepho HP, Möhring J (2007) Computing heritability and selection response from unbalanced plant breeding trials. Genetics 177(3):1881–1888. 10.1534/genetics.107.074229 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR55] Pourkheirandish M, Komatsuda T (2007) The importance of barley genetics and domestication in a global perspective. Ann Bot 100(5):999–1008. 10.1093/aob/mcm139 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR56] Purugganan MD, Fuller DQ (2009) The nature of selection during plant domestication. Nature 457(7231):843–848. 10.1038/nature07895 [DOI] [PubMed] [Google Scholar]

[CR57] Rincent R, Laloë D, Nicolas S, Altmann T, Brunel D, Revilla P, Rodríguez VM, Moreno-Gonzalez J, Melchinger A, Bauer E, Schoen CC, Meyer N, Giauffret C, Bauland C, Jamin P, Laborde J, Monod H, Flament P, Charcosset A, Moreau L (2012) Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds ( l.). Genetics 192(2):715. 10.1534/genetics.112.141473 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR58] Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 10.1186/gb-2010-11-3-r25 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR59] Rutkoski J, Poland J, Mondal S, Autrique E, Pérez LG, Crossa J, Reynolds M, Singh R (2016) Canopy temperature and vegetation indices from high-throughput phenotyping improve accuracy of pedigree and genomic selection for grain yield in wheat. G3-Genes Genomes Genetics 6(9):2799–2808. 10.1534/g3.116.032888 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR60] Sandhu K, Patil SS, Pumphrey M, Carter A (2021) Multitrait machine- and deep-learning models for genomic selection using spectral information in a wheat breeding program. Plant Genome 14(3):e20119. 10.1002/tpg2.20119 [DOI] [PubMed] [Google Scholar]

[CR61] Schrag TA, Westhues M, Schipprack W, Seifert F, Thiemann A, Scholten S, Melchinger AE (2018) Beyond genomic prediction: combining different types of data can improve prediction of hybrid performance in maize. Genetics 208(4):1373–1385. 10.1534/genetics.117.300374 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR62] Schurch NJ, Schofield P, Gierlinski M, Cole C, Sherstnev A, Singh V, Wrobel N, Gharbi K, Simpson GG, Owen-Hughes T, Blaxter M, Barton GJ (2016) How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA 22(6):839–51. 10.1261/rna.053959.115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR63] Searle SR, Speed FM, Milliken GA (1980) Population marginal means in the linear-model—an alternative to least-squares means. Am Stat 34(4):216–221. 10.2307/2684063 [Google Scholar]

[CR64] Shrestha A, Cosenza F, Inghelandt D, Wu PY, Li JQ, Casale FA, Weisweiler M, Stich B (2022) The double round-robin population unravels the genetic architecture of grain size in barley. J Exp Bot 73(22):7344–7361. 10.1093/jxb/erac369 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR65] Stich B (2009) Comparison of mating designs for establishing nested association mapping populations in maize and. Genetics 183(4):1525–1534. 10.1534/genetics.109.108449 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR66] Tan B, Grattapaglia D, Martins GS, Ferreira KZ, Sundberg B, Ingvarsson PK (2017) Evaluating the accuracy of genomic prediction of growth and wood traits in two eucalyptus species and their f1 hybrids. BMC Plant Biol 17(1):1471–2229. 10.1186/s12870-017-1059-6 [Google Scholar]

[CR67] Tilman D, Balzer C, Hill J, Befort BL (2011) Global food demand and the sustainable intensification of agriculture. Proc Natl Acad Sci USA 108(50):20260–20264. 10.1073/pnas.1116437108 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR68] Tsai HY, Cericola F, Edriss V, Andersen JR, Orabiid J, Jensen JD, Jahoor A, Janss L, Jensen J (2020) Use of multiple traits genomic prediction, genotype by environment interactions and spatial effect to improve prediction accuracy in yield data. PLoS ONE 15(5):e0232665. 10.1371/journal.pone.0232665 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR69] Ubbens J, Parkin I, Eynck C, Stavness I, Sharpe AG (2021) Deep neural networks for genomic prediction do not estimate marker effects. Plant Genome 14(3):e20147. 10.1002/tpg2.20147 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR70] VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91(11):4414–4423. 10.3168/jds.2007-0980 [DOI] [PubMed] [Google Scholar]

[CR71] Washburn JD, Burch MB, Franco JAV (2020) Predictive breeding for maize: making use of molecular phenotypes, machine learning, and physiological crop models. Crop Sci 60(2):622–638. 10.1002/csc2.20052 [Google Scholar]

[CR72] Weber SE, Roscher-Ehrig L, Kox T, Abbadi A, Stahl A, Snowdon RJ (2024) Genomic prediction in brassica napus: evaluating the benefit of imputed whole-genome sequencing data. Genome 67(7):210–222. 10.1139/gen-2023-0126 [DOI] [PubMed] [Google Scholar]

[CR73] Weisweiler M, Montaigu A, Ries D, Pfeifer M, Stich B (2019) Transcriptomic and presence/absence variation in the barley genome assessed from multi-tissue MRNA sequencing and their power to predict phenotypic traits. BMC Genomics. 10.1186/s12864-019-6174-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR74] Weisweiler M, Arlt C, Wu PY, Inghelandt D, Hartwig T, Stich B (2022) Structural variants in the barley gene pool: precision and sensitivity to detect them using short-read sequencing and their association with gene expression and phenotypic variation. Theor Appl Genet 135(10):3511–3529. 10.1007/s00122-022-04197-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR75] Wen ZX, Tan RJ, Zhang SC, Collins PJ, Yuan JZ, Du WY, Gu CH, Ou SJ, Song QJ, An YQC, Boyse JF, Chilvers MI, Wang DC (2018) Integrating GWAs and gene expression data for functional characterization of resistance to white Mould in soya bean. Plant Biotechnol J 16(11):1825–1835. 10.1111/pbi.12918 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR76] Westhues M, Schrag TA, Heuer C, Thaller G, Utz HF, Schipprack W, Thiemann A, Seifert F, Ehret A, Schlereth A, Stitt M, Nikoloski Z, Willmitzer L, Schön CC, Scholten S, Melchinger AE (2017) Omics-based hybrid prediction in maize. Theor Appl Genet 130(9):1927–1939. 10.1007/s00122-017-2934-0 [DOI] [PubMed] [Google Scholar]

[CR77] Westhues M, Heuer C, Thaller G, Fernando R, Melchinger AE (2019) Efficient genetic value prediction using incomplete omics data. Theor Appl Genet 132(4):1211–1222. 10.1007/s00122-018-03273-1 [DOI] [PubMed] [Google Scholar]

[CR78] Wright SI (2005) The effects of artificial selection on the maize genome (vol 308, pg 1310, 2005). Science, 310(5745):54–54. ISSN 0036-8075. URL Go to ISI://WOS:000232477000025

[CR79] Wu PY, Stich B, Weisweiler M, Shrestha A, Erban A, Westhoff P, Inghelandt D (2022) Improvement of prediction ability by integrating multi-omic datasets in barley. BMC Genomics. 10.1186/s12864-022-08337-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR80] Zabel F, Delzeit R, Schneider JM, Seppelt R, Mauser W, Václavík T (2019) Global impacts of future cropland expansion and intensification on agricultural markets and biodiversity. Nat Commun. 10.1038/s41467-019-10775-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR81] Zarringhalam K, Degras D, Brockel C, Ziemek D (2018) Robust phenotype prediction from gene expression data using differential shrinkage of co-regulated genes. Sci Rep 8(1):1237. 10.1038/s41598-018-19635-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR82] Zhang C, Huang L, Zhang H, Hao Q, Lyu B, Wang M, Epstein L, Liu M, Kou C, Qi J, Chen F, Li M, Gao G, Ni F, Zhang L, Hao M, Wang J, Chen X, Luo M-C, Zheng Y, Wu J, Liu D, Fu D An ancestral nb-lrr with duplicated 3’UTRs confers stripe rust resistance in wheat and barley. Nature Communications, 10.1038/s41467-019-11872-9

[CR83] Zhao Y, Mette MF, Gowda M, Longin CFH, Reif JC (2014) Bridging the gap between marker-assisted and genomic selection of heading time and plant height in hybrid wheat. Heredity 112(6):638–645. 10.1038/hdy.2014.1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR84] Zhong SQ, Dekkers JCM, Fernando RL, Jannink JL (2009) Factors affecting accuracy from genomic selection in populations derived from multiple inbred lines: a barley case study. Genetics 182(1):355–364. 10.1534/genetics.108.098277 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR85] Zhu XT, Maurer HP, Jenz M, Hahn V, Ruckelshausen A, Leiser WL, Würschum T (2022) The performance of phenomic selection depends on the genetic architecture of the target trait. Theor Appl Genet 135(2):653–665. 10.1007/s00122-021-03997-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Assessment of genomic prediction capabilities of transcriptome data in a barley multi-parent RIL population

Christopher Arlt

Delphine van Inghelandt

Jinquan Li

Benjamin Stich

Abstract

Key message

Abstract

Supplementary Information

Introduction

Materials and methods

Genetic material

Fig. 1.

Plant cultivation for RNA extraction

RNA extraction

Library preparation

Sequencing and read processing

Sequence variant calling

RNA-Seq SNP data: functional annotation-based filtering

RNA-Seq SNP data: empirical quality filtering

Read count calculation

Phenotypic datasets

Table 1.

Genomic prediction

Table 2.

Genomic prediction: reduced sequencing depth

Genomic prediction: Intra and inter-population analysis

Results

Characterization of gene expression data and their GP performance

Fig. 2.

Prediction ability of sequence variant datasets

Fig. 3.

Fig. 4.

GP performance of combined datasets

Fig. 5.

The effect of sequencing depth reduction on GP performance

Fig. 6.

Comparison of intra- vs. inter-population GP

Fig. 7.

Discussion

Filtering of RNA-Seq datasets to harness the full GP potential

Cost optimization of GP using RNA-Seq

Combining multiple datasets maximized GP performance

Genetic diversity vs. relatedness: cross-, inter-, and intra-population prediction

Summary

Supplementary Information

Acknowledgements

Author contributions

Funding

Data availability

Declarations

Conflict of interest

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases