Abstract
Background
Whole-genome genotyping techniques like Genotyping-by-sequencing (GBS) are being used for genetic studies such as Genome-Wide Association (GWAS) and Genomewide Selection (GS), where different strategies for imputation have been developed. Nevertheless, imputation error may lead to poor performance (i.e. smaller power or higher false positive rate) when complete data is not required as it is for GWAS, and each marker is taken at a time. The aim of this study was to compare the performance of GWAS analysis for Quantitative Trait Loci (QTL) of major and minor effect using different imputation methods when no reference panel is available in a wheat GBS panel.
Results
In this study, we compared the power and false positive rate of dissecting quantitative traits for imputed and not-imputed marker score matrices in: (1) a complete molecular marker barley panel array, and (2) a GBS wheat panel with missing data. We found that there is an ascertainment bias in imputation method comparisons. Simulating over a complete matrix and creating missing data at random proved that imputation methods have a poorer performance. Furthermore, we found that when QTL were simulated with imputed data, the imputation methods performed better than the not-imputed ones. On the other hand, when QTL were simulated with not-imputed data, the not-imputed method and one of the imputation methods performed better for dissecting quantitative traits. Moreover, larger differences between imputation methods were detected for QTL of major effect than QTL of minor effect. We also compared the different marker score matrices for GWAS analysis in a real wheat phenotype dataset, and we found minimal differences indicating that imputation did not improve the GWAS performance when a reference panel was not available.
Conclusions
Poorer performance was found in GWAS analysis when an imputed marker score matrix was used, no reference panel is available, in a wheat GBS panel.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-016-3120-5) contains supplementary material, which is available to authorized users.
Keywords: GBS, QTL, GWAS, Power, False positive
Background
Genetic markers are nowadays an essential part of plant and animal breeding programs. Next-generation sequencing (NGS) techniques allow discovering, sequencing, and genotyping thousands of Single Nucleotide Polymorphism (SNPs) covering the whole genome [1]. These SNPs are being used in analyses like transcriptome assembly [2], generation of high-quality draft genomes even for complex genomes [3], understanding plant growth [4], evaluating the effect of epigenetics in plant development [5], isolation of mutant genes [6],species evolution and economic insight [7], genetic diversity [8], GWAS [9], and GS [10]. The GBS technique is one of the most used NGS approaches [8–11]. It was developed originally for barley and maize, and later extended to other complex genomes species like wheat [8–11]. GBS that relies on methylation-sensitive restriction enzymes is highly efficient [12]. However, GBS generates a large proportion of missing data when alleles are obtained due to the use of short reads and when low sequencing depth are used [12]. Therefore, different strategies to impute missing data have been developed and used for genetic analyses [9]. Some imputation methods use reference panels and are based on Linkage Disequilibrium (LD), while other methods do not require reference panels. In the first group, the most common methods are known as MACH [13], IMPUTE [14], fastPHASE [15], PLINK [16], and Beagle [17]. All of them use haplotype segments from a reference panel densely genotyped to impute missing markers [18–20]. MACH uses a Markov Chain based algorithm to infer pairs of haplotypes for each individual’s genotypes [13]. IMPUTE considers the sequence of pairs of known haplotypes as hidden states, then models the sequence of hidden states based on a recombination map estimated from the reference data, and finally it predicts unknown genotypes [14]. The fastPHASE algorithm is a haplotype clustering algorithm that samples missing genotypes based on allele frequencies estimated from reference haplotypes, and then uses an Expectation- Maximization (EM) algorithm to estimate parameter values to infer missing genotypes [15]. PLINK predicts missing data by the local haplotypic background and by the haplotype formed by the two or more flanking SNPs [16]. Finally, Beagle is a haplotype clustering based algorithm that uses the localized haplotype cluster model to group haplotypes at each marker and then finds the most likely haplotype pairs based on the individual’s known genotypes [17]. Therefore, strong LD among markers and low minor allele frequency (MAF) is required for effective LD imputation methods [21]. Additionally, more markers with an even genome coverage and therefore smaller distance among markers, and markers with larger subpopulation differentiation are also desirable to ensure imputation accuracy [22]. The second group of methods do not require a reference panel and include imputation by the mean, the MVN-EM algorithm, and random forests [10]. In mean imputation, the most common allele at a particular marker in the population is used to impute missing data. MVN-EM, on the other hand, considers the realized additive relationship matrix between the lines and an EM approach assuming that marker genotypes follow a multivariate normal distribution designed for use with GBS. Finally, random forest methods use an algorithm with multiple decision trees to determine a prediction value for each missing data point. For an overview of the imputation methods see [10].
Several studies found that imputation can improve QTL power detection [23, 24], but other studies found that large power is accompanied by either larger false positive rates or an increase in the multiple-testing penalty [20, 25]. Unless a ‘one-hit’ procedure is used (i.e. the uncertainty of genotypic probability distributions due to the imputation is incorporated in the GWAS analysis), large imputation error can be generated [26]. Other studies found that imputation should be carefully evaluated because quality control of the data is an important source of loss of power [27]. To carry on GWAS analysis, where one marker at a time is being tested, marker-trait associations can be estimated without marker imputation using the available information at each marker.
The aim of this study was to compare the performance of imputation methods for GWAS analysis when no reference panel is available in a wheat GBS panel. Specifically, our objectives were: (1) to evaluate the effect of imputation using a golden standard (i.e. simulation over a complete marker score matrix), to determine whether ascertainment bias is responsible for imputation success; (2) to evaluate whether the outcome of the imputation performance is affected by the marker score matrix used to simulate the QTL; and (3) to compare the effect of imputation in a real phenotype wheat panel using GBS data with different missing rates (25 %, 35 % and 50 %) and four phenotypic traits.
Results
The strategies we pursued are explained in the Methods section, and the general procedure presented in Fig. 1. We used different number of QTL and heritabilities to simulate the QTL, along with different thresholds for calling the QTL. We summarized the results with power (PO) and false positive rate (FPR).
Ascertainment bias in imputation performance comparison (golden standard)
When we used a golden standard matrix of barley for simulating the QTL (i.e. a complete dataset, for general approach see Fig. 1A1), we found that for major QTL effects, larger power was obtained without imputing the genotypic matrix. Furthermore, for minor QTL effects, larger power was detected without imputing the genotypic matrix or imputing it with the MVN-EM method (GNImp, GMVN-EM Fig. 2). The smallest false positive rate was obtained for the genotypic matrix imputed by the RF method (GRF), and the largest false positive rate was obtained with the MVN-EM imputation method (GMVN-EM). False positive rates were still really small (i.e. 0.015, Fig. 2). Power was also small in general (i.e. 0.3, Fig. 2). The same pattern was found when using different threshold levels for the dissection of quantitative traits (i.e. Bonferroni corrected by the effective number of independent markers, Fig. 2; Bonferroni correction, Additional file 1; and an arbitrary threshold set at α = 0.01, Additional file 2).
When we simulated QTL over an imputed matrix (for the general approach see Fig. 1A.2), we found that larger power was obtained with the imputed genotypic matrices (GMean, GMVN-EM or GRF), while the largest false positive rate was obtained with the MVN-EM imputation method (GMVN-EM) (Fig. 3). However, when QTL were simulated over a not-imputed matrix, the largest power was obtained when a not-imputed or imputed by the MVN-EM genotypic matrices were used (GMVN-EM or GNImp). This pattern was consistent across number of QTL (i.e. 25 and 50, data not shown) and heritabilities (i.e. 0.2, 0.4, 0.6, 0.7, 0.9, Fig. 3). The same pattern was found when using different threshold levels for the dissection of quantitative traits (i.e. Bonferroni corrected by the effective number of independent markers, Fig. 3; Bonferroni correction, Additional file 3; and an arbitrary threshold set at α = 0.01, Additional file 4).
Imputation effect for real GBS data with 25 %, 35 % or 50 % missing information
By using naturally sparse genotypic matrices like GBS in wheat with 25 %, 35 % or 50 % missing data information (for the general approach see Fig. 1b), we detected that larger power was obtained when a not-imputed or imputed by the MVN-EM genotypic matrices were used (Fig. 4, Additional files 5 and 6). However, when simulating over a matrix with imputed data, larger power was obtained by recover QTL with an imputed matrix (Fig. 4, Additional files 5 and 6). This was true for the different number of QTL (i.e. 25 and 50, data not shown) and heritabilities (i.e. 0.2, 0.4, 0.6, 0.7, 0.9, Fig. 4, Additional files 5 and 6). Differences between power were more evident for major QTL, resulting in a reasonable increase of power for high heritabilities (Fig. 4). The largest values of false positive rate were found when simulating with the Ysim-NImp and GMVN-EM or the Ysim-RF and GMVN-EM (Fig. 4, Additional files 5 and 6). Additionally, the same pattern was found using different threshold levels (i.e. Bonferroni corrected by the effective number of independent markers, Fig. 4; Bonferroni correction, Additional file 7; and an arbitrary threshold set at α = 0.01, Additional file 8).
Imputation effect on GWAS for real phenotypes
We compared the QTL obtained for GWAS analysis using real phenotypic data from wheat, between the not-imputed matrix (GNImp) with different missing rates (25 %, 35 % and 50 % of missing data), and the genotypic data imputed with the mean, MVN-EM or RF method (GMean, GMVN-EM or GRF). The performance of GWAS analysis was similar across imputation methods (Fig. 5, Additional files 9 and 10), but not all QTL were detected across methods. For the 4 traits, plant height (PH, cm), days to heading (DH, days), thousand kernel weight (TKW, g) and spikes per square meter (SPM, number, Fig. 6, Additional files 11 and 12), we detected different putative QTL when using imputed or not-imputed matrices. In general, the MVN-EM imputation method performed similarly to non imputation, having some QTL being detected by both methods (Fig. 6, Additional files 11 and 12). However, each approach found also unique QTL (Fig. 6, Additional files 11 and 12).
Differences between methods for false positive rate
When we performed FPR boxplots with the replications for analyzing if the differences between the methods are significantly different or due to random errors (Additional files 13, 14, 15, 16, 17), we found that FPR rates were larger for: (i) the imputed genotypic matrices by the MVN-EM method for the golden standard, (ii) the imputed genotypic matrix by the MVN-EM method (GMVN-EM) for the ascertainment bias, (iii) the imputed genotypic matrices by the MVN-EM or RF methods (GMVN-EM, GRF) for the GBS data with 35 % or 50 % missing data, (iv) and the imputed genotypic matrices by the RF method (GRF) for the GBS data with 25 % missing data.
Discussion
New whole-genome genotyping techniques are constantly being developed and used for genetic analyses like GWAS [9]. Although GBS is a powerful tool for genotyping hundreds of individuals with thousands of SNPs, it generates large amounts of missing information, and therefore, researchers have applied several strategies to impute these missing [14–17]. However, when retained a considerable amount of missing information using GBS data in wheat or artificially removing genotypic data from complete panels in barley, we found that imputation does not improve the dissection of quantitative traits performance in several situations. Our results should be restricted to our panels that have a specific LD (barley and wheat) and SNP quality, due to the continuous improvement of the sequencing technologies that allows the decrease of costs and therefore the increase of sequencing depth and quality, leading to a lower missing rate.
Ascertainment bias in imputation performance comparison (golden standard)
When we used the “golden standard” marker score matrix, the not-imputed marker score matrix outperformed the imputation methods for all the combinations of parameters (Fig. 2, Additional files 1 and 2). The higher values of false positive rate found with the MVN-EM matrix and lower values of power found with the RF matrix for all thresholds could be a consequence of an imputation error affecting the signal of the QTL.
The fact that we also found that the not-imputed marker score matrix outperformed the imputation methods comparing both, power and false positive rate simultaneously, when we used real GBS data (i.e. data with missing points, Fig. 4), suggests that using an imputed matrix for GWAS analysis could introduce an ascertainment bias. This could be caused when there is no reference panel, and the uncertainty of genotypic probability distributions due to the imputation is not considered, as methods based on LD have found that if some restrictions are taken into account (i.e. strong LD among markers, low minor MAF, short distances between not-imputed markers, and markers with higher subpopulation differentiation), the imputation accuracy and then the GWAS is improved [22, 28].
Although the low power found to detect QTL for the barley marker score matrix could theoretically be due to low LD between markers in the same LD blocks, we do not expect this to be the reason of low power in our study. When there are unlinked QTL controlling a trait, the power is moderate even with large populations and high heritabilities [29]. However, we do not expect unlinked QTL within the LD blocks due to the cluster of markers within those blocks [30], and because the genome coverage of the markers was very high, having 50 % of its SNPs, at a distance smaller than 0.625 cM (Table 1). The small population (122 lines) used for barley dataset could be the reason affecting the low values of power detected, as the power is a function of the population size [31]. However, this should not differently affect the imputation methods. Additionally, the great differences found in power and false positive rate between major and minor QTL, could indicate that major QTL are the QTL mostly detected by any of the imputation methods. Other LD structures in different populations could make our results to vary, therefore, this results are restricted to the populations used in this analysis. Further analyses considering different population structure should be tested.
Table 1.
Chr | SNPs number | Length (cM) | Largest gap (cM) | Median (cM) |
---|---|---|---|---|
1 | 125 | 139.78 | 10.74 | 0.63 |
2 | 187 | 150.27 | 8.21 | 0.58 |
3 | 178 | 170.88 | 6.59 | 0.58 |
4 | 131 | 121.65 | 7.50 | 0.60 |
5 | 201 | 194.03 | 8.05 | 0.57 |
6 | 147 | 129.38 | 8.62 | 0.47 |
7 | 127 | 166.56 | 10.53 | 0.49 |
Imputation effect for real GBS data with 25 %, 35 % or 50 % missing information
The differences found when we simulated QTL on top of imputed or not-imputed marker score matrices (Fig. 4, Additional files 5 and 6) were probably due to the imputation method used and the simulation. Therefore, we found that not-imputing was the best option for evaluating one marker at a time in GWAS analysis using GBS data with 25 %, 35 % or 50 % missing information, especially for detecting major QTL.
Imputation effect on GWAS for real phenotypes
As no significant differences were detected in the real wheat datasets in terms of the type 1 error inflation imputation (Fig. 5, Additional files 9 and 10), we consider that imputation does not improved the GWAS performance and therefore is not needed.
The traits evaluated in this paper were selected for having high heritability values and being related or a component of grain yield. The high heritability values may have reduced the differences between the QTL found with GNImp or GMVN-EM.
We found QTL where previous QTL were reported. The QTL found for TKW (chromosome 1B, bin 224 and 242) with the GNImp, GMVN-EM and GMean matrices for 50 % missing data, and with the GMean matrix for 25 % and 35 % missing data, are partially coincident with a QTL reported for green leaf area [32], a QTL reported for Near Differential Vegetative Index [33] and a QTL reported for yield, anthesis and plant height [34]. A QTL found for TKW (chromosome 1D, bin 205) with the GNImp, GMVN-EM and GMean matrices for 25 %, 35 % and 50 % missing data, is coincident with a QTL reported for grain yield and plant height [34]. The QTL found for TKW (chromosome 2D, bin 167) with 3 marker score matrices (GNImp, GMVN-EM and GMean) for 25 %, 35 % and 50 % missing data, SPM (chromosome 2D, bin 167) with the GNImp matrix for 25 % and 50 % missing data, and with 3 marker score matrices (GNImp, GMVN-EM and GMean) for 35 % missing data, are coincident with a QTL reported for kernel weight, Near Differential Vegetative Index and flag leaf [33]. A QTL found for DH (chromosome 3B, bin 282) with GNImp and GMVN-EM for 50 % missing data is coincident with a QTL reported for grain filling duration [32]. A QTL found for SPM (chromosome 4A, bin 179) with the GNImp and GMVN-EM matrices for 25 %, 35 % and 50 % missing data, is coincident with a QTL reported for anthesis and plant height [34]. The QTL found for DH (chromosome 4B, bin 106) with the GNImp matrix for 50 % missing data, is coincident with a QTL reported for yield and plant height [34]. A QTL found for DH (chromosome 6B, bin 116) with the GNImp matrix for 35 % and 50 % missing data, and with the GNImp and GMean matrices for the 25 % missing data, is coincident with a QTL for yield and plant height [28]. A QTL found for PH (chromosome 7A, bin 225) with the GNImp and GMean matrices for 50 % missing data, is coincident with yield and anthesis [34]. These positions are based on bins and should be regarded as an approximation. These could be improved after the draft of the genome is available [35].
As we found that QTL detected by the GNImp and GMVN-EM matrices were similar, we believe that imputation do not improve GWAS analysis.
Conclusions
Imputation can introduce an ascertainment bias to GWAS analysis using GBS within crops when a reference panel is not available. Comparing the GWAS performance by the power and false positive rate with imputed or not-imputed marker score matrices, poorer performance was found when an imputed marker score matrix was used. Additionally, the power and false positive rate changed in a clear way between major and minor QTL, showing that differences among imputation methods were more evident for major QTL and that the detection of minor QTL is negligible. Our results are restricted to the wheat panel used, as with different LD they could vary, and as well with different GBS quality data, which is affected by different SNP identification algorithms.
Methods
Dataset
We used two datasets: (1) a complete SNPs barley panel array (i.e. 99 % coverage), and (2) a GBS wheat marker score matrix with an average of 25 %, 35 % or 50 % missing points and phenotypic data (for general approach see Fig. 1).
The complete barley SNP marker score array dataset (Additional file 18), consisted in a panel of 122 barley advanced inbred lines from a population of 360 described in [36]. Briefly, 1,096 SNPs from the Barley Oligonucleotide Pool Assay-1 (BOPA 1, Additional file 19) were selected [37, 38]. A total of 122 lines were chosen to form 2 complete datasets without missing information (Table 1).
The wheat GBS dataset (Additional file 20), consisted on a panel of 384 advanced inbred lines from breeding programs: 186 genotypes from the National Wheat Breeding Program from Uruguay (INIA-Uruguay, National Institute of Agricultural Research), 55 genotypes from the National Wheat Breeding Program from Chile (INIA-Chile), and 143 genotypes from the International Breeding Center of Maize and Wheat (CIMMYT, International Maize and Wheat Improvement Center), published in [39]. The CIMMYT genotypes share common ancestors with the INIA-Chile genotypes (see [39] for more details). DNA was extracted by the DNeasy Plant Maxi Kit (QIAGEN). Library construction was conducted at Kansas State University (Manhattan, Kansas) using a PstI-MspI GBS protocol [10]. The sequencing was performed on an Illumina Hi-Sequation 2000 at the DNA core facility at the University of Missouri, Columbia, Missouri, and the McGill Univesity-Génome Quebec Innovation Centre (Montreal, Canada) for each set of libraries. SNPs were obtained using the Tassel-GBS Pipeline [40]. The base quality and distribution of sequences was studied with the Galaxy (https://galaxyproject.org/) software. SNPs with more than 25 %, 35 % or 50 % missing points and with minor allele frequency (MAF) smaller than 10 % were excluded. Sequences were blasted to the SyntheticxOpata map (synop) using the blastn function from NCBI-BLAST+ package using the number of descriptions and the number of threads set to one. Therefore, SNPs were placed into recombination bins [11] (Additional file 21). A final matrix set of 18,337 SNPs was obtained for 50 % missing data (Table 2), a final matrix set of 8,227 SNPs was obtained for 25 % missing data (Additional files 22 and 23), and a final matrix set of 11,858 SNPs was obtained for 35 % missing data (Additional files 24 and 25).
Table 2.
Chr | SNPs number | Length (cM) | Largest gap (cM) | Median (cM) |
---|---|---|---|---|
1A | 821 | 266 | 33 | 0 |
1B | 1282 | 294 | 22 | 0 |
1D | 255 | 242 | 25 | 0 |
2A | 900 | 242 | 22 | 0 |
2B | 1746 | 266 | 38 | 0 |
2D | 327 | 182 | 27 | 0 |
3A | 929 | 329 | 28 | 0 |
3B | 1912 | 290 | 30 | 0 |
3D | 270 | 287 | 29 | 0 |
4A | 907 | 234 | 28 | 0 |
4B | 610 | 177 | 31 | 0 |
4D | 74 | 130 | 45 | 0 |
5A | 1023 | 232 | 26 | 0 |
5B | 1270 | 316 | 22 | 0 |
5D | 197 | 306 | 29 | 0 |
6A | 883 | 237 | 34 | 0 |
6B | 1302 | 232 | 25 | 0 |
6D | 243 | 276 | 28 | 0 |
7A | 1456 | 323 | 24 | 0 |
7B | 1660 | 263 | 40 | 0 |
7D | 270 | 337 | 45 | 0 |
The phenotypic data for the wheat panel was obtained from an evaluation in a Mediterranean environment in Santa Rosa-Chile in 2011 (36° 329’ S, 71° 559’ W; 217 m.a.s.l.). The field was irrigated with 50 mm m-2 at each of four moments: tillering, flag leaf emergence, heading date, and grain filling (see [33] for further details). The experimental design was an alpha-lattice with 20 replications and 20 incomplete blocks. The traits evaluated were: plant height (PH, cm) evaluated from the base of the plant to the flower insertion (Additional file 26), days to heading (DH, days) was recorded when 50 % of the culms showed emerged ears (Additional file 27), thousands kernel weight (TKW, g, Additional file 28), and spikes per square meter (SPM, number, Additional file 29). We obtained the best linear unbiased predictors (BLUPs) for each genotype using the following model for each trait: y ijk = µ + a i + β j + δ k(j) + ε ijk where y ijk is the value for the phenotypic trait corresponding to the i-th genotype, j-th replication, and k-th incomplete block, μ is the overall mean, a i is the random effect of the i-th genotype with a i ~ N(0, σ 2g), β j is the effect of the j-th replication, δ k(j) is the random effect of the k-th incomplete block within the j-th replication with δ k(j) ~ N(0, σ 2B), ε ijk is the experimental error corresponding to the i-th genotype, j-th replication and k-th incomplete block with ε ijk ~ N(0, σ 2e). The genotypic breeding values were estimated with the function lmer (lme4 package) in R statistical software [41]. Broad sense heritabilities were estimated in R statistical software [35] using the above model (Table 3).
Table 3.
Trait | Santa Rosa- Chile 2011 |
---|---|
Plant height (cm) | 0.78 |
Days to heading (days) | 0.97 |
Thousand kernel weight (g) | 0.93 |
Spikes per square meter (number) | 0.76 |
Imputation methods
For the barley SNP array panel, we started with a genotype by marker score matrix with 122 genotypes (rows) and 1,096 markers (columns) without missing values Markers were scored as {1, -1}. Then, we randomly generated missing values in order to have the same coverage as the GBS panel (50 %). Finally, three methods were used to fill in those missing values, MVN-EM, which considers the realized additive relationship matrix between the lines and an EM approach assuming that marker genotypes follow a multivariate normal distribution [10], Random Forest (RF), which uses an algorithm with multiple decision trees to predict a value for each missing point, and the Mean, which uses the average value score per marker (i.e. the expected allele value at the particular marker). Imputation was conducted in R statistical software [41] with the A.mat function (rrBLUP package) [42].
For the wheat GBS panel, we started with: (i) a genotype by marker score matrix with 384 genotypes (rows) and 18,337 markers (columns) with 50 % of missing values, (ii) a genotype by marker score matrix with 384 genotypes (rows) and 8,227 markers (columns) with 25 % of missing values, and (iii) a genotype by marker score matrix with 384 genotypes (rows) and 11,858 markers (columns) with 35 % of missing values. Markers were scored as the number of alleles {NA, 1, -1}. We used the same methods as the previous sections to impute by the MVN-EM and the Mean.
Simulation procedure
To evaluate the effect of imputation using a golden standard with the barley SNP array, we created phenotypic vectors simulating QTL on top of the complete barley marker score matrix (Ysim-NoNA). The phenotypic vectors were the sum of the effects of genotypic and residual terms, Ysim = g + e. The genotypic effect was calculated as the sum of the markers (selected as QTL) effects and markers effects were obtained from a Beta(2, 6) distribution. Markers selected as QTL were obtained from the LD blocks defined from a single linkage agglomerative procedure [30] with euclidean distances between markers and a minimum of 1.5 cM to consider independent groups. QTL with major effects were defined as the QTL with effects larger than the 75 % of the maximum, and QTL with minor effect were defined as the remaining QTL. The residual term was obtained by sampling from a normal distribu tion, N(0, σ2 e), where σ2 e = (1- h 2)σ2 g/ h 2 and σ2 g was the variance of the realized g. One vector for the combinations of number of QTL (i.e. 25 and 50), different heritabilities (i.e. 0.2, 0.4, 0.6, 0.7, 0.9), and for each one of 500 iterations was created. Then, we created missing data at random, imputed (i.e. GNImp, GMVN-EM, GMean and GRF) and pursued the GWAS analysis with each combination of genotypic matrix, evaluating power and false positive rate (for the general approach see Fig. 1A.1).
For the ascertainment bias evaluation, we first created the missing data and then simulated the QTL on top of each matrix: not-imputed marker score (Ysim-NImp), imputed with MVN-EM [10] marker score (Ysim-MVN-EM), imputed by the mean marker score (Ysim-Mean) and imputed with RF [10] marker score (Ysim-RF). Finally, we performed the GWAS analysis with each genotypic marker score (i.e. GNImp, GMVN-EM, GMean and GRF) and for each phenotypic vector (i.e. Ysim-NImp, Ysim-MVN-EM, Ysim-Mean and Ysim-RF, for the general approach see Fig. 1A.2). We then compared the power and false positive rate.
For evaluating GWAS performance based on simulated phenotypes with the wheat GBS panel (Fig 1b) data we first created vectors of phenotypic values (i.e. Ysim-NImp, Ysim-MVN-EM, Ysim-Mean and Ysim-RF). Each phenotypic vector was simulated for different number of QTL (i.e. 25 and 50), different heritabilities (i.e. 0.2, 0.4, 0.6, 0.7, 0.9) as in the previous section. In order to avoid collinearity, LD blocks were defined as the bins in each chromosome and a marker chosen at random within each LD block was considered a QTL. One vector for each combination of the parameters and for each one of 500 iterations was created. We performed the simulations in R statistical software [41].
GWAS analysis
For the GWAS analysis, the mixed model described by [43] was used: y = Xβ + Qv + Zu + e, where y is the phenotypic vector (n x 1) with n the total number of lines, X is a (n x m) SNPs matrix with m the number of SNPs coded as described before {NA, 1, -1}, β is a (m x 1) vector of allelic effects to be estimated, Q is a (n x q) incidence matrix with q origin’s groups, v is a (n x 1) populations fixed effect vector, Z is the genotypic incidence matrix, u is the vector of random background polygenic effects, u ~ N(0, Aσ2 g), where A is the realized additive relationship matrix obtained with the A.mat function from package rrBLUP [36] in R statistical software [35] and e is the residual error, e ~ N(0, σ2 e). For each Y sim , we used the 4 genotypic marker score to recover the QTL (i.e. GNImp, GMVN-EM, GMean and GRF). We performed the analysis for three different thresholds (threshold) to define markers as significant: (1) Bonferroni correction, (2) Bonferroni correction using the effective number of markers, Li&Ji method [38], and (3) a liberal threshold of α = 0.01. GWAS analysis was accomplished with GWAS function from rrBLUP package [42] in R statistical software [41–45]. We defined as true positives (TP) the number of bins with a QTL and at least one significant marker; false positives (FP) the number of bins with no QTL and at least one significant marker; true negatives (TN) the number of bins with no QTL and no significant markers, and false negatives (FN) the number of bins with QTL and no significant markers. We evaluated power (PO = TP/(TP + FN)) and false positive rate (FPR = FP/ (FP + TN)) [39] for QTL detection. We evaluated performance for QTL of major and minor effect.
Acknowledgements
The research that gives rise to the results presented in this publication was funded by the National Agency for Research and Innovation under the code POS_NAC_2014_1_102695 and FONTAGRO (Project FTG-8038-08 and Project FTG 0617-06). The authors wish to express their appreciation for the effort of the technical personnel of all the involved institutions.
Availability of data and materials
All data is included as supporting files for the manuscript.
Authors’ contributions
AJC was in charge of the barley project. IM, MQ, JvZ, MC and AdP created the wheat project. IM and AdP conducted phenotyping in wheat. JvZ and BL conducted the genotyping in wheat. SB, AGR, MM, AAFG and LG designed the hypothesis, methodology and conducted statistical analysis. SB, AGR, BL and LG wrote the paper. All authors read and approved the final paper.
Competing interests
The authors declare that they have no competing interest.
Abbreviations
- BLUPs
Best linear unbiased predictors
- BOPA
Barley oligonucleotide pool assay-1
- CIMMYT
International maize and wheat improvement center
- DH
Days to heading
- FN
False negatives
- FP
False positives
- FPR
False positive rate
- GBS
Genotype-by-sequencing
- GS
Genome-wide selection
- GWAS
Genome-wide analysis
- ICARDA
International center for agricultural research in the dry areas
- INIA
National institute of agricultural research
- LD
Linkage disequilibrium
- MAF
Minor allele frequency
- MVN-EM
Multivariate normal expectation maximization
- NGS
Next-generation sequencing
- PH
Plant height
- PO
Power
- QTL
Quantitative trait loci
- RF
Random forest
- SNPs
Single-nucleotide polymorphism
- SPM
Spikes per square meter
- SR2011
Santa rosa 2011
- TKW
Thousands kernel weight
- TN
True negatives
- TP
True positives
- USDA
United States department of agriculture
Additional files
Contributor Information
Sofía P. Brandariz, Email: sbrandariz@fagro.edu.uy
Agustín González Reymúndez, Email: agonzalez@epi.msu.edu.
Bettina Lado, Email: blado@fagro.edu.uy.
Marcos Malosetti, Email: marcos.malosetti@wur.nl.
Antonio Augusto Franco Garcia, Email: aafgarci@esalq.usp.br.
Martín Quincke, Email: mquincke@inia.org.uy.
Jarislav von Zitzewitz, Email: J.Zitzewitz@secobra.de.
Marina Castro, Email: mcastro@inia.org.uy.
Iván Matus, Email: imatus@inia.cl.
Alejandro del Pozo, Email: adelpozo@utalca.cl.
Ariel J. Castro, Email: vontruch@fagro.edu.uy
Lucía Gutiérrez, Email: gutierrezcha@wisc.edu.
References
- 1.Davey JW, Hohenlohe PA, Etter PD, Boone JQ, Catchen JM, Blaxter ML. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat Rev Genet. 2011;12:499–510. doi: 10.1038/nrg3012. [DOI] [PubMed] [Google Scholar]
- 2.Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Publ Gr. 2011;2:671–682. doi: 10.1038/nrg3068. [DOI] [PubMed] [Google Scholar]
- 3.Hirsch CN, Buell CR. Tapping the Promise of Genomics in Species with Complex, Nonmodel Genomes. Annu Rev Plant Biol. 2013;64:89–110. doi: 10.1146/annurev-arplant-050312-120237. [DOI] [PubMed] [Google Scholar]
- 4.Bedewitz MA, Góngora-Castillo E, Uebler JB, Gonzales-Vigil E, Wiegert-Rininger KE, Childs KL, Hamilton JP, Vaillancourt B, Yeo YS, Chappell J, DellaPenna D, Jones AD, Buell CR, Barry CS. A Root-Expressed L -Phenylalanine: 4-Hydroxyphenylpyruvate Aminotransferase Is Required for Tropane Alkaloid Biosynthesis in Atropa belladonna. Plant Cell. 2014;9:3745–3762. doi: 10.1105/tpc.114.130534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Zhong S, Fei Z, Chen Y, Zheng Y, Huang M, Vrebalov J, Mcquinn R, Gapper N, Liu B, Xiang J, Shao Y, Giovannoni JJ. Single-base resolution methylomes of tomato fruit development reveal epigenome modifications associated with ripening. Nat Biotechnol. 2013;31:154–159. doi: 10.1038/nbt.2462. [DOI] [PubMed] [Google Scholar]
- 6.Schneeberger K. Using next-generation sequencing to isolate mutant genes from forward genetic screens. Nat Publ Gr. 2014;15:662–676. doi: 10.1038/nrg3745. [DOI] [PubMed] [Google Scholar]
- 7.Zhang G, Liu X, Quan Z, Cheng S, Xu X, Pan S, Xie M, Zeng P, Yue Z, Wang W, Tao Y, Bian C, Han C, Xia Q, Peng X, Cao R, Yang X, Zhan D, Hu J, Zhang Y, Li H, Li H, Li N, Wang J, Wang C, Wang R, Guo T, Cai Y, Liu C, Xiang H, Shi Q, Huang P, Chen Q, Li Y, Wang J, Zhao Z, Wang J. Genome sequence of foxtail millet (Setaria italica) provides insights into grass evolution and biofuel potential. Nat Biotechnol. 2012;30:549–554. doi: 10.1038/nbt.2195. [DOI] [PubMed] [Google Scholar]
- 8.Heslot N, Rutkoski J, Poland J, Jannink J-L, Sorrells ME. Impact of marker ascertainment bias on genomic selection accuracy and estimates of genetic diversity. PLoS ONE. 2013;8:e74612. doi: 10.1371/journal.pone.0074612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Romay MC, Millard MJ, Glaubitz JC, Peiffer JA, Swarts KL, Casstevens TM, Elshire RJ, Acharya CB, Mitchell SE, Flint-Garcia SA, McMullen MD, Holland JB, Buckler ES, Gardner CA. Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biol. 2013;14:R55. doi: 10.1186/gb-2013-14-6-r55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Poland J, Endelman J, Dawson J, Rutkoski J, Wu S, Manes Y, Dreisigacker S, Crossa J, Sánchez-Villeda H, Sorrells M, Jannink J-L. Genomic Selection in Wheat Breeding using Genotyping-by-Sequencing. Plant Genome J. 2012;5:103. doi: 10.3835/plantgenome2012.06.0006. [DOI] [Google Scholar]
- 11.Poland JA, Brown PJ, Sorrells ME, Jannink J-L. Development of high-density genetic maps for barley and wheat using a novel two-enzyme genotyping-by-sequencing approach. PLoS ONE. 2012;7:e32253. doi: 10.1371/journal.pone.0032253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, Mitchell SE. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS ONE. 2011;6:e19379. doi: 10.1371/journal.pone.0019379. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010;34:816–34. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet. 2007;39:906–13. doi: 10.1038/ng2088. [DOI] [PubMed] [Google Scholar]
- 15.Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78:629–44. doi: 10.1086/502802. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–75. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Browning SR, Browning BL. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet. 2007;81:1084–97. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Browning SR. Missing data imputation and haplotype phase inference for genome-wide association studies. Hum Genet. 2008;124:439–50. doi: 10.1007/s00439-008-0568-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Jannink J-L, Iwata H, Bhat PR, Chao S, Wenzl P, Muehlbauer GJ. Marker Imputation in Barley Association Studies. Plant Genome J. 2009;2:11. doi: 10.3835/plantgenome2008.09.0006. [DOI] [Google Scholar]
- 20.Hao K, Chudin E, McElwee J, Schadt EE. Accuracy of genome-wide imputation of untyped markers and impacts on statistical power for association studies. BMC Genet. 2009;10:27. doi: 10.1186/1471-2156-10-27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Pei Y-F, Li J, Zhang L, Papasian CJ, Deng HW. Analyses and comparison of accuracy of different genotype imputation methods. PLoS ONE. 2008;3:e3551. doi: 10.1371/journal.pone.0003551. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Iwata H, Jannink J-L. Marker genotype imputation in a low-marker-density panel with a high-marker-density reference panel. Accuracy evaluation in barley breeding lines. Crop Sci. 2010;50:1269. doi: 10.2135/cropsci2009.08.0434. [DOI] [Google Scholar]
- 23.Guan Y, Stephens M. Practical issues in imputation-based association mapping. PLoS Genet. 2008;4:e1000279. doi: 10.1371/journal.pgen.1000279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Pasaniuc B, Rohland N, McLaren PJ, Garimella K, Zaitlen N, Li H, Gupta N, Neale BM, Daly MJ, Sklar P, Sullivan PF, Bergen S, Moran JL, Hultman CM, Lichtenstein P, Magnusson P, Purcell SM, Haas DW, Liang L, Sunyaev S, Patterson N, de Bakker PIW, Reich D, Price AL. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat Genet. 2012;44:631–5. doi: 10.1038/ng.2283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Almeida MAA, Oliveira PSL, Pereira TV, Krieger JE, Pereira AC. An empirical evaluation of imputation accuracy for association statistics reveals increased type-I error rates in genome-wide associations. BMC Genet. 2011;12:10. doi: 10.1186/1471-2156-12-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Aulchenko YS, Struchalin MV, van Duijn CM. ProbABEL package for genome-wide association analysis of imputed data. BMC Bioinformatics. 2010;11:134. doi: 10.1186/1471-2105-11-134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.De Bakker PIW, Ferreira MAR, Jia X, Neale BM, Raychaudhuri S, Voight BF. Practical aspects of imputation-driven meta-analysis of genome-wide association studies. Hum Mol Genet. 2008;17:122–28. doi: 10.1093/hmg/ddn288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Bernardo R. Breeding for quantitative traits in plants. 2. Minnesota: Stemma Press; 2010. [Google Scholar]
- 29.Sibson R. SLINK: an optimally efficient algorithm for the single-link cluster method. Comput J. 1973;30–34.
- 30.Lande R, Thompson R. Efficiency of marker-assisted selection in the improvement of quantitative traits. Genetics. 1990;124:743–756. doi: 10.1093/genetics/124.3.743. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.He S, Zhao Y, Mette MF, Bothe R, Ebmeyer E, Sharbel TF, Reif JC, Jiang Y. Prospects and limits of marker imputation in quantitative genetic studies in European elite wheat (Triticum aestivum L.) BMC Genomics. 2015;16:1–12. doi: 10.1186/1471-2164-16-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Edae EA, Byrne PF, Haley SD, Lopes MS, Reynolds MP. Genome-wide association mapping of yield and yield components of spring wheat under contrasting moisture regimes. Theor Appl Genet. 2014;127:791–807. doi: 10.1007/s00122-013-2257-8. [DOI] [PubMed] [Google Scholar]
- 33.Bennett D, Reynolds M, Mullan D, Izanloo A, Kuchel H, Langridge P, Schnurbusch T. Detection of two major grain yield QTL in bread wheat (Triticum aestivum L.) under heat, drought and high yield potential environments. Theor Appl Genet. 2012;125:1473–85. doi: 10.1007/s00122-012-1927-2. [DOI] [PubMed] [Google Scholar]
- 34.Mathews KL, Malosetti M, Chapman S, McIntyre L, Reynolds M, Shorter R, van Eeuwijk F. Multi-environment QTL mixed models for drought stress adaptation in wheat. Theor Appl Genet. 2008;117:1077–91. doi: 10.1007/s00122-008-0846-8. [DOI] [PubMed] [Google Scholar]
- 35.Mayer KFX, Rogers J, el Dole J, Pozniak C, Eversole K, Feuillet C, Gill B, Friebe B, Lukaszewski AJ, Sourdille P, Endo TR, Kubalakova M, Ihalikova J, Dubska Z, Vrana J, Perkova R, Imkova H, Febrer M, Clissold L, McLay K, Singh K, Chhuneja P, Singh NK, Khurana J, Akhunov E, Choulet F, Alberti A, Barbe V, Wincker P, Kanamori H, et al. A chromosome-based draft sequence of the hexaploid bread wheat (Triticum aestivum) genome. Science. 2014;345(6194):1251788. doi: 10.1126/science.1251788. [DOI] [PubMed] [Google Scholar]
- 36.Gutiérrez L, Germán S, Pereyra S, Hayes PM, Pérez CA, Capettini F, Locatelli A, Berberian NM, Falconi EE, Estrada R, Fros D, Gonza V, Altamirano H, Huerta-Espino J, Neyra E, Orjeda G, Sandoval-Islas S, Singh R, Turkington K, Castro AJ. Multi-environment multi-QTL association mapping identifies disease resistance QTL in barley germplasm from Latin America. Theor Appl Genet. 2015;128:501–16. doi: 10.1007/s00122-014-2448-y. [DOI] [PubMed] [Google Scholar]
- 37.Close TJ, Bhat PR, Lonardi S, Wu Y, Rostoks N, Ramsay L, Druka A, Stein N, Svensson JT, Wanamaker S, Bozdag S, Roose ML, Moscou MJ, Chao S, Varshney RK, Sz P, Sato K, Hayes PM, Matthews DE, Kleinhofs A, Muehlbauer GJ, Deyoung J, Marshall DF, Madishetty K, Fenton RD, Condamine P, Graner A, Waugh R. Development and implementation of high-throughput SNP genotyping in barley. BMC Genomics. 2009;13:1–13. doi: 10.1186/1471-2164-10-582. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Szűcs P, Blake VC, Bhat PR, Chao S, Close TJ, Cuesta-Marcos A, Muehlbauer GJ, Ramsay L, Waugh R, Hayes PM. An Integrated Resource for Barley Linkage Map and Malting Quality QTL Alignment. Plant Genome J. 2009;2:134. doi: 10.3835/plantgenome2008.01.0005. [DOI] [Google Scholar]
- 39.Lado B, Matus I, Rodríguez A, Inostroza L, Poland JA, Belzile F, del Pozo A, Quincke M, Castro M, von Zitzewitz J. Increased genomic prediction accuracy in wheat breeding through spatial adjustment of field trial data. G3 (Bethesda) 2013;3:2105–14. doi: 10.1534/g3.113.007807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Glaubitz JC, Casstevens TN, Lu F, Harriman J, Elshire RJ, Sun Q, Buckler ES. TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline. PLoS ONE. 2014;9(2):e90346. doi: 10.1371/journal.pone.0090346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.R Core Team: R. A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, https://www.r-project.org/ 2015.
- 42.Endelman JB. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome J. 2011;4:250–255. doi: 10.3835/plantgenome2011.08.0024. [DOI] [Google Scholar]
- 43.Yu J, Pressoir G, Briggs WH, Bi IV, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S, Buckler ES. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 2006;38:203–208. doi: 10.1038/ng1702. [DOI] [PubMed] [Google Scholar]
- 44.Li J, Ji L. Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation matrix. Heredity. 2005;95:221–27. doi: 10.1038/sj.hdy.6800717. [DOI] [PubMed] [Google Scholar]
- 45.Chengsong Z, Jianming Y. Nonmetric multidimensional scaling corrects for population structure in association mapping with different sample types. Genetics. 2009;182:875–88. doi: 10.1534/genetics.108.098863. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All data is included as supporting files for the manuscript.