Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2020 Feb 14;16(2):e1007663. doi: 10.1371/journal.pcbi.1007663

RAINBOW: Haplotype-based genome-wide association study using a novel SNP-set method

Kosuke Hamazaki 1, Hiroyoshi Iwata 1,*
Editor: Mihaela Pertea2
PMCID: PMC7046296  PMID: 32059004

Abstract

Difficulty in detecting rare variants is one of the problems in conventional genome-wide association studies (GWAS). The problem is closely related to the complex gene compositions comprising multiple alleles, such as haplotypes. Several single nucleotide polymorphism (SNP) set approaches have been proposed to solve this problem. These methods, however, have been rarely discussed in connection with haplotypes. In this study, we developed a novel SNP-set method named “RAINBOW” and applied the method to haplotype-based GWAS by regarding a haplotype block as a SNP-set. Combining haplotype block estimation and SNP-set GWAS, haplotype-based GWAS can be conducted without prior information of haplotypes. We prepared 100 datasets of simulated phenotypic data and real marker genotype data of Oryza sativa subsp. indica, and performed GWAS of the datasets. We compared the power of our method, the conventional single-SNP GWAS, the conventional haplotype-based GWAS, and the conventional SNP-set GWAS. Our proposed method was shown to be superior to these in three aspects: (1) controlling false positives; (2) in detecting causal variants without relying on the linkage disequilibrium if causal variants were genotyped in the dataset; and (3) it showed greater power than the other methods, i.e., it was able to detect causal variants that were not detected by the others, primarily when the causal variants were located very close to each other, and the directions of their effects were opposite. By using the SNP-set approach as in this study, we expect that detecting not only rare variants but also genes with complex mechanisms, such as genes with multiple causal variants, can be realized. RAINBOW was implemented as an R package named “RAINBOWR” and is available from CRAN (https://cran.r-project.org/web/packages/RAINBOWR/index.html) and GitHub (https://github.com/KosukeHamazaki/RAINBOWR).

Author summary

Detecting rare variants has been one of the most problematic problems in GWAS. Here, we proposed a novel SNP-set GWAS approach, which is superior in controlling false positives and detecting rare variants compared with conventional approaches, and implemented this method as an R package named “RAINBOWR” (Reliable Association INference By Optimizing Weights with R). In this article, we introduce the application of RAINBOW to haplotype-based GWAS by regarding a haplotype block as a SNP-set, which enables one to perform haplotype-based GWAS without prior haplotype information. We showed that the haplotype-based GWAS with the RAINBOW package succeeded in detecting causal variants with complex mechanisms that were not detected by any other conventional methods. RAINBOW also offers a fast single-SNP GWAS method. RAINBOW offers not only a SNP-set GWAS that can be applied to universal situations but also one that is faster with the restircted situations using linear kernel for constructing the Gram matrix of SNP-set of interest. We also used Rcpp (functions for using C++ in R) for the RAINBOW implementation to achieve faster computation. We believe that our package will lead to the detection of novel genes associated with biologically and agronomically essential traits.


This is a PLOS Computational Biology Software paper.

Introduction

With the decreasing cost and increasing throughput of next-generation sequencing, the number of accessions that can be used for genome-wide association study (GWAS) is increasing [13]. Using such large sequencing data, GWAS is now widely used not only in human but also in plant and animal genetics and breeding, and has identified novel genes related to important agronomic traits [46]. One example of large next-generation sequencing data is that of the “3,000 rice genomes project” as used in this study [7, 8], data from which are available in the “Rice SNP-Seek Database” [911]. GWAS results using these data have already been reported [12].

Despite the enhancement of such public data, the conventional GWAS method still faces obstacles in the detection of unknown candidate genes. One common example is its difficulty in detecting rare alleles or rare variants. One problem caused by rare variants is that the non-causal markers that have a strong linkage disequilibrium (LD) with one causal rare variant indicate a higher detection power than the true causal rare variant, which may interfere with the detection of the true causal variant. This phenomenon is known as “synthetic association”, and often happens when the minor allele frequency (MAF) of the non-causal marker is higher than that of the true rare variant [13]. This problem is closely related to the complex gene compositions comprising multiple alleles such as haplotypes because genes related to important agronomic traits often consist of multiple rare alleles, and this is why haplotypes are hard to detect using GWAS [14].

Several methods have been proposed to solve this problem. The sequence kernel association test (SKAT) is one of the methods used to detect rare variants, and has been used mainly in human genomics [15]. The SKAT employs a single nucleotide polymorphism (SNP) set approach, which tests multiple SNPs in each SNP-set at the same time. The SKAT evaluates the significance of the variance explained by a SNP-set of interest as a random effect using a mixed effect model approach [16, 17]. The fatal drawback of the original SKAT is that the model does not take the effects of family relatedness into account as a random effect, which results in false positives for GWAS in materials with a strong population structure or family relatedness, such as in the world collection of rice germplasm used in this study. Several methods were also proposed to overcome another SKAT drawback: a weighting scheme of the SKAT for rare and common variants can lead to loss of power of common variants, but their models also do not include the term for correcting the confounding effects of family relatedness [18, 19].

To solve the fatal drawbacks of the original SKAT, several methods whose models include the term of family relatedness as random effects to control false positives have been previously proposed [2022]. From a statistical point of view, these methods usually perform the score test [23], which is a computationally efficient method since it requires variance component estimation only for the null model. In terms of the detection power, however, the score test is not necessarily the best method for testing the random effects in the mixed effects model [24]. The likelihood-ratio (LR) test [25, 26] is another candidate used to test the variance of a SNP-set of interest, and several methods have been proposed that use the LR test for SNP-set GWAS in family samples [24, 27]. In particular, Lippert et al. implemented a computationally efficient SNP-set GWAS method using the LR test, and reported that the LR test showed greater power than the score test [24]. Despite being such an efficient method, Lippert et al. mainly used a linear kernel for constructing the Gram matrix from each SNP-set, and therefore other kernels, such as a Gaussian kernel or an exponential kernel, cannot be used for constructing the Gram matrix in their method.

Haplotype-based approaches, which try to improve the detection power of causal haplotypes, make sense from the point of view that a gene functions as one gene set, not as each SNP in the gene set. These haplotype-based approaches are expected to control false positives better than the single-SNP method because the haplotype-based methods focus on the entire haplotype block, not on each SNP in the haplotype block. These methods are also expected to reveal the complex mechanism of causal haplotypes that cannot be detected when focusing on one SNP, such as repulsion states between two causal quantitative trait loci (QTL) located close to each other. However, only a few methods for haplotype-based GWAS have so far been proposed. In plant genomics, Yano et al. performed a haplotype-based GWAS by testing the effects of haplotypes while regarding dummy variables of haplotype groups as fixed effects, and found new candidate genes related to heading date for rice [28]. Other approaches have been proposed in animal genomics, which estimated ancestral haplotype effects by regarding them as random effects [29, 30]. In their methods, each pairwise element of a covariance matrix for the random effects was determined as 1 if individuals belong to the same ancestral haplotype, and 0 if otherwise. However, these conventional haplotype-based GWAS methods require haplotype information a priori, and it is not so easy to apply these methods at the genome-wide level.

In this study, we extended the multi-kernel mixed effects model more generally to take family relatedness into account, while enabling computational speed-up for some limited cases, and developed a novel SNP-set GWAS approach named RAINBOW (Reliable Association INference By Optimizing Weights). We also estimated haplotype blocks from genome-wide marker genotype data, and used them as SNP-sets for analysis with RAINBOW to enable haplotype-based GWAS without prior haplotype information.

Materials and methods

All statistical analyses in this study were conducted using R version 3.6.0 [31], and figures were produced using the R package ggplot2 version 3.2.1 [32]. Our R package, RAINBOWR, was implemented using the R packages Rcpp version 1.0.2 [3335] and RcppEigen version 0.3.3.5.0 [36] to reduce the computational time required for solving the multi-kernel mixed-effects model described below. The overall simulation framework in this study is shown in S1 Fig as a flow chart.

Methods for RAINBOW

In this subsection, we describe the basic idea of RAINBOW.

RAINBOW model

The RAINBOW model can be written as

y=Xβ+Zcuc+Zriuri+ϵ, (1)

where y is a n × 1 vector of phenotypic values, Xβ is a n × 1 vector of fixed effects including an intercept, a term to correct the population structure and other covariates, Zcuc and Zriuri are n × 1 vectors of random effects, and ϵ is a n × 1 vector of residual errors. Here β is a p × 1 vector of fixed effects, where p is the number of fixed effects. uc and uri are mc × 1 and mri×1 vector of genotypic values respectively, where mc is the number of genotypes for additive polygenetic effects and mri is the number of genotypes for i-th SNP-set of interest. X, Zc and Zri are n × p, n × mc and n×mri design matrices that correspond to β, uc and uri respectively. As the following formula Eq 2, we assume that the polygenetic effect uc follows the multivariate normal distribution whose variance-covariance matrix is proportional to the additive numerator relationship matrix Kc.

ucMVN(0,Kcσc2), (2)

where σc2 is the additive genetic variance to be estimated in the “Estimation of variance components” section, and here mc × mc matrix Kc = A, where A is the known additive genetic relationship matrix estimated from marker genotype data Wc [37].

We also assume that the random effects from i-th SNP-set of interest uri follows the multivariate normal distribution whose variance-covariance matrix is proportional to the Gram matrix Kri.

uriMVN(0,Kriσri2), (3)

where σri2 is the genetic variance for i-th SNP-set to be estimated in the “Estimation of variance components” section, and Kri is the known mri×mri Gram matrix estimated from marker genotype data Wri belonging to the i-th SNP-set. We offer a linear, an exponential and a Gaussian kernel for the Gram matrix Kri, and faster computation can be realized for the linear kernel case (Supplementary Note in S1 Appendix) [24].

Finally, the residual term is assumed to identically and independently follow a normal distribution as shown in the following equation.

ϵMVN(0,Inσe2), (4)

where In is a n × n identity matrix and σe2 is estimated in the “Estimation of variance components” section.

Estimation of variance components

The variance components were estimated by maximum-likelihood (ML) [26, 38] and restricted maximum-likelihood (REML) [39]. Here we explain how to obtain ML and REML estimates of Eq 1 for the general Kri.

First we estimated the weights (we define wc and wri) between the genetic variances (σc2 and σri2) by the following algorithm.

  1. Setting initial parameters for wc and wri:
    wc=wri=12. (5)
  2. Computing the following n × n matrix Ks:
    Ks=ZcKcZcTwc+ZriKriZriTwri. (6)
  3. Solving the following single-kernel linear mixed model (LMM) by using EMMA (efficient mixed model association) or GEMMA (genome-wide efficient mixed model association) [40, 41].
    y=Xβ+us+ϵ, (7)
    where
    usMVN(0,Ksσs2.) (8)
  4. Computing the full log likelihood (lF) or the restricted log likelihood (lR) of Eq 7 by using estimated parameters; β^, σ^s2 and σ^e2:
    lF(y;β^,σ^s,δ^)=12[nlog(2πσ^s2)log|H^|1σ^s2(yXβ^)TH^1(yXβ^)], (9)
    lR(y;σ^s,δ^)=lF(y;β^,σ^s,δ^)+12[plog(2πσ^s2)+log|XTX|log|XTH^1X|]. (10)
    Here H^ is
    H^=V^σ^s2=Ks+δ^In. (11)
    where V^ is a phenotypic variance-covariance matrix and δ^=σ^e2/σ^s2.
  5. Optimizing wc and wri over maximization of the full/restricted log likelihood by using L-BFGS optimization method through repeating step 2-4 [42].

After estimating the weights wc and wri, we estimated the variance components (σs2 and σe2) of the model Eqs 7 and 8 by EMMA/GEMMA using w^c and w^ri. Then we obtained σ^c2=w^cσ^s2 and σ^ri2=w^riσ^s2.

Our fitting method, as described above, is a two-step approach, which first estimates the weights of genetic variances, and then estimates the variance components of the model shown in Eqs 7 and 8 by EMMA/GEMMA with the estimated weights. On the other hand, some fitting methods that directly estimate the variance components for Eq 1 via AIREML (average information REML) [43] have also been proposed and implemented in some packages/software [44, 45]. The advantage of our two-step approach compared with the direct estimation approach via AIREML is that the search space of the weights is limited to the interval [0, 1], and the convergence is relatively warranted [46] even when the heritability is too low/high.

Likelihood ratio test for GWAS

To test the significance of each SNP-set, we performed the LR test of whether σri2=0 or not. As a null hypothesis, the following model, which does not include the term of SNP-set effects was assumed.

y=Xβ+Zcuc+ϵ. (12)

In contrast, as an alternative hypothesis model, the multi-kernel linear mixed model (MKLMM) of Eq 1 was assumed. Therefore, we computed the following deviance after the estimation of variance components for each SNP-set.

D=2×(l^R,modell^R,null), (13)

where l^R,model is the maximum of the restricted log likelihood for the model of Eq 1 and l^R,null is the maximum of the restricted log likelihood for the model of Eq 12.

Finally, we tested the significance of σri2 and calculate the p-value by assuming that the deviance in Eq 13 followed the mixture of two chi-square distributions with different degrees of freedom [47, 48].

Dπ0χ02+(1π0)χ12, (14)

where π0 is the mixture parameter and here we used π0 = 1/2.

Materials and simulations

Genotype data

In this study, 414 accessions of Oryza sativa subsp. indica were collected from “the 3,000 rice genomes project” (S1 Table) [7]. We used a marker genotype consisting of core SNPs defined by the Rice SNP-Seek Database as “404k CoreSNP Dataset”. Imputations were imputed using Beagle version 5.0 [49, 50]. We analyzed only bi-allelic sites over all accessions with a MAF ≥ 0.025 by using VCFtools version 0.1.15 [51]. In the following analysis, genotypes are represented as -1 (homozygous of the reference allele), 1 (homozygous of the alternative allele) or 0 (heterozygous of the reference and alternative alleles). As a result of this data processing, marker genotypes with 112,630 SNPs were used for the following simulation study.

Estimation of haplotype block

To perform haplotype-based GWAS by regarding each haplotype block as a SNP-set, haplotype blocks were estimated from marker genotype data by using PLINK 1.9 [5254]. As a result of estimation, we obtained 15,275 haplotype blocks consisting of 78,237 SNPs.

Simulation of phenotype data

We considered two scenarios to validate our novel haplotype-based GWAS approach. In both models, phenotypic values were simulated as follows.

y=X1β1+X2β2+X3β3+Zu+e, (15)

where y is the vector of simulated phenotypic values of 414 accessions, X1, X2 and X3 correspond to three quantitative trait nucleotides (QTNs) scored as -1, 0 or 1 (hereinafter, referred to as “QTN1”, “QTN2” and “QTN3” respectively), β1, β2 and β3 are scalars representing the effects of the three QTNs, u is the vector of polygenetic effects, and e is the vector of the residuals.

Here, QTN1 and QTN2 were randomly selected from all genome-wide SNPs to satisfy that they belonged to the same haplotype block that harbored more than 4 SNPs. QTN3 was randomly selected from all the SNPs. We assumed that the effects of QTN1 and QTN2 had a variance 4 times greater than that of the effects of QTN3 to mainly check the detection power for the haplotype block. More details about the other terms are described in S1 Appendix.

The difference between two scenarios is based on the directions of the two QTN effects β1 and β2. Scenario 1 assumed that the directions of two effects were identical. That is,

β1={β2(ρ120)β2(ρ12<0), (16)

where ρ12 is Pearson’s correlation coefficient between X1 and X2. We call this model as “coupling”.

Conversely, scenario 2 assumed that the directions of the two effects were opposite. That is,

β1={β2(ρ120)β2(ρ12<0). (17)

We call this scenario 2 as “repulsion”.

Evaluation of RAINBOW

Comparison of four methods

To validate our novel approach, we compared the following four methods: a single-SNP GWAS [55], a haplotype-based GWAS introduced by Yano et al. (hereinafter, referred to as “HGF”) [28], the SKAT [15] as a SNP-set approach, and our novel approach, RAINBOW. For all methods, to account for the population structure, the two eigen vectors (which correspond to the top two eigen values) of the additive genetic relationship matrix were included in the model as fixed effects. The details of these four methods are described in S1 Appendix.

Evaluation of the simulation results

The value of −log10(p) of each marker or haplotype block was calculated by the four GWAS methods 100 times for the two simulated scenarios, coupling and repulsion. In this study, the following summary statistics were used to evaluate the simulation results.

−log10(p) and −log10(pa). The first summary statistic is −log10(p) of each causal SNP or haplotype block itself. For haplotype-based GWAS methods, HGF, SKAT and RAINBOW, the significance of β1 and β2 was represented by −log10(p) of the causal haplotype block to which X1 and X2 belong. In the single-SNP GWAS method, the −log10(p) of β1 and β2 were calculated separately, even though these SNPs were in the same haplotype. To compare the single-SNP GWAS method with the haplotype-based GWAS methods, the −log10(p) values were averaged over β1 and β2.

As some of these methods showed the results of inflated −log10(p), we defined the following summary statistic to evaluate the degree of inflation.

inflator=1Ll=1L(log10(pfalse,l)), (18)

where pfalse,l is the lth p-values for false positives arranged in increasing order. In this study, L was set as 10. Then we adjusted −log10(p) of the causal by using the inflator (Eq 18) as follows.

log10(pa)=log10(p)inflator, (19)

where pa is the p-value adjusted by the inflator.

Here, we calculated each summary statistic in two ways. The first method is to calculate each summary statistic by directly using −log10(p) of each causal SNP / haplotype block. The other method is to calculate the summary statistics by regarding multiple SNPs or haplotype blocks within the extent of the LD as one set. In this study, we defined SNPs or haplotype blocks that satisfy the condition that they are within 300 kb from the focused SNP or haplotype block and the condition that their square of the correlation coefficients with the focused SNP or haplotype block are 0.35 or more as one set considering the LD. The highest value of −log10(p) in the LD region was assumed to represent the values of the SNPs or haplotype blocks within the extent of the LD.

Recall, precision and F-measure

We calculated the recall, precision and F-measure means as other summary statistics to evaluate the GWAS results. These summary statistics were calculated from the numbers of SNPs or haplotype blocks that were true positives, false positives, false negatives and true negatives. Here, we regarded a SNP or haplotype block as “positive” when that SNP or haplotype block exceeded the threshold. In this study, the value of −log10(p) so that the FDR (false discovery rate) was 0.01 was set as the threshold by using the Benjamini-Hochberg method [56, 57]. In addition, these three summary statistics, recall, precision and F-measure, were calculated by assuming that the highest value of −log10(p) in the LD region represented the values of the SNPs or haplotype blocks within the extent of the LD.

Therefore, recall represents the proportion of causals detected by GWAS. In contrast, precision represents the ratio of the detected SNPs or haplotype blocks that were causals. Finally, F-measure was calculated as the harmonic mean of the recall and the precision, which evaluates the GWAS results comprehensively. The greater these three summary statistics, the better the results of GWAS are. Here we simply took the average of each summary statistic from all the 100 simulation results.

AUC for regions around causals

We calculated the mean of the AUC (area under the curve) for regions around the causals as a summary statistic. AUC refers to the area under the ROC (receiver operating characteristic) curve obtained by plotting the false positive rate on the horizontal axis and the true positive rate on the vertical axis when the threshold is varied. In this study, the AUC was calculated for the SNPs or haplotype blocks near the causal SNP / haplotype block (QTN1 and QTN2). In other words, the non-causal markers that had a strong LD with the causal SNP / haplotype block were regarded as false positives under this summary statistic. Therefore, this summary statistic indicates the extent to which the causal itself can be detected by GWAS without relying on the LD. Here, when taking the average of the AUCs obtained from the simulation results, two methods were used, either using all the 100 results or only using the results whose QTN1 and QTN2 were “detected”. Here, QTN1 and QTN2 were regarded as “detected” if −log10(pa) ≥ 1.5 for each method.

Availability of data and material

RAINBOW was implemented as an R package named “RAINBOWR”, which offers the single-SNP GWAS method [41, 55] and a novel SNP-set method that includes faster computation for the linear kernel [24]. A stable version of RAINBOWR is available from the CRAN (Comprehensive R Archive Network), https://cran.r-project.org/web/packages/RAINBOWR/index.html. The latest version of RAINBOWR is also available from the “KosukeHamazaki/RAINBOWR” repository in the GitHub, https://github.com/KosukeHamazaki/RAINBOWR. Source codes for the R package RAINBOWR are deposited in S1 File. The datasets generated and analyzed during the current study and their source codes are also available from the “KosukeHamazaki/HGRAINBOW” repository in the GitHub, https://github.com/KosukeHamazaki/HGRAINBOW.

Results

The detection power of four methods

The detection power of the four methods was evaluated by the value of −log10(p) and −log10(pa) of QTN1 and QTN2 for the two models, coupling and repulsion (Fig 1). RAINBOW outperformed the other methods when the significance was evaluated by the causal itself (Fig 1a, 1c, 1e and 1g). However, when the significance was evaluated by the highest values of SNPs or haplotypes within the extent of the LD, other methods, e.g., HGF (k = 2, k-medoids method), showed a greater detection power than RAINBOW (Fig 1b). When the detection power was evaluated by taking the extent of inflation into account, RAINBOW showed as great a power as HGF (k = 2, 3) even if the significance was evaluated by the unit of the LD block (Fig 1d). Moreover, although the detection power of all the GWAS methods for the repulsion scenario was less than that for the coupling scenario, the tendency for RAINBOW to outperform the other methods was clearer for the repulsion scenario than the coupling scenario (Fig 1). Finally, as compared with the other haplotype-based GWAS methods, RAINBOW showed smaller variation among iterations, indicating that the causal variants can be stably detected (Fig 1).

Fig 1. The detection power of each GWAS method.

Fig 1

Boxplot of the detection power evaluated by −log10(p) and −log10(pa). a-d: The results for the “coupling” scenario. e-h: The results for the “repulsion” scenario. a,b,e,f: The results evaluated by −log10(p) with the scale on the vertical axis aligned in these four figures. c,d,g,h: The results evaluated by −log10(pa) with the scale on the vertical axis aligned in these four figures. a,c,e,g: The results evaluated by the unit of the causal SNP or haplotype block itself. b,d,f,h: The results evaluated by the unit of the regions within the extent of LD. The abbreviation of each method is as follows. R: RAINBOW. SS: Single-SNP GWAS. H2k-H4p: HGF methods. The numbers in the method names correspond to the numbers of the groups they assume. The last letters of the methods are “k” or “p”. “k” corresponds to the k-medoids method and “p” corresponds to UPGMA method for the grouping method. SK: SKAT.

The detection power for QTN3 was also evaluated. The single-SNP GWAS method showed a greater power than RAINBOW when evaluated by −log10(p) (a,b,e,f in S2 Fig). However, if the detection power was evaluated by −log10(pa), RAINBOW showed as great a power as single-SNP GWAS (c,d,g,h in S2 Fig). Contrary to the results for QTN1 and QTN2, the detection power of all the GWAS methods for the repulsion scenario was greater than for the coupling scenario (S2 Fig).

Recall, precision and F-measure

The characteristics of each GWAS method were evaluated by the recall, precision and F-measure means (Fig 2). For the mean of recall, the HGF methods and SKAT showed higher values than RAINBOW and single-SNP GWAS for both scenarios (Fig 2). However, the haplotype-based GWAS methods other than RAINBOW showed low precision. That is, these methods may cause too many false positives. In contrast, RAINBOW and single-SNP GWAS showed higher precision than the remainders, and RAINBOW showed the highest precision among all the scenarios. From the results for the three summary statistics, RAINBOW also showed the highest value for F-measure among the methods. In particular, for the repulsion scenario, the recall of RAINBOW was also higher than that of single-SNP GWAS, which resulted in the large difference of F-measure between these two methods (Fig 2b). A similar tendency was also confirmed when changing the criterion of how to determine the threshold for the Bonferroni’s correction [58] for the significance level α = 0.01 (S3 Fig).

Fig 2. Recall, precision and F-measure of each GWAS method.

Fig 2

Bar plot of the mean of each summary statistic for 100 simulation results. The red bars show the results for recall, the green bars show the results for precision, and the blue bars show the results for F-measure. a: Results for the coupling scenario. b: Results for the repulsion scenario. The abbreviations of each method are the same as those of Fig 1.

To compare the three summary statistics of the two scenarios in more detail, these values for each QTN were also calculated (S4 Fig). For both scenarios, RAINBOW showed the highest recall for QTN1 and QTN2 among the methods (a, b in S4 Fig). In particular, RAINBOW outperformed the other methods in all summary statistics for QTN1 and QTN2 for the repulsion scenario. However, it showed lower recall for QTN3 than the other methods (c, d in S4 Fig). In particular, the recall of RAINBOW for QTN3 was 0 for the coupling scenario (c in S4 Fig). In addition, the three summary statistics of QTN3 for the repulsion scenario were greater than those for the coupling scenario in almost all the methods (c, d in S4 Fig). Regarding these results, a similar trend was confirmed even when changing the criterion of how to determine the threshold for the Bonferroni’s correction for the significance level α = 0.01 (S5 Fig).

AUC for regions around causals

To evaluate how the causal itself can be detected by GWAS without relying on the LD, the AUC means for regions around the causals (QTN1 and QTN2) were compared (Fig 3). The mean of AUC was almost the same when using all simulation results or using only the cases in which QTN1 and QTN2 were detected, although the value of the latter was slightly larger than that of the former in some methods. The results show that RAINBOW outperformed the other methods in both models (Fig 3). Especially, the AUC mean of the single-SNP GWAS method in the repulsion scenario was much smaller than that in the coupling scenario, while RAINBOW was able to maintain a high AUC even in the repulsion scenario (Fig 3b).

Fig 3. AUC for regions around causals.

Fig 3

Bar plot of the mean of AUC for regions around causals. This summary statistic indicates the extent to which the causal itself can be detected by GWAS without relying on the LD. The red bars show the results for the means of 100 simulation results and the blue bars show the results for the means of the simulation results whose QTN1 and QTN2 were detected. a: Results for the coupling scenario. b: Results for the repulsion scenario. The abbreviations of each method are the same as those of Fig 1.

Examples in the repulsion scenario

Of the 100 simulations for the repulsion scenario, there were 7 cases in which QTN1 and QTN2 were detected only by RAINBOW. These cases were selected to satisfy three conditions that −log10(pa,R) ≥ 1.5, −log10(pa,O) ≤ 1.2 and the recall for QTN1 and QTN2 equals to 1. Here, pa,R represents the adjusted p-value of RAINBOW and pa,O represents the adjusted p-value of all the other methods. Although the same analysis was done for the other methods, no method satisfied the three conditions described above. One example of these cases (iteration 48) was shown by comparing the four GWAS methods; RAINBOW, single-SNP GWAS, HGF (the number of groups is 2, the grouping method is UPGMA) and SKAT (Fig 4). The Manhattan plot shows that RAINBOW succeeded in detecting the causal haplotype block (of QTN1 and QTN2) that was not detected by the other methods. Although both QTN1 and QTN2 were also detected by the single-SNP method in one case (iteration 85), the same trend as the results for iteration 48 was seen for the remaining five results (S6 Fig).

Fig 4. An example of GWAS results for the repulsion scenario.

Fig 4

Manhattan plots of 4 GWAS methods (RAINBOW, Single-SNP GWAS, HGF (the number of groups is 2, the grouping method is UPGMA), and SKAT) for one simulation result of the Repulsion model. The black horizontal dashed lines represent the thresholds determined by the Benjamini-Hochberg method (FDR = 0.01) for each result of the Repulsion model. The red vertical dashed lines show the positions of QTN1, QTN2, and the purple ones show the position of QTN3. The red points show −log10(p) of causal SNPs or haplotypes including QTN1 and QTN2, and the purple ones show −log10(p) of QTN3 or haplotypes including QTN3.

Discussion

As shown in Results section, when −log10(p) was evaluated by the LD block unit for the coupling scenario, RAINBOW did not necessarily outperform the other methods. However, if we considered the inflation level of each result and evaluated the results with the −log10(pa), RAINBOW showed as great a detection power as other methods (Fig 1d), which means RAINBOW succeeded in controlling false positives compared with other haplotype-based GWAS methods. This can also be seen from the fact that the precision of RAINBOW was much higher than the other GWAS methods including single-SNP GWAS (Fig 2).

Moreover, −log10(p) of RAINBOW was the highest when evaluated by that of the causal SNP/haplotype block itself, which implies that RAINBOW can detect causal haplotype blocks themselves without relying on the LD beyond the scope of the haplotype block. This can also be confirmed by the results that showed that the AUC for the regions around the causal was larger in RAINBOW than in any other methods (Fig 3).

In addition, for the repulsion scenario, RAINBOW succeeded in detecting causal haplotype blocks that were not able to be detected by any other methods including single-SNP GWAS (Fig 4). This result affected other results that RAINBOW outperformed the other methods especially when evaluated by the detection power, recall, precision and F-measure in the repulsion scenario. This fact suggests that RAINBOW is good for detecting the causal haplotype block with multiple causal variants. For example, RAINBOW can be applied to the detection of genes that have more than one variant. Therefore, for future analysis, RAINBOW can be used for gene-set GWAS (which regards one gene as one SNP-set) by using gene annotation information.

The only drawback of RAINBOW is that the detection power for the causal with small effects (QTN3) was not so high (c,d in S4 Fig). The drawback may be related to the fact that RAINBOW succeeded in detecting QTN1 and QTN2 well. In other words, RAINBOW cannot account for the loci of large effects well when testing other loci, and the loci of relatively small effects may be concealed by these loci of large effects. This drawback, however, can be easily resolved by using methods that condition the loci of large effects, such as composite interval mapping for QTL analysis [59, 60] or a multi-locus mixed model for GWAS [61]. For future analysis, we will implement this function to condition the loci of large effects when testing other loci of small effects.

Supporting information

S1 Appendix. Supplementary Note for additional RAINBOW methods.

A faster computational method for the linear kernel and effective testing method for dominance and epistatic effects are mainly described.

(PDF)

S1 Table. Supplementary table for accession information used in this study.

(CSV)

S1 File. Source codes for the R package RAINBOWR.

Including source code and license files for the R package RAINBOWR. Please see “Readme.md” file to start the RAINBOW.

(RAR)

S1 Fig. Supplementary figure for the flow chart of the simulation framework in tis study.

(PDF)

S2 Fig. Supplementary figure for −log10(p) and −log10(pa) of QTN3 for each method.

How to view this figure (including legends and abbreviations) is the same as that of Fig 1.

(PDF)

S3 Fig. Supplementary figure for recall, precision and F-measure determined by the threshold criterion of Bonferroni correction whose significance level equals to 0.01.

How to view this figure (including legends and abbreviations) is the same as that of Fig 2.

(PDF)

S4 Fig. Supplementary figure for recall, precision and F-measure of each QTN.

How to view this figure (including legends and abbreviations) is the same as that of Fig 2.

(PDF)

S5 Fig. Supplementary figure for recall, precision and F-measure of each QTN determined by the threshold criterion of Bonferroni correction whose significance level equals to 0.01.

How to view this figure (including legends and abbreviations) is the same as that of Fig 2.

(PDF)

S6 Fig. Supplementary figures (6 pages) for the examples of the cases where only RAINBOW succeeded in detecting causals, for the repulsion scenario.

How to view this figure (including legends and abbreviations) is the same as that of Fig 4.

(PDF)

Acknowledgments

We are grateful to Dr. Ryokei Tanaka and Dr. Shiori Yabe for fruitful discussions, Dr. Motoyuki Ishimori and Mr. Goshi Sasaki for debugging the package, and Mr. Ryusuke Hamazaki for naming the package, RAINBOW.

Data Availability

We implemented the method to an R package named RAINBOWR. RAINBOWR is deposited in the CRAN (Comprehensive R Archive Network), https://cran.r-project.org/web/packages/RAINBOWR/index.html, and in the "KosukeHamazaki/RAINBOWR" repository in the GitHub, https://github.com/KosukeHamazaki/RAINBOW. The datasets and scripts generated and analyzed during the current study are available from the “KosukeHamazaki/HGRAINBOW‘repository in the GitHub,https://github.com/KosukeHamazaki/HGRAINBOW.

Funding Statement

This work was supported by JST CREST (https://www.jst.go.jp/kisoken/crest/en/index.html) Grant Number JPMJCR16O2, Japan. The funders had no role in study design, data collection, and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Metzker ML. Sequencing technologies the next generation. Nat Rev Genet. 2010;11(1):31–46. 10.1038/nrg2626 [DOI] [PubMed] [Google Scholar]
  • 2. Koboldt DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER. The next-generation sequencing revolution and its impact on genomics. Cell. 2013;155(1):27–38. 10.1016/j.cell.2013.09.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Ott J, Wang J, Leal SM. Genetic linkage analysis in the age of whole-genome sequencing. Nat Rev Genet. 2015;16(5):275–284. 10.1038/nrg3908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Atwell S, Huang YS, Vilhjálmsson BJ, Willems G, Horton M, Li Y, et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature. 2010;465(7298):627–631. 10.1038/nature08800 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Huang X, Wei X, Sang T, Zhao Q, Feng Q, Zhao Y, et al. Genome-wide asociation studies of 14 agronomic traits in rice landraces. Nat Genet. 2010;42(11):961–967. 10.1038/ng.695 [DOI] [PubMed] [Google Scholar]
  • 6. Korte A, Farlow A. The advantages and limitations of trait analysis with GWAS: a review. Plant Methods. 2013;9(1):29 10.1186/1746-4811-9-29 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Li JY, Wang J, Zeigler RS. The 3,000 rice genomes project: New opportunities and challenges for future rice research. GigaScience. 2014;3(1):1–3. 10.1186/2047-217X-3-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Wang W, Mauleon R, Hu Z, Chebotarov D, Tai S, Wu Z, et al. Genomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature. 2018;557(7703):43–49. 10.1038/s41586-018-0063-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Alexandrov N, Tai S, Wang W, Mansueto L, Palis K, Fuentes RR, et al. SNP-Seek database of SNPs derived from 3000 rice genomes. Nucleic Acids Res. 2015;43(D1):D1023–D1027. 10.1093/nar/gku1039 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Mansueto L, Fuentes RR, Chebotarov D, Borja FN, Detras J, Abriol-Santos JM, et al. SNP-Seek II: A resource for allele mining and analysis of big genomic data in Oryza sativa. Curr Plant Biol. 2016;7-8:16–25. 10.1016/j.cpb.2016.12.003 [DOI] [Google Scholar]
  • 11. Mansueto L, Fuentes RR, Borja FN, Detras J, Abrio-Santos JM, Chebotarov D, et al. Rice SNP-seek database update: New SNPs, indels, and queries. Nucleic Acids Res. 2017;45(D1):D1075–D1081. 10.1093/nar/gkw1135 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Misra G, Badoni S, Anacleto R, Graner A, Alexandrov N, Sreenivasulu N. Whole genome sequencing-based association study to unravel genetic architecture of cooked grain width and length traits in rice. Nat Sci Reports. 2017;7(1):12478. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Dickson SP, Wang K, Krantz I, Hakonarson H, Goldstein DB. Rare Variants Create Synthetic Genome-Wide Associations. PLoS Biol. 2010;8(1). 10.1371/journal.pbio.1000294 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Stram D. Design, Analysis, and Interpretation of Genome-Wide Association Scans. Heidelberg, New York: Springer Science+Business Media; 2014. [Google Scholar]
  • 15. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93. 10.1016/j.ajhg.2011.05.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Liu D, Lin X, Ghosh D. Semiparametric regression of multidimensional genetic pathway data: Least-squares kernel machines and linear mixed models. Biometrics. 2007;63(4):1079–1088. 10.1111/j.1541-0420.2007.00799.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Liu D, Ghosh D, Lin X. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC Bioinformatics. 2008;9:1–11. 10.1186/1471-2105-9-292 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Sha Q, Wang X, Wang X, Zhang S. Detecting Association of Rare and Common Variants by Testing an Optimally Weighted Combination of Variants. Genet Epidemiol. 2012;36(6):561–571. 10.1002/gepi.21649 [DOI] [PubMed] [Google Scholar]
  • 19. Ionita-Laza I, Lee S, Makarov V, Buxbaum JD, Lin X. Sequence kernel association tests for the combined effect of rare and common variants. Am J Hum Genet. 2013;92(6):841–853. 10.1016/j.ajhg.2013.04.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SLR, Peyser PA, et al. SNP Set Association Analysis for Familial Data. Genet Epidemiol. 2012;36(8):797–810. 10.1002/gepi.21676 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Chen H, Meigs JB, Dupuis J. Sequence Kernel Association Test for Quantitative Traits in Family Samples. Genet Epidemiol. 2013;37(2):196–204. 10.1002/gepi.21703 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Oualkacha K, Dastani Z, Li R, Cingolani PE, Spector TD, Hammond CJ, et al. Adjusted Sequence Kernel Association Test for Rare Variants Controlling for Cryptic and Family Relatedness. Genet Epidemiol. 2013;37(4):366–376. 10.1002/gepi.21725 [DOI] [PubMed] [Google Scholar]
  • 23. Rao CR. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Math Proc Cambridge Philos Soc. 1948;44(1):50–57. 10.1017/S0305004100023987 [DOI] [Google Scholar]
  • 24. Lippert C, Xiang J, Horta D, Widmer C, Kadie C, Heckerman D, et al. Greater power and computational efficiency for kernel-based association testing of sets of genetic variants. Bioinformatics. 2014;30(22):3206–3214. 10.1093/bioinformatics/btu504 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Neyman J, Pearson ES. On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference. Biometrika. 1928;20A(1-2):175–240. 10.1093/biomet/20A.1-2.175 [DOI] [Google Scholar]
  • 26. Wilks SS. The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses. Ann Math Stat. 1938;9(1):60–62. 10.1214/aoms/1177732360 [DOI] [Google Scholar]
  • 27. Listgarten J, Lippert C, Kang EY, Xiang J, Kadie CM, Heckerman D. A powerful and efficient set test for genetic markers that handles confounders. Bioinformatics. 2013;29(12):1526–1533. 10.1093/bioinformatics/btt177 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Yano K, Yamamoto E, Aya K, Takeuchi H, Lo PC, Hu L, et al. Genome-wide association study using whole-genome sequencing rapidly identifies new genes influencing agronomic traits in rice. Nat Genet. 2016;48(8):927–934. 10.1038/ng.3596 [DOI] [PubMed] [Google Scholar]
  • 29. Druet T, Georges M. A hidden Markov model combining linkage and linkage disequilibrium information for haplotype reconstruction and quantitative trait locus fine mapping. Genetics. 2010;184(3):789–798. 10.1534/genetics.109.108431 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Zhang Z, Guillaume F, Sartelet A, Charlier C, Georges M, Farnir F, et al. Ancestral haplotype-based association mapping with generalized linear mixed models accounting for stratification. Bioinformatics. 2012;28(19):2467–2473. 10.1093/bioinformatics/bts348 [DOI] [PubMed] [Google Scholar]
  • 31. R Core Team. R: A Language and Environment for Statistical Computing; 2019. Available from: https://www.R-project.org/. [Google Scholar]
  • 32. Wickham H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag; New York; 2016. Available from: https://ggplot2.tidyverse.org. [Google Scholar]
  • 33. Eddelbuettel D, François R. Rcpp: Seamless R and C++ Integration. Journal of Statistical Software. 2011;40(8):1–18. 10.18637/jss.v040.i08 [DOI] [Google Scholar]
  • 34. Eddelbuettel D. Seamless R and C++ Integration with Rcpp. New York: Springer; 2013. [Google Scholar]
  • 35. Eddelbuettel D, Balamuta JJ. Extending extitR with extitC++: A Brief Introduction to extitRcpp. PeerJ Preprints. 2017;5:e3188v1. [Google Scholar]
  • 36. Bates D, Eddelbuettel D. Fast and Elegant Numerical Linear Algebra Using the RcppEigen Package. Journal of Statistical Software. 2013;52(5):1–24. 10.18637/jss.v052.i0523761062 [DOI] [Google Scholar]
  • 37. Endelman JB, Jannink JL. Shrinkage Estimation of the Realized Relationship Matrix. G3 (Bethesda). 2012;2(11):1405–1413. 10.1534/g3.112.004259 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Edgeworth FY. On the Probable Errors of Frequency-Constants (Contd.). J R Stat Soc. 1908;71(3):499–512. 10.2307/2339293 [DOI] [Google Scholar]
  • 39. Patterson HD, Thompson R. Recovery of inter-block information when block sizes are unequal. Biometrika. 1971;58(3):545–554. 10.1093/biomet/58.3.545 [DOI] [Google Scholar]
  • 40. Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ, et al. Efficient Control of Population Structure in Model Organism Association Mapping. Genetics. 2008;178(3):1709–1723. 10.1534/genetics.107.080101 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Zhou X, Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat Genet. 2012;44(7):821–824. 10.1038/ng.2310 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Byrd R, Lu P, Nocedal J, Zhu C. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific Computing. 1995;16(5):1190–1208. 10.1137/0916069 [DOI] [Google Scholar]
  • 43. Gilmour AR, Thompson R, Cullis BR. Average Information REML: An Efficient Algorithm for Variance Parameter Estimation in Linear Mixed Models. Biometrics. 1995;51(4):1440–1450. 10.2307/2533274 [DOI] [Google Scholar]
  • 44. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: A tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82. 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Covarrubias-Pazaran G. Genome-Assisted prediction of quantitative traits using the r package sommer. PLoS One. 2016;11(6):1–15. 10.1371/journal.pone.0156744 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Wang J, Do H, Woznica A, Kalousis A. Metric learning with multiple kernels Adv Neural Inf Process Syst. 2011; p. 1170–1178. [Google Scholar]
  • 47. Self SG, Liang KY. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J Am Stat Assoc. 1987;82(398):605–610. 10.1080/01621459.1987.10478472 [DOI] [Google Scholar]
  • 48. Stram DO, Lee JW. Variance Components Testing in the Longitudinal Mixed Effects Model. Biometrics. 1994;50(4):1171–1177. 10.2307/2533455 [DOI] [PubMed] [Google Scholar]
  • 49. Browning SR, Browning BL. Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering. Am J Hum Genet. 2007;81(5):1084–1097. 10.1086/521987 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Browning BL, Zhou Y, Browning SR. A One-Penny Imputed Genome from Next-Generation Reference Panels. Am J Hum Genet. 2018;103(3):338–348. 10.1016/j.ajhg.2018.07.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–2158. 10.1093/bioinformatics/btr330 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Purcell S, Chang C. PLINK 1.9; 2018. Available from: https://www.cog-genomics.org/plink/1.9/.
  • 53. Gaunt TR, Rodríguez S, Day INM. Cubic exact solutions for the estimation of pairwise haplotype frequencies: Implications for linkage disequilibrium analyses and a web tool’CubeX’. BMC Bioinformatics. 2007;8:1–9. 10.1186/1471-2105-8-428 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Taliun D, Gamper J, Pattaro C. Efficient haplotype block recognition of very long and dense genetic sequences. BMC Bioinformatics. 2014;15(1):1–18. 10.1186/1471-2105-15-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 2006;38(2):203–8. 10.1038/ng1702 [DOI] [PubMed] [Google Scholar]
  • 56. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Author (s): Yoav Benjamini and Yosef Hochberg Source: Journal of the Royal Statistical Society. Series B (Methodological), Vol. 57, No. 1 Published by: Wi. J R Stat Soc Ser B. 1995;57(1):289–300. [Google Scholar]
  • 57. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci. 2003;100(16):9440–9445. 10.1073/pnas.1530509100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Bland JM, Altman DG. Multiple significance tests: The Bonferroni correction. BMJ. 1995;310:170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Zeng ZB. Theoretical basis for separation of multiple linked gene effects in mapping quantitative trait loci. Proc Natl Acad Sci. 1993;90(23):10972–10976. 10.1073/pnas.90.23.10972 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Zeng ZB. Precision Mapping of Quantitative Trait Loci. Genetics. 1994;136:1457–1468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Segura V, Vilhjálmsson BJ, Platt A, Korte A, Seren Ü, Long Q, et al. An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nat Genet. 2012;44(7):825–830. 10.1038/ng.2314 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007663.r001

Decision Letter 0

Mihaela Pertea

11 Nov 2019

Dear Dr Iwata,

Thank you very much for submitting your manuscript 'RAINBOW: Haplotype-based genome-wide association study using a novel SNP-set method' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts.

In addition, when you are ready to resubmit, please be prepared to provide the following:

(1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors.

(2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text.

(3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution.

Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are:

- Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition).

- Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video.

- Funding information in the 'Financial Disclosure' box in the online system.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here

We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us.

Sincerely,

Mihaela Pertea

Software Editor

PLOS Computational Biology

Mihaela Pertea

Software Editor

PLOS Computational Biology

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This is a very interesting research that attempts to improve rare variants detection of conventional GWAS-SNPs models via haplotypes. The proposed method performs as good as the conventional methods for controlling false positives; however, it shows to outperform the other models detecting causal induced variants that were not identified with the conventional models. Also, one of the advantages of this proposed model is that it does not rely on LD when causal variants are also genotyped.

In general the materials and methods, Results and Discussion sections are well written; however, the abstract and introduction sections needs some improvements. Especially for describing better the scope and implications of the results of the proposed method.

Here a few minor points.

Page 2, lines 6-9. sequencing data is the "3000 rice genomes project" [].

such public data, the conventional GWAS

Page 2, line 29. in false

as in the world collection of rice germplasm used in this

drawback: a weighting

Line 39. which is a computationally

method since it requires

Page 3, lines 47-49. Please rephrase.

Derivations of the equations and model development is ok

Adding a diagram for explaining the proposed simulation scheme would help to understand better the results.

Page 8, line 217. data an material

Discussion was well conducted. Perhaps a conclusions section would be desirable if that is allow in the journal format.

Reviewer #2: Please see attachment for review.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Diego Jarquin

Reviewer #2: No

Attachment

Submitted filename: Review.pdf

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007663.r003

Decision Letter 1

Mihaela Pertea

18 Jan 2020

Dear Dr. Iwata,

We are pleased to inform you that your manuscript 'RAINBOW: Haplotype-based genome-wide association study using a novel SNP-set method' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch within two working days with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Mihaela Pertea

Software Editor

PLOS Computational Biology

Mihaela Pertea

Software Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I have no further comments on the version of the manuscript. All my questions were correctly addressed by the authors.

Reviewer #2: The authors have substantially improved their writing.

Their contribution have tried to address GWAS - an important but difficult problem. The method requires estimating variance components of models from a multi-step approach, which includes estimating weights to scale the estimated variance components from a model with a single random effect.

I am not convinced such an approach is optimal to estimation of variance components directly but acknowledge that the contribution and results are worthy of dissemination. A weakness in the method include that it is not easily extendible (e.g. if there are three random effects then the algorithm needs modification) - perhaps something that authors may like to think about in future developments of their software.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007663.r004

Acceptance letter

Mihaela Pertea

6 Feb 2020

PCOMPBIOL-D-19-01767R1

RAINBOW: Haplotype-based genome-wide association study using a novel SNP-set method

Dear Dr Iwata,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Sarah Hammond

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Supplementary Note for additional RAINBOW methods.

    A faster computational method for the linear kernel and effective testing method for dominance and epistatic effects are mainly described.

    (PDF)

    S1 Table. Supplementary table for accession information used in this study.

    (CSV)

    S1 File. Source codes for the R package RAINBOWR.

    Including source code and license files for the R package RAINBOWR. Please see “Readme.md” file to start the RAINBOW.

    (RAR)

    S1 Fig. Supplementary figure for the flow chart of the simulation framework in tis study.

    (PDF)

    S2 Fig. Supplementary figure for −log10(p) and −log10(pa) of QTN3 for each method.

    How to view this figure (including legends and abbreviations) is the same as that of Fig 1.

    (PDF)

    S3 Fig. Supplementary figure for recall, precision and F-measure determined by the threshold criterion of Bonferroni correction whose significance level equals to 0.01.

    How to view this figure (including legends and abbreviations) is the same as that of Fig 2.

    (PDF)

    S4 Fig. Supplementary figure for recall, precision and F-measure of each QTN.

    How to view this figure (including legends and abbreviations) is the same as that of Fig 2.

    (PDF)

    S5 Fig. Supplementary figure for recall, precision and F-measure of each QTN determined by the threshold criterion of Bonferroni correction whose significance level equals to 0.01.

    How to view this figure (including legends and abbreviations) is the same as that of Fig 2.

    (PDF)

    S6 Fig. Supplementary figures (6 pages) for the examples of the cases where only RAINBOW succeeded in detecting causals, for the repulsion scenario.

    How to view this figure (including legends and abbreviations) is the same as that of Fig 4.

    (PDF)

    Attachment

    Submitted filename: Review.pdf

    Attachment

    Submitted filename: Response_to_Reviewers.docx

    Data Availability Statement

    We implemented the method to an R package named RAINBOWR. RAINBOWR is deposited in the CRAN (Comprehensive R Archive Network), https://cran.r-project.org/web/packages/RAINBOWR/index.html, and in the "KosukeHamazaki/RAINBOWR" repository in the GitHub, https://github.com/KosukeHamazaki/RAINBOW. The datasets and scripts generated and analyzed during the current study are available from the “KosukeHamazaki/HGRAINBOW‘repository in the GitHub,https://github.com/KosukeHamazaki/HGRAINBOW.

    RAINBOW was implemented as an R package named “RAINBOWR”, which offers the single-SNP GWAS method [41, 55] and a novel SNP-set method that includes faster computation for the linear kernel [24]. A stable version of RAINBOWR is available from the CRAN (Comprehensive R Archive Network), https://cran.r-project.org/web/packages/RAINBOWR/index.html. The latest version of RAINBOWR is also available from the “KosukeHamazaki/RAINBOWR” repository in the GitHub, https://github.com/KosukeHamazaki/RAINBOWR. Source codes for the R package RAINBOWR are deposited in S1 File. The datasets generated and analyzed during the current study and their source codes are also available from the “KosukeHamazaki/HGRAINBOW” repository in the GitHub, https://github.com/KosukeHamazaki/HGRAINBOW.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES