Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2011 Nov 3;7(11):e1002255. doi: 10.1371/journal.pcbi.1002255

The Statistics of Bulk Segregant Analysis Using Next Generation Sequencing

Paul M Magwene 1,*, John H Willis 2, John K Kelly 3
Editor: Adam Siepel4
PMCID: PMC3207950  PMID: 22072954

Abstract

We describe a statistical framework for QTL mapping using bulk segregant analysis (BSA) based on high throughput, short-read sequencing. Our proposed approach is based on a smoothed version of the standard Inline graphic statistic, and takes into account variation in allele frequency estimates due to sampling of segregants to form bulks as well as variation introduced during the sequencing of bulks. Using simulation, we explore the impact of key experimental variables such as bulk size and sequencing coverage on the ability to detect QTLs. Counterintuitively, we find that relatively large bulks maximize the power to detect QTLs even though this implies weaker selection and less extreme allele frequency differences. Our simulation studies suggest that with large bulks and sufficient sequencing depth, the methods we propose can be used to detect even weak effect QTLs and we demonstrate the utility of this framework by application to a BSA experiment in the budding yeast Saccharomyces cerevisiae.

Author Summary

Quantitative or complex phenotypes are traits that are under the control of multiple genes and environmental factors. Identifying the parts of the genome that contribute to variation in complex traits (Quantitative Trait Loci or QTLs), and ultimately the genes and alleles that are mechanistically responsible for trait variation, is a primary challenge in animal and plant breeding, population studies of human health and disease, and evolutionary genetics. In this study we describe an analytical framework that allows investigators to marry a QTL mapping approach called “bulk segregant analysis” (BSA) with high-throughput genome sequencing methodologies in order to map traits quickly, efficiently, and in a relatively inexpensive manner. This framework provides a statistical basis for analyzing BSA experiments that use next-generation sequencing and will help to accelerate the identification of QTLs in both model and non-model organisms.

Introduction

Bulk segregant analysis (BSA; [1]) is a QTL mapping technique for identifying genomic regions containing genetic loci affecting a trait of interest. Starting with a segregating population from a genetic cross, individuals are assayed for the focal trait and two pools (bulks) of segregants are created by selecting individuals from the tails of the phenotypic distribution (other sampling designs can also be used as discussed below). Genotype frequencies are estimated for the two bulks, either via genotyping of individuals or via the creation of pooled DNA samples from which allele frequencies are estimated. Allele frequencies should be approximately equal between the two bulks in genomic regions without loci affecting the trait. Regions of the genome containing causal loci should exhibit allele frequency differences between bulks. BSA is most effective with high marker density and accurate allele frequency estimation within bulks [2]. The former was effectively addressed with the application of microarray based genotyping to BSA [3][8]. More recently, investigators have begun to use massively parallel sequencing methods to estimate allele frequencies for BSA studies [9][11], which has a number of advantages. For organisms with moderately sized genomes, next generation sequencing can provide essentially single base-pair resolution. In such cases rather than simply observing markers in linkage with causal loci the BSA-sequencing approach should allow one to observe allelic biases at the causal loci themselves. For larger genomes where high coverage of the entire genome is less practical, BSA-sequencing still has many potential advantages. For example, it does not require the design of new genotyping arrays for new crosses and may provide greater resolution than array based genotyping. Furthermore, sequencing data yields counts of alleles at polymorphic loci and thus provides a simple and intuitive way of estimating allele frequencies.

In bulk segregant studies based on high-throughput sequencing there are two sources of variation that affect allele frequency estimates. The first is variation due to the sampling of segregants that constitute the bulks themselves. This source of variation can be minimized by increasing both the size of the segregant population and the size of the bulk samples. The second source of variation is a consequence of the measurement technique used to estimate allele frequencies in the bulks. In the case of sequencing of pooled DNA samples, the sources of variation of this second type include, but are not limited to, library preparation, sequencing chemistry, sequencing coverage, post-sequencing alignment of reads, and base/allele calling algorithms. Here again, some of these sources of variation can be minimized by standardization of experimental protocols and analysis pipelines. However some of these sources of variation, particularly stochasticity in sequencing coverage, are an inherent property of short-read sequencing methods.

In this paper, we develop explicit statistical models to describe the sources of variation that should be considered in the analysis of BSA-sequencing data. We first develop test statistics based on the classic Inline graphic -statistic accounting for the two phase sampling inherent to BSA. We then propose an analysis pipeline for whole-genome studies and present a proof-of-concept example with data from yeast. A combination of simulation and empirical application demonstrate the utility of this analytical framework.

Results

Theory and Analytical Framework

Expected distribution of Inline graphic for BSA-sequencing data

Consider the experimental design with an FInline graphic population consisting of Inline graphic individuals, each of which is measured for a phenotype of interest. A set of Inline graphic individuals from each of the tails of the distribution (low and high) are collected. DNA bulks are prepared by combining equal amounts of tissue/cells from individuals within each bulk followed by DNA extraction, or by extracting DNA from each individual and combining equal amounts. Following preparation of DNA bulks, genomic libraries are prepared and sequenced at average coverage Inline graphic per SNP. Thus for each SNP the data is four allele counts that can be summarized in a Inline graphic table, where Inline graphic is the allele from the high parent (Table 1). The Inline graphic-values in the table are counts of alleles not individuals. The observed allele frequency of Inline graphic in the low bulk is Inline graphic; that in the high bulk is Inline graphic. If the SNP is close to a QTL with effects in the expected direction (i.e. the ‘high allele’ increases trait values), then we expect Inline graphic.

Table 1. The summary of data from a single variable site.
Low bulk High bulk Total
Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic
Total Inline graphic Inline graphic

The Inline graphic represent counts of alleles Inline graphic and Inline graphic generated from sequencing of the segregant bulks.

The counts in Table 1 are determined by two levels of hierarchical of sampling. The first sample is the Inline graphic chromosomes that constitute each bulk (assuming diploid inheritance). Second, there is random variation in the number of reads per allele within each bulk due to the stochastic nature of next-generation sequencing. Let Inline graphic and Inline graphic be the expected (‘true’) frequency of the high allele in each bulk. The realized frequencies (Inline graphic, Inline graphic) differ from Inline graphic and Inline graphic in each bulk due to binomial sampling:

graphic file with name pcbi.1002255.e035.jpg (1)
graphic file with name pcbi.1002255.e036.jpg (2)

If we assume that sequencing coverage is approximately Poisson, then the conditional distributions of the observed allele counts are:

graphic file with name pcbi.1002255.e037.jpg (3)
graphic file with name pcbi.1002255.e038.jpg (4)
graphic file with name pcbi.1002255.e039.jpg (5)
graphic file with name pcbi.1002255.e040.jpg (6)

A natural statistic to characterize the data at each SNP is the standard Inline graphic -statistic:

graphic file with name pcbi.1002255.e042.jpg (7)

where Inline graphic is the ‘expected value’ for count Inline graphic. The null hypothesis is that there is no QTL close to the focal SNP. This implies the standard expected counts for a Inline graphic contingency table, e.g. Inline graphic. If the null hypothesis is correct, Inline graphic and Inline graphic. If we further assume no segregation distortion and equal (average) sequencing coverage of each bulk, then Inline graphic. See the supplementary materials (Text S1) for a generalization that includes segregation distortion.

However, due to the hierarchical sampling scheme, the usual expectation that Inline graphic follows a Inline graphic distribution (chi-square with 1 d.f.; [12]) does not hold in the present situation. The mean and variance of Inline graphic are inflated relative the Inline graphic even when the null hypothesis is true (i.e. there is no QTL). Based on the arguments in Text S1 we approximate the mean and variance of Inline graphic as:

graphic file with name pcbi.1002255.e055.jpg (8)
graphic file with name pcbi.1002255.e056.jpg (9)

These equations predict convergence on Inline graphic under certain parameter sets. In particular, if Inline graphic, then Inline graphic and Inline graphic, as expected from Inline graphic.

A simulation model was used to test the accuracy of approximate equations (8) and (9). We simulated genetic data for a chromosomal region of 10 cM in recombinational length. Informative markers were uniformly distributed along this chromosome with Inline graphic SNPs per cM. The causal locus (QTL) was located at the center of the chromosome and was thus flanked by Inline graphic SNPs on each side. Alternative homozygotes at the QTL differ by Inline graphic phenotypic units on average (additive gene action) and simulations of the null hypothesis (no QTL) were done with Inline graphic. In each simulation run, we first established the genotypes and phenotypes of the Inline graphic distinct FInline graphic segregants. Each individual was assigned a QTL genotype according to Mendelian probabilities (0.25, 0.5, 0.25) and the phenotype was assigned as the genotypic value plus a normal deviate. Individuals were then ranked by phenotype and Inline graphic were selected from each tail. The full haplotype of these individuals was then established by working out from each allele at the QTL and allowing recombination to occur probabilistically according to the linkage map. Given the haplotypes in each bulk, we simulated an independent Poisson number for each count of Table 1 for each SNP. These data were used to calculate Inline graphic at each SNP, and also Inline graphic as described below, within windows around each SNP. For the latter we needed to specify a window size in centimorgans. For each parameter set, this entire procedure was repeated 10,000 times. Table 1 in Text S1 reports simulation results for the null hypothesis (Inline graphic) for a range of reasonable combinations of Inline graphic and Inline graphic. There is a close correspondence of observed means and variances of Inline graphic with the values predicted by equations (8) and (9). As expected, in these simulations the distribution of Inline graphic is right skewed with a mean and variance exceeding the Inline graphic expectations.

The full distribution of Inline graphic values is depicted for one parameter set (Inline graphic, Inline graphic) in Figure 1a. The gray histogram shows the distribution of Inline graphic under the null hypothesis (Inline graphic) while the overlapping red histogram shows the corresponding distribution in the case of a weak QTL (Inline graphic). Focusing first on the null distribution, because the distribution is right skewed (mean = 1.19, variance = 2.93), if we compare this distribution to critical values of Inline graphic the observed false positive rate is somewhat elevated (6.98% at Inline graphic; 1.98% at Inline graphic). However when Inline graphic approaches Inline graphic the mean and variance of Inline graphic far exceed the Inline graphic expectation and type I error rates increase dramatically. Perhaps even more problematic is the inability of Inline graphic to detect a QTL based on the naïve Inline graphic expectation. For the weak QTL case, where the QTL explains 2% of the phenotypic variance, the causal SNP is significant at a Inline graphic in only 34.9% of the simulations, and in only 16.8% of simulations at Inline graphic. The application of the naïve Inline graphic thus suffers from a lack of power.

Figure 1. The distribution of Inline graphic (A) and Inline graphic values (B) from 10,000 simulations.

Figure 1

The gray histograms depict the observed distributions of Inline graphic and Inline graphic for the null case (no QTL), while the red distributions depict the distributions in the case of a weak QTL that explains 2% of the phenotypic variance.

Inline graphic, A Smoothed Version of Inline graphic

A substantial source of variation in Inline graphic is the random margin in Table 1, Inline graphic. To deal with this variation we propose the use of a weighted average of Inline graphic across neighboring SNPs. Averaging Inline graphic values across SNPs is sensible because the real signal of divergence in allele frequency between bulks is conserved between closely linked sites but random noise due to variable sequencing read coverage is not. We suggest the following average test statistic for each SNP:

graphic file with name pcbi.1002255.e105.jpg (10)

where the sum includes all SNPs within the window Inline graphic bracketing the SNP. This type of weighted moving average, where the weights are given by a kernel function, Inline graphic, is also known as Nadaraya-Watson kernel regression [13], [14]. Nadaraya-Watson kernel regression acts as a smoothing function, with the amount of smoothing increasing with larger window size Inline graphic [15]. The simplest scheme for Inline graphic would be to give equal weight to all SNPs within Inline graphic (a rectangular kernel). We opt instead to apply the tri-cube kernel fuction:

graphic file with name pcbi.1002255.e111.jpg (11)

where Inline graphic is standardized distance, with value 0 at the focal position and value 1 at the edge of the window. Inline graphic is the sum of Inline graphic for all SNPs in Inline graphic. The tri-cube kernel is commonly used in local polynomial regression methods like LOESS [16] and gives greater weight to observations that are close to the focal SNP. Any other weighting kernel that decreases smoothly to 0 as Inline graphic goes to 1 could be used as well. We discuss the choice of the kernel window size, Inline graphic, below.

A methodological issue arises when kernel smoothing is used – at the beginning or end of a data series it can produce a biased estimate because the data included in the kernel bandwidth is asymmetric. The simplest way to deal with this is to append a reflected version of the values that fall within the right half-bandwith (at the beginning of the series) and left half-bandwidth (at the end of the series), run the kernel smoother as normal, and then trim the appended values from the output.

Expected distribution of Inline graphic for BSA-sequencing data

The null expectation of Inline graphic is given by equation (8). The variance of Inline graphic depends on the variance of individual Inline graphic values (equation 9) and the covariance between SNPs within a window. In Text S1 we show that Inline graphic can be approximated as:

graphic file with name pcbi.1002255.e123.jpg (12)

where Inline graphic indexes all SNPs other than Inline graphic contained within the window.

Figure 1b illustrates the distribution of Inline graphic for the same parameters as Figure 1a (plus window size Inline graphic cM and SNP density Inline graphic per cM). The difference between the null distributions in Figure 1a and 1b is due to the normalizing effect of averaging. The predicted mean and variance of Inline graphic (1.17 and 0.066) are reasonably close to the observed moments (1.18 and 0.056). The distribution of Inline graphic is still right skewed but the right tail can reasonably predicted from log-normal densities with parameters derived from Inline graphic and Inline graphic (Figure S1 and Text S2). The observed false-positive rates (using a log-normal density estimation) are: 5.14% at Inline graphic and 1.86% at Inline graphic). Unlike the use of the naive Inline graphic -test based on Inline graphic, the type I error does not increase dramatically as Inline graphic approaches Inline graphic. Furthermore, Inline graphic has good power to detect QTLs. For the example illustrated in Figure 1b the causal SNP is significant in 94.3% of the simulations at Inline graphic, and in 88.0% and 77.2% of simulations at Inline graphic and Inline graphic respectively.

Non-parametric estimation of the null distribution of Inline graphic

In addition to the theoretical expectations discussed above, an empirical estimate of the null distribution of Inline graphic can be derived from the observed data itself. We assume that the observed data, Inline graphic, is a mixture of the null distribution (non-QTL regions) and several contaminating distributions (QTLs). As discussed above, the null distribution of Inline graphic (Inline graphic) is right-skewed with a tail density reasonably predicted from a log-normal distribution, Inline graphic. We also assume the contaminating distributions have higher means than the null distribution. Our goal is to estimate Inline graphic and Inline graphic in a manner that is not unduly influenced by the contaminating distributions.

Recall that for a log-normal distribution: Inline graphic and Inline graphic [17]. Thus if we can estimate the median and mode of Inline graphic can use those to estimate Inline graphic and Inline graphic. To do so we propose the folowing steps:

  1. Let Inline graphic

  2. Let Inline graphic, the left median absolute deviation (MAD) of Inline graphic where Inline graphic is defined as
    graphic file with name pcbi.1002255.e160.jpg
  3. Use Hampel's rule [18] to identify outliers, Inline graphic, as all Inline graphic in Inline graphic that satisfy:
    graphic file with name pcbi.1002255.e164.jpg
    where Inline graphic defines the limits of the outlier regions [18] and is usually taken to be 5.2 for normally distributed data.
  4. Construct a trimmed data set Inline graphic for all Inline graphic such that Inline graphic

  5. Let Inline graphic and Inline graphic where Inline graphic is a robust estimator of the mode for continuous variables (see [19] for several such estimators)

The logic of this procedure is as follows. The median and MAD are robust estimators of location and spread respectively [20]. In the absence of contaminating distributions Inline graphic should be approximately normally distributed, and hence the median and MAD of Inline graphic can be used as robust estimates of the mean and spread of Inline graphic (Inline graphic for a symmetric distribution). Hampel's rule is a commonly used procedure to identify likely outliers in a set of data based on the median and MAD; if the underlying distribution is normally distributed and Inline graphic this is approximately equivalent to identifying outliers as those observations with Inline graphic-values Inline graphic (we use a one-sided test in the procedure above). When contaminating distributions (QTLs) are present, Inline graphic lies to the right of the true mean of the null distribution. Thus, Inline graphic and Inline graphic are conservative estimators of Inline graphic and Inline graphic. We then use Hampel's procedure to identify observations likely to be drawn from the contaminating distributions and create a trimmed data set, Inline graphic, with those outlying observations removed. From the trimmed data set we estimate Inline graphic and Inline graphic.

For the null simulations in Figure 1b the observed false-positive rate estimated using this non-parametric approach are 3.18% at Inline graphic and 0.76% at Inline graphic. In general, the non-parameteric procedure tends to be slightly more conservative than our proposed parametric estimators but not greatly so. Because this non-parametric approach makes few distributional assumptions (other than approximate log-normality of the null distribution) it might be preferred in cases where one suspects the sampling (either of segregants or alleles) grossly violates the hierarchical model described above.

Choosing Inline graphic

A weighted moving average is a type of low-pass filter; the larger the window size the lower the frequncy of signals that are rejected by the filter. The choice of smoothing width, Inline graphic, is therefore a tradeoff between filtering out high-frequency deviations in Inline graphic due to variable sequence coverage and SNP density and attenuating the signal of real QTLs. We want to pick a Inline graphic that minimizes noise while maximizing the underlying signal. The matched filter theorem [21] suggests that the filter that maximizes the signal-to-noise ratio of a symmetric signal is one which matches the shape of the signal. A simple measure of the shape of a symmetric signal is the full-width at half maximum (FWHM). The ratio of the width of the kernel to the peak FHWM (‘smoothing ratio’) is a useful metric for quantifying the effects of smoothing [22]. As a rule of thumb, using a smoothing kernel with a smoothing ratio of approximately two provides a good signal-to-noise ratio [22]. However, the matched filter may fail to distinguish multiple peaks when there are two or more signals in the input [23] as we would expect in cases of multiple QTLs with overlapping regions of elevated Inline graphic. Specifically, peaks separated by less than twice the FWHM of the filter will be merged [24]. Therefore, to distinguish overlapping signals requires filters with smoothing ratios significantly smaller, perhaps as small as 0.7.

In Text S2 we derive the expected shape of Inline graphic around a single causal SNP. For the case in which the causal allele is fixed in one bulk and has a frequency of 0.5 in the other bulk, the half-bandwidth (Inline graphic) at half-maximum corresponds to Inline graphic12.42 cM (Inline graphic). More extreme allelic biases between the bulks favor slightly smaller bandwidths, while less extreme differences favor larger bandwidths. SNP density also affects the optimal kernel bandwidth, with higher SNP density favoring narrower bandwidths. In simulations and applied to real data we have found that kernels with smoothing ratios in the range 1–1.5 produce smoothed estimators with good signal-to-noise ratios and which are neither strongly over- or undersmoothed. In terms of mapping distances this corresponds to kernels with Inline graphic in the range Inline graphic24.8–37.25 cM.

Since recombination rates vary across genomes, a given genetic distance will correspond to a range of physical distances. In terms of the choice of smoothing width, higher recombination rates favor smaller window sizes (in physical distance). If regional recombination rates are known this can be incorporated into the analysis; however the use of average chromosomal or genomic recombination rates to choose a single physical size for the smoothing window should not be problematic unless recombination rates vary widely. In such cases, one can calculate Inline graphic using a range of smoothing widths to explore whether peak estimates are strongly affected by over- or undersmoothing.

Proposed Analytical Pipeline

Based on the arguments developed above, we propose the following analytical pipeline for the analysis of BSA-sequencing data sets. We assume that sequencing reads have been aligned to a reference genome where physical distances between polymorphic sites and (approximate) rates of recombination are known. We assume that all sites are biallelic. Following alignment of reads to a reference genome, per site counts of each allele are generated from the reads. Our recommended analysis pipeline for estimating QTLs is as follows:

  1. For each variable site, calculate Inline graphic based on the observed number of reads for each allele in each of the two pools

  2. At each site calculate Inline graphic using a smoothing kernel with bandwidth Inline graphic bases where Inline graphic is chosen based on known or estimated rates of recombination. Bandwidths should typically correspond to genetic map distances in the range 25–40 cM.

  3. Estimate parameters of the log-normal null-distribution (i.e. no QTL) of Inline graphic, Inline graphic, based on either theoretical expectations (equations (8) and (12 and Text S2) or using the robust empirical estimator of the null distribution inferred from the observed Inline graphic.

  4. Using Inline graphic estimate Inline graphic-values directly using the log-normal CDF. Alternately log-transform Inline graphic and calculate Inline graphic scores Inline graphic and corresponding Inline graphic-values at each site.

  5. Use a false discovery rate approach (FDR; [25], [26]) to account for multiple comparisons and estimate an appropriate p-value threshold (or the corresponding Inline graphic threshold) to determine sites that deviate significantly from the background null distribution

  6. Define candidate QTL regions as continuous runs of significant sites

Power Analysis

We used simulations to conduct a simple power analysis of our proposed methodology. In this analysis we used the mean Inline graphic at a causal site as measure of power for given values of Inline graphic, Inline graphic, Inline graphic, window size (Inline graphic), SNP density, and for different magnitudes of QTL effect on phenotype. Figure 2 summarizes results for two different values of Inline graphic, corresponding to large (Inline graphic) and very large (Inline graphic) FInline graphic populations. We find that increasing coverage, Inline graphic, is advantageous until Inline graphic, but has minimal effect beyond that. A somewhat counterintuitive result is that larger bulk size, Inline graphic, is generally beneficial as long as sequencing coverage is modest to high. This is despite the fact that larger bulks imply weaker selection for a given Inline graphic (and hence a smaller allele frequency divergence among bulks). Based on these findings we recommend bulks consisting of at least 10% and as perhaps as high as 20% of the FInline graphic segregant population in order to maximize power to detect QTLs.

Figure 2. Power analysis.

Figure 2

Average Inline graphic at a causal site as a function of sequencing coverage, Inline graphic, and bulk size, Inline graphic, for two different FInline graphic population sizes (left, Inline graphic; right, Inline graphic). Note the difference in scales between the two figures.

An Application to Yeast

To demonstrate the correspondence between theory and data we here draw on a BSA-sequencing data set generated to identify loci that contribute to variation in colony morphology in the budding yeast Saccharomyces cerevisiae [27]. A full description and analysis of these data will appear elsewhere (Granek et al., in prep). Here, these data serve to illustrate the utility of both our theoretical framework and the associated robust estimators for data analysis.

The yeast data consist of a low and high bulk, each composed of 288 homozygous diploid segregants drawn from an FInline graphic population of size Inline graphic generated by sporulating a naturally heterozygous diploid strain [28]. The low bulk consists of segregants with simple colony morphology, while the high bulk consists of segregants with complex colony morphology (see [27] for a description of morphology scoring). Creation of DNA pools, sequencing, and mapping of reads is described in the Methods section. Because each segregant is homozygous, the effective number of alleles sampled for each bulk is Inline graphic instead of 2 Inline graphic. In total 44,066 polymorphic sites were analyzed with a mean interval between sites of approximately 280 bp. Below we refer to the two sequencing runs for the low bulks as Inline graphic and Inline graphic, and those for the high bulks as Inline graphic and Inline graphic. The coverage per SNP (Inline graphic) for each sequencing run was as follows: Inline graphic, Inline graphic, Inline graphic, and Inline graphic. For each of the analyses below, we used a smoothing window width of Inline graphic (Inline graphic30 cM), and took the average coverage of each bulk being compared as the estimate of coverage, Inline graphic.

Because there are two sequencing runs per DNA pool, variation in allele frequency estimates between sequencing runs from the same segregant bulk should be exclusively due to stochastic aspects of the sequencing reaction and primary bioinformatics analyses (base calling, read alignment). The structure of this data set is thus useful for dissecting the impact of sequencing variation on estimates of Inline graphic and Inline graphic, and the subsequent impact of this variability on the inference of QTL regions and peaks. We use these data to explore both the null model (no QTL; by analyzing the low-vs-low and high-vs-high comparisons) as well as the case where QTLs are expected (comparing low-vs-high bulks). In the null case, the differences in allele frequencies are subject to only one source of variation because the bulks are fixed but sequencing is variable. The non-null analyses are individually affected by both sources of variation (bulking and sequencing), but when comparing the results from comparable analyses (e.g. comparing QTL peak locations between the Inline graphic-vs-Inline graphic and Inline graphic-vs-Inline graphic analyses), the differences are again simply a function of sequencing variation.

Null comparisons: Variation in Inline graphic and Inline graphic due to sequencing

The two low samples (Inline graphic and Inline graphic) and the two high samples (Inline graphic and Inline graphic) represent independent sequencing runs of the same low and high segregant bulks respectively. Using Inline graphic and Inline graphic from a comparison of Inline graphic vs. Inline graphic and Inline graphic vs. Inline graphic we can estimate the impact of sequencing on the variation of these statistics. When the two bulks differ only due to read number variation, there is only one source of variation, and the statistics of Inline graphic should should be approximately Inline graphic with Inline graphic and Inline graphic. By invoking a weighted version of the central limit theorem [29], we find the distribution of Inline graphic should be approximately normal with Inline graphic and Inline graphic where Inline graphic, the sum of the Inline graphic squared kernel weights in the smoothing window (Inline graphic converges to Inline graphic in the case of a square kernel). As illustrated in Table 2 the observed data for the null-comparisons conform well to the asymptotic expectations.

Table 2. Null comparisons for the yeast data set.
Comparison Theoretical Inline graphic, Inline graphic Observed Inline graphic, Inline graphic Theoretical Inline graphic, Inline graphic Observed Inline graphic, Inline graphic
Inline graphic-vs-Inline graphic 1.000, 2.000 1.018, 2.050 1.000, 0.0124 1.020, 0.0115
Inline graphic-vs-Inline graphic 1.000, 2.000 1.015, 2.077 1.000, 0.0124 1.014, 0.0117

Theoretical and observed means and variances of Inline graphic and Inline graphic for the null comparisons in the yeast data set.

Between replicate comparisons of Inline graphic and Inline graphic in the presence of a QTL

In addition to tests of the null model, the design of the yeast experiment facilitates a between replicate comparison of Inline graphic and Inline graphic in the presence of QTLs. There are four possible low-vs-high comparisons; here we focus on two of those, Inline graphic-vs-Inline graphic and Inline graphic-vs-Inline graphic. Figure 3 illustrates the relationships for Inline graphic and Inline graphic at each SNP for Inline graphic-vs-Inline graphic and Inline graphic-vs-Inline graphic. The between replicate correlation for Inline graphic is Inline graphic0.677, while that between Inline graphic is Inline graphic0.996. This illustrates the ability of the smoothing kernel to act as a low-pass filter on the Inline graphic -statistic, filtering out the high-frequency noise associated with variation in read counts, while preserving the underlying signal of QTLs and increasing the repeatability of the analysis.

Figure 3. Comparison of Inline graphic and Inline graphic between technical replicates.

Figure 3

The correspondence of raw Inline graphic (black) and smoothed Inline graphic values (red) for different sequencing runs of the same low-vs-high bulks from the yeast data set.

Using the false discovery rate approach outline above, we estimated cutoff values for Inline graphic using a FDR of 0.01 based on both our theoretical results (equations 8 and 12) and the corresponding non-parametric estimators. For the parametric estimate we used the following parameter values: Inline graphic, Inline graphic, Inline graphic. The estimated Inline graphic cutoff values are as follows: Inline graphic-vs-Inline graphic : 2.59 [parametric], 3.51 [non-parameteric]; Inline graphic-vs-Inline graphic : 2.58 [parametric], 3.91 [non-parametric].

Using the theoretical Inline graphic cutoff of 2.59 we find 7,845 SNPs have significant Inline graphic values for the Inline graphic-vs-Inline graphic comparison, and 8,011 significant SNPs for the Inline graphic-vs-Inline graphic comparison, representing approximately 17% of the polymorphic sites. Nearly 38% of the significant sites are on chromosome XIII which appears to have multiple overlapping peaks leading to elevated Inline graphic values across much of the chromosome. The number of significant sites shared between the replicates is 7,330. We identified 12 significant regions (QTLs) in the two replicates (Figure 4). The QTLs are nearly identical between the replicates except for a marginal QTL on chromosome 7, where one of the replicates is significant but the other is just short of significance. To assess the variability in QTL location we compared the distance between peaks (using the single largest peak in cases of multiple peaks per chromosome). The mean and median absolute distances between nine comparable QTL peaks from the two comparisons are 5.08 Kb and 4.97 Kb respectively. The root mean square deviation (RMSD) between comparable QTL peaks is 6.7 Kb. Using the RMSD as a measure of spread and applying the 3Inline graphic rule of thumb, a conservative confidence interval for QTL peak is Inline graphic20 Kb (Inline graphic7.4 cM) around the observed peak. The size of this confidence interval is a function of read depth and SNP density, and is a measure of variability in peak estimation due to sequencing only. This confidence interval doesn't include variation that would arise from the bulking of segregants.

Figure 4. Yeast QTL Peaks.

Figure 4

Chromosomal distributions of Inline graphic for the Inline graphic-vs-Inline graphic (dark blue) and Inline graphic-vs-Inline graphic (light blue) data sets. The dashed red line indicates the estimated Inline graphic threshold corresponding to a FDR of 0.01. Regions above the red line are QTL regions; the highest point in each QTL region was called as the QTL peak.

As will be described elsewhere, candidate genes corresponding to several of the major peaks in this analysis have been functionally validated to affect yeast colony morphology (J. Granek and P. Magwene, unpublished data).

Discussion

The use of a test based on the Inline graphic -statistic provides a straightforward framework for analyzing BSA-sequencing data. The Inline graphic -statistic has several advantages over the use of allele frequency differences as the basis for QTL estimation (e.g. [11]). For example, as shown in the supporting information (Text S2), Inline graphic is expected to decrease much more rapidly around the causal site than bias in allele frequencies, implying narrower intervals of support around QTLs. Also in contrast to statistics based on the divergence of allele frequencies, Inline graphic takes into account the strength of evidence related to sample size. This feature of the Inline graphic -statistic can also potentially complicate analyses, as variance in read depth contributes to variance in Inline graphic over relatively small spatial scales. However, as we show above, weighted averaging of Inline graphic effectively smooths out ‘high frequency’ noise associated with sequencing variation.

Bulk Size and Sequencing Considerations

Our simulations suggest that for the experimental design considered here using bulk sizes as large as 15–20% of the phenotyped segregant population increases power to detect causal QTLs despite the fact that this means relatively smaller allele frequency differences between bulks. This is due to tradeoffs between bulk-size, selection intensity, and the variance of allele frequencies under the hierarchical sampling. Consider, for example, a single locus with alleles Inline graphic and Inline graphic, where the effect of Inline graphic is additive and the two homozygotes differ by Inline graphic units on average. Assuming no segregation distortion, and an Inline graphic population generated from inbred lines, the change in the allele frequency of Inline graphic in the high bulk after truncation selection is approximately Inline graphic [30], [31] where Inline graphic is the intensity of selection, and Inline graphic is the ‘standardized effect of the locus’ (these quantities can be related to the selection coefficient, Inline graphic, by Inline graphic). Given truncation selection on a normal distribution, the intensity of selection is given by Inline graphic where Inline graphic is the proportion of selected individuals and Inline graphic is the probability density function at the truncation point [31]. Since the intensity of selection increases at a rate much less than Inline graphic (e.g. see [31], Fig. 11.3), an Inline graphic-fold decrease in Inline graphic results in a much less than Inline graphic-fold change in the intensity of selection. For example, let Inline graphic and consider truncation on the upper 20%, 10%, and 1%, of the phenotypic distribution. The increase in the frequency of Inline graphic in the high bulk given these truncation points is approximately 3.5%, 4.4%, and 6.7% respectively (translating to allele frequency differences of 7%, 8.8%, and 13.4% in the two-bulk case). On the other hand, the variance of the realized frequencies of the alleles in each bulk is inversely proportional to bulk size (Inline graphic). Thus, a twenty-fold decrease in bulk size translates to less than a two-fold increase in allele frequency divergence, but a twenty-fold increase in the variance of allele frequencies. As long as average coverage, Inline graphic, is moderate to large, the benefit of increasing Inline graphic offsets the relatively smaller penalty resulting from a decrease in selection intensity. However, there is little benefit to increasing sequencing coverage beyond the size of the bulks.

Sequencing can introduce complications such as biases toward particular nucleotide calls; however in general this should effect both segregant bulks in the same direction. Due to the averaging affect of Inline graphic, unless such biased sites are common over very large map distances they are unlikely to have substantial affects on results derived under our proposed framework. Similarly, a low percentage of mismapped reads or miscalled SNP calling are unlikely to be problematic for our framework, again because of the averaging affect of Inline graphic. However caution should be exercised in genomic regions that are particularly problematic in this regard, such as repeat rich regions.

Other Experimental Designs

In this paper we have focused on QTL mapping with an FInline graphic experimental design, but clearly our framework can be extended to other designs. Common alternatives include mapping populations produced by imposing one or more generations of inbreeding on an FInline graphic, such as Recombinant Inbred Lines (RILs). The increased homozygosity of such populations should also be taken into consideration, as it increases the expected change in allele frequency due to selection but it also decreases the number of independent chromosomes that are sampled for a given number of selected individuals. Chromosomes in such RILs experience as much as twice the number of crossovers as do FInline graphic populations so the physical size of the smoothing window Inline graphic should be reduced to take this reduced linkage disequilibrium into account. Even greater reductions of linkage disequilibrium can be accomplished by an alternative design that imposes additional generations of random mating, rather than inbreeding, on an FInline graphic, resulting in more precise localization of QTLs. Additional generations of outcrossing (beyond the FInline graphic) will likely magnify deviations of the null allele frequency from 0.5 owing to segregation distortion and/or inadvertent selection. This can be accommodated by application of formulas in Text S1 with Inline graphic estimated from all sites within a genomic window.

Other experimental designs, such as backcrosses, will not have allele frequencies of 0.5. For these situations the null expected distributions of Inline graphic and Inline graphic can be approximated using the equations presented in Text S1, although in this case it will be necessary to know the parental origin of the SNP alleles. Similarly, since Inline graphic can be generalized to an arbitrary number of classes [12], one-tailed scenarios (e.g. [9]) involving comparison to either a theoeretical population or a random sampling of segregants can be addressed in this framework.

Methods

Sequencing of Yeast Bulks

To create the bulked DNA pools each segregant was grown overnight in liquid medium to saturation (Inline graphic cells/ml) and equal volumes of each culture were mixed to form cell bulks. Genomic DNA was isolated from the cell bulks and single Illumina DNA sequencing libraries were prepared from each bulk, using standard protocols as described in [28]. Each bulk DNA pool was sequenced twice using 50 bp reads on an Illumina GAII sequencing instrument. Approximately 15 M reads were generated in each sequencing run. Reads were aligned to the yeast reference genome (obtained from the Saccharomyces Genome Database, January 2010) using the program BWA [32] and polymorphic sites were called using SAMtools [33]. For each sequencing run, SAMtools was used to create a pileup file giving the alleles at each polymorphic site, from which allele counts were derived using scripts written in Python.

Supporting Information

Figure S1

Simulations results for the null distribution of Inline graphic based on 10,000 simulations with ( Inline graphic , Inline graphic , Inline graphic ). The gray histogram represents the observed distribution of Inline graphic, corresponding to Figure 1b. The dashed lines represent log-normal distributions estimated from theoretical expectation (red line) or via the non-parametric approach described in the text (black line). Both the parametric and non-parametric approaches provide good control of type I error (right tail of the distribution).

(PDF)

Text S1

Generalization of theoretical results to include segregation distortion.

(PDF)

Text S2

Miscellaneous information. This file includes information on: 1) estimation of the parameters of a log-normal distribution from the expected mean and variance of a variable of interest; 2) the expected shape of the Inline graphic around at a QTL; and 3) A summary table of expected and observed means and variances of Inline graphic based on simulations of the null hypothesis (no QTL).

(PDF)

Acknowledgments

We thank Joshua Granek and Debra Murray who helped to generate the yeast BSA data set. We thank Stuart McDonald for conversations and feedback. We thank the Duke University Institute for Genome Sciences & Policy Sequencing Facility for the sequencing of genomic libraries.

Footnotes

The authors have declared that no competing interests exist.

This research was supported by NIH grant P50GM081883-04 (to PMM), NIH grant R01-GM073990 (to JKK and JHW), NSF grant DEB-10-19753 (to PMM) and NSF grant IOS-10-24966 (to JHW). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Michelmore RW, Paran I, Kesseli RV. Identification of markers linked to disease-resistance genes by bulked segregant analysis: a rapid method to detect markers in specific genomic regions by using segregating populations. Proc Natl Acad Sci U S A. 1991;88:9828–9832. doi: 10.1073/pnas.88.21.9828. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Ehrenreich IM, Gerke JP, Kruglyak L. Genetic dissection of complex traits in yeast: insights from studies of gene expression and other phenotypes in the byxrm cross. Cold Spring Harb Symp Quant Biol. 2009;74:145–153. doi: 10.1101/sqb.2009.74.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Winzeler EA, Richards DR, Conway AR, Goldstein AL, Kalman S, et al. Direct allelic variation scanning of the yeast genome. Science. 1998;281:1194–1197. doi: 10.1126/science.281.5380.1194. [DOI] [PubMed] [Google Scholar]
  • 4.Borevitz JO, Liang D, Plouffe D, Chang HS, Zhu T, et al. Large-scale identification of single-feature polymorphisms in complex genomes. Genome Res. 2003;13:513–523. doi: 10.1101/gr.541303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Brauer MJ, Christianson CM, Pai DA, Dunham MJ. Mapping novel traits by array-assisted bulk segregant analysis in saccharomyces cerevisiae. Genetics. 2006;173:1813–1816. doi: 10.1534/genetics.106.057927. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Segr AV, Murray AW, Leu JY. High-resolution mutation mapping reveals parallel experimental evolution in yeast. PLoS Biol. 2006;4:e256. doi: 10.1371/journal.pbio.0040256. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Boer VM, Amini S, Botstein D. Inuence of genotype and nutrition on survival and metabolism of starving yeast. Proc Natl Acad Sci U S A. 2008;105:6930–6935. doi: 10.1073/pnas.0802601105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Demogines A, Smith E, Kruglyak L, Alani E. Identification and dissection of a complex dna repair sensitivity phenotype in baker's yeast. PLoS Genet. 2008;4:e1000123. doi: 10.1371/journal.pgen.1000123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ehrenreich IM, Torabi N, Jia Y, Kent J, Martis S, et al. Dissection of genetically complex traits with extremely large pools of yeast segregants. Nature. 2010;464:1039–1042. doi: 10.1038/nature08923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Wenger JW, Schwartz K, Sherlock G. Bulk segregant analysis by high-throughput sequencing reveals a novel xylose utilization gene from saccharomyces cerevisiae. PLoS Genet. 2010;6:e1000942. doi: 10.1371/journal.pgen.1000942. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Parts L, Cubillos FA, Warringer J, Jain K, Salinas F, et al. Revealing the genetic structure of a trait by sequencing a population under selection. Genome Res. 2011;21:1131–1138. doi: 10.1101/gr.116731.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Sokal RR, Rohlf FJ. Biometry. W. H. Freeman; 1994. [Google Scholar]
  • 13.Nadaraya EA. On estimating regression. Theor Probab Appl. 1964;9:141–142. [Google Scholar]
  • 14.Watson GS. Smooth regression analysis. Sankhaya. 1964;26:175–184. [Google Scholar]
  • 15.Schucany WR. Kernel smoothers: An overview of curve estimators for the first graduate course in nonparametric statistics. Statist Sci. 2004;19:663–675. [Google Scholar]
  • 16.Cleveland WS. Robust locally weighted regression and smoothing scatterplots. J Amer Stat Assoc. 1979;74:829–826. [Google Scholar]
  • 17.Mohn E. Confidence estimation of measures of location in the log normal distribution. Biometrika. 1979;66:567–575. [Google Scholar]
  • 18.Davies L, Gather U. The identification of multiple outliers. J Amer Stat Assoc. 1993;88:782–792. [Google Scholar]
  • 19.Bickel DR, Frühwirth R. On a fast, robust estimator of the mode: Comparisons to other robust estimators with applications. Comput Stat Data An. 2006;50:3500–3530. [Google Scholar]
  • 20.Rousseeuw PJ, Croux C. Alternatives to the median absolute deviation. J Amer Stat Assoc. 1993;88:1273–1283. [Google Scholar]
  • 21.Turin GL. An introduction to matched filters. IEEE Trans Inform Theory. 1960;6:311–329. [Google Scholar]
  • 22.Enke CG, Nieman TA. Signal-to-noise ratio enhancement by least-squares polynomial smoothing. Anal Chem. 1976;48:705–712A. [Google Scholar]
  • 23.Gu H, Gao R. Resolution of overlapping echoes and constrained matched filter. IEEE Trans Signal Proc. 1997;45:1854–1857. [Google Scholar]
  • 24.Mikl M, Marecek R, Hlustk P, Pavlicov M, Drastich A, et al. Effects of spatial smoothing on fmri group inferences. Magn Reson Imaging. 2008;26:490–503. doi: 10.1016/j.mri.2007.08.006. [DOI] [PubMed] [Google Scholar]
  • 25.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Sci, B. 1995;57:289–300. [Google Scholar]
  • 26.Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001;29:1165–1188. [Google Scholar]
  • 27.Granek JA, Magwene PM. Environmental and genetic determinants of colony morphology in yeast. PLoS Genet. 2010;6:e1000823. doi: 10.1371/journal.pgen.1000823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Magwene PM, Kayıkçı Ömür, Granek JA, Reininga JM, Scholl Z, et al. Outcrossing, mitotic recombination, and life-history trade-offs shape genome evolution in saccharomyces cerevisiae. Proc Natl Acad Sci USA. 2011;108:1987–1992. doi: 10.1073/pnas.1012544108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Weber M. A weighted central limit theorem. Stat Probabil Lett. 2006;76:1482–1487. [Google Scholar]
  • 30.Kimura M, Crow JF. Effect of overall phenotypic selection on genetic change at individual loci. Proc Natl Acad Sci U S A. 1978;75:6168–6171. doi: 10.1073/pnas.75.12.6168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Falconer DS, Mackay TFC. Introduction to quantitative genetics, 4th edition. Longman; 1996. [Google Scholar]
  • 32.Li H, Durbin R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. The sequence alignment/map format and samtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1

Simulations results for the null distribution of Inline graphic based on 10,000 simulations with ( Inline graphic , Inline graphic , Inline graphic ). The gray histogram represents the observed distribution of Inline graphic, corresponding to Figure 1b. The dashed lines represent log-normal distributions estimated from theoretical expectation (red line) or via the non-parametric approach described in the text (black line). Both the parametric and non-parametric approaches provide good control of type I error (right tail of the distribution).

(PDF)

Text S1

Generalization of theoretical results to include segregation distortion.

(PDF)

Text S2

Miscellaneous information. This file includes information on: 1) estimation of the parameters of a log-normal distribution from the expected mean and variance of a variable of interest; 2) the expected shape of the Inline graphic around at a QTL; and 3) A summary table of expected and observed means and variances of Inline graphic based on simulations of the null hypothesis (no QTL).

(PDF)


Articles from PLoS Computational Biology are provided here courtesy of PLOS

RESOURCES