Abstract
Pooled sequencing can be a cost-effective approach to disease variant discovery, but its applicability in association studies remains unclear. We compare sequence enrichment methods coupled to next-generation sequencing in non-indexed pools of 1, 2, 10, 20 and 50 individuals and assess their ability to discover variants and to estimate their allele frequencies. We find that pooled resequencing is most usefully applied as a variant discovery tool due to limitations in estimating allele frequency with high enough accuracy for association studies, and that in-solution hybrid-capture performs best among the enrichment methods examined regardless of pool size.
Introduction
Genome-wide association studies (GWAS) have precipitated a dramatic rise in the discovery of novel, robustly-associated complex trait loci. As the majority of these signals involve common alleles with modest or small effect sizes, a large proportion of genetic variance remains unexplained. Low frequency (minor allele frequency [MAF]0.05) and rare (MAF0.01) variants may be associated with complex traits and help account for the ‘missing’ heritability [1], [2] (for example as recently shown for hypertriglyceridemia [3]). A comprehensive catalogue of rare variants does not yet exist, although large-scale resequencing efforts such as the 1000 Genomes (1KG) [4] and UK10K (www.uk10k.org) Projects are enhancing our understanding of human sequence variation.
Experimental costs associated with variant discovery have been drastically reduced through the advent of next-generation sequencing technologies; however, whole-genome deep sequencing of individual samples in large disease association studies remains prohibitively expensive and likely will for some time. Pooling DNA samples could empower cost-efficient sequence variant identification and allele frequency estimation. This can in theory enable comparisons between disease cases and controls, bypassing the need for exhaustive genotyping, and allowing the identification of promising novel association signals, for example as applied to the discovery of the type 1 diabetes IFIH1 locus [5].
Non-indexed, or non-barcoded, pools (which form the focus of this study) do not enable the assignment of variants to individuals, but have lower associated costs. Even though targeted resequencing in pools has attractive attributes that may facilitate disease association studies, technical and analytical parameters central to this study design have not been empirically evaluated yet. Pooling studies are sensitive to DNA quantification and pool construction. The choice of target enrichment method is important. PCR is difficult to multiplex, optimize and normalize, but can be highly effective. The ability of PCR-enriched targeted resequencing to sensitively identify low frequency and rare variants and estimate their frequency in non-indexed pools has been established, but evaluations have been restricted to small-scale experiments investigating up to 300 kb [5]–[12]; however, most post-GWAS sequencing efforts target several megabases. Hybrid-capture methods (array-based [aHC] and in-solution hybrid-capture [sHC]) are easy to multiplex and enable large-scale experiments [13]–[17]. A recent investigation illustrated that they can be successfully applied to the targeted resequencing of 2.6 Mb in individual samples [18], but their effectiveness in pooled samples is not clear.
Here, we assess variant detection and frequency estimation of different sequence enrichment methods (long-range PCR, aHC and sHC) in non-indexed pools of 1, 2, 10, 20 and 50 samples (Tables S1, S2) across six genomic regions encompassing coding and non-coding sequence (1.6 Mb in total, Table S3), and evaluate the feasibility of these approaches in the context of complex disease association studies. Specifically, we evaluate the uniformity of target coverage, the sensitivity and specificity of variant detection and the accuracy of frequency estimation in non-indexed pools of different sizes and across different enrichment methods for the first time.
Results
Alignment of Reads to Target
Enrichment specificity can be assessed by comparing the proportion of sequencing reads that map to the target regions. The lower the specificity, the higher the sequencing capacity required to achieve the desired target coverage. We observed large variability in the total number of reads produced by each of the three enrichment methods (Table 1, Table S4). This variability is also evident for the PCR and aHC technical replicates we conducted (for the Pool of 20; Tables S5, S6). It is common practice in whole genome sequencing (WGS) to remove potential duplicate reads to avoid biases in coverage analyses as well as downstream analyses, but applying this practice in pooled targeted sequencing of a relatively small target region with a high depth of coverage is still a matter of debate. Therefore we calculated alignment statistics both before and after removing potential duplicate reads. PCR showed the highest percentage of sequencing reads that map to the target region both before and after duplicate read removal (Table 1, Table S4). Conversely, both aHC and sHC showed higher proportions of mapped on-target reads with good mapping quality scores (20) both before and after duplicate read removal (Table 1, Table S4). The mapping quality score of reads is an important factor in accurate variant detection and the specificity of target enrichment impacts directly on target coverage.
Table 1. Target sequence enrichment success before duplicate removal.
Pool | Number | Total Number | % Reads Mapped | % Reads Mapped | % Reads Mapped |
of | Lanes | Reads | to Referencea | to Targeta | to Target w/Q20b |
1 PCR | 1 | 44,232,852 | 48.97 | 46.05 | 44.27 |
1 aPD | 1 | 61,487,334 | 95.80 | 21.82 | 21.58 |
1 sPD | 1 | 35,813,898 | 97.90 | 46.55 | 45.95 |
2 PCR | 1 | 30,843,770 | 97.92 | 85.97 | 79.61 |
2 aPD | 1 | 58,352,664 | 92.19 | 13.07 | 12.91 |
2 sPD | 1 | 29,554,192 | 97.50 | 46.96 | 46.36 |
10 PCR | 2 | 55,278,922 | 84.51 | 73.44 | 67.02 |
10 aPD | 2 | 90,319,688 | 96.44 | 18.62 | 18.15 |
10 sPD | 2 | 85,783,964 | 97.83 | 48.13 | 47.48 |
20 PCR | 3 | 121,378,560 | 89.33 | 80.88 | 75.37 |
20 aPD | 3 | 103,231,280 | 97.24 | 34.05 | 33.44 |
20 sPD | 3 | 111,444,476 | 97.11 | 45.91 | 45.31 |
50 PCR | 7 | 132,547,082 | 99.74 | 70.90 | 67.42 |
50 aPD | 7 | 251,257,124 | 96.02 | 22.62 | 22.27 |
50 sPD | 7 | 295,115,044 | 97.52 | 49.97 | 49.30 |
For each pool and sequence enrichment method this table details the total number of reads generated for the pool, the percentage of total reads mapped to the reference genome, the percentage of total reads mapped to the target regions, and the percentage of mapped reads that mapped to the target regions with mapping quality 20. The total number of reads for a pool is calculated from the fastq file(s) generated for each lane of sequencing. The percentage of reads mapped to the reference is calculated from the BAM file generated from merging all the Maq map files for each lane for a pool. The percentage of reads mapped to the target regions is calculated as the number of reads with at least one base overlapping a target region divided by the total number of reads. The percentage of reads mapped to the target regions with a mapping quality score Q20 is calculated as the number of reads with at least one base overlapping a target region with mapping Q20 divided by the total number of reads.
: Calculated by samtools view –c.
: Calculated by samtoools veiw -c -q 20.
Target Coverage Depth and Uniformity
Target coverage depth directly affects the ability to detect variants, and depth is affected by the removal of potential duplicate reads. The higher enrichment specificity of PCR resulted in a higher overall mean read depth for target bases as compared to aHC and sHC, taking pool size and number of lanes sequenced into account regardless of duplicate read removal (Figure 1; Figures S1, S2, S3; Tables S7, S8). PCR yielded a higher percentage of target bases covered at 20× per individual across all pool sizes (Figure 1; Figure S1). However, target regions were not covered in a uniform way. For example, we found different coverage of protein coding versus non-coding target regions with duplicate read removal affecting the depth by approximately 100–200 reads but not the overall trend (Tables S9, S10; Figures S4, S5, S6, S7). Both aHC and sHC preferentially covered protein coding regions over non-coding regions across all pool sizes, whereas PCR demonstrated a bias in the opposite direction (Tables S9, S10; Figures S4, S5, S6, S7, S8, S9, S10, S11; t-test p-value0.05 in all pools, for all methods). The same trends were observed in the technical replicates conducted (Figures S8, S9, S10, S11). An analysis of %GC, repeat and low complexity regions in the protein coding and non-coding target regions (Table S11) showed that non-coding DNA contains a higher proportion of repeat elements, thereby making it difficult to design highly specific oligonucleotide probes, affecting coverage for the hybrid capture methods. PCR experiments tended to favour the overall lower GC content of non-coding regions (Figures S12, S13, S14).
Variant Detection Sensitivity and Specificity
Variant discovery is linked with coverage depth, but study design power importantly also depends on a balance between false positive and false negative variant discovery rates. A major reason for the removal of duplicate reads is to remove biases in variant detection and calling. To address issues related to removing duplicate reads in variant detection and frequency estimation in pooled targeted resequencing we analyzed all pools with the removal of duplicate reads before variant calling, and pools of 1, 10 and 50 individuals for the PCR and sHC enrichment without the removal of duplicate reads. We found the total number of called variants to increase with pool size, in keeping with the variants known to be present in each pool (Tables S12, S13). The removal of potential duplicate reads reduces the total number of variants called, with the effect being largest for PCR enrichment and for larger pools (Tables S12, S13). As the number of sequence-identified variants increased, the proportion present in dbSNP129 decreased regardless of duplicate read removal (Tables S14, S15). This trend could either be due to a higher false positive rate in larger pools, or to the fact that deep sequencing identified variants not present in dbSNP. We utilized HapMap, Illumina chip and 1KG data available for the pooled individuals to directly address questions of false positive and false negative rates (Table S1). sHC demonstrated the highest sensitivity to detect HapMap variants across all pool sizes and for both removing and not removing duplicate reads, except in the case of enriching a single individual after duplicate read removal (in which case aHC performed best; Table 2, Table S16). The removal of duplicate reads has a dramatic effect on the sensitivity in the pool of 1 enriched by PCR. Although the pre-duplicate read removal sensitivity is higher overall the difference in sensitivity is only approximately 1–3%. The same trend was observed when considering 1KG variants and the union of all known variants (Tables S19, S20, S21, S22).
Table 2. HapMap variation detection sensitivity after duplicate removal.
Pool | Pool | Pool | Pool | Pool | |
of 1 | of 2 | of 10 | of 20 | of 50 | |
(1089)a | (1459)a | (1999)a | (2067)a | (2145)a | |
PCR | 26.26 | 87.46 | 92.35 | 96.27 | 95.80 |
aHC | 97.15 | 85.33 | 96.60 | 97.82 | 94.41 |
sHC | 94.12 | 95.07 | 98.30 | 98.16 | 96.88 |
This table contains the percentage of the known HapMap variants with at least one non-reference allele in the pool that each pool and enrichment method discovered (true positives). The false negative rate is 100 minus this value.
: number of non-reference HapMap variants in pool.
We found that PCR had overall lower sensitivity to detect known singleton HapMap variants compared to HC methods (Table S23). Similarly, HC methods showed higher sensitivity to detect the variants identified in the single-individual pool particularly after duplicate read removal (Tables S24, S25), and sHC generally performed better than aHC. The ability to accurately call variants depends on sequence coverage, and the depth is affected by duplicate read removal. The read depth of false negative HapMap variants was significantly different to that of true positives, for both HC methods across pools of 2–50 individuals (Figures S15, S16, S17; data not shown pool of 2 and 50) (t-test p-value0.05 in all cases). A similar trend was observed for PCR (Figures S15, S16, S17). For both hybrid capture methods there was a trend towards a lower GC content in 200 base-pair regions around false negative HapMap variants compared to true positive variants, and the pattern was similar before and after duplicate read removal (Figures S18, S19, S20). This trend was not as prominent for the PCR experiments. The ability to call variants is also tied to the frequency of the variant in the pool. The false negative HapMap variants tended to have lower allele frequencies in the pools compared to true positives, and this trend was accentuated before duplicate read removal (Figures S21. S22, S23, S24, S25). This is in keeping with the fact that false negatives have lower depth coverage, making low frequency variant detection more difficult.
We found specificity (true negative rate), calculated on the basis of HapMap loci monomorphic in the pooled samples, to decrease as the complexity of the pool increased, and for a given pool the specificity was higher after duplicate read removal (Table 3, Table S17). False positives could be ascribed to genotype misclassification in HapMap or to sequencing error in our experiment. To resolve this, we examined data across 22 of the pooled samples present in both HapMap and 1KG. 1KG data corroborate the pooled sequencing findings across over 92% of overlapping loci for pools of more than one sample after duplicate reads are removed. For sHC, the concordance is 100% regardless of pool size when duplicate reads are removed, but is reduced to 95% when duplicates are included for the pool of 1 individual (Table 4, Table S18). The inclusion of duplicate reads uniformly increases the proportion of calls corroborated by 1KG for PCR. We examined the rate of genotype discordance between HapMap and 1KG at all sites in the regions examined for the 22 samples and found it to be 1.8%. Given the deep coverage of target bases in our experiment and concordance with 1KG we infer that the calculated false positive rates are likely to be overestimates.
Table 3. HapMap variation detection specificity after duplicate removal.
Pool | Pool | Pool | Pool | |
of 1 | of 2 | of 10 | of 20 | |
(1722)a | (1353)a | (683)a | (590)a | |
PCR | 99.88 | 98.97 | 97.66 | 96.95 |
aHC | 98.84 | 98.67 | 97.22 | 96.61 |
sHC | 99.07 | 98.74 | 97.22 | 96.95 |
This table contains the percentage of the known HapMap variants with no non-reference alleles and no missing genotypes in the pool that each pool and enrichment method correctly didn't call as a variant (true negatives). The false positive rate is 100 minus this value.
: number of reference HapMap variants in pool.
Table 4. 1KG support for HapMap false positive loci after duplicate removal.
Pool | Pool | Pool | Pool | |
of 1 | of 2 | of 10 | of 20 | |
PCR | 2(50%) | 14(100%) | 15(93.33%) | 14(92.86%) |
aHC | 19(94.74%) | 17(94.12%) | 16(100%) | 15(100%) |
sHC | 16(100%) | 16(100%) | 16(100%) | 16(100%) |
This table contains the number of loci considered false positives based on HapMap data that are present in 1KG and the percentage of these overlapping loci that the 1KG data supports the presence of non-reference alleles in the pool.
Variant Frequency Estimation
The usefulness of pooled sequencing approaches in complex trait studies is primarily encapsulated by the ability to perform association tests through allele frequency estimate comparisons between pools of disease cases and controls. We compared estimated allele frequencies from the resequenced pools with those from HapMap and 58BC data and found that the sHC designs achieve the highest accuracy (Figures 2–3, Figures S26, S27, S28). The accuracy of frequency estimates improved with increasing pool size and was higher after duplicate read removal. The correlation between estimated allele frequency from sequencing the pool of 50 and from known genotypes was 95.8%, 97.9%, and 99.0% for PCR, aHC, and sHC respectively when duplicate reads were removed (Figure 2). However, when duplicate reads were included in the analysis the correlation in the same pool increased slightly for the PCR enrichment and dropped slightly for the HC methods (Figure 3). The decrease in correlation between true and estimated allele frequency pre-duplicate read removal was also seen for the pool of 10 individuals (Figures S26, S27). The allele frequency estimates appear to be stable and robust. For example, frequency estimates from the technical replicates of the Pool of 20 have a correlation of 98.59% for PCR and 99.31% for aHC (Figures S29, S30). Overall, pooled sequencing resulted in under-estimates of the true allele frequency regardless of duplicate read removal (Tables S26, S27).
We found the per-individual read depth at called variants to be weakly correlated with frequency estimate accuracy, and to vary across enrichment methods (Figures S31, S32, S33, S34). The inclusion of potential duplicate reads before the analysis increased this correlation (Figures S32, S35). There was a stronger correlation between the number of variant alleles in the pool and the accuracy of the allele frequency estimates (Figures S36, S37, S38, S39, S40). This correlation was also higher when potential duplicate reads were included in the analysis (Figures S37, S40). Interestingly, the higher the number of variant alleles in the pool, the worse the allele frequency estimates, a trend consistently observed across all enrichment methods and pool sizes. Specifically, we observed that low frequency variants tended to be more accurately estimated (Figures 2–3; Figures S26, S27, S28, S36, S37, S38, S39, S40).
Reproducibility of Results
Reproducibility was assessed by performing technical replicates for PCR and aHC for the Pool of 20 individuals as a representative example. The HC replicates yielded more consistent results in terms of the number of reads produced and median coverage of target bases (Tables S5, S6). The sensitivity of HapMap variant detection varied by 4% between PCR replicates, and 2% between aHC replicates (Table S28). We next considered the number of variants that overlap between replicates as a function of the total number of unique variants called across replicates. The overlap rates of called variants across pairs of replicates were low (59%) for both PCR and aHC (Table S29). For variants called in both technical replicates the correlation between estimated allele frequencies was found to be high (98.6% and 99.3% for PCR and aHC respectively) (Figures S29, S30). When comparing allele frequencies for these overlapping variants (i.e. expecting identical estimates under an ideal experimental scenario), we found an average absolute allele frequency difference of 2.7% for PCR (across 7,233 overlapping variants) and 2.1% for aHC (6,713 variants) (Table S29).
Cost
We compared the relative cost implications of the different study designs considered here. Considering the results after duplicate reads were removed, the Pool of 10 individuals had the highest sensitivity and specificity for pools greater than 1 individual but they were only 2% higher than the Pool of 50 which provided better allele frequency estimates and was more cost-effective. For example, for a pooling experiment involving 1000 cases and 1000 controls the Pool of 50 would be associated with 30% lower costs based on the number of sequencing lanes required as compared to the Pool of 10 and 86% lower costs than sequencing each individual on a single lane. Within each pool size, the cost of PCR was 3-fold more expensive than either of the hybrid-capture enrichment methods.
Discussion
The field of human genetics is entering a new era of next-generation association studies. However, the cost of large-scale sequencing experiments of individual samples or indexed pools can be prohibitive, whilst the ability to accurately and inexpensively enrich and sequence targeted regions remains important to the research community. We have evaluated three enrichment methods in four non-indexed pool sizes to determine the best performing and most cost-effective strategy in the context of disease association studies.
The proportion of reads mapping to the target region, the uniformity of coverage of the target, and the read depth at targeted bases represent important measures of enrichment success. PCR yields 20–30% more on-target sequence reads than either aHC or sHC, resulting in a higher mean read depth for targeted bases. The hybrid capture methods show a bias for enrichment of protein coding versus non-coding target regions, and this difference can be explained by the high repeat content of non-coding regions. PCR shows the opposite bias, with non-coding regions covered at higher depth than coding regions, potentially ascribed to the lower GC content of non-coding regions.
The most relevant factors for disease association studies are variant detection sensitivity and specificity, and accuracy of allele frequency estimates. sHC shows the highest dbSNP129 overlap, and demonstrates the highest sensitivity and specificity for discovering HapMap and 1KG variants across all pool sizes. Similarly, sHC produces the best estimation of allele frequencies across the board. Allele frequency estimation appears to improve with increasing pool size, therefore arguing for pooling larger numbers of samples. Interestingly, low frequency variants appear to be better-estimated, potentially because of higher resolution to correctly call a smaller number of alternate alleles. The 2% average difference between allele frequencies across technical replicates indicates that estimates are not robust. A MAF difference at this scale could lead to false positive or false negative signals, particularly for variants at the lower end of the frequency spectrum, which are typically the focus of resequencing studies. Reviewed together, the results of our experiment indicate that in-solution hybrid capture in pools of 50 individuals has clear advantages over the alternative strategies considered here. Advances in sequencing and multiplexing protocols may have an effect on pool efficiency. We also conclude that non-indexed pooled resequencing studies are well-powered for variant discovery, but produce unreliable allele frequency estimates, particularly within the context of complex disease association studies.
Materials and Methods
Ethics Statement
This study has been approved by the ethics committee of the Wellcome Trust Sanger Institute (WTSI). This study only used extracted DNA from cell-lines, which falls outside of the UK Human Tissue Act. The use of the 1958BC samples is covered by a material transfer agreement (MTA) with the ALSPAC Laboratory, University of Bristol (the 1958BC sample custodian), which stated that the 1958BC had been collected under UK NHS Research Ethics Committee approval from SouthEast MREC, in Aug. 2002. REC Ref. MREC 01/1/44. The HapMap Populations/ELSI Group made recommendations for the HapMap project during the initial planning phase, and developed an informed consent form template (http://hapmap.ncbi.nlm.nih.gov/consent.html). The use of the HapMap CEU DNA is governed by these individually signed informed consent forms that grant permission for the use of the DNA in future studies approved by relevant ethics committees. The use of the HapMap DNAs were approved by the HapMap Repository (Coriell).
DNA Samples
The samples sequenced consisted of 31 HapMap CEU individuals and 19 individuals from the 1958 British Birth Cohort (58BC). The HapMap DNA samples were obtained from Coriell Repositories and the sample IDs are: NA12249, NA12156, NA12004, NA11831, NA12716, NA11832, NA11993, NA12057, NA11995, NA12006, NA12144, NA12802, NA12146, NA12005, NA12003, NA07000, NA12043, NA12044, NA11992, NA11881, NA11994, NA07345, NA12154, NA06994, NA06985, NA12239, NA07022, NA07034, NA12155, NA07056, NA06993. Individuals with a 1KG superscript were sequenced as part of pilot 1 of the 1,000 Genomes Project [4].
Region Selection
The genomic regions selected for sequencing (Table S3) had shown suggestive evidence for association with type 2 diabetes following cumulative analysis of low frequency/rare variants directly typed on GWAS chips using a collapsing method [19]. Association in these regions did not replicate when further sample sets were tested. The targets for enrichment span 1.6 Mb in total and include entire genic regions that encompass 3′ and 5′ UTRs, introns, and exons, and have been defined as 50 Kb either side of the transcriptional start and stop sites.
Array and Solution Oligonucleotide pool design
Genomic coordinates for the regions of interest were submitted to Nimblegen for the design of custom 385K arrays covering the target regions. Oligonucleotide pools for hybridization in solution phase were generated by Nimblegen to cover the same target regions. To cover real-estate on the array, three further regions were added on the hybrid-capture arm of the experiment (for a total of 1.96 Mb). These additional regions were excluded from the analysis presented here. This exclusion results in an under-estimation of the percentage of reads mapping back to target for the aHC and sHC experiments in Table 1.
Preparation of the pools
Each DNA sample was quantified using standard picogreen protocols and normalized to 50 ng/l. The pools were generated by mixing the required volumes of the appropriate number of samples to give a final concentration of each pool of 50 ng/l. The concentration of the resulting pool was checked using picogreen. Aliquots of the same pool were used for both PCR and hybrid-capture.
PCR
Primers were designed automatically using Primer 3 to achieve a 5-fold depth of 5- and 10 kb amplicons across the target regions. Where necessary, manual primer design of 5 kb amplicons using Primer 3 was used to fill any gaps in the coverage following the automatic design. In total 462×10 kb STSs and 737×5 kb STSs were designed automatically and 88×5 kb STSs manually. All primers were pre-screened on a set of four genomic DNAs. Products were separated on an 0.8% agarose gel, visualised with ethidium bromide staining and scored as pass/weak/fail. Based on the prescreening results a final set of STSs were chosen to give 3-fold coverage over the target regions which consisted of 256×10 kb STSs and 256×5 kb STSs. Aliquots of the same DNA pools used for hybrid capture were used as template for PCR amplification with each STS. 5 kb amplicons were amplified as follows: Primers were pre-aliquoted at a concentration of 10 ng/l, 4 l per well into 384-well PCR plates. A premix was made consisting of 2 µl of 10× Buffer (as supplied with the enzyme), 0.4 l 10 mM dNTPs, 0.8 l 50 mM MgSO4 (as supplied with the enzyme), 0.12 l Platinum Hi-Fi Taq, 11.8 l DDW and 30 ng of pooled DNA per reaction and added to the pre-aliquoted primers. PCR cycling conditions were as follows: 98°C for 3 minutes, followed by 15 cycles of 94°C for 30 seconds, 68°C for 30 seconds, with the annealing temperature decreasing by 1oC per cycle, 68°C for 5 minutes followed by 19 cycles of 94°C for 30 seconds, 58°C for 30 seconds, 68°C for 5 minutes followed by 68°C for 10 minutes. 10 kb amplicons were amplified as follows: Primers were pre-aliquoted at a concentration of 10 ng/l, 4 l per well into 384-well PCR plates. A premix was made consisting of 2 l of 10× Buffer (as supplied with the enzyme), 0.4 l 10 mM dNTPs, 0.8 l 50 mM MgSO4 (as supplied with the enzyme), 0.16 l Platinum Hi-Fi Taq, 11.14 l DDW and 90 ng of pooled DNA per reaction and added to the pre-aliquoted primers. PCR cycling conditions for were as follows: 98°C for 3 minutes, followed by 15 cycles of 94°C for 30 seconds, 68°C for 30 seconds, with the annealing temperature decreasing by 1oC per cycle, 68°C for 10 minutes followed by 19 cycles of 94°C for 30 seconds, 58°C for 30 seconds, 68°C for 10 minutes followed by 68°C for 10 minutes. Products were visualised using ethidium bromide staining. PCR products from each DNA pool for all STSs were pooled together in equimolar amounts and used to construct an Illumina library prior to sequencing as described below.
Illumina Library Construction
20 g of DNA were sheared to 100–400 bp using a Covaris S2 following manufacturer's protocols and the settings Duty Cycle, 20%; Intensity, 5.0; Cycles/burst, 200; Duration, 90; Mode, Freq Sweeping. Sheared samples were quantitated on a Bioanalyzer (Agilent, Santa Clara, USA). 10–15 g of sheared DNA were end-repaired, A-tailed and Illumina sequencing adapters ligated to the resulting fragments using the Illumina Paired-End DNA Sample Prep protocol with the slight modification that the gel size selection step was replaced with a SPRI bead purification (following manufacturer's protocol).
Array Hybridization
5 g of each library were hybridized to a custom Nimblegen 385-K array following manufacturer's protocols (Roche/Nimblegen) with the modification that no pre-hybridization PCR was performed. Captured samples were washed and eluted in 50 l of PCR-Grade water following manufacturer's protocols. Eluted samples were amplified using a master-mix containing 2 mM MgCl2, 0.2 mM dNTPs, 0.5 M PE.1. 0.5 M PE.2 and 3 units of Platinum® Pfx DNA Polymerase per sample. Samples were aliquoted into 3 individual wells of a plate and amplified using the following conditions: 94°C for 5 minutes followed by 20 cycles of 94°C for 15 seconds, 58°C for 30 seconds, 72°C for 30 seconds and a final extension of 72°C for 5 minutes. PCR products were purified using SPRI beads prior to sequencing.
Solution Hybridization
1 g of each library was hybridized to an oligo pool following manufacturer's protocols with the modifications that 14 cycles of pre-hybridization PCR were performed and 50× COT1DNA was used in the hybridization. Following hybridization the captured samples bound to the Streptavidin beads were washed following manufacturer's protocols. Post-capture PCR was performed on the captured samples bound to the beads as described above.
Sequencing
Captured libraries were sequenced on the Illumina Genome Analyzer II (GAII) platform as paired-end 37-bp or 54-bp reads, following manufacturer's protocols. The raw sequencing reads are available through the European Genome-Phenome Archive (http://www.ebi.ac.uk/ega, accession EGAS00001000134) and the European Nucleotide Archive (http://www.ebi.ac.uk/ena, accession ERP000770).
Read Mapping and Sequence Analysis
The reference human genome used in these analyses was UCSC assembly hg18 (NCBI Build 36), including unordered sequence. Each lane of sequencing was mapped to the reference genome using Maq (v0.7.1) with default parameters [20]. For pools with multiple lanes of sequencing, the individual lane mappings were merged with the Maq utility mapmerge. The phred-scaled base quality scores from the GAII were recalibrated using the Quality Score Recalibration tool in the Genome Analysis Toolkit (v1.0 build January 21, 2010) [21]. Duplicate reads were identified and marked using Picard (v1.17; http://picard.sourceforge.net/), and for a subset of the analyses duplicates were removed with SamTools (v0.1.7) [22]. The number of reads mapped and mapped to target regions was calculated using the view utility in SamTools. The %GC versus coverage analysis was performed using the CollectGcBiasMetrics utility in Picard. The analysis of the repeat and low-complexity content of the coding and non-coding target regions were performed with the RepeatMasker software (v. open-3.2.9) [23].
Variant Calling and Frequency Estimation
Variants were called on the merged BAM file from all lanes for a pool. The BAM file used to call variants had recalibrated base quality scores, reads mapping off the end of the reference soft-clipped, and either duplicate reads marked or removed. The variant calling and frequency estimation was performed by Syzygy (v0.9.5.39) using the default parameters. Syzygy calls single nucleotide variants and single base insertion/deletions [7] (http://www.broadinstitute.org/software/syzygy/). This analysis only considered Syzygy single nucleotide variant calls. Variants are defined as a locus having 1 non-reference allele, an allele different than the reference genome used for mapping, present in the pool. Syzygy assigns a confidence score to all variant calls (high, medium and low). We analyzed all the called variants regardless of confidence.
Comparison Genotypes
The sensitivity, specificity and frequency estimation analyses were conducted by comparing the variants and frequency estimates from the Syzygy calls to the known variant content in the pool using existing genotype data for each pooled individual. We used the non-redundant release 27 HapMap genotypes for the 31 HapMap individuals used in the pooling experiments. The genotypes were mapped to the forward strand of Build 36 of the reference genome and sensitivity analysis included all loci where the HapMap genotypes indicated that there was at least one non-reference base in the pool, whereas the specificity and allele frequency estimation analysis only included loci where all individuals in the pool had non-missing genotype data. Twenty-two of the HapMap individuals used in our pooling experiments were sequenced in Pilot 1 of the 1,000 Genomes Project. We used 1KG genotypes4 for these individuals from the final pilot 1 call set released March 28, 2010. Due to the fact that no pool consisted solely of individuals sequenced in 1KG, we are unable to perform specificity analysis for the 1KG loci. The 1958 Birth Cohort (58BC) genotypes came from 2 sources. Sixteen of the pooled individuals were genotyped as part of the Wellcome Trust Case Control Consortium 2 (WTCCC2; Illumina 1.2 M Duo platform) [24] and 6 individuals were genotyped as part of this project at the Wellcome Trust Sanger Institute (Illumina 670K platform). The WTCCC2 genotypes were downloaded from the European Genotype Archive (http://www.ebi.ac.uk/ega/). The frequency estimation and variant discovery specificity analysis for the pool of 50 was based on the intersection of variants that occurred in both the HapMap and 58BC genotype sets. The variant discovery sensitivity analysis for the pool of 50 was carried out by taking the union of variants in 1KG, HapMap and 58BC genotype sets. The dbSNP variants used were dbSNP129 variants downloaded from the UCSC genome browser, with all rsIDs that mapped 2 locations in the genome removed (referred to as the non-redundant dbSNP129). The coding/non-coding analysis was performed by defining coding intervals for each gene as per the March 27, 2009 release of the consensus coding sequence (CCDS) project [25].
Statistical Sequence and Variant Analysis
All statistical analyses were performed with the R statistical software package [26]. The target regions and called variants were separated into different subsets and two-sided, two-sample t-tests with unequal variances were performed to assess differences in the means of the distributions. An obtained t-test p-value of 0 indicates that the p-value of the test was more significant than the statistical software R would calculate (the highest exponent on the machine used for calculation is 1024). The correlation coefficients reported in Figure 2 and Figures S15, S16, S17, S18, S19, S20, S21, S22, S23, S24 are Pearson's correlation coefficients. Figures S19, S20, S21 further investigate the relationship between individual read depth and allele frequency accuracy, defined as the HapMap frequency minus the Syzygy estimated frequency, by a least squares fitting of the model, , and the red lines in these figures shows the resulting estimate of the intercept and . Figures S22, S23, S24 further investigate the relationship between allele count and allele frequency accuracy, as defined above, by a least squares fitting of the model, , and the red lines in these figures shows the resulting estimate of the intercept and .
Supporting Information
Acknowledgments
The authors are grateful to Manuel Rivas and Mark Daly for making their variant calling Syzygy software available to us, and to Karen McLaren (project management), Cordelia Langford (genotyping), Sarah Hunt (informatics), Carol Scott (informatics), Sanjeev Bhaskar (informatics), Mike Quail and the Library Construction team, John Burton and the Illumina Sequencing team and David Jackson and the New Technology Pipeline, for their support in the project.
Footnotes
Competing Interests: The authors have declared that no competing interests exist.
Funding: This work is supported by the Wellcome Trust (WT088885/Z/09/Z). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Eichler E, Flint J, Gibson G, Kong A, Leal S, et al. Missing heritability and strategies for _nding the underlying causes of complex disease. Nat Rev Genet. 2010;11:446–450. doi: 10.1038/nrg2809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Manolio T, Collins F, Cox N, Goldstein D, Hindorff L, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Johansen C, Wang J, Lanktree M, Cao H, McIntyre A, et al. Excess of rare variants in genes identified by genome-wide association study of hypertriglyceridemia. Nat Genet. 2010;42:684–687. doi: 10.1038/ng.628. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Durbin R, Abecasis G, Altshuler D, Auton A, Brooks L, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Nejentsev S, Walker N, Riches D, Egholm M, JA T. Rare variants of ifih1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science. 2009;324:387–389. doi: 10.1126/science.1167728. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.V B. A statistical method for the detection of variants from next-generation resequencing of dna pools. Bioinformatics. 2010;26:i318–324. doi: 10.1093/bioinformatics/btq214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Calvo S, Tucker E, Compton A, Kirby D, Crawford G, et al. High-throughput, pooled sequencing identifies mutations in nubpl and foxred1 in human complex i deficiency. Nat Genet. 2010;42:851–858. doi: 10.1038/ng.659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Druley T, Vallania F, Wegner D, Varley K, Knowles O, et al. Quantification of rare allelic variants from pooled genomic dna. Nat Methods. 2009;6:263–265. doi: 10.1038/nmeth.1307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ingman M, Gyllensten U. Snp frequency estimation using massively parallel sequencing of pooled dna. Eur J Hum Genet. 2009;17:383–386. doi: 10.1038/ejhg.2008.182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Koboldt D, Chen K, Wylie T, Larson D, McLellan M, et al. Varscan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics. 2009;25:2283–2285. doi: 10.1093/bioinformatics/btp373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Out A, van Minderhout I, Goeman J, Ariyurek Y, Ossowski S, et al. Deep sequencing to reveal new variants in pooled dna samples. Hum Mutat. 2009;30:1703–1712. doi: 10.1002/humu.21122. [DOI] [PubMed] [Google Scholar]
- 12.Vallania F, Druley T, Ramos E, Wang J, Borecki I, et al. High-throughput discovery of rare insertions and deletions in large cohorts. Genome Res. 2010;20:1711–1718. doi: 10.1101/gr.109157.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Albert T, Molla M, Muzny D, Nazareth L, Wheeler D, et al. Direct selection of human genomic loci by microarray hybridization. Nat Methods. 2007;4:903–905. doi: 10.1038/nmeth1111. [DOI] [PubMed] [Google Scholar]
- 14.Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust E, et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol. 2009;27:182–189. doi: 10.1038/nbt.1523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Okou D, Steinberg K, Middle C, Cutler D, Albert T, et al. Microarray-based genomic selection for high-throughput resequencing. Nat Methods. 2007;4:907–909. doi: 10.1038/nmeth1109. [DOI] [PubMed] [Google Scholar]
- 16.Tewhey R, Nakano M, Wang X, Pabon-Pena C, Novak B, et al. Enrichment of sequencing targets from the human genome by solution hybridization. Genome Biol. 2009;10:R116. doi: 10.1186/gb-2009-10-10-r116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Mamanova L, Coffey A, Scott C, Kozarewa I, Turner E, et al. Target-enrichment strategies for next-generation sequencing. Nat Methods. 2010;7:111–118. doi: 10.1038/nmeth.1419. [DOI] [PubMed] [Google Scholar]
- 18.Teer J, Bonnycastle L, Chines P, Hansen N, Aoyama N, et al. Systematic comparison of three genomic enrichment methods for massively parallel dna sequencing. Genome Res. 2010;20:1420–1431. doi: 10.1101/gr.106716.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lawrence R, Day-Williams A, Elliott K, Morris A, E Z. CCRaVAT and QuTie – enabling analysis of rare variants in large-scale case control and quantitative trait association studies. BMC Bioinformatics. 2010;11:527. doi: 10.1186/1471-2105-11-527. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Li H, Ruan J, Durbin R. Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;18:1851–1858. doi: 10.1101/gr.078212.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, et al. The Genome Analysis Toolkit: A mapreduce framework for analyzing next-generation dna sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. The sequence alignment/map format and samtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Smit A, Hubley R, Green P. Repeatmasker open-3.0. 1996–2010. URL http://www.repeatmasker.org.
- 24.Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Pruitt K, Harrow J, Harte R, Wallin C, Diekhans M, et al. The consensus coding sequence (ccds) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 2009;19:1316–1323. doi: 10.1101/gr.080531.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. 2010. URL http://www.R-project.org. ISBN 3-900051-07-0.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.