Abstract
Advances in both high-throughput sequencing and whole-genome amplification (WGA) protocols have allowed genomes to be sequenced from femtograms of DNA, for example from individual cells or from precious clinical and archived samples. Using the highly curated Caenorhabditis elegans genome as a reference, we have sequenced and identified errors and biases associated with Illumina library construction, library insert size, different WGA methods and genome features such as GC bias and simple repeat content. Detailed analysis of the reads from amplified libraries revealed characteristics suggesting that majority of amplified fragment ends are identical but inverted versions of each other. Read coverage in amplified libraries is correlated with both tandem and inverted repeat content, while GC content only influences sequencing in long-insert libraries. Nevertheless, single nucleotide polymorphism (SNP) calls and assembly metrics from reads in amplified libraries show comparable results with unamplified libraries. To utilize the full potential of WGA to reveal the real biological interest, this article highlights the importance of recognizing additional sources of errors from amplified sequence reads and discusses the potential implications in downstream analyses.
Keywords: whole-genome amplification, Illumina, SNPs, genome assembly, chimeric DNA
1. Introduction
The use of genomic data generated by so-called ‘next generation sequencing’ (NGS) has become commonplace in many fields of biological research, with sequencing-by-synthesis from Illumina currently the most popular. A standard Illumina paired-end (PE) library is made from DNA templates of about 500 bp in length, and a sequencing run can generate billions of paired reads of length 37–250 bp from either ends of these fragments.1 Reads from longer fragments of DNA can also be produced to aid the deconvolution of repetitive regions and for identifying large structural variations in genomes. A specialized mate-pair (MP) library, constructed by introducing a circularization step at the start of the library preparation, allows end sequencing from fragments of at least 2 kb.2–4 This powerful technology can be applied to address a wide range of biological questions, such as variant calling and resolving haplotypes between individuals of a population or de novo assembly of complex genomes.
Advancement in library preparation also permits their creation from just a few nanograms of DNA.5 Nevertheless, obtaining even nanograms of starting material can be challenging in certain applications. One solution is to pool many samples to obtain sufficient DNA to construct one library. However, this approach is often not applicable to rare clinical or archived samples,6 and increases the complexity of downstream analysis. Within an assembly of pooled DNA samples, it can be particularly challenging to distinguish variants of a sequence that is repeated in the genome of one individual from allelic differences between multiple individuals. This problem increases with the levels of intraspecies variation, for instance, C. brenneri has 14.1% of polymorphic synonymous sites between individuals, comparable with hyperdiverse bacteria.7 A potential solution is to use whole-genome amplification (WGA) techniques to reduce the amount of DNA required to make a sequencing library. Several WGA protocols have been proposed8 and can be divided into those based on polymerase change reaction (PCR) or those involving multiple strand displacement amplification (MDA). PCR-based techniques such as degenerate oligonucleotide primed PCR9 and primer extension PCR10 can produce non-specific amplification artefacts and typically short-amplification products.11 MDA uses the strand-displacing DNA polymerase from Phi29 bacteriophage and has several advantages over PCR-based protocols. For example, MDA can generate long-amplified fragments up to 100 kb, which are of feasible size to serve as templates from which MP libraries can be synthesized.8 An alternative method of amplification called pWGA (primase-based WGA)12 is based on in vitro reconstruction of the naturally existing cellular DNA replication machinery found in bacteriophage T7, and comparable performance with MDA in terms of efficiency and unbiased amplification has been reported.12 WGA techniques have already been regularly used in genotyping bacteria,13 other small organisms14 and human cells.15 Though still in its infancy, sequencing from single cells is now feasible using WGA and this opens up a new field of exploring heterogeneity within populations, or populations of cells in an organism or tissue, an ultimate goal for many geneticists and molecular biologists.
Despite the advances in WGA methods, preferential amplification of certain DNA fragments still exists,16 resulting in uneven sequencing between different parts of a given genome and chimera formation.17 In this article we describe the effect of amplifying nanograms of genomic DNA, which is usually the available starting material of a de novo small eukaryote genome project, with three different protocols before constructing Illumina short- and long-insert libraries. Using the high quality genome sequence of C. elegans as a reference, we show that GC content and composition of inverted and tandem repeats play a major role in the variation of read coverage. We demonstrate that reads sequenced from amplified DNA can generate de novo assemblies that are of comparable quality with those from unamplified DNA. In addition, we have also examined capillary reads from libraries cloned from amplified high molecular weight DNA of the potato cyst nematode Globodera pallida genome project. As the research focus is turning to more unexplored organisms and single cells, WGA protocols will become the standard method of choice. Understanding the caveats and biases generated with these protocols is a paramount requirement for interpretation of results.
2. Materials and methods
2.1. Whole-genome amplification
Genomic DNA was extracted from ∼10 000 C. elegans N2 nematodes using Genomic tip G20 (Qiagen) according to the manufacturer’s instructions. Ten nanograms of template DNA was used in each of three WGA reactions using: (i) GenomiPhi v2 (GE life science), hereafter termed Phi; (ii) Phi29 MDA plus trehalose, hereafter termed Tre; and (iii) Rapisome (BioHelix), hereafter termed Rap. The Phi WGA reactions were carried out in the 20 µl reaction mixture at 30°C for 90 min followed by heat denaturation at 95°C for 3 min according to the manufacturer's instructions. For Rap, DNA was mixed directly with the 25 µl reaction mixture and incubated at 37°C for 60 min according to the manufacturer's instructions. For Tre, Phi29 polymerase (Qiagen), 0.7 M trehalose and 16 h reaction time were used according to the method described in Pan et al.18 Amplified products were purified using a QIAAmp DNA mini kit (Qiagen) and DNA concentrations were measured using Qubit (Life technologies).
2.2. Illumina library construction and sequencing
One microgram of DNA was used to construct standard 450 bp libraries using a TruSeq DNA Sample Preparation Kit with the standard protocol (Illumina), after fragmentation on the Covaris, 3 kb mate-pair libraries were constructed following the protocol described in Park et al.19 with the following exceptions. Size selection was carried out with Agencourt AMPure XP beads (Beckman Coulter) in a buffer of 5% PEG and 0.95 M NaCl, aiming to remove most fragments of length <1.5 kb. Nick translation was carried out for 11 min (libraries in Replicate 1) or 14 min (libraries in Replicate 2). Libraries were sequenced using the TruSeq SBS Kit v3-HS kit according to the manufacturer's recommended protocol (https://icom.illumina.com/) in either Illumina MiSeq 150 cycles (for 3 kb MP libraries in Replicate 2) or the Illumina HiSeq 100 cycles (for all other libraries). In situ, the linearization, blocking and hybridization step was repeated to regenerate clusters, release the second strand for sequencing and to hybridize the R2 sequencing primer followed by another 100 or 150 cycles of sequencing to produce PE reads.
2.3. Data analysis
Reads were trimmed based on base quality (an average phred score of 15 for every four bases) and the presence of adaptor sequences. For 450 bp fragment short-insert libraries Trimmomatic20 was used with options: ‘ILLUMINACLIP 2:40:15 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36’. For 3 kb fragment long-insert libraries an in-house developed algorithm was used to trim biotin adaptor sequence which may present at either ends of the long-insert fragments (Supplementary Fig. S1). Three to 15% of reads were removed in short- and long-insert libraries, respectively, before undertaking the alignment stage (Supplementary Table S1).
Illumina reads were aligned to the C. elegans reference genome (WS236 from Wormbase ftp://ftp.wormbase.org/) using SMALT (http://www.sanger.ac.uk/resources/software/smalt/). Repetitive mappings were allowed, and only alignments with at least 80% or 50% of the sequence read aligned to reference were considered in the short- and long-insert libraries, respectively. Additionally, a maximum of 600 bp and 100 000 bp insert size were considered for paired mappings in the short- and long-insert libraries, respectively. Duplicates were called using GATK.21 SNPs were called using Varscan222 with options ‘−min-coverage 5—strand-filter 1’ using alignment files (in the form of bam format) as the input on each of the short-insert replicates. Tandem and inverted repeat content of the C. elegans genome was calculated using program trf23 and irf24 with default parameters, respectively. GC content and repeat content analyses for 10 kb windows in the C. elegans genome was calculated using a combination of BEDTools25 and custom Perl and R26 scripts. Capillary reads from the G. pallida genome project were first trimmed to increase bases having at least phred quality score of 40 and the vector contamination was removed. The resulting subset of reads was mapped against the G. pallida v1 assembly (ftp://ftp.sanger.ac.uk/pub/pathogens/Globodera/pallida/) using SSAHA2.27
Assemblies were constructed from short-insert libraries using SGA v.0.10.928 with authors' recommended parameters (https://github.com/jts/sga/blob/master/src/examples/sga-celegans.sh). Scaffolding was performed using SSPACE basic version 2.029 with default settings, with library sizes set to the median insert values as shown in Table 1 with a standard deviation of 0.5. These assemblies were then compared with the C. elegans genome (WS236) using GAGE.30 Because of the differences in number of reads between replicates, we only presented Replicate 1, which has more sequencing depth (Table 4). A separate Supplementary Table S2 summarizes assemblies where every library is normalized to the library with least number of reads.
Table 1.
Method | Replicate | Platform | Library type | Total reads | Mapped (%) | Duplicatesa (%) | Properb (%) | Both mapped (%) | Median insert (bp) | Median coverage |
---|---|---|---|---|---|---|---|---|---|---|
Unamplified | 1 | HiSeq | Short | 41 688 676 | 99.7 | 0.8 | 98.1 | 99.5 | 285 | 38 |
2 | HiSeq | Short | 16 333 732 | 89.4 | 0.6 | 88 | 89.3 | 357 | 13 | |
Phi | 1 | HiSeq | Short | 48 501 916 | 99.4 | 0.7 | 98 | 98.8 | 224 | 46 |
2 | HiSeq | Short | 20 656 296 | 95.2 | 0.8 | 88.6 | 93.7 | 349 | 18 | |
Tre | 1 | HiSeq | Short | 44 481 270 | 99.3 | 1.1 | 97.6 | 98.7 | 237 | 40 |
2 | HiSeq | Short | 25 188 788 | 96.2 | 0.7 | 92 | 95.3 | 328 | 22 | |
Rap | 1 | HiSeq | Short | 26 277 398 | 89.3 | 2.8 | 71.6 | 80.1 | 248 | 15 |
2 | HiSeq | Short | 22 278 134 | 82.5 | 35.7 | 60 | 73.7 | 308 | 3 | |
Unamplified | 1 | HiSeq | Long | 60 856 860 | 99.5 | 9 | 97.2 | 99.1 | 2631 | 41 |
2 | MiSeq | Long | 2 551 720 | 86.4 | 1 | 79.1 | 85.4 | 2136 | 2 | |
Phi | 1 | HiSeq | Long | 61 735 210 | 99.5 | 3.5 | 81.5 | 99.2 | 2576 | 39 |
2 | MiSeq | Long | 2 760 576 | 99.3 | 1.2 | 80.2 | 98.9 | 2025 | 2 | |
Tre | 1 | HiSeq | Long | 55 842 586 | 99.4 | 3.4 | 82.6 | 99 | 2285 | 29 |
2 | MiSeq | Long | 2 999 856 | 99.4 | 1 | 73 | 98.9 | 2094 | 2 | |
Rap | 1 | HiSeq | Long | 58 443 656 | 99.4 | 7.6 | 92.1 | 99 | 2591 | 35 |
2 | MiSeq | Long | 3 914 622 | 99 | 6.6 | 91.8 | 98.7 | 2121 | 2 |
All percentages are relative to total number of reads in each replicate.
aReads that are identical copies of other reads and have exact mapped coordinates on the genome.
bReads mapped in the correct orientation and at a distance corresponding to that predicted by the fragment library size.
Table 4.
Protocol | Contig assembly |
Scaffolding |
||||||
---|---|---|---|---|---|---|---|---|
Unamplified | Phi | Tre | Rap | Unamplified | Phi | Tre | Rap | |
Assembly size (bp) | 94 641 187 | 94 028 877 | 94 541 978 | 88 411 985 | 96 571 590 | 96 849 620 | 96 382 191 | 96 869 541 |
contig number | 13 386 | 14 661 | 21 721 | 34 073 | 9415 | 8744 | 9068 | 9247 |
contig average (kb) | 7.1 | 6.4 | 4.4 | 2.6 | 10.3 | 11.1 | 10.6 | 10.5 |
largest contig (kb) | 167.7 | 147.8 | 116.8 | 41.8 | 167.7 | 187.5 | 167.7 | 167.7 |
N50 (kb) | 16.6 | 15.7 | 10.8 | 4.2 | 17.6 | 24.1 | 20.7 | 18.4 |
N50 (number) | 1525 | 1597 | 2109 | 5897 | 1533 | 1067 | 1258 | 1482 |
GAGE assessment | ||||||||
Corrected N50 (kb) | 15.1 | 14.1 | 9.5 | 3.5 | 16.8 | 22.7 | 19.7 | 17.7 |
Corrected N50 (number) | 1721 | 1825 | 2431 | 7493 | 1642 | 1141 | 1354 | 1577 |
Missing reference (%) | 0.09 | 0.09 | 4.43 | 0.14 | 0.09 | 0.09 | 0.09 | 0.09 |
Inversion | 13 | 21 | 38 | 50 | 12 (−1) | 17 (+4) | 13 | 15 (+2) |
Relocation | 7 | 7 | 11 | 22 | 17 (+10) | 13 (+6) | 19 (+12) | 11 (+4) |
Translocation | 12 | 16 | 37 | 30 | 12 | 12 | 12 | 11 (−1) |
3. Results
3.1. Genome amplification
Caenorhabditis elegans genomic DNA was amplified using three different protocols that utilized Phi29 polymerase alone (Phi); Phi29 polymerase supplemented with trehalose (Tre), which reportedly produces a more homogenous and unbiased amplification18; and Rapisome (Rap)—a commercially available version of pWGA, which does not utilize random primers and initial template denaturation that are possible factors in chimera formation in MDA reactions. After amplification, we observed DNA fragments of length>10 kb resulting from all amplification methods. The average yield of amplified DNA from 10 ng template DNA was 46 ng/µl of reaction mixture (Phi), 31 ng/µl (Rap) and 95 ng/µl (Tre). These fragments were further fragmented and size selected to generate Illumina libraries (Supplementary Fig. S1).
3.2. Illumina read quality
A total of ∼495 million of both standard 450 bp PE (short-insert libraries) and 3 kb MP (long-insert libraries) Illumina reads were generated from amplified C. elegans genomic DNA fragments. Two technical replicates were generated from each amplification protocol and unamplified DNA. The availability of a high quality C. elegans reference genome allowed the quality of the libraries to be assessed by checking various mapping metrics such as number of reads mapped, orientation of mates in a pair. In all cases, at least 82% of the reads were aligned to the reference genome (Table 1) and the insert size distribution, i.e. the distance between two mates of a read pair (including themselves) mapped to the genome in the correct orientation, was close to our expectation for selecting desired size fragments (Supplementary Fig. S2).
Various errors can occur at different stages of Illumina library preparation (Supplementary Fig. S1). In both short- and long-insert library construction, duplicates can arise during PCR amplification, resulting in perfect copies of the original DNA templates being sequenced many times.31 Of the mapped reads in short-insert libraries, the number of PCR duplicates was ∼1%, except in Rap libraries, where Replicate 1 had 2.8% duplicates and Replicate 2 had 35.7% duplicates suggesting that the starting amplified products had lower molecular sizes with only a fraction of DNA desirable as DNA templates. We therefore considered Replicate 2 as a ‘failed’ library and excluded it from subsequent analysis. The number of PCR duplicates in long-insert libraries was consistently higher than in short-insert libraries, as observed by others.4 The number of PCR duplicates also appeared to be affected by individual runs; a consistently fewer duplicates were observed in the second replicate. Again, we found that PCR duplicates in Rap amplified long-insert libraries were high in both replicates.
Problematic reads can arise from several stages of long-insert library construction. First, fragments can be produced from circularized templates where nick translation has occurred in a segment that does not contain a biotinylated adaptor (Supplementary Fig. S1) presumably the selection stages are sometimes ineffective. We found out that 0.5–4.8% of such reads present in long-insert libraries and they appear to be higher in Replicate 2 and not influenced by WGA protocols (Table 2). Second, templates can still be circularized if size selection is not efficient (Supplementary Fig. S1). Paired reads sequenced from these fragments should have short-insert sizes and be aligned in the correct (outward-facing in long-insert libraries) orientation. We found these reads to be present at much higher frequency in Replicate 2 which were sequenced as a batch (Supplementary Fig. S2).
Table 2.
Method | Replicate | Library type | Singletons (%) | Interchromosomala (%) | Outies/inniesb (%) | Wrong orientationc (%) | Incorrect insert size (%) |
---|---|---|---|---|---|---|---|
Unamplified | 1 | Short | 0.2 | 0.5 | 0.5 | 0.2 | 0.2 |
2 | Short | 0.2 | 0.6 | 0.3 | 0.1 | 0.3 | |
Phi | 1 | Short | 0.6 | 0.15 | 0.1 | 0.55 | 0.0 |
2 | Short | 1.5 | 0.4 | 0.3 | 4.1 | 0.3 | |
Tre | 1 | Short | 0.6 | 0.2 | 0.1 | 0.8 | 0.0 |
2 | Short | 1 | 0.4 | 0.3 | 2.5 | 0.1 | |
Rap | 1 | Short | 9.1 | 4.2 | 0.3 | 3.7 | 0.3 |
2 | Short | 8.8 | 0.5 | 1.7 | 3.1 | 8.4 | |
Unamplified | 1 | Long | 0.4 | 1 | 0.5 | 0.3 | 0.1 |
2 | Long | 1 | 3.9 | 1.4 | 0.8 | 0.2 | |
Phi | 1 | Long | 0.3 | 2 | 1.7 | 13.3 | 0.7 |
2 | Long | 0.4 | 5.1 | 3.1 | 10.0 | 0.5 | |
Tre | 1 | Long | 0.4 | 1.9 | 1.0 | 12.9 | 0.6 |
2 | Long | 0.4 | 8.1 | 4.8 | 12.4 | 0.6 | |
Rap | 1 | Long | 0.4 | 1.3 | 0.5 | 4.5 | 0.6 |
2 | Long | 0.3 | 3 | 1.3 | 2.2 | 0.4 |
All percentages are relative to total number of reads in each replicate shown in Table 1.
aReads with mates mapped to different chromosomes.
bReads with mates mapped to the same chromosome that show incorrect orientation of facing either outwards (‘←→’; outies for short-insert libraries) or inwards (‘→←’ innies for long-insert libraries).
cReads with mates mapped to the same chromosome but shows the same orientation, i.e. ‘←←’ or ‘→→’. In the case of long-insert libraries, chimera formation is one of the causes of the formation of these reads.
One of the main concerns in using amplification methods is their tendency to form chimeric DNA fragments, which seems to arise from a mechanism involving priming from displaced 3′ termini.17 Based on this mechanism, the majority of chimeric DNA fragments will be sequences where a segment, which is partially deleted, illustrated by Segments a and b in Fig. 1A, joins to another Sequence c from the same chromosome which is inverted. Thus we could measure the tendency of amplified libraries forming chimeras by counting the number of read pairs that were mapped with both mates in the wrong orientation (i.e. forward–forward or reverse–reverse). All amplified libraries show an increased proportion of reads in the wrong orientation when compared with the unamplified counterparts (Table 2). This pattern is more prevalent in long-insert reads, where long-insert libraries prepared using Phi and Tre show 10–12.9% of reads with the wrong orientation compared with 0.55–4.1% in short-insert libraries. The majority of these reads are evenly distributed across the chromosomes in all amplified samples (Supplementary Fig. S3), suggesting that wrongly amplified fragments occur infrequently and randomly. As mentioned before, another characteristic of the chimeras is that part of the sequence will be deleted, and as a result the two segments where they map in the genome will appear to have been brought closer together as a consequence of the deletion (Fig. 1B). Hence, when insert size is calculated based on the mapping positions, we expect to see a much broader insert size distribution than of non-chimeric reads. Indeed, this is what we observed in all WGA long-insert libraries (Replicate 1 of Phi shown in Fig. 1C and rest in Supplementary Fig. S4) with distances between mates mapped in wrong orientation sometimes even >10 kb. By further looking at sequence reads that can be uniquely mapped into two different positions on the same chromosome, we found two cases that confirm the presence of chimeras in Phi and Tre amplified fragments (Supplementary Fig. S5).
To our surprise, the insert size distribution of these chimeric reads all peaked close to 0 bp (Fig. 1C, Supplementary Fig. S4), suggesting that both mates in read pairs were overlapped or placed very close to each other. This observation cannot be solely explained by the priming mechanism where two different sequences on the chromosome were joined, and can only be explained if the same sequence is present on both ends of the sequenced fragment with one version inverted (Fig. 1D). To investigate this phenomenon further, we looked at Sanger sequence data that were produced from potato cyst nematode (Globodera pallida) genomic DNA, amplified using Genomiphi (Phi) and cloned into plasmid or fosmid vectors. Various insert sizes, from 2 kb to 40 kb (fosmid) were sequenced from either end with read lengths of 200–600 bp. Indeed, we also found that the majority of mates of wrongly oriented reads overlapped with each other, concurring that the same region was sequenced twice with one version inverted. Interestingly, we found that the fraction of wrongly oriented reads was correlated with fragment size, and in extreme cases 85% of MPs derived from fosmids mapped in the wrong orientation and overlapped each other (Supplementary Fig. S6).
3.3. Uniform read coverage across the genome
One of the most important criteria for accurate variant calling and assemblies from Illumina reads is an even coverage of sequence data genome-wide. We first evaluated the variability in the depth of coverage of short-insert reads32 by plotting the cumulative fraction of normalized depth of correctly paired read coverage that covers a given cumulative fraction of genome (Fig. 2). Normalization of read coverage depth allows libraries of different coverage depths to be compared with each other. The theoretical line (Fig. 2) indicates a perfectly uniform distribution of reads where 100% of the genome is covered by reads with a normalized and consistent depth of 1. Figure 2 shows that both replicates of the unamplified short-insert library have the closest fit to the theoretical line, suggesting the most uniform distribution of reads. The remaining samples show some level of deviation, suggesting non-uniform distribution across the genome. Distribution plots of the long-insert libraries show more deviation away from the theoretical distribution than short-insert libraries. This effect is more evident in the lower tail of the distribution, indicating a greater proportion of the genome has lower coverage. By inspecting regions of lower coverage across all libraries, the most evident patterns are regions enriched in G homopolymers tracts and GGC motifs33 (Supplementary Fig. S7).
Next, we grouped the Illumina short-insert read coverage at each base into 10 kb non-overlapping windows across the C. elegans genome. Using Chromosome I as an example, plotting the median coverage of bases in each window immediately revealed the difference in coverage between libraries (Fig. 3). Some differences exist, but two technical replicates in each library show remarkably similar patterns with each other. Libraries with unamplified DNA show more uniform coverage across Chromosome I compared with amplified libraries. Interestingly, we observed that read coverage only seems to be more even towards the middle of Chromosome I in all WGA protocols, but is even throughout in Chromosome X (Supplementary Fig. S8). The unevenness in coverage of data from Illumina short-insert libraries at C. elegans autosome arms mirrors the distribution of inverted and tandem repetitive sequences, which also cluster more frequently at the autosome arms34 (Supplementary Fig. S9). To investigate this phenomenon, we re-annotated the tandem and inverted repeat contents of C. elegans genome and compared them with read coverage in 10 kb windows. Strikingly, we found that both tandem and inverted repeat content are significantly correlated with read coverage across all samples including unamplified libraries (P < 0.001, Spearman's test). Read coverage in libraries treated with Phi and Tre decrease with increasing content of tandem and inverted repeats (Fig. 4A and B). Libraries created with Rap also show an inverse relationship between read coverage and tandem repeat content, but positively correlated with inverted repeat content, which explains contrasting patterns of coverage on autosome arms against Phi and Tre.
GC bias of Illumina reads has been widely reported in the literature.32,35 To analyse the effect of GC composition on uniformity of read coverage we calculated the distribution of GC of mapped reads to the reference, normalized by the average coverage across each set of replicate, and compared them against the theoretical distribution. Short-insert libraries prepared under all protocols showed a good fit to the theoretical distribution (Fig. 5A); however, we see a strong positive bias towards higher GC in all long-insert libraries (Fig. 5B). To characterize this observation further, we obtained the normalized coverage difference between short- and long-insert libraries in 10 kb windows and correlated with the GC content of the window. Compared with short-insert libraries, the majority of long-insert libraries show more amplification generated from higher extremes of GC regions and less amplification from lower extremes of GC regions (Supplementary Fig. S10).
3.4. Application 1: variant callings in WGA libraries
To investigate the effect on variant calling caused by the decreased uniformity in amplified libraries, SNPs were called using Varscan222 on each of the short-insert replicates. Since the same starting material was used for all samples, and considering the technical variations between different sequencing runs, we first inferred 643 homozygous and 2117 heterozygous SNPs that were called in both non-amplified replicates. Only one replicate of the Rap amplification was analysed, which therefore showed the least robust SNP calling accuracy. For samples amplified with either Phi or Tre, we found ∼80% homozygous SNP calls were also called in at least one replicate (Table 3A). The majority of miss calls in libraries prepared using these two protocols from sites with a coverage depth ≤2, consistent with the expectation that some regions of the genome failed to amplify during the process. The effect of non-uniform read coverage of amplified libraries is further reflected in their heterozygous calls, where only 61–65% of SNPs from the non-amplified library were also called in at least one of the replicates from the two amplification protocols. An exhaustive investigation of the miss-called SNPs revealed that most of the missed heterozygotes were incorrectly called as homozygous (Table 3B).
Table 3.
Protocols | Phi | Tre | Rap |
---|---|---|---|
(A) | |||
Homozygous SNPs (643) | |||
No/low coverage | 105 | 91 | 192 |
Not called the same | 28 | 30 | 28 |
Also called in one replicate | 150 | 154 | 423 |
Called in both replicates | 360 | 368 | NA |
Heterozygous SNPs (2117) | |||
No/low coverage | 132 | 85 | 291 |
Not called the same | 692 | 650 | 832 |
Also called in one replicate | 705 | 813 | 994 |
Called in both replicates | 588 | 569 | NA |
(B) | |||
Homozygous SNPs | |||
Called differently in both unamplified replicates | 37 | 36 | 107 |
Called in one replicate | 14 | 28 | 134 |
Heterozygous SNPs | |||
Called differently in both unamplified replicates | 105 | 158 | 528 |
Called in one replicate | 44 | 46 | 1465 |
(A) Fate of 643 homozygous and 2117 heterozygous SNP calls from both unamplified replicates; (B) fate of additional homozygous and heterozygous SNP calls from amplified replicates.
3.5. Application 2: assembly generated by WGA prepared libraries
In order to evaluate the effect that different WGA protocols may have in generating a genome assembly, we used the SGA assembler28 to assemble reads from each of the libraries. These assemblies were then assessed using GAGE30 and summarized in Table 4. We only presented Replicate 1 for this analysis because of the much lower coverage in Replicate 2 observed in all libraries, with the smaller numbers of reads producing significantly worse assemblies (Supplementary Table S2). Having a genome reference allowed us to compute corrected N50 where contigs were broken at miss-assembled regions prior to the calculation of N50. The first observation is that the Rap amplified libraries produce the worst assembly, and as expected the most contiguous and accurate assembly is produced from unamplified library. The second observation is that the assemblies produced from Phi and Tre amplified libraries are almost identical to unamplified libraries, with only an additional 0–4.3% of the assembly missing compared with the reference and similar corrected N50 values. The variation in assembly quality between replicates was greater than between non-amplified and amplified libraries prepared by following either Phi or Tre protocols (Supplementary Table S2).
To assess the effect of the uneven sequencing coverage and the presence of erroneous orientation in the long-insert libraries, we ran the SSPACE29 scaffolder on the best assembly (non-amplified Replicate 1) using reads from each long-insert library. We only used Replicate 1 for this analysis because of the much lower coverage in Replicate 2 observed in all samples. A total of 10 additional miss-assemblies were identified by GAGE after scaffolding with the unamplified long-insert library (Table 4) and, comparably, the Phi amplified library created the same number of miss-assemblies but actually had the largest corrected N50, as well as containing the largest scaffold of all the assemblies. The assembly scaffolded with the Rap amplified library, which contains the fewest wrongly oriented reads in among all amplified libraries but has the most PCR duplicates, shows the fewest additional miss-assemblies and with both the N50 and longest scaffold length very similar to the unamplified sample.
4. Discussion
4.1. Bias due to WGA protocols and genome features
There are two main types of biases present in Illumina read sequenced from amplified DNA: those that arise from the Illumina library construction and sequencing process, and those arising from the nature of DNA fragments generated from WGA. Reads of different insert size also exert influences on the extent of these biases. First, inverted repeats are known to generate chimeras in WGA protocols,17 the extent of which we assessed by recording the number of wrongly oriented reads in the samples. In fact, the distribution of wrongly oriented reads reveals that previously proposed mechanism17 only constitute a part of these, and instead the majority of these reads arise from fragments containing the same sequence on both ends with one end being an inverted version of another. The same trend was found in the G. pallida capillary reads, and a positive correlation was observed between numbers of chimeras and length of sequenced fragments. An explanation could be that the duplicated sequences are more likely to be present in larger fragments. Hence selection of such fragments in larger insert libraries after shearing will also result in preferentially selection for artefacts. In shorter insert Illumina libraries, the wrongly amplified sequences were broken down into smaller fragments and hence a decreased number of wrongly orientated reads was found. It would be interesting to sequence the whole wrongly amplified fragment to reveal the dominant nature of chimera formation in WGA.
Secondly, we obtained the quality statistics of read libraries such as read coverage aligned across the genome, proportions of wrongly oriented mapped reads and searched for potential biases due to GC content, repeat composition of the C. elegans genome and the different insert size libraries. Coverage biases in regions of genomes as a result of amplification have been previously observed in bacteria16 and in humans,15 and we found that the uneven coverage between chromosome arms and centres are correlated with tandem and inverted repeat content, both of which have been previously described to bias the coverage of WGA reads.16,17 We show that in repeat regions Phi and Tre displayed lower read coverage, while Rap showed different biases depending on the repeat type, suggesting alternative mechanisms affecting the amplification process. It also may not be the repeat sequence per se, as repetitive regions in the genome are also associated with the global structural features such as histone modifications. For example, chromosomal arms of C. elegans show enrichment of H3K9 methylation,36 which are also zones of elevated meiotic recombination.34
Thirdly, we found the presence of GC bias in the reads from long-insert libraries including those without WGA, despite the C. elegans genome possessing an almost uniform GC content (36%) across all chromosomes.34 The current proposed mechanisms behind WGA do not explain a bias with GC content, and indeed we found no obvious effect of WGA libraries against GC content. Hence, the overall bias is mainly caused during the library construction process of the long-insert libraries and will be even more prevalent in genomes that possess more extreme GC content, such as Plasmodium falciparum (17% GC).
Finally, in this study we have analysed a total of three WGA protocols. We found that the addition of trehalose (Tre) does not seem to improve the coverage evenness against the libraries amplified with Genomiphi only (Phi), as all the metrics and biases are almost identical between two protocols. A possible explanation is that Genomiphi might have already contained reagents that work similarly to trehalose. On the other hand, libraries amplified using Rapisome (Rap) showed different patterns to Genomiphi based protocols. Constructing a good Illumina library from Rap amplified products seemed to be more difficult than Genomiphi based protocols, considering the lower number of mapped reads and higher number of PCR duplicates. However, Rap amplified libraries have shown lower number of wrongly orientated reads, suggesting an alternative mechanism that attributes to biases to different genome features. For instance, we show regions associated with inverted repeat actually increase read coverage in Rapisome libraries. It will be of interest to find out whether the lower percentage of wrongly orientated reads in Rapisome amplified libraries are also noticeable in longer insert libraries. If that was indeed the case, then Illumina libraries can be constructed from much larger amplified DNA fragments with still lower proportions of wrongly oriented reads.
4.2. Performance of WGA reads
Sequence reads from amplified fragments can potentially give rise to false SNP calls, particularly in genomic regions that are difficult to amplify. We found that SNPs were under-called from reads generated from amplified DNA compared with unamplified samples. When reads were sequenced at good coverage, we found that ∼80% of homozygous SNP candidates from unamplified samples could also be identified correctly from amplified samples, while the majority of miss calls were due to insufficient read coverage. The limitation seems to be more prevalent in heterozygous SNPs, where only up to 65% of candidates could be accurately identified. Most of these SNPs were called as homozygous in one of the alleles, and this may be due to the fact that most SNP callers make heterozygous calls confidently if allele frequencies were close to 50%, for which non-uniform coverage results in departure from this assumption. Hence, although most of the homozygous SNPs can be accurately identified in WGA samples, it is essential to investigate the allele frequencies of variant sites in order to not miss calls that were otherwise heterozygous in the sample.
There have been advances in assembly algorithms that assemble genomes with fluctuating coverage in the samples, but we wished to emphasize the effect that reads sequenced from amplified fragments have on a typical assembly process, hence we applied commonly used methods to assemble the C. elegans genome using short-insert reads with the SGA assembler,28 and scaffolded using long-insert reads with the SSPACE scaffolder.29 Contig and scaffold statistics generated from amplified libraries were comparable with those from their unamplified counterparts. In fact, we found that variability between technical replicates influenced assembly metrics more than different WGA protocols. We also found that the wrongly oriented reads did not cause more miss-assemblies than expected. An explanation is that our long-insert libraries were sequenced at adequate coverage and problematic reads were present in minority. Occasions can arise where correct reads suggest joining of two scaffolds but problematic reads suggest otherwise, in which case the SSPACE scaffolder will scaffold correctly because the number of correct reads outweighs problematic reads. However, we expect that reads from larger insert WGA libraries would result in significantly more scaffolding errors because the majority of reads are problematic, with the error rate increasing with fragment length.
4.3. Conclusion
WGA will remain an essential method in genomics for many years to come. The mechanisms of chimera formation and other sources of bias have been investigated and there are many new protocols that are currently being developed to minimize such effects. In the meantime, it is also important to evaluate the nature of biases in order to correctly interpret assembly results. Here, we have presented the effects of WGA on Illumina reads. We recognize the biases present from the reads, and show that the DNA amplified from all protocols investigated in this article is able to call the majority of SNPs and produce accurate assemblies comparable with those produced from unamplified DNA. Reads sequenced from longer fragments are still useful in scaffolding and problematic reads can be detected and filtered out as long as they remain a minority. We envisage also using WGA in other eukaryotic species (including nematodes) that possess similar GC content and base composition to C. elegans. Fundamentally, as biologists turn to sequencing more unexplored species or samples from which only very low amounts of starting material are available, our study provides a first initial assessment on Illumina sequencing from WGA DNA.
5. Availability
Illumina data are released to the European Nucleotide Archive (http://www.ebi.ac.uk/ena/) under accession number ERP000964.
Supplementary data
Supplementary data are available at www.dnaresearch.oxfordjournals.org.
Funding
This work was supported by JSPS KAKENHI (grants numbers 20353659 and 23248024) and the Wellcome Trust (grant no. WT 098051). I.J.T. was supported by JSPS Postdoctoral Fellowship Program for Foreign Researchers.
Supplementary Material
Acknowledgements
We thank Anna Protasio for her commenting on the manuscript.
Footnotes
Edited by Prof. Masahira Hattori
References
- 1.Illumina. HiSeq Sequencing Systems http://res.illumina.com/documents/systems/hiseq/datasheet_hiseq_systems.pdf .
- 2.Van Nieuwerburgh F., Thompson R.C., Ledesma J., Deforce D., Gaasterland T., Ordoukhanian P., Head S.R. Illumina mate-paired DNA sequencing-library preparation using Cre-Lox recombination. Nucleic Acids Res. 2012;40:e24. doi: 10.1093/nar/gkr1000. doi:10.1093/nar/gkr1000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Illumina. 2009. Mate Pair Library V2 Sample Preparation Guide.
- 4.Peng Z., Zhao Z., Nath N., et al. Generation of long insert pairs using a Cre-LoxP Inverse PCR approach. PloS One. 2012;7:e29437. doi: 10.1371/journal.pone.0029437. doi:10.1371/journal.pone.0029437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Grunenwald H., Baas B., Caruccio N., Syed F. Rapid, high-throughput library preparation for next-generation sequencing. Nat. Methods. 2010;7:iii–iv. [Google Scholar]
- 6.Seth-Smith H.M., Harris S.R., Skilton R.J., et al. Whole-genome sequences of Chlamydia trachomatis directly from clinical samples without culture. Genome Res. 2013;23:855–66. doi: 10.1101/gr.150037.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Dey A., Chan C.K., Thomas C.G., Cutter A.D. Molecular hyperdiversity defines populations of the nematode Caenorhabditis brenneri. Proc. Natl. Acad. Sci. USA. 2013;110:11056–60. doi: 10.1073/pnas.1303057110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hawkins T.L., Detter J.C., Richardson P.M. Whole genome amplification—applications and advances. Curr. Opin. Biotechnol. 2002;13:65–7. doi: 10.1016/s0958-1669(02)00286-0. doi:10.1016/S0958-1669(02)00286-0. [DOI] [PubMed] [Google Scholar]
- 9.Telenius H., Carter N.P., Bebb C.E., Nordenskjold M., Ponder B.A., Tunnacliffe A. Degenerate oligonucleotide-primed PCR: general amplification of target DNA by a single degenerate primer. Genomics. 1992;13:718–25. doi: 10.1016/0888-7543(92)90147-k. doi:10.1016/0888-7543(92)90147-K. [DOI] [PubMed] [Google Scholar]
- 10.Zhang L., Cui X., Schmitt K., Hubert R., Navidi W., Arnheim N. Whole genome amplification from a single cell: implications for genetic analysis. Proc. Natl. Acad. Sci. USA. 1992;89:5847–51. doi: 10.1073/pnas.89.13.5847. doi:10.1073/pnas.89.13.5847. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Binga E.K., Lasken R.S., Neufeld J.D. Something from (almost) nothing: the impact of multiple displacement amplification on microbial ecology. ISME J. 2008;2:233–41. doi: 10.1038/ismej.2008.10. doi:10.1038/ismej.2008.10. [DOI] [PubMed] [Google Scholar]
- 12.Li Y., Kim H.-J., Zheng C., et al. Primase-based whole genome amplification. Nucleic Acids Res. 2008;36:e79. doi: 10.1093/nar/gkn377. doi:10.1093/nar/gkn377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lasken R.S. Genomic sequencing of uncultured microorganisms from single cells. Nat. Rev. Microbiol. 2012;10:631–40. doi: 10.1038/nrmicro2857. doi:10.1038/nrmicro2857. [DOI] [PubMed] [Google Scholar]
- 14.Dharia N.V., Bright a.T., Westenberger S.J., et al. Whole-genome sequencing and microarray analysis of ex vivo Plasmodium vivax reveal selective pressure on putative drug resistance genes. Proc. Natl. Acad. Sci. USA. 2010;107:20045–50. doi: 10.1073/pnas.1003776107. doi:10.1073/pnas.1003776107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Wang J., Fan H.C., Behr B., Quake S.R. Genome-wide single-cell analysis of recombination activity and de novo mutation rates in human sperm. Cell. 2012;150:402–12. doi: 10.1016/j.cell.2012.06.030. doi:10.1016/j.cell.2012.06.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Pinard R., de Winter A., Sarkis G.J., et al. Assessment of whole genome amplification-induced bias through high-throughput, massively parallel whole genome sequencing. BMC Genomics. 2006;7:216. doi: 10.1186/1471-2164-7-216. doi:10.1186/1471-2164-7-216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Lasken R.S., Stockwell T.B. Mechanism of chimera formation during the Multiple Displacement Amplification reaction. BMC Biotechnol. 2007;7:19. doi: 10.1186/1472-6750-7-19. doi:10.1186/1472-6750-7-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Pan X., Urban A.E., Palejev D., et al. A procedure for highly specific, sensitive, and unbiased whole-genome amplification. Proc. Natl. Acad. Sci. USA. 2008;105:15499–504. doi: 10.1073/pnas.0808028105. doi:10.1073/pnas.0808028105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Park N., Shirley L., Gu Y., Keane T.M., Swerdlow H., Quail M.A. An improved approach to mate-paired library preparation for Illumina sequencing. Methods Next Gener. Seq. 2013;1:10–20. [Google Scholar]
- 20.Lohse M., Bolger A.M., Nagel A., Fernie A.R., Lunn J.E., Stitt M., Usadel B. RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics. Nucleic Acids Res. 2012;40:W622–7. doi: 10.1093/nar/gks540. doi:10.1093/nar/gks540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.McKenna A., Hanna M., Banks E., et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–303. doi: 10.1101/gr.107524.110. doi:10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Koboldt D.C., Zhang Q., Larson D.E., et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22:568–76. doi: 10.1101/gr.129684.111. doi:10.1101/gr.129684.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27:573–80. doi: 10.1093/nar/27.2.573. doi:10.1093/nar/27.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Warburton P.E., Giordano J., Cheung F., Gelfand Y., Benson G. Inverted repeat structure of the human genome: the X-chromosome contains a preponderance of large, highly homologous inverted repeats that contain testes genes. Genome Res. 2004;14:1861–9. doi: 10.1101/gr.2542904. doi:10.1101/gr.2542904. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics (Oxford, England) 2010;26:841–2. doi: 10.1093/bioinformatics/btq033. doi:10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Team R.C. 2013. R: A Language and Environment for Statistical Computing.
- 27.Ning Z., Cox A.J., Mullikin J.C. SSAHA: a fast search method for large DNA databases. Genome Res. 2001;11:1725–9. doi: 10.1101/gr.194201. doi:10.1101/gr.194201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Simpson J.T., Durbin R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 2012;22:549–56. doi: 10.1101/gr.126953.111. doi:10.1101/gr.126953.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Boetzer M., Henkel C.V., Jansen H.J., Butler D., Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics (Oxford, England) 2011;27:578–9. doi: 10.1093/bioinformatics/btq683. doi:10.1093/bioinformatics/btq683. [DOI] [PubMed] [Google Scholar]
- 30.Salzberg S.L., Phillippy A.M., Zimin A., et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012;22:557–67. doi: 10.1101/gr.131383.111. doi:10.1101/gr.131383.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Kozarewa I., Ning Z., Quail M.A., Sanders M.J., Berriman M., Turner D.J. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat. Methods. 2009;6:291–5. doi: 10.1038/nmeth.1311. doi:10.1038/nmeth.1311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Oyola S.O., Otto T.D., Gu Y., et al. Optimizing Illumina next-generation sequencing library preparation for extremely AT-biased genomes. BMC Genomics. 2012;13:1. doi: 10.1186/1471-2164-13-1. doi:10.1186/1471-2164-13-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Nakamura K., Oshima T., Morimoto T., et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 2011;39:e90. doi: 10.1093/nar/gkr344. doi:10.1093/nar/gkr344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Press H., York N., Nw A. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998;282:2012–8. doi: 10.1126/science.282.5396.2012. doi:10.1126/science.282.5396.2012. [DOI] [PubMed] [Google Scholar]
- 35.Benjamini Y., Speed T.P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012;40:e72. doi: 10.1093/nar/gks001. doi:10.1093/nar/gks001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Liu T., Rechtsteiner A., Egelhofer T.A., et al. Broad chromosomal domains of histone modification patterns in C. elegans. Genome Res. 2011;21:227–36. doi: 10.1101/gr.115519.110. doi:10.1101/gr.115519.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.