Abstract
Background
Short oligonucleotides can be used as markers to tag and track DNA sequences. For example, barcoding techniques (i.e. Multiplex Identifiers or Indexing) use short oligonucleotides to distinguish between reads from different DNA samples pooled for high-throughput sequencing. A similar technique called molecule tagging uses the same principles but is applied to individual DNA template molecules. Each template molecule is tagged with a unique oligonucleotide prior to polymerase chain reaction. The resulting amplicon sequences can be traced back to their original templates by their oligonucleotide tag. Consensus building from sequences sharing the same tag enables inference of original template molecules thereby reducing effects of sequencing error and polymerase chain reaction bias. Several independent groups have developed similar protocols for molecule tagging; however, user-friendly software for build consensus sequences from molecule tagged reads is not readily available or is highly specific for a particular protocol.
Results
MT-Toolbox recognizes oligonucleotide tags in amplicons and infers the correct template sequence. On a set of molecule tagged test reads, MT-Toolbox generates sequences having on average 0.00047 errors per base. MT-Toolbox includes a graphical user interface, command line interface, and options for speed and accuracy maximization. It can be run in serial on a standard personal computer or in parallel on a Load Sharing Facility based cluster system. An optional plugin provides features for common 16S metagenome profiling analysis such as chimera filtering, building operational taxonomic units, contaminant removal, and taxonomy assignments.
Conclusions
MT-Toolbox provides an accessible, user-friendly environment for analysis of molecule tagged reads thereby reducing technical errors and polymerase chain reaction bias. These improvements reduce noise and allow for greater precision in single amplicon sequencing experiments.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-284) contains supplementary material, which is available to authorized users.
Keywords: Amplicon, Sequencing, Molecule tagging, Sequencing error, PCR Bias
Background
High-throughput sequencing has revolutionized biological science and biomedical research. However, erroneous base calls reduce the information value of each sequence, and polymerase chain reaction (PCR) bias leads to inaccurate quantification of sequences. To address these limitations several methods have been developed where randomly generated oligonucleotides are used as a molecule tag (MT). Molecule tagging should not be confused with barcoding (i.e. Multiplex Identifiers or Indexing) where short oligonucleotides are used to tag individual samples, which are then pooled and simultaneously sequenced. The resulting reads are then informatically sorted by the sample barcode. Molecule tagging is a similar idea where unique tags are attached to individual DNA template molecules within a sample prior to exponential PCR amplification (Additional file 1: Figure S1.A). After PCR and sequencing, reads sharing the same MT likely originated from the same template molecule, meaning that discrepancies among these reads can be attributed to technical error. Forming consensus sequences (ConSeqs) from reads with the same MT corrects these errors. Additionally, any preferential PCR amplification biases are mitigated because ConSeqs represent the original population of templates [1–3].
Molecule tagging is useful for a variety of applications. For instance, Kinde et al. [4] used molecule tagging to test polymerase fidelity, accuracy of in vitro synthesized oligonucleotides, and prevalence of mutations in nuclear and mitochondrial genomes of normal cells. Jabara et al. [5] used molecule tagging to detect and quantify single nucleotide polymorphisms (SNPs) in the HIV—1 protease gene in complex viral populations. Kivioja et al. [1] showed how molecule tagging improves quantification of mRNA sequencing experiments. Faith et al. [6] used a molecule tagging method called Low-Error Amplicon Sequencing (LEA-Seq) for metagenomic 16S gut profiling and observed a substantial reduction in the observed microbial community complexity due to the elimination of spurious sequences. Lundberg et al. [7] saw a similar reduction in 16S microbial complexity when profiling microbially diverse bulk soil samples. In each of these studies, molecule tagging allowed greater confidence in the amplicon sequences and their quantification.
Despite extensive efforts developing and using these error-reducing protocols, software for building ConSeqs in the previously cited projects [4–6] is not readily available or is highly specific for a particular application (e.g. [5]). For example, LEA-Seq scripts can only be run on a small number of 16S amplicons sequenced using paired-end 108 bp Illumina reads with a single 12-20 bp molecule tag. This specificity makes LEA-Seq scripts less practical for most amplicon experiments that could benefit from molecule tagging. Consequently, we developed MT-Toolbox (Molecule Tag Toolbox), a flexible and user-friendly software package to generate ConSeqs from molecule tagged reads produced from several different MT protocols.
Implementation
The primary purpose of MT-Toolbox is to categorize reads by MT and build ConSeqs (Figure 1). MT-Toolbox can categorize and correct single-end (SE), overlapping paired-end (PE), and non-overlapping PE reads. With overlapping PE reads, a preprocessing step runs FLASH [8] to merge corresponding PE reads into a single sequence. Regular expressions, a common pattern matching technique, are used to identify the expected regions (e.g. MT, primer, amplicon) of each read (Additional file 1: Figure S1.B-D). Reads matching the regular expression are then categorized by their MT.
The first step in identifying and correcting errors among reads having the same MT is building a square alignment matrix, M (Additional file 1: Note S.1). This matrix is described as each row, r, representing a read and each column, c, containing a single base from read r at position c. The number of rows (i.e. number of reads) in M is referred to as the MT depth and is an important parameter in evaluating the accuracy of final ConSeqs. Multiple sequence alignment (MSA) programs such as ClustalW [9] or MUSCLE [10] can be used to generate such a matrix. However, computational overhead from operations like file input/output associated with these programs requires a substantial amount of time (Additional file 1: Figure S2). Alternatively, M can be created without using an MSA program by simply stacking reads. Because reads in an MT category are likely to originate from the same template molecule, they are likely to have uniform lengths (Additional file 1: Figure S3). Furthermore, Illumina sequences rarely incorporate insertions or deletions into sequenced reads (Additional file 1: Figure S4). In the rare case where reads in a single MT category differ in lengths, reads are clustered by length and only reads from the largest cluster are used to build the ConSeq. If multiple clusters are equally represented as the largest cluster, one of them is arbitrarily chosen to build the ConSeq. Comparisons of ConSeqs generated by ClustalW, MUSCLE, and the read stacking method show that ConSeqs derived from stacked reads are only slightly less accurate (Additional file 1: Figure S5), and reduce runtime by ~54%. While options for using either ClustalW or MUSCLE are available in MT-Toolbox, the default is to stack reads.
From M a consensus sequence can be built by choosing the mode base in each column (Additional file 1: Figure S6). The quality score of the consensus base is set to be the mean of the original quality values of the mode base. Ties are resolved by choosing the base with the highest average quality score. If a tie cannot be resolved using quality scores, an IUPAC encoding is used as the consensus base. Using quality score information provides a major advantage because ConSeqs can be generated from MTs represented by only two reads thereby keeping a larger proportion of reads. This is especially important for samples with high amplicon population diversity because it captures a larger fraction of the population. This is an improvement over LEA-Seq, which cannot build ConSeqs from MTs having a depth of two where the reads are not identical. Furthermore, other ConSeq building software (e.g. [5]) use only sequence information to build consensus sequence and thus are only able to generate ConSeqs from MTs having a depth greater than two reads.
The primary output file contains ConSeqs and corresponding quality scores in FASTQ format. A second FASTQ file contains single read categories (SRCs; MTs with only one raw read) that can optionally be included in downstream analysis (Additional file 1: Note S.2). However, SRCs retain all technical errors associated with sequencing and PCR because no consensus sequence can be generated from a single read. Quality control parameters (Additional file 1: Note S.3) allow filtering of low quality ConSeqs and SRCs, ConSeqs with low depth, and ConSeqs where a single MT tags two different templates by chance (i.e. the ‘birthday paradox’) [11, 12] (Additional file 1: Note S.4, Figures S7 and S8).
MT-Toolbox also includes the following features: 1) jobs can be started via a graphical user interface (GUI) or command line interface (Additional file 1: Figure S9), 2) an additional plugin provides features for 16S microbial profiling, namely—building operational taxonomic units (OTUs), assigning OTU taxonomy, and removing contaminant OTUs (Figure 1, Additional file 1: Note S.5) using the MeTagenomics plugin (MT-MT-Toolbox, Additional file 2), 3) the BioUtils library (Additional file 1: Note S.6, Figure S10; Additional file 3), digital normalization [13] parameters (Additional file 1: Note S.7), and optional Load Sharing Facility (LSF) based cluster parallelization (Additional file 1: Note S.8) reduce runtime and memory requirements, and 4) an MT-Toolbox website provides descriptions, tutorials, installation instructions, updates, and other important documentation [14].
MT-Toolbox is implemented as a suite of object-oriented Perl modules and scripts (Additional file 4). It has been successfully tested on Perl versions 5.8.8, 5.8.9, and 5.12.3. Several external Perl modules are required, and can be easily downloaded and installed via a simple build command before building and installing MT-Toolbox. The GUI was built using the Perl/Tk library and requires an X Window System. MT-Toolbox also uses gnuplot 4.4 for generating simple summary graphs. The optional MT-MT-Toolbox plugin allows for standard 16S microbial profiling analysis. MT-MT-Toolbox requires USEARCH v7.0.1090 [15] or greater for OTU clustering and chimera filtering, the RDP Classifier [16] as implemented in QIIME [17] for OTU taxonomy classification, and BLAST + 2.2.25 [18] for contaminant sequence removal.
Results
To show the utility of ConSeqs generated by MT-Toolbox, we used data from [7] which consists of a clonal plasmid containing a known 16S gene. From this single clonal plasmid, separate DNA samples were created by performing two replicate dilutions of 1x, 50x, or 100x, for a total of six samples. Each sample was molecule tagged and PCR amplified. It is important to note that each sample should contain just one “real” amplicon matching the original 16S amplicon in the clonal plasmid. Samples were barcoded, pooled, and sequenced on an Illumina MiSeq platform using standard 2 × 250 bp protocols. After demultiplexing samples, ConSeqs were generated using MT-Toolbox (Figure 2).
Diluted samples result in a greater number of MTs having high depth (Figure 3); the diversity of the original amplicon population, however, is reduced. In theory, molecule tags having high depth should generate the most accurate ConSeqs by overcoming the effects of sequencing error. This creates a trade-off between creating highly accurate ConSeqs and capturing the diversity of the amplicon population. Low-complexity samples benefit from dilution because a large number of accurate ConSeqs can be created without sacrificing information about the diversity of the amplicon population. Alternatively, for high complexity samples like soil microbial communities it may be better to sacrifice ConSeq accuracy to observe a larger portion of the amplicon population.
The most accurate ConSeqs were generated from merged PE reads. Average errors per base (EPB) for ConSeqs built from ClustalW or Muscle alignments of merged PE reads was measured at ~0.00047. Without using an MSA (i.e. the read stacking method), ConSeqs had ~0.00112 EPB. Removing ConSeqs with a c-score ≤35 reduced the EPB to ~0.00089 (Additional file 1: Note S.4, Figures S7 and S8). In general, ConSeqs derived from any type of molecule tagged read were more accurate than any type of raw read (Figure 4; Additional file 1: Note S.9). Also, as depth of coverage for each MT increased, consensus sequence EPB decreased. However, for depths >15, EPB increases slightly. This trend is primarily driven by a single sample (Additional file 1: Figure S11). Why this particular sample has higher mean EPB is unclear, however it is unlikely to be caused by sample contamination or sequencing error (Additional file 1: Figure S12). In general, this outlying sample appears more error prone even at depths where other samples have very few errors (Additional file 1: Figure S11). Two examples of errors in high depth ConSeqs from this sample suggest that nucleotide misincorporation during early PCR cycles contributes to increased EPB (Additional file 1: Figure S12). Additionally, this sample has relatively fewer ConSeqs at these high depths (Figure 3) indicating that outlier ConSeqs may be inflating the mean EPB. In any case, nearly all ConSeqs in this sample still have fewer EPB than raw reads.
In general, MT-Toolbox outperforms LEA-Seq in terms of accuracy (Figure 4) and data retention. For MT depths of 2 (21% of the data), LEA-Seq failed to generate ConSeqs because it is unable resolve difference between only two reads. Alternatively, MT-Toolbox uses read quality scores to resolve such difference thereby retaining MTs of depth 2. For MT depths between 3 and 10 (67% of the data), mean EPB of MT-Toolbox ConSeqs generated from merged PE reads is lower than LEA-Seq ConSeqs. For the remaining MT depths (12% of the data) MT-Toolbox ConSeqs were either on par or slightly less accurate than LEA-Seq ConSeqs. Furthermore, because LEA-Seq was implemented to recognize only specific amplicons, a substantial number of changes to the source code were required to run LEA-Seq on these reads. This highlights the utility of MT-Toolbox where users can easily adjust parameters to build ConSeqs from virtually any amplicon or sequencing technology.
Conclusions
Molecule tagging is a practical technique applicable to a variety of amplicon sequencing experiments, however generalizable and easily accessible software for processing custom MT data is not readily available. MT-Toolbox fills this niche by enabling efficient processing of MT data generated from a variety of protocols (Additional file 1: Note S.10). By forming ConSeqs, MT-Toolbox reduces technical errors and biases associated with PCR and sequencing thereby enabling precise measurements of DNA template abundance in mixed amplicon samples.
Availability and requirements
Project Name: MT-Toolbox.
Project Home Page:https://sites.google.com/site/moleculetagtoolbox/.
Operation System(s): Unix-based system.
Programming Language: Perl.
Other Requirements: Perl 5.8.8 or higher, select Perl libraries, ClustalW 2.1, MUSCLE 3.8.31, FLASH 1.2.5, gnuplot 4.4, an X Window System, select Unix system commands.
License: FreeBSD.
Any restrictions to use by non-academics: None.
Electronic supplementary material
Acknowledgements
We thank Surojit Biswas, Sur Herrera Paredes, Natalie Breakfield, and Sarah Lebeis for essential discussions and comments on the manuscript. We acknowledge support from the UNC BCB training grant [T32 GM067553-06] (SY), the Howard Hughes Medical Institute (JD), the Gordon and Betty Moore Foundation (JD), the National Science Foundation (IOS-0958245) (JD), and the North Carolina Biotechnology Center [2013-MRG-1110] (CJ).
Abbreviations
- MT
Molecule tag
- PE
Paired-end
- SE
Single-end
- MSA
Multiple sequence alignment
- SRC
Single read category
- GUI
Graphical user interface
- OTU
Operational taxonomic unit
- EPB
Errors per base
- SNP
Single nucleotide polymorphism
- RDP
Ribosomal Database Project
- LSF
Load Sharing Facility
- IUPAC
International Union of Pure and Applied Chemistry
- LEA-Seq
Low Error Amplicon Sequencing.
Footnotes
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
SY designed and implemented the software and wrote the manuscript. DL gave valuable feedback on the design, did the lab work associated with the sequencing tests, and revised the manuscript. DL, JD, and CJ were involved in the project design and manuscript revisions. All authors read and approved the final manuscript.
Contributor Information
Scott M Yourstone, Email: scott.yourstone81@gmail.com.
Derek S Lundberg, Email: derek.lundberg@gmail.com.
Jeffery L Dangl, dangl@email.unc.edu.
Corbin D Jones, cdjones@email.unc.edu.
References
- 1.Kivioja T, Vähärautio A, Karlsson K, Bonke M, Enge M, Linnarsson S, Taipale J. Counting absolute numbers of molecules using unique molecular identifiers. Nat Methods. 2012;9:72–74. doi: 10.1038/nmeth.1778. [DOI] [PubMed] [Google Scholar]
- 2.Fu GK, Hu J, Wang P-H, Fodor SPA. Counting individual DNA molecules by the stochastic attachment of diverse labels. Proc Natl Acad Sci U S A. 2011;108:9026–9031. doi: 10.1073/pnas.1017621108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Casbon JA, Osborne RJ, Brenner S, Lichtenstein CP. A method for counting PCR template molecules with application to next-generation sequencing. Nucleic Acids Res. 2011;39:e81. doi: 10.1093/nar/gkr217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Kinde I, Wu J, Papadopoulos N, Kinzler KW, Vogelstein B. Detection and quantification of rare mutations with massively parallel sequencing. Proc Natl Acad Sci U S A. 2011;108:9530–9535. doi: 10.1073/pnas.1105422108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Jabara CB, Jones CD, Roach J, Anderson JA, Swanstrom R. Accurate sampling and deep sequencing of the HIV-1 protease gene using a Primer ID. Proc Natl Acad Sci U S A. 2011;108:20166–20171. doi: 10.1073/pnas.1110064108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Faith JJ, Guruge JL, Charbonneau M, Subramanian S, Seedorf H, Goodman AL, Clemente JC, Knight R, Heath AC, Leibel RL, Rosenbaum M, Gordon JI. The long-term stability of the human gut microbiota. Science. 2013;341:1237439. doi: 10.1126/science.1237439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lundberg DS, Yourstone S, Mieczkowski P, Jones CD, Dangl JL. Practical innovations for high-throughput amplicon sequencing. Nat Methods. 2013;10:999–1002. doi: 10.1038/nmeth.2634. [DOI] [PubMed] [Google Scholar]
- 8.Magoč T, Salzberg SL. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011;27:2957–2963. doi: 10.1093/bioinformatics/btr507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23:2947–2948. doi: 10.1093/bioinformatics/btm404. [DOI] [PubMed] [Google Scholar]
- 10.Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Liang RH, Mo T, Dong W, Lee GQ, Swenson LC, McCloskey RM, Woods CK, Brumme CJ, Ho CKY, Schinkel J, Joy JB, Harrigan PR, Poon AFY. Nucleic Acids Res. 2014. Theoretical and experimental assessment of degenerate primer tagging in ultra-deep applications of next-generation sequencing; pp. 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Sheward DJ, Murrell B, Williamson C. Degenerate Primer IDs and the birthday problem. Proc Natl Acad Sci U S A. 2012;109:E1330. doi: 10.1073/pnas.1203613109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Brown CT, Howe A, Zhang Q, Pyrkosz AB, Brom TH. A reference-free algorithm for computational normalization of shotgun sequencing data. ArXiv. 2012;1203:4802v2. [Google Scholar]
- 14.MT-Toolbox. https://sites.google.com/site/moleculetagtoolbox/
- 15.Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461. doi: 10.1093/bioinformatics/btq461. [DOI] [PubMed] [Google Scholar]
- 16.Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007;73:5261–5267. doi: 10.1128/AEM.00062-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Peña AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, Mcdonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R. QIIME allows analysis of high-throughput community sequencing data. Nat Publ Gr. 2010;7:335–336. doi: 10.1038/nmeth.f.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.