Significance
The detection of rare mutations in clinical samples is essential to the screening, diagnosis, and treatment of cancer. Although next-generation sequencing has greatly enhanced the sensitivity of detecting mutations, the relatively high error rate of these platforms limits their overall clinical utility. The elimination of sequencing artifacts could facilitate the detection of early-stage cancers and provide improved treatment recommendations tailored to the genetic profile of a tumor. Here, we report the development of BiSeqS, a bisulfite conversion-based sequencing approach that allows for the strand-specific detection and quantification of rare mutations. We demonstrate that BiSeqS eliminates nearly all sequencing artifacts in three common types of mutations and thereby considerably increases the signal-to-noise ratio for diagnostic analyses.
Keywords: next-generation sequencing, bisulfite sequencing, strand-specificity, polymerase chain reaction, mutation
Abstract
The identification of mutations that are present at low frequencies in clinical samples is an essential component of precision medicine. The development of molecular barcoding for next-generation sequencing has greatly enhanced the sensitivity of detecting such mutations by massively parallel sequencing. However, further improvements in specificity would be useful for a variety of applications. We herein describe a technology (BiSeqS) that can increase the specificity of sequencing by at least two orders of magnitude over and above that achieved with molecular barcoding and can be applied to any massively parallel sequencing instrument. BiSeqS employs bisulfite treatment to distinguish the two strands of molecularly barcoded DNA; its specificity arises from the requirement for the same mutation to be identified in both strands. Because no library preparation is required, the technology permits very efficient use of the template DNA as well as sequence reads, which are nearly all confined to the amplicons of interest. Such efficiency is critical for clinical samples, such as plasma, in which only tiny amounts of DNA are often available. We show here that BiSeqS can be applied to evaluate transversions, as well as small insertions or deletions, and can reliably detect one mutation among >10,000 wild-type molecules.
Extensive knowledge of the genetic alterations that underlie cancer is now available, opening new opportunities for the management of patients (1–3). Some of the most important of these opportunities involve “liquid biopsies”—that is, the evaluation of blood and other bodily fluids for mutant DNA template molecules that are released from tumor cells into such fluids. Although the potential value of liquid biopsies was recognized more than two decades ago (4–6), more recent advances in sequencing technology have made this approach practical. For example, it has recently been shown that liquid biopsies of blood can detect minimal amounts of disease in patients with early-stage colorectal cancers, thereby providing evidence that could substantially affect their survival (7). Other studies have shown that circulating tumor DNA (ctDNA) can be detected in the blood of patients with other malignancies, as well as in other bodily fluids such as pancreatic cysts, Pap smears, and saliva (8–16).
The vast majority of current technologies for detecting rare mutations use digital approaches, where each template molecule is assessed, one by one, to determine whether it is wild type or mutant (17). The digitalization can be performed in wells (17), in tiny droplets formed by emulsification or microfluidics (18, 19), or in clusters (20). The most comprehensive of these approaches employs massively parallel sequencing to simultaneously analyze the entire sequences of hundreds of millions of individually amplified template molecules (21). However, all of the currently available sequencing instruments have relatively high error rates, limiting sensitivity at many nucleotide positions to one mutant among 100 wild-type (WT) template molecules, even with DNA templates that are of optimal quality (21). The DNA quality of clinical samples is often far less than optimal, compounding the problem. Sensitivity can be increased by pretreating the DNA to remove damaged bases before sequencing (22, 23) and by bioinformatics and statistical methods to enhance base calls after sequencing (24, 25). Although useful for a variety of purposes, the sensitivity obtainable with these improvements is generally not sufficiently high for the most challenging applications, such as liquid biopsies, which can require detection of one mutant molecule among thousands of WT molecules (9).
Another important way to improve sensitivity is with the use of “molecular barcodes,” in which each template is covalently linked to unique identifying sequences (UIDs). Molecular barcodes were originally used to count individual template molecules (26) but were subsequently incorporated into a powerful approach, termed SafeSeqS, for error reduction (27). After incorporation of the UIDs, subsequent amplification steps produce multiple copies of each UID-linked template. Each of the daughter molecules produced by amplification contains the same UID, forming a UID family. To be considered a bona fide mutation, termed a supermutant, every member of the UID family must have the identical sequence at each queried position (27).
There are two general ways to assign molecular barcodes to template DNA molecules. One uses a set of locus-specific primers to PCR-amplify genomic loci, while the other uses adapters ligated before amplification to create a whole genome library. The PCR method uses primers containing a stretch of random (N) bases to distinguish each individual template molecule (exogenous barcodes) (27, 28). The advantage of this approach is that it is applicable to very small amounts of DNA, and virtually the only sequences amplified are the desired ones, reducing the amount of sequencing needed to evaluate a specific mutation. The disadvantage is that errors introduced into one strand during the UID-incorporation cycles will create supermutants. This method will still therefore eliminate errors during sequencing but not errors made during the initial cycles of PCR. The ligation method either employs random sequences in the adapters used for ligation (27–29) or uses the ends of the randomly sheared template DNA to which the adapters are ligated as “endogenous UIDs” (27, 30). Although errors are still introduced during the PCR steps with the ligation approach, its advantage is that both strands can be identified from the sequencing data (duplex sequencing). The probability that the identical, complementary mutation is introduced into both strands is low (the square of the probability of the mutation appearing in only one strand). The disadvantage of this approach is that it requires library preparation and capture of the sequences to be queried, neither of which are highly efficient.
We here describe an approach that incorporates advantages of both the PCR- and ligation-based approaches described above. This approach takes advantage of the fact that bisulfite treatment can efficiently convert dC bases in DNA to U bases. This conversion makes the two strands of DNA distinguishable and was previously used to distinguish RNA transcripts copied from each of the two possible template strands of DNA (31). Bisulfite conversion has also been extensively used to distinguish methylated C-residues, which do not get converted to T bases, from unmethylated C bases, thereby illuminating epigenetic changes (32). It has also been shown that dC bases can be partially converted to T bases so that each individual template DNA molecule can be distinguished from others by its unique pattern of C to T changes, thereby creating an intrinsic barcode similar to what can be achieved with externally added UIDs (33). In this work, we show that DNA in which all C bases have been fully converted to T bases can be used as PCR templates with specially designed primers linked to exogenous barcodes. This allows individual mutations to be assessed on both strands (duplex sequencing) in a reliable manner, without creation of libraries and with a relatively small number of sequencing reads.
Results
BiSeqS Workflow.
The principal feature of BiSeqS is the simultaneous detection of a mutation on both the plus and minus strands of DNA templates that were bisulfite treated and molecularly barcoded. We refer to the reference sequence as defined by UCSC as the plus (+) strand and its reverse complement as the minus (–) strand. Three simple experimental steps (bisulfite conversion, molecular barcoding, and sample barcoding) are required before a specialized bioinformatics analysis of the sequencing data, as described below (Fig. 1 and Fig. S1 A and B).
Step i: Bisulfite conversion.
Incubation of DNA with sodium bisulfite at elevated temperatures and low pH deaminates cytosine to form 5,6-dihydrocytosine-6-sulfonate (34). Subsequent hydrolytic deamination at high pH removes the sulfonate, resulting in uracil (35). Many modifications of this basic reaction have been described and used largely to differentiate between cytosine and 5-methylcytosine (5-mC), the latter of which is not susceptible to bisulfite conversion. In addition to converting C to U, bisulfite treatment denatures DNA and can degrade it. Although this degradation is not limiting for standard applications of bisulfite treatment, it is critical for applications involving mutation detection in clinical samples that are already degraded before conversion (36–38). In the current study, we evaluated many ways to convert DNA and purify the converted strands. The best results were obtained with the reagents, conditions, and incubation times described in Materials and Methods. As shown in Fig. S2, treatment under these conditions did not inhibit the amplification of PCR products up to 285 bp in size. Sequencing of these products revealed that, on average, >99.8% of the C bases were converted to T bases on both strands (excluding C bases at 5′-CpG sites, which can be resistant to bisulfite conversion because they are either methylated or hydroxymethylated).
Step ii: Molecular barcoding.
The goal of bisulfite treatment is to create a code for distinguishing the two strands of DNA. This doubles the number of templates that need to be molecularly barcoded, requiring specialized steps compared with that used for standardly amplifying DNA. First, four primers must be designed to amplify each region of interest, two primers for each strand. Second, the primers must be complementary to the converted form of the DNA, accentuating the importance of full conversion—otherwise, some template molecules will not be amplified because they will not be perfectly complementary to the primers. Third, bisulfite treatment under the conditions we used converts virtually all nonmodified C residues to T, lowering the melting temperature of both the primer annealing sites and the amplicon in general. Because both strands must be amplified equivalently and in the same reaction, the primers must be chosen so that the same PCR cycling conditions can be used for amplifying both strands in a highly specific manner. For regions in which there is already a low C:G base pair content, the primers have to be long enough to allow specific amplification under relatively high-temperature annealing conditions. This proved difficult without yielding large amounts of primer dimers, and to overcome these challenges, several primer designs were evaluated. Eventually, variations in primer length, position, composition, and C:G content allowed for specific and robust amplification of both strands of every target region attempted.
Another issue confronting amplification of bisulfite converted DNA is that many polymerases will not efficiently copy DNA that contains uracil bases. We tested seven commercially available polymerases and various reaction conditions to optimize efficiency of template use and uniformity of amplification of both strands when four primers were used (Dataset S1). Although a combination of AMPIGene Hot Start Taq Polymerase and iTAQ Polymerase amplified the greatest number of template molecules, their lack of 3′ → 5′ exonuclease activity proved limiting for specificity in that the number of errors during PCR was unacceptably high. Ultimately, we chose Phusion U Hot Start Polymerase, a polymerase that exhibits 3′ → 5′ exonuclease activity, as the enzyme to amplify uracil-containing templates with the highest specificity while maintaining sensitivity.
Step iii: Sample barcoding.
Part of the power of massively parallel sequencing instruments is that they can be used to analyze many samples at once. To enable this capacity for BiSeqS, we incorporated a sample barcode PCR step following the purification of the molecularly barcoded PCR products (Fig. S1, step iii). Moreover, the converted sample DNA was divided into two to six wells of the PCR plate before the molecular barcoding step. Each well was then assigned a different sample barcode. This distribution served two purposes. First, with concentrated DNA templates, it could provide independent replication of mutations with small mutant allele fractions. Second, with dilute DNA templates, as are often present in clinical samples such as plasma (9), urine (39), and CSF (12), it provides the opportunity to test more template molecules, increasing the chance of identifying mutant templates.
BiSeqS Data Processing Pipeline.
High-quality base calls were aligned to the bisulfite-converted reference sequence, and the aligned data were organized into tables for each sample, where each observed mutation in each strand of each well was listed in a separate row. The columns in this table included the number of reads, UIDs, and supermutants for each mutation (see Datasets S2 and S3 for examples). Supermutants were defined as mutations in a UID family in which >90% of the family members contained that mutation. For example, if all three members of a UID family contained the same mutation, it was considered a supermutant. The supermutant allele fraction was defined as the number of supermutants divided by the number of UIDs in an individual well.
Individual mutations in the plus and minus strands were compared to determine whether the identical supermutant was found in both strands. If the mutation was found in both strands, the supermutant allele fractions in each strand were compared. The supermutant allele fractions on each strand provide an additional level of specificity because these fractions are expected to be similar if a mutant base pair existed in the template DNA before conversion and amplification. Given that mutations arising during PCR are relatively rare, it would be even rarer for the same mutation to arise at the identical position in both strands. This is especially true after conversion, when the two strands contain markedly different nucleotide contexts. If the supermutant allele fractions in each strand differed by <10-fold, then the mutation was considered to be a superduper mutant (SDM). The SDM allelic fraction was defined as the number of SDMs divided by the number of UIDs in the strand that contained the fewest UIDs. For example, if the number of SDMs was 10, and the number of UIDs in the plus and minus strands were 10,000 and 20,000, respectively, then the SDM allelic fraction would be 0.1% (i.e., 10 of 10,000).
Special features of the analysis of mutations in converted DNA include the following. A transition from C > T noted in the sequencing could have resulted from a single base substitution mutation that changed a C:G base pair to a T:A base pair or from bisulfite conversion of a C to a T on one strand. In light of this ambiguity, C to T mutations cannot be considered supermutants in the strand containing the C, although a supermutant would still be evident at that position in the strand containing the G. There are a total of six possible single base substitutions in duplex DNA: A C:G base pair can be mutated to either A:T, G:C, or T:A base pair, and an A:T base pair can be mutated to either C:G: G:C, or T:A. Of these six single base pair substitutions, all result in supermutants on at least one strand and four result in supermutants on both strands (i.e., SDMs). In addition, transitions that create a CpG dinucleotide in which the C is methylated can be assessed on both strands. All insertions or deletions within the amplified sequences can form SDMs. Methylation also introduces complexity, as methylated or hydroxymethylated C bases are not converted to U bases by bisulfite treatment. The BiSeqS pipeline takes this into account when it analyzes the data by not assuming that any particular C is methylated or unmethylated (or that every unmethylated C is converted to T by bisulfite treatment). Instead, it considers the possible effects of conversion and methylation and only labels a mutation as a supermutant or SDM if there is no ambiguity. A list of all possible single base substitutions on either strand, within a triplet context and with the mutated base in the middle, is provided in Dataset S4. For each single base substitution, the capacity of BiSeqS to identify SDMs is also provided in this table. In general terms, all transversions, all insertions and deletions, and a small subset of transitions can be unambiguously scored as SDMs (Dataset S4). Because the power of BiSeqS lies in SDMs, only mutations that are interpretable in both strands are considered below.
BiSeqS Increases the Specificity of Mutation Calling.
We selected eight amplicons within prototypic cancer driver genes to assess BiSeqS performance. For each of the eight amplicons, two forward primers and two reverse primers for each strand were synthesized and tested using the principles described above and in Materials and Methods. For all amplicons, at least one primer pair for each strand was found capable of specifically amplifying the intended strand with high efficiency, as judged by polyacrylamide gel analysis (Fig. S2). The sequences of these primers are listed in Dataset S5.
For each of the eight amplicons, we compared the specificity of BiSeqS to that of conventional next-generation sequencing (NGS) and molecular barcode-assisted sequencing (i.e., SafeSeqS). We considered only those potential mutations that could be discerned in both strands, as described above. There were a total of 608 bp within these amplicons, yielding a total of 1,550 possible single base substitutions (SBSs). Of these 1,550 potential SBSs, 1,252 (80.8%) were scorable as SDMs; the remainder were transitions that were not scorable for the reasons noted above. There were also many possible indels at each position that could have been observed in the sequencing data, all scorable as SDMs.
In the actual experiment, we could distinguish the strand used as template in the sequencing instrument because of the bisulfite conversion. In light of this, there were actually 2,504 mutations (2× the number of mutations noted in the previous paragraph) that could be scored for both conventional and molecular-barcode assisted sequencing. Of these 2,504 potential SBSs, 1,865 (74.5% of the total possible mutations) were actually observed upon conventional sequencing (25), highlighting the relatively large number of errors observed unless error correction by SafeSeqS or BiSeqS is applied (Dataset S6). There was no discernible difference between the two strands with respect to the number of mutations observed, with 907 and 958 mutations observed on the plus and minus strands, respectively. There were also 298 small insertions or deletions observed by conventional NGS.
Application of the molecular barcoding approach to these data considerably reduced the number of mutations, as evident by comparison of Fig. S3 A and B (note that the y axis scale was reduced by two orders of magnitude in Fig. S3B). The most relevant measure of this reduction is the comparison of the mutant allele frequencies (MAFs) before and after molecular barcoding was applied. Before molecular barcoding was applied, the median MAF of the SBS in the plus strand was 0.0233% (average, 0.0720%; 95% CI, 0.0627–0.0813%; Fig. 2 A–C and Dataset S6). It was similar in the minus strand: median of 0.0185%, average of 0.0751%, 95% CI 0.0643–0.0859%. As shown in Fig. 2B, after molecular barcoding, the MAF in the plus strand was reduced by eightfold, to a median of 0.0000%, average of 0.0091% (95% CI of 0.0062–0.0119%; P < 10−12, paired two-tailed Student’s t test). Note that the MAF after molecular barcoding is a measure of supermutant allele frequency (SMAF) but is labeled MAF in Fig. 2B for simplicity. The MAF of the minus strand was reduced by ninefold by molecular barcoding (median of 0.0000%, average of 0.0080%, 95% CI of 0.0047–0.0113%; P < 10−12, paired two-tailed Student’s t test). The magnitude of the reductions achieved by SafeSeqS was in accordance with expectations from experiments on native DNA that had not been treated with bisulfite (27).
Application of BiSeqS to these data resulted in a further striking reduction in errors. Only four SDMs were observed over all eight amplicons sequenced, as opposed to 1,865 and 163 mutations without and with molecular barcoding, respectively (Fig. S3; note that the y axis of Fig. S3C has been reduced by another order of magnitude compared with Fig. S3B). This was reflected in the MAFs, as shown in Fig. 2C, which were reduced by 1,217-fold through BiSeqS compared with NGS and 141-fold compared with molecular barcoding (median of 0.0000%, average of 0.0001%, 95% CI of 0.0000–0.0001%; P < 10−12, paired two-tailed Student’s t test).
BiSeqS also reduced errors at indels; there were 364 mutants, 11 supermutants, and zero SDMs observed in the eight amplicons (Dataset S6 and Figs. S4 and S5). The MAFs were thereby reduced from an average of 0.0041% with NGS to 0.0011% with molecular barcoding to 0.0000% with BiSeqS (P < 1.2 × 10−6 for NGS compared with molecular barcoding for the plus strand, P < 7.5 × 10−4 for NGS compared with molecular barcoding for the minus strand, and P < 1.3 × 10−2 for molecular barcoding compared with BiSeqS).
Sensitivity of BiSeqS.
Massively parallel sequencing allows billions of amplicons to be assessed simultaneously, resulting in theoretical sensitivities of one mutation among >1 billion WT templates for any base within an amplicon. The actual sensitivities in clinical samples are limited only by the amount of input DNA and the specificity. In many types of liquid biopsies, such as those from plasma, pancreatic cysts, CSF, and urine, the total DNA available is often <33 ng (7, 9, 12, 39). A sensitivity of 0.01% is therefore adequate for detecting the one or two mutant molecules that may exist among the ∼10,000 templates contained in 33 ng of human DNA in such samples. The reliability of this detection is limited by the biological and technical specificities, where the queried mutation must be found at far lower frequencies in the normal control samples used for comparison with the tumor. Although the biological issues that might lead to mutations in normal samples cannot be circumvented (40), technical issues can be addressed and overcome through methodological advances such as BiSeqS.
To address the sensitivity of BiSeqS, we evaluated tumor samples containing 10 double-stranded mutations (20 mutations if each strand is counted separately) within the eight amplicons described above (Dataset S7). The proportion of mutations in each of the tumor samples was defined through NGS. We used the DNA from these tumors to create the scenario characteristic of liquid biopsies, wherein a small amount of DNA from neoplastic cells is mixed with a much larger amount of DNA from normal cells in the patient. More specifically, we diluted this tumor DNA with normal leukocytes to achieve minor allele fractions of 0.02% and 0.20% and then used bisulfite treatment to convert the mixtures. We determined the mutant allele fractions of each of the tumor-derived mutations when analyzed with standard NGS, with molecular barcodes, or with BiSeqS, in all cases holding the input DNA to 5,000 template molecules per well, and performing each experiment in six wells. We found that each of the three methods of analysis yielded mutant allele fractions that were similar to those expected from the dilutions (examples in Fig. 3 and all data in Dataset S7). This experiment demonstrated that the efficiency of each of the steps in BiSeqS—from bisulfite conversion through the amplification and sequencing steps—was high.
Although the efficiency of amplification was therefore always high enough to detect the mutant templates, the MAFs of the normal controls limited the interpretation of the sequencing data. We called a mutant call a true mutation when the signal-to-noise ratios (SNRs), defined as the MAF in the tumor specimen divided by the MAF in normal cells, was >10. We averaged the MAF in both strands for this calculation when considering standard NGS or molecular barcode-assisted NGS. Fig. 3 and Fig. S6 show the detected MAFs for dilutions of 0.20% and 0.02%. Standard NGS yielded SNRs > 10 for only two of the eight mutations at a neoplastic cell content of 0.20% and one out of the three mutations at neoplastic cell contents of 0.02%. Molecular barcoding yielded SNRs > 10 for 7 of the 10 mutations at these neoplastic cell contents. In contrast, BiSeqS yielded SNR > 10 for all 10 mutations at all tested neoplastic cell fractions (Fig. 3, Fig. S6, and Dataset S7). Representative SNR plots of the MAF for mutations in NRAS and TP53 are shown in Fig. S7 A and B, respectively.
BiSeqS Simultaneously Detects Methylation Status on Both Strands.
Cytosine bases in 5′-CpG dinucleotides that are methylated are protected from conversion to uracil during bisulfite treatment, allowing BiSeqS to detect the methylation status of the plus and minus strands simultaneously. Although not the primary purpose of BiSeqS, this discrimination could prove useful for the analysis of methylation that occurs at low levels, either for basic research or clinical purposes. Although bisulfite treatment and specially designed primers have often been used to evaluate methylation in the past for a variety of clinical purposes (41–43), the combination of molecular barcoding with simultaneous amplification of both strands provides unprecedented sensitivity in this type of analysis.
To demonstrate the ability of BiSeqS to discriminate the methylation status on both strands simultaneously, we evaluated a region of the TP53 gene that contains a known methylated CpG at hg19 position 7,572,973–4. Greater than 90% of the UIDs on both strands were found to be methylated at the C at the plus strand of position 7,572,973 and the C opposite the G on the minus strand at position 7,572,974. Greater than 99.8% of the C residues that were not at 5′-CpG dinucleotides within this amplicon were found to be converted to Ts, providing an essential control for interpreting the extent of methylation. We then searched for evidence of double-stranded methylation within all eight amplicons evaluated in this study in normal white blood cells (WBCs). There were two 5′-CpG residues within the 608 bp that could be evaluated. Of these, we found that both CpGs were methylated on both strands, with the fraction of methylated alleles ranging from 92.10% to 96.10% (Dataset S8).
Discussion
The results described above show that BiSeqS can accurately quantify rare mutations in a highly sensitive and specific manner. We envision that its major use will be in the surveillance of patients with cancer whose primary tumors have been sequenced. It has already been shown that liquid biopsies can be used for this purpose and can accurately identify patients who are in clinical remission but are destined to recur (7, 11, 44). Many such patients, particularly when their residual burden of disease is small and therefore most likely to be cured by adjuvant therapy (45), have only one or two mutant DNA molecules in 10 mL of plasma. In such situations, a technique like BiSeqS, which can efficiently use all template molecules while maintaining high specificity, could prove particularly useful.
A disadvantage of BiSeqS is that it cannot be applied to most transition mutations because of the ambiguities caused by the bisulfite conversion of C to U, mimicking such transitions. Although one strand is still susceptible to BiSeqS, the power of the technology lies in its ability to detect mutations in both strands, so it poses no advantages over molecular barcoding for such mutations. For example, single base substitutions in KRAS codons 12, 13, and 61 are commonly mutated in colon, rectal, and pancreatic adenocarcinomas (46). BiSeqS can be used to quantify KRAS mutations in 38.7%, 43.4%, and 47.6% of these cancers, respectively (47). Across all cancers and mutations cataloged in the IARC TP53 database, approximately 44% of all mutations (i.e., SBS and indels) are amenable to BiSeqS analysis (IARC TP53 Database, R18).
Additionally, bisulfite treatment can result in conversion of methylated C bases to U in rare instances, depending upon the incubation time and reagent concentration (48). The protocol used for BiSeqS employs reduced incubation temperatures that appear to minimize this possibility (48), but sequence heterogeneity at methylated CpG sites may raise background and such sites are not preferred for mutation evaluation.
However, for liquid biopsies in surveillance, limitations inherent to a single gene are not a major issue because several different mutations, including transversions and indels, are generally observed upon genome-wide sequencing of cancers (1–3), and any identified mutation could in principle be applied to this clinical scenario. Based on a recent study of 3,281 cancer samples, it was evident that most cancers have at least one driver gene mutation that should be amenable to BiSeqS analysis (49). It is also worth noting that passenger gene mutations that are clonal can also be useful for diagnostic evaluation (50). Because there are at least 10-fold as many passenger mutations as driver gene mutations in nearly all cancers, it is likely that the vast majority of cancers will have several somatic mutations that could be assessed by BiSeqS. For example, in a study of SBSs, insertions, and deletions detected in breast cancer, we calculate that 62.1% of mutations would be amenable to BiSeqS analysis (51). Because breast cancers nearly always contain >25 clonal substitutions, virtually all breast cancers would have many mutations amenable to BiSeqS analysis.
BiSeqS can complement screening for other genomic alterations, such as structural variants (SVs), for rare allele detection and monitoring (52). SVs provide exquisitely specific markers for cancer that can be used for liquid biopsies (9, 50). Simple polymerase errors do not produce SVs, providing advantages over single base substitutions as diagnostic targets. On the other hand, there are disadvantages to the use of SVs as diagnostic markers. First, SV detection requires whole genome sequencing of tumors, rather than targeted sequencing of tumors, for their initial detection; the latter is currently much less expensive than the former. Second, and more importantly, SVs are “private”—that is, generally confined to one or a small number of patients. To be used as a tumor marker, primers that specifically amplify the translocation junction must be designed and tested on the patient's tumor to ensure that the SV is somatic and the amplicon is specific. Although this approach is feasible in a research setting, it is not easily practicable in large-scale settings. In contrast, single base substitutions and indels in driver genes are observed in numerous independent tumors, and a small set of “off-the-shelf” primers can be used to assess most patients. For example, we estimate that >98% of patients with colorectal cancer have mutations detectable through amplification with one of 130 predesigned primer pairs.
In the future, it is possible that chemical treatments of DNA that convert A:T base pair (rather than C:G base pair) to other base pair could substitute for bisulfite when transition mutations must be analyzed. Another avenue for future research is multiplexing, permitting mutations in a variety of amplicons to be assessed simultaneously in screening scenarios. This multiplexing is more difficult than normal because two amplicons must be designed for each region of interest while achieving homogeneous efficiency of every amplicon in all regions of interest.
Materials and Methods
Detailed materials and methods are available in SI Materials and Methods. Briefly, DNA from macrodissected formalin-fixed paraffin-embedded (FFPE) tumor sections was extracted and bisulfite treated with an EZ DNA Methylation Kit (Zymo Research, cat. no. D5001). Custom primers containing a unique identifier (UID) and amplicon-specific sequence were used to amplify both strands of DNA, and the resulting products were sequenced on an Illumina MiSeq instrument. To characterize the specificity of BiSeqS, DNA isolated from one normal tissue was bisulfite-treated and processed through the BiSeqS pipeline to query for single base substitutions and indels. To characterize the sensitivity of BiSeqS, macrodissected tumor samples with known MAFs were diluted with the DNA from normal WBCs to obtain final neoplastic cell contents ranging from 0.02% to 0.20%, bisulfite-treated, and processed through the BiSeqS pipeline. All tissues were obtained from consenting patients at the Johns Hopkins Hospital with the approval of the Johns Hopkins Institutional Review Board.
SI Materials and Methods
Human Tissues.
FFPE tumor sections were macrodissected under a dissecting microscope to ensure a neoplastic cellularity of >30%. DNA was purified with a Qiagen FFPE Kit (Qiagen, cat. no. 56494). Tumor samples with known MAFs were diluted with the DNA from normal WBCs to obtain final neoplastic cell contents ranging from 0.02% to 0.20%. To precisely quantify the DNA concentrations of the tumor and normal DNA samples, various mixtures of tumor and normal DNA were amplified with primers that revealed normal single nucleotide polymorphisms within the final amplicons. NGS was then used to quantify the fraction of neoplastic cells within each of the tested mixtures, and the same mixtures were then used as template DNA for BiSeqS, as described below.
Bisulfite Treatment and PCR Amplification of Purified DNA for BiSeqS.
After extensive testing of various commercially available bisulfite conversion kits, we chose the EZ DNA Methylation Kit (Zymo Research, cat. no. D5001) to bisulfite treat and desulphonate DNA samples following the manufacturer’s recommended protocol. DNA was eluted in 10 µL of Elution Buffer and stored at –20 °C. Custom HPLC-purified PCR Primers (IDT) were designed for each bisulfite-converted strand of the DNA double helix at the amplified loci (Dataset S5). Compared with traditional PCR primers, the custom primers were longer to account for the reduced sequence complexity of bisulfite-converted DNA. Each forward primer contained the sequence necessary for well barcode amplification at the 5′ end, followed by a string of 14 random nucleotides that served as the unique identifier (UID), and amplicon-specific primer sequences at the 3′ end (Fig. S1 A and B). Each reverse primer contained the sequence necessary for well barcode amplification at the 5′ end, followed by amplicon-specific primer sequences. To anneal to bisulfite-converted DNA, it is important to replace specific nucleotides in the various wild-type amplicon-specific primer sequences. T replaced C in the plus strand forward primer, whereas A replaced G in the plus strand reverse primer. A replaced G in the minus strand forward primer, and T replaced C in the minus strand reverse primer.
The molecular barcoding PCR cycles included 12.5 µL of 2X Phusion U Hot Start PCR Master Mix (ThermoFischer, cat. no. F533S) in a 25 µL reaction and optimized concentrations of each forward and reverse primer, ranging from 0.125 µM to 4 µM of each forward and each reverse primer for a total of four primers per well. The following cycling conditions were used: one cycle of 95 °C for 3 min, 20 cycles of 95 °C for 10 s, 63 °C for 2 min, and 72 °C for 2 min.
AMPure XP (Beckman Coulter, cat. no. A63881) was used to remove the primers for UID assignment. We used 0.025% of the PCR product generated from the UID cycles for the WBC cycles. Primers used for the well barcode step were identical to those described previously and are diagrammed in Fig. S1 A and B (27). The WBC cycles were performed in 25 µL reactions containing 11.8 µL of water (ThermoFisher UltraPure, cat. no. 10977–023), 5 µL of 5X Phusion HF Buffer (ThermoFisher, cat. no. F518L), 0.5 µL of 10 mM dNTPs (NEB, cat. no. N0447L), and 0.25 µL of Phusion Hot Start II DNA Polymerase (ThermoFisher, cat. no. F549L). The following cycling conditions were used: one cycle of 98 °C for 2 min, 24 cycles of 98 °C for 10 s, 65 °C for 2 min, and 72 °C for 2 min.
Sequencing.
Sequencing of all of the amplicons described above was performed using an Illumina MiSeq instrument. The total length of the reads used for each instrument varied from 79 to 130 bases. Reads passing Illumina CASAVA Chastity filters were used for subsequent analysis.
BiSeqS Pipeline.
High-quality reads were processed with the SafeSeqS pipeline (27) to generate aligned data that were then organized into tables for each BiSeqS analysis. Each of the tables contains (i) strand information, (ii) well barcode and UID sequences, (iii) information listing all differences from the reference amplicon, and (iv) prevalence of each UID family corresponding to a change with respect to all UID families per amplicon. To determine whether a combination of plus and minus strand changes constitute a double-strand mutant, the various mutations detected at a specific genomic locus are compared with respect to (i) sample identity, (ii) chromosome, (iii) genomic position, and (iv) mutation type. Changes were called as true mutations when (i) the change appeared on both the plus and the minus strands and (ii) the MAFs corresponding to the plus and minus strands differed by less than 10-fold.
Characterization of BiSeqS Specificity.
To characterize the specificity of BiSeqS, DNA isolated from one normal tissue was bisulfite-treated and processed through the BiSeqS pipeline to query for single base substitutions and indels. Analysis using NGS across 8 amplicons and 608 bases for indels yielded 907 unique mutations identified on the plus strand and 958 unique mutations identified on the minus strand that were ultimately amenable to analysis by BiSeqS. For each strand of each amplicon, we calculated the MAF by dividing the number of reads or the number of UIDs containing >2 mutant reads per UID (UID Family Count > 2) by the number of total reads or the number of total UIDs, respectively. Using molecular barcodes to group reads into families decreased the number of unique mutations to 92 on the plus strand and 71 on the minus strand (Dataset S6). After matching the plus and minus strand amplicons and imposing a filter of less than 10 for the ratio of mutations observed on the plus strand to the ratio of mutations observed on the minus strand (and vice versa), four mutations were identified (Dataset S6). The number of SDMs was taken to be the minimum of the number of supermutants on the plus or the minus strand that corresponded to a mutation, as this is the limiting number of double-stranded supermutant molecules detectable. The total number of double-stranded molecules was similarly taken to be the minimum of the number of total UIDs on the plus or the minus strand, as this is the limiting number of double-stranded template molecules detected. Standard NGS detected 197 and 167 indels on the plus and minus strands, respectively. Use of molecular barcodes reduced the number of detected indels to 6 and 5 for the plus and minus strand, respectively, whereas BiSeqS double-strand analysis reduced the number of indels to zero.
Supplementary Material
Acknowledgments
We thank Margaret Hoang, Surojit Sur, Nick Wyhs, Wyatt McMahon, and Ming Zhang for their helpful comments on the project and manuscript as well as Lisa Dobbyn, Janine Ptak, Joy Schaefer, Natalie Silliman, and Maria Papoli for their expert technical assistance. This work was supported by The Virginia and D.K. Ludwig Fund for Cancer Research, the Lustgarten Foundation for Pancreatic Cancer Research, and National Institutes of Health Grants P50-CA62924, CA 06973, and GM 07309. All sequencing was performed at the Sol Goldman Sequencing Facility at Johns Hopkins.
Footnotes
Conflict of interest statement: N.P., K.W.K., and B.V. have no conflicts of interest with respect to the new technology described in this manuscript, as defined by the Johns Hopkins University policy on conflict of interest. N.P., K.W.K., and B.V. are founders of Personal Genome Diagnostics, Inc. and PapGene, Inc. K.W.K. and B.V. are members of the Scientific Advisory Board of Syxmex-Inostics. B.V. is also a member of the Scientific Advisory Boards of Morphotek and Exelixis GP. These companies and others have licensed technologies from Johns Hopkins University; N.P., K.W.K., and B.V. are the inventors of some of these technologies and receive equity or royalties from their licenses. The terms of these arrangements are being managed by the Johns Hopkins University in accordance with its conflict of interest policies.
Data deposition: The sequences reported in this paper have been deposited in the European Genome-Phenome Archive (EGA) database (accession no. EGAS00001002406; https://www.ebi.ac.uk/ega/home).
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1701382114/-/DCSupplemental.
References
- 1.Garraway LA, Lander ES. Lessons from the cancer genome. Cell. 2013;153:17–37. doi: 10.1016/j.cell.2013.03.002. [DOI] [PubMed] [Google Scholar]
- 2.Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature. 2009;458:719–724. doi: 10.1038/nature07943. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Vogelstein B, et al. Cancer genome landscapes. Science. 2013;339:1546–1558. doi: 10.1126/science.1235122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sidransky D, et al. Identification of ras oncogene mutations in the stool of patients with curable colorectal tumors. Science. 1992;256:102–105. doi: 10.1126/science.1566048. [DOI] [PubMed] [Google Scholar]
- 5.Sidransky D, et al. Identification of p53 gene mutations in bladder cancers and urine samples. Science. 1991;252:706–709. doi: 10.1126/science.2024123. [DOI] [PubMed] [Google Scholar]
- 6.Hruban RH, van der Riet P, Erozan YS, Sidransky D. Brief report: Molecular biology and the early detection of carcinoma of the bladder--The case of Hubert H. Humphrey. N Engl J Med. 1994;330:1276–1278. doi: 10.1056/NEJM199405053301805. [DOI] [PubMed] [Google Scholar]
- 7.Tie J, et al. Circulating tumor DNA analysis detects minimal residual disease and predicts recurrence in patients with stage II colon cancer. Sci Transl Med. 2016;8:346ra92. doi: 10.1126/scitranslmed.aaf6219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Dawson SJ, et al. Analysis of circulating tumor DNA to monitor metastatic breast cancer. N Engl J Med. 2013;368:1199–1209. doi: 10.1056/NEJMoa1213261. [DOI] [PubMed] [Google Scholar]
- 9.Bettegowda C, et al. Detection of circulating tumor DNA in early- and late-stage human malignancies. Sci Transl Med. 2014;6:224ra24. doi: 10.1126/scitranslmed.3007094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kinde I, et al. Evaluation of DNA from the Papanicolaou test to detect ovarian and endometrial cancers. Sci Transl Med. 2013;5:167ra4. doi: 10.1126/scitranslmed.3004952. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wang Y, et al. Detection of somatic mutations and HPV in the saliva and plasma of patients with head and neck squamous cell carcinomas. Sci Transl Med. 2015;7:293ra104. doi: 10.1126/scitranslmed.aaa8507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Wang Y, et al. Detection of tumor-derived DNA in cerebrospinal fluid of patients with primary tumors of the brain and spinal cord. Proc Natl Acad Sci USA. 2015;112:9704–9709. doi: 10.1073/pnas.1511694112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wang Y, et al. Diagnostic potential of tumor DNA from ovarian cyst fluid. eLife. 2016;5:5. doi: 10.7554/eLife.15175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Springer S, et al. A combination of molecular markers and clinical features improve the classification of pancreatic cysts. Gastroenterology. 2015;149:1501–1510. doi: 10.1053/j.gastro.2015.07.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Forshew T, et al. Noninvasive identification and monitoring of cancer mutations by targeted deep sequencing of plasma DNA. Sci Transl Med. 2012;4:136ra68. doi: 10.1126/scitranslmed.3003726. [DOI] [PubMed] [Google Scholar]
- 16.De Mattos-Arruda L, Caldas C. Cell-free circulating tumour DNA as a liquid biopsy in breast cancer. Mol Oncol. 2016;10:464–474. doi: 10.1016/j.molonc.2015.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Vogelstein B, Kinzler KW. Digital PCR. Proc Natl Acad Sci USA. 1999;96:9236–9241. doi: 10.1073/pnas.96.16.9236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Dressman D, Yan H, Traverso G, Kinzler KW, Vogelstein B. Transforming single DNA molecules into fluorescent magnetic particles for detection and enumeration of genetic variations. Proc Natl Acad Sci USA. 2003;100:8817–8822. doi: 10.1073/pnas.1133470100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Margulies M, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. doi: 10.1038/nature03959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Mitra RD, Church GM. In situ localized amplification and contact replication of many individual DNA molecules. Nucleic Acids Res. 1999;27:e34. doi: 10.1093/nar/27.24.e34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. 2008;26:1135–1145. doi: 10.1038/nbt1486. [DOI] [PubMed] [Google Scholar]
- 22.Do H, Dobrovic A. Dramatic reduction of sequence artefacts from DNA isolated from formalin-fixed cancer biopsies by treatment with uracil-DNA glycosylase. Oncotarget. 2012;3:546–558. doi: 10.18632/oncotarget.503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Do H, Wong SQ, Li J, Dobrovic A. Reducing sequence artifacts in amplicon-based massively parallel sequencing of formalin-fixed paraffin-embedded DNA by enzymatic depletion of uracil-containing templates. Clin Chem. 2013;59:1376–1383. doi: 10.1373/clinchem.2012.202390. [DOI] [PubMed] [Google Scholar]
- 24.Bratman SV, Newman AM, Alizadeh AA, Diehn M. Potential clinical utility of ultrasensitive circulating tumor DNA detection with CAPP-Seq. Expert Rev Mol Diagn. 2015;15:715–719. doi: 10.1586/14737159.2015.1019476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Bokulich NA, et al. Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nat Methods. 2013;10:57–59. doi: 10.1038/nmeth.2276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Sykes PJ, et al. Quantitation of targets for PCR by use of limiting dilution. Biotechniques. 1992;13:444–449. [PubMed] [Google Scholar]
- 27.Kinde I, Wu J, Papadopoulos N, Kinzler KW, Vogelstein B. Detection and quantification of rare mutations with massively parallel sequencing. Proc Natl Acad Sci USA. 2011;108:9530–9535. doi: 10.1073/pnas.1105422108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Casbon JA, Osborne RJ, Brenner S, Lichtenstein CP. A method for counting PCR template molecules with application to next-generation sequencing. Nucleic Acids Res. 2011;39:e81. doi: 10.1093/nar/gkr217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Schmitt MW, et al. Detection of ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci USA. 2012;109:14508–14513. doi: 10.1073/pnas.1208715109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Hoang ML, et al. Genome-wide quantification of rare somatic mutations in normal human tissues using massively parallel sequencing. Proc Natl Acad Sci USA. 2016;113:9846–9851. doi: 10.1073/pnas.1607794113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.He Y, Vogelstein B, Velculescu VE, Papadopoulos N, Kinzler KW. The antisense transcriptomes of human cells. Science. 2008;322:1855–1857. doi: 10.1126/science.1163853. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Frommer M, et al. A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc Natl Acad Sci USA. 1992;89:1827–1831. doi: 10.1073/pnas.89.5.1827. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Levy D, Wigler M. Facilitated sequence counting and assembly by template mutagenesis. Proc Natl Acad Sci USA. 2014;111:E4632–E4637. doi: 10.1073/pnas.1416204111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Hayatsu H, Wataya Y, Kai K, Iida S. Reaction of sodium bisulfite with uracil, cytosine, and their derivatives. Biochemistry. 1970;9:2858–2865. doi: 10.1021/bi00816a016. [DOI] [PubMed] [Google Scholar]
- 35.Clark SJ, Statham A, Stirzaker C, Molloy PL, Frommer M. DNA methylation: Bisulphite modification and analysis. Nat Protoc. 2006;1:2353–2364. doi: 10.1038/nprot.2006.324. [DOI] [PubMed] [Google Scholar]
- 36.Li M, et al. Sensitive digital quantification of DNA methylation in clinical samples. Nat Biotechnol. 2009;27:858–863. doi: 10.1038/nbt.1559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Lewis F, Maughan NJ, Smith V, Hillan K, Quirke P. Unlocking the archive--Gene expression in paraffin-embedded tissue. J Pathol. 2001;195:66–71. doi: 10.1002/1096-9896(200109)195:1<66::AID-PATH921>3.0.CO;2-F. [DOI] [PubMed] [Google Scholar]
- 38.Koch I, et al. Real-time quantitative RT-PCR shows variable, assay-dependent sensitivity to formalin fixation: Implications for direct comparison of transcript levels in paraffin-embedded tissues. Diagn Mol Pathol. 2006;15:149–156. doi: 10.1097/01.pdm.0000213450.99655.54. [DOI] [PubMed] [Google Scholar]
- 39.Kinde I, et al. TERT promoter mutations occur early in urothelial neoplasia and are biomarkers of early disease and disease recurrence in urine. Cancer Res. 2013;73:7162–7167. doi: 10.1158/0008-5472.CAN-13-2498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Krimmel JD, et al. Ultra-deep sequencing detects ovarian cancer cells in peritoneal fluid and reveals somatic TP53 mutations in noncancerous tissues. Proc Natl Acad Sci USA. 2016;113:6005–6010. doi: 10.1073/pnas.1601311113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Chung W, et al. Detection of bladder cancer using novel DNA methylation biomarkers in urine sediments. Cancer Epidemiol Biomarkers Prev. 2011;20:1483–1491. doi: 10.1158/1055-9965.EPI-11-0067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Taby R, Issa JP. Cancer epigenetics. CA Cancer J Clin. 2010;60:376–392. doi: 10.3322/caac.20085. [DOI] [PubMed] [Google Scholar]
- 43.Issa JP. DNA methylation as a clinical marker in oncology. J Clin Oncol. 2012;30:2566–2568. doi: 10.1200/JCO.2012.42.1016. [DOI] [PubMed] [Google Scholar]
- 44.Harris FR, et al. Quantification of somatic chromosomal rearrangements in circulating cell-free DNA from ovarian cancers. Sci Rep. 2016;6:29831. doi: 10.1038/srep29831. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Bozic I, et al. Evolutionary dynamics of cancer in response to targeted combination therapy. eLife. 2013;2:e00747. doi: 10.7554/eLife.00747. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Fearon ER, Vogelstein B. A genetic model for colorectal tumorigenesis. Cell. 1990;61:759–767. doi: 10.1016/0092-8674(90)90186-i. [DOI] [PubMed] [Google Scholar]
- 47.Prior IA, Lewis PD, Mattos C. A comprehensive survey of Ras mutations in cancer. Cancer Res. 2012;72:2457–2467. doi: 10.1158/0008-5472.CAN-11-2612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Shiraishi M, Hayatsu H. High-speed conversion of cytosine to uracil in bisulfite genomic sequencing analysis of DNA methylation. DNA Res. 2004;11:409–415. doi: 10.1093/dnares/11.6.409. [DOI] [PubMed] [Google Scholar]
- 49.Kandoth C, et al. Mutational landscape and significance across 12 major cancer types. Nature. 2013;502:333–339. doi: 10.1038/nature12634. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Leary RJ, et al. Detection of chromosomal alterations in the circulation of cancer patients with whole-genome sequencing. Sci Transl Med. 2012;4:162ra154. doi: 10.1126/scitranslmed.3004742. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Wood LD, et al. The genomic landscapes of human breast and colorectal cancers. Science. 2007;318:1108–1113. doi: 10.1126/science.1145720. [DOI] [PubMed] [Google Scholar]
- 52.Macintyre G, Ylstra B, Brenton JD. Sequencing structural variants in cancer for precision therapeutics. Trends Genet. 2016;32:530–542. doi: 10.1016/j.tig.2016.07.002. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.