Abstract
Although next-generation sequencing technologies have been widely adapted for clinical diagnostic applications, an urgent need exists for multianalyte calibrator materials and controls to evaluate the performance of these assays. Control materials will also play a major role in the assessment, development, and selection of appropriate alignment and variant calling pipelines. We report an approach to provide effective multianalyte controls for next-generation sequencing assays, referred to as the control plasmid spiked-in genome (CPSG). Control plasmids that contain approximately 1000 bases of human genomic sequence with a specific mutation of interest positioned near the middle of the insert and a nearby 6-bp molecular barcode were synthesized, linearized, quantitated, and spiked into genomic DNA derived from formalin-fixed, paraffin-embedded–prepared hapmap cell lines at defined copy number ratios. Serial titration experiments demonstrated the CPSGs performed with similar efficiency of variant detection as formalin-fixed, paraffin-embedded cell line genomic DNA. Repetitive analyses of one lot of CPSGs 90 times during 18 months revealed that the reagents were stable with consistent detection of each of the plasmids at similar variant allele frequencies. CPSGs are designed to work across most next-generation sequencing methods, platforms, and data analysis pipelines. CPSGs are robust controls and can be used to evaluate the performance of different next-generation sequencing diagnostic assays, assess data analysis pipelines, and ensure robust assay performance metrics.
Next-generation sequencing (NGS) technology is having major effects on biomedical research. Decreasing costs and increasing data generation are driving rapid uptake of this method. Clinical applications have quickly followed.1, 2 NGS technology is currently under evaluation for guiding cancer patient treatment selection.3, 4 However, there is uncertainty that there is sufficient interlaboratory concordance for meaningful clinical use. The rapid proliferation of different sequencing methods, platforms, and data analysis tools has resulted in a high discordance of mutations reported from different clinical NGS assays.5, 6 Reference and control materials that contain known analytes (variants) at known allele fraction [variant allele frequency (VAF)] in a form comparable to clinical specimens are essential for comparing and monitoring the assay performance and will be valuable in the study of cross-platform comparisons and identifying weaknesses in informatics pipelines (ie, alignment and variant calling). However, unlike most conventional assays (eg, Sanger sequencing and PCR-based methods) that typically detect single or only a few analytes, an NGS assay usually measures hundreds to thousands of genomic loci. Currently, there is no standardized set of clinically relevant materials useful as controls or calibrators to standardize the assessment of NGS data across platforms, assays, and informatics pipelines. Genome in a Bottle, a public consortium led by the National Institute of Standards and Technology, has released a reference genome and will soon release several other genomes.7 These are valuable resources but do not directly address the need for clinically relevant controls and calibrators. Therefore, there is an urgent need to implement highly multiplexed materials as calibrators and controls for the clinical use of NGS assays.5, 6, 8
One approach to NGS calibrators and controls relies on the use of cell line genomic DNA. A mixture of variant types and VAF can be manufactured by combining genomes at defined molar ratios.9 This approach is limited by the number of genomes that can be mixed while maintaining an adequate VAF and by the number of different mutations that can be introduced into a single cell line.
Another approach is the use of synthetic nucleic acid molecules, such as long oligonucleotides as used in the SNaPshot assay10 and in vitro transcribed RNA molecules from the External RNA Control Consortium (ERCC) used in gene expression and RNAseq assays.11 In taking the first step toward building highly multiplex control materials, we report the development and characterization of a control plasmid–based multianalyte calibrator and control material for NGS assays, termed the control plasmid spiked-in genome (CPSG). We found that these materials are scalable in their ability to incorporate many different variants with different allele frequencies in a complex mixture, are easy to design and manufacture, are distinguishable from a clinical specimen, and are detectable by various genomic assays. Our results indicate that CPSGs can serve as routine assay controls to monitor performance of NGS assays and standards for cross-site and cross-platform comparison studies and as valuable tools for the evaluation, development, and testing of new informatics pipelines. Such an approach was previously accepted by the US Food and Drug Administration as an effective method of validating the detection of rare germline variants with an NGS platform in a submission of 510 (k) premarket notification (Food and Drug Administration, http://www.accessdata.fda.gov/cdrh_docs/pdf13/K132750.pdf, last accessed November 20, 2015) by Illumina (Illumina MiSeqDx Cystic Fibrosis Clinical Sequencing Assay; Illumina Inc., San Diego, CA). Importantly, we also found that the efficiency of variant detection in CPSG samples is similar to that of formalin-fixed, paraffin-embedded (FFPE) genomic DNA samples.
Materials and Methods
Design and Construction of Control Plasmids
To evaluate the performance of various NGS assays on different types of mutations, a panel of 69 control plasmids was designed and constructed and a subset of them used for this study. This panel of 69 control plasmids contains 38 single-nucleotide variants (SNVs), nine SNVs at a homopolymeric region (HP; >3 identical bases in a row), 12 insertion/deletions (indels), five indels at HP, and five large indels (gap size >4 bp). Mutations of interest (MOIs) in these control plasmids were selected because of their known clinical actionable value and high recurrent frequency in the Catalogue of Somatic Mutations in Cancer database or because they represent rare mutation types. For each MOI, an approximate 1000-bp region flanking (approximately 500 bp upstream and approximately 500 bp downstream) the MOI was synthetically generated (DNA 2.0, Menlo Park, CA). In addition, a 6-bp insert sequence (ACATCG), which functions as a molecular barcode, was placed 5 to 20 bp away from the MOI. Each of the approximate 1000-bp fragments was flanked with attB sites (attB1 - ACAACTTTGTACAAAAAAGTTGGC at 5′ end and attB2 - TCAACTTTCTTGTACAAAGTTG at 3′ end) and then cloned into an entry vector (pDONR253, Thermo Fisher Scientific, Waltham, MA) by the Gateway cloning system. The full-length insert sequences, including the MOI and molecular barcode in each entry clone, were verified by Sanger sequencing. The entry clones were then used to generate the final construct by recombining the insert fragments via LR reaction into pDEST-318, a small pUC19-based ampicillin-resistant holding vector. An example construct, pNF1_34041, is shown in Figure 1A. The control plasmid DNAs were purified using the GenElute XL kit (Sigma-Aldrich, St. Louis, MO) and quantitated by spectrophotometry using a NanoDrop 2000 (Thermo Fisher Scientific). The pertinent mutation information of 69 control plasmids, CPSG pool composition, and NGS assay used in this study are listed in Table 1. Because these 69 plasmids were constructed gradually for a specified period, we only made two pools (CPSG13 and CPG51) from the plasmids available at that time when the experiments were launched for characterization.
Figure 1.
Design of control plasmids. A: Map of a representative control plasmid, pNF1_34041. Each control plasmid was constructed by inserting approximately 1000 bp of genomic DNA (blue box) spanning the mutation of interest (MOI) (red star). A 6-bp (ACATCG) molecular barcode (orange rectangle in B) was inserted near the MOI to track variant reads. Single-cut restriction sites are indicated by yellow triangles. B: Coordination of the molecular barcode with the MOI. A subset of sequencing reads from MOIs [an A deletion (red box)] and 6-bp molecular barcode confirms the mutation is plasmid borne.
Table 1.
List of 69 Control Plasmids
| Plasmid name | Mutation position (hg19) | Transcript | CDS mutation | AA mutation | Mutation type | Restriction enzyme used | CPSG13 (yes/no) | CPSG51 (yes/no) | NCI-MPACT (yes/no) | TSCA (yes/no) | WES (yes/no) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| pAKT1_33765 | 14:105246551 | ENST00000349310 | c.49G>A | p.E17K | SNV at HP | ScaI | No | Yes | Yes | Yes | Yes |
| pAKT1_36918 | 14:105246455 | ENST00000349310 | c.145G>A | p.E49K | SNV | PvuI | No | Yes | Yes | Yes | Yes |
| pAKT2_93894 | 19:40761084 | ENST00000392038 | c.268G>T | p.V90L | SNV | BglI | No | Yes | Yes | Yes | Yes |
| pAKT3_48227 | 1:243809253 | ENST00000366539 | c.371A>T | p.Q124L | SNV at HP | BglI | No | Yes | Yes | Yes | Yes |
| pAPC_13127 | 5:112175639 | ENST00000457016 | c.4348C>T | p.R1450∗ | SNV | BglI | No | Yes | Yes | Yes | Yes |
| pAPC_18561 | 5:112175957 | ENST00000457016 | c.4666_4667insA | p.T1556fs∗3 | Indel at HP | BglI | No | Yes | Yes | No | Yes |
| pAPC_18584 | 5:112175539 | ENST00000457016 | c.4284delC | p.I1417fs∗2 | Indel | BglI | No | No | Yes | Yes | Yes |
| pARHGAP5_88502 | 14:32561739 | ENST00000345122 | c.1864G>A | p.E622K | SNV | BglI | No | Yes | Yes | Yes | Yes |
| pATM_21924 | 11:108117847 | ENST00000278616 | c.1058_1059delGT | p.C353fs∗5 | Indel | PvuI | No | No | Yes | Yes | Yes |
| pATM_41596† | 11:108175462 | ENST00000278616 | c.5557G>A | p.D1853N | SNV at HP | BglI | No | Yes | Yes | Yes | Yes |
| pATR_20627 | 3:142254973 | NM_001184 | c.3790_3796delATAAAAG | p.I1264fs∗24 | Large Indel | ScaI | Yes | Yes | Yes | No | Yes |
| pBRAF_476 | 7:140453135 | ENST00000288602 | c.1799_1800T>A | p.V600E | SNV | PvuI | Yes | Yes | Yes | Yes | Yes |
| pCTNNB1_5664 | 3:41266124 | ENST00000349496 | c.121.A>G | p.T41A | SNV | ScaI | No | Yes | Yes | Yes | Yes |
| pDNMT3A_53042 | 2:25457243 | ENST00000321117 | c.2644C>T | p.R882C | SNV | BglI | No | No | No | No | Yes |
| pEGFR_12378 | 7:55249012 | ENST00000275493 | c.2310_2311insGGT | p.D770_N771insG | Indel | ScaI | Yes | No | Yes | Yes | Yes |
| pEGFR_6224 | 7:55259515 | ENST00000275493 | 2573T>G | p.L858R | SNV at HP | BglI | No | Yes | Yes | No | Yes |
| pEGFR_6225 | 7:55242466 | ENST00000275493 | 2236_2250del15 | p.E746_A750del | Large Indel | BglI | Yes | Yes | Yes | Yes | Yes |
| pEGFR_6240 | 7:55249071 | ENST00000275493 | 2369C>T | p.T790M | SNV | BglI | No | Yes | Yes | Yes | Yes |
| pERBB2_682 | 17:37880993 | ENST00000269571 | 2322_2323ins12 | p.M774_A775insAYVM | Large Indel | ScaI | No | Yes | Yes | No | Yes |
| pERCC1_140843 | 19:45924470 | ENST00000013807 | c.287C>A | p.A96E | SNV at HP | ScaI | No | Yes | Yes | No | Yes |
| pEZH2_37028 | 7:148508727 | ENST00000320356 | 1937A>T | p.Y646F | SNV | ScaI | No | No | No | No | Yes |
| pFBXW7_22965 | 4:153249384 | ENST00000281708 | c.1394G>A | p.R465H | SNV | ScaI | No | Yes | Yes | Yes | Yes |
| pFGFR3_715 | 4:1803568 | ENST00000440486 | c.746C>G | p.S249C | SNV at HP | ScaI | No | Yes | Yes | Yes | Yes |
| pFLT3_783 | 13:28592642 | ENST00000241453 | 2503G>T | p.D835Y | SNV | BglI | No | Yes | Yes | Yes | Yes |
| pGABRA6_70853 | 5:161117296 | ENST00000274545 | c.763G>C | p.V255L | SNV | ScaI | No | Yes | Yes | No | Yes |
| pGABRG2_74722 | 5:161580301 | ENST00000356592 | c.1355A>G | p.Y452C | SNV | ScaI | No | Yes | Yes | Yes | Yes |
| pGNAQ_28758 | 9:80409488 | ENST00000286548 | c.626A>C, | p.Q209L | SNV at HP | ScaI | No | Yes | Yes | No | Yes |
| pGNAS_27887 | 20:57484420 | ENST00000371085 | c.601C>T | p.R201C | SNV | ScaI | No | Yes | Yes | No | Yes |
| pIDH1_28747 | 2:209113113 | ENST00000345146 | 394C>T | p.R132C | SNV | BglI | No | Yes | Yes | Yes | Yes |
| pIDH2_33733 | 15:90631838 | ENST00000330062 | 515G>A | R172K | SNV | ScaI | No | No | No | No | Yes |
| pIDH2_41590 | 15:90631934 | ENST00000330062 | 419G>A | p.R140Q | SNV | ScaI | No | No | No | No | Yes |
| pJAK2_12600 | 9:5073770 | ENST00000381652 | 1849G>T | p.V617F | SNV at HP | BglI | No | Yes | Yes | No | Yes |
| pKIT_1314 | 4:55599321 | ENST00000288135 | 2447A>T | p.D816V | SNV | BglI | No | Yes | Yes | No | Yes |
| pKRAS_521 | 12:25398283 | ENST00000311936 | c.35G>A | p.G12D | SNV | PvuI | Yes | Yes | Yes | Yes | Yes |
| pMET_700 | 7:116423428 | ENST00000318493 | c.3757T>G | p.Y1253D | SNV | PvuI | No | Yes | Yes | Yes | Yes |
| pMLH1_26085 | 3:37067240 | ENST00000231790 | c.1151T>A | p.V384D | SNV | BglI | No | Yes | Yes | Yes | Yes |
| pMPL_18918 | 1:43815009 | ENST00000372470 | 1544G>T | p.W515L | SNV | BglI | No | Yes | Yes | Yes | Yes |
| pMSH2_111644 | 2:47705450 | ENST00000233146 | c.2250delG | p.G751fs∗12 | Indel at HP | BglI | Yes | Yes | Yes | Yes | Yes |
| pMSH2_26122 | 2:47705559 | ENST00000233146 | c.2359_2360delCT | p.L787fs∗11 | Indel | BglI | No | No | Yes | Yes | Yes |
| pMTOR_94356 | 1:11291097 | ENST00000361445 | c.2664A>T | p.L888F | SNV | ScaI | No | Yes | Yes | No | Yes |
| pMYD88_85940 | 3:38182641 | ENST00000396334 | 794T>C | p.L265P | SNV | ScaI | No | No | No | No | Yes |
| pNBN_35664 | 8:90947833 | NM_006904 | c.2242C>T | p.P748S | SNV | ScaI | No | Yes | Yes | No | Yes |
| pNF1_24443 | 17:29576111 | ENST00000358273 | c.4084C>T | p.R1362∗ | SNV at HP | BglI | Yes | Yes | Yes | Yes | Yes |
| pNF1_24468 | 17:29679318 | ENST00000358273 | c.7501delG | p.E2501fs∗22 | Indel | BglI | Yes | No | Yes | Yes | Yes |
| pNF1_34041 | 17:29554610 | ENST00000358274 | c.2395delA | p.M799fs∗22 | Indel at HP | BglI | Yes | Yes | Yes | Yes | Yes |
| pNF1_41820 | 17:29556989 | ENST00000358273 | c.2987_2988insAC | p.R997fs∗16 | Indel | BglI | No | No | Yes | Yes | Yes |
| pNPM1_17559 | 5:170837547 | ENST00000517671 | 863_864insTCTG | p.W288fs∗12 | Large Indel | BglI | Yes | Yes | Yes | Yes | Yes |
| pNRAS_584 | 1:115256529 | ENST00000369535 | c.182A>G | p.Q61R | SNV | ScaI | Yes | Yes | Yes | No | Yes |
| pPARP1_21691 | 1:226551692 | ENST00000366794 | c.2738delG | p.G913fs∗4 | Indel at HP | ScaI | No | Yes | Yes | Yes | Yes |
| pPARP2_75849 | 14:20820412 | NM_005484.2 | c.398A>C | p.D133A | SNV | ScaI | No | Yes | Yes | No | Yes |
| pPDGFRA_28053 | 4:55141048 | ENST00000257290 | c.1694_1695insA | p.S566fs∗6 | Indel | BglI | Yes | No | Yes | Yes | Yes |
| pPDGFRA_736‡ | 4:55152093 | ENST00000257290 | c.2525A>T | p.D842V | SNV | ScaI | No | Yes | Yes | Yes | Yes |
| pPIK3CA_12464 | 3:178952149 | NM_006218.1 | c.3204_3205insA | p.N1068fs∗4 | Indel | BglI | No | No | Yes | Yes | Yes |
| pPIK3CA_763 | 3:178936091 | NM_006218.1 | c.1633G>A | p.E545K | SNV | PvuI | Yes | Yes | Yes | Yes | Yes |
| pPIK3CA_775 | 3:178952085 | NM_006218.1 | c.3140A>G | p.H1047R | SNV | PvuI | No | Yes | Yes | Yes | Yes |
| pPTEN_4986 | 10:89717716 | ENST00000371953 | c.741_742insA | p.P248fs∗5 | Indel | BglI | No | No | Yes | Yes | Yes |
| pPTEN_5152 | 10:89692904 | ENST00000371953 | c.388C>T | p.R130∗ | SNV | ScaI | No | Yes | Yes | No | Yes |
| pPTEN_5809 | 10:89717775 | ENST00000371953 | c.800delA | p.K267fs∗9 | Indel at HP | BglI | No | Yes | Yes | Yes | Yes |
| pPTPN11_13000 | 12:112888210 | ENST00000351677 | c.226G>A | p.E76K | SNV | ScaI | No | Yes | Yes | Yes | Yes |
| pRAD51_117943 | 15:41001312 | ENST00000267868 | c.433C>T | p.Q145∗ | SNV | ScaI | No | Yes | Yes | Yes | Yes |
| pRB1_891 | 13:48941648 | ENST00000267163 | c.958C>T | p.R320∗ | SNV | ScaI | No | Yes | Yes | Yes | Yes |
| pRET_965 | 10:43617416 | ENST00000355710 | 2753T>C | p.M918T | SNV | BglI | No | Yes | Yes | No | Yes |
| pSMAD4_14105 | 18:48603093 | ENST00000342988 | c.1394_1395insT | p.A466fs∗28 | Indel | BglI | No | No | Yes | Yes | Yes |
| pTP53_10648 | 17:7578406 | ENST00000269305 | 524G>A | p.R175H | SNV | BglI | No | Yes | Yes | Yes | Yes |
| pTP53_10660 | 17:7577120 | ENST00000269305 | 818G>A | p.R273H | SNV | BglI | No | Yes | Yes | Yes | Yes |
| pTP53_10662 | 17:7577538 | ENST00000269305 | 743G>A | p.R248Q | SNV | BglI | No | Yes | Yes | No | Yes |
| pTP53_18610 | 17:7579419 | ENST00000269305 | c.263delC | p.S90fs∗33 | Indel | PvuI | No | No | Yes | Yes | Yes |
| pTP53_6530 | 17:7577558 | ENST00000269305 | c.723delC | p.C242fs∗5 | Indel | BglI | No | No | Yes | Yes | Yes |
| pVHL_18578 | 3:10188282 | ENST00000256474 | c.426_429delTGAC | p.G144fs∗14 | Large indel | PvuI | No | No | Yes | Yes | Yes |
AA, amino acid change notation; CDS, coding DNA sequence change notation; CPSG, control plasmid spiked-in genome; HP, homopolymeric region; indel, insertion/deletion; NCI, National Cancer Institute; TSCA, TruSeq Custom Amplicon; SNV, single-nucleotide variant; WES, whole exome sequencing.
Premature stop codon created by mutation.
Excluded from the limit of detection analysis using three data analysis pipelines and the next-generation sequencing data set from the NCI-MPACT assay due to this variant occurring as a single-nucleotide polymorphism in the NA12878 hapmap background into which the plasmid was spiked.
Excluded from analysis because of over dilution during CPSG51 preparation.
Preparation of CPSG Samples
The workflow of CPSG DNA sample preparation is illustrated in Figure 2. Briefly, the control plasmids were linearized by a single-cut restriction enzyme within the vector backbone as indicated in Table 1, purified by the Qiagen PCR Cleanup kit (Qiagen, Valencia, CA), and an aliquot was run on a Bioanalyzer DNA 7500 Chip (Agilent Technologies Inc., Santa Clara, CA) to verify complete digestion and the correct size of the plasmid. The purified linear plasmids were quantitated by spectrophotometry using a NanoDrop 2000, and the number of copies per microliter was calculated for each based on the sizes of the plasmids. Genomic DNA extracted from FFPE cell pellets of hapmap CEPH NA12878 (Coriell Institute for Medical Research, Camden, NJ) was also quantitated by spectrophotometry on a NanoDrop 2000, and the number of copies per microliter was calculated. Each of the quantitated plasmids was pooled at an equal molar ratio, and the pooled plasmids were spiked into CEPH DNA at indicated copy number ratios (plasmid versus genome of hapmap cells). The composition of selected plasmid pools for each of the studies below are given in Table 1.
Figure 2.
Control plasmid spiked-in genome (CPSG) titration workflow. Schematic showing the procedure for generating a CPSG sample. Briefly, plasmids are linearized, quantified, pooled, and spiked into a background genome at a determined copy number ratio.
Description of NGS Assays
The CPSG DNA samples were characterized by three different NGS assays. Library preparation methods and the sequencers used are indicated below. The National Cancer Institute's MPACT (NCI-MPACT) assay12 is a targeted amplicon sequencing assay using the AmpliSeq technology on the Personal Genome Machine (PGM) sequencer (Thermo Fisher Scientific). Briefly, 20 ng of CPSG pool DNA was used to generate the library by multiplex PCR using the NCI-MPACT custom amplicon panel and the Ion AmpliSeq Library Kit version 2.0 with barcode incorporation (Thermo Fisher Scientific). The libraries were quantified using the Ion Library Quantification Kit (Thermo Fisher Scientific), and 10 μL of a 10 pM library dilution was used for clonal amplification onto ion sphere particles using the Ion Template OT2 200 Kit (Thermo Fisher Scientific) on the Ion One Touch 2 instrument before sequencing. Templated ion sphere particles were subjected to 500 flows of 200-bp bidirectional sequencing on an Ion Torrent PGM system using Ion 316 chips. All procedures were performed according to the manufacturer's instructions.
The TruSeq Custom Amplicon (TSCA) assay is based on Illumina's TSCA technology and sequenced on the MiSeq sequencer (Illumina). The same list of mutations used to design the NCI-MPACT assay was submitted into the TSCA panel design website (Illumina) to build the NCI-MPACT TSCA panel (Supplemental Table S1). Libraries were prepared from 250 ng of CPSG according to the manufacturer's guidelines and quantified using the KAPA Library Quantitation Kit (KAPA Biosystems, Wilmington, MA). After quantification, libraries were normalized to 4 nmol/L (MiSeq version 3 chemistry), pooled in equal volumes, denatured with 0.2 N NaOH, and diluted to 12 pM (MiSeq version 3 chemistry). The libraries were then sequenced on the MiSeq using 2 × 300 paired end mode.
The whole exome sequencing (WES) assay uses the Agilent SureSelect XT Human All exon version 5.0 baits (Agilent Technologies Inc.) on a HiSeq 2000 sequencer (Illumina). The library preparation and sequencing procedures followed the vendor's user manuals. For WES, 500 ng of CSPG was sheared to 150 to 200 bp using a Covaris E220 sonicator (Covaris, Woburn, MA). After cleanup with AMPure XP Beads (Beckman Coulter, Brea, CA), samples were checked for correct size distribution using a Bioanalyzer 2100 system (Agilent Technologies Inc.). These fragmented DNA samples were then processed to add sequencing adaptors, hybridize with biotinylated RNA bait set (SureSelect XT Human All Exon version 5, Agilent Technologies Inc), and enrich the captured fragments for sequencing. The AMPure XP purified libraries were examined for size distribution (300 to 400 bp) using an Agilent Bioanalyzer and quantified using the KAPA Library Quantification Kit (KAPA Biosystems). A pooled library made by mixing the two final libraries at equal molar ratio was clustered at 16 pM per flow cell lane using the Illumina cBot before sequencing on an Illumina HiSeq 2000 platform (Illumina). Sequencing reactions were run using 2 × 100 paired-end mode.
NGS Data Analysis and Bioinformatics
NGS data from the NCI-MPACT assay were analyzed by the Torrent Suite Software (TSS) version 4.4.2 (Thermo Fisher Scientific, Waltham, MA), which includes alignment and variant calling. The data analysis parameters recommended by the manufacturer were maintained with the exception of increasing the necessary minimum number of variant reads for each type to 25 reads (snp_min_coverage, indel_min_coverage, etc.) and relaxing the strand-specific error threshold to 36% (sse_prob_threshold = 0.36), as per our clinical protocol. For variants called in flow space, VAF was calculated by the pipeline as the number of flow space alt allele observations (FAOs) in the variant call format (VCF) file divided by the sum of the FAO reads and flow space reference allele observations (FROs) in the VCF [approximately equivalent to flow evaluator read depth at the locus (FDP)]. For variants called by the long indel assembler module, VAF was calculated by the number of alt allele observations (AOs) in the VCF divided by the sum of AOs and reference allele observations (ROs) in the VCF (approximately equivalent to the read depth). To assess the limit of detection, the cutoffs for the lowest allele frequency called by the pipeline were manually reduced to 1%. In addition, comparisons were run with the 3.2.1, 4.0.2, and 4.4.2 versions of the pipeline to reveal performance improvements as the bioinformatics algorithms have developed over time.
NGS data from the TSCA assay were analyzed using the built-in MiSeq Reporter version 2.5 pipeline (Illumina) using the default parameters and cutoffs provided by the manufacturer. VAF was calculated by dividing the allelic depth in the VCF by approximate read depth in the VCF.
For WES data analysis, demultiplexed FASTQ files were generated with Casava version 1.8.2 configureBclToFastq.pl (Illumina) starting with .bcl files. The multiple FASTQ files generated by this script were concatenated and primer trimmed using the ea-utils fastq-mcf tool with the options –l 30 –q 10 –u –P 33 to remove Illumina PCR and sequencing primers from the sequences. The trimmed sequences were mapped to human reference genome hg19 using the Burrows-Wheeler Aligner version 0.6.2 aln and sample mode with default settings.13 The resulting SAM files were converted to BAM format, sorted, deduplicated, realigned, and base quality score recalibrated using samtools, Picard, and GATK tools14 following the best practices guidelines version 3 as mentioned in the GATK website (https://www.broadinstitute.org/gatk/guide/best-practices?bpm=DNAseq, last accessed November 5, 2015). The variants were called using HalplotypeCaller within GATK version 3.3 using ploidy = 20 to increase sensitivity to low allele frequency variants expected in the samples. VAF was calculated by dividing the allelic depth in the VCF by approximate read depth in the VCF. Because we were interested only in the variants present in the spiked-in plasmids, we used a custom BED file (Supplemental Table S2) that contained 100 bp of flanking sequence around the variants to be identified to limit variant calling to these regions (GATK tools option –L) and to decrease the time needed to call variants in the exome samples both before and after base recalibration.
NGS data were visualized using the Integrated Genome Viewer version 2.3.5615 and CLC Genomics Workbench version 8.0.2 (Qiagen). Statistical analysis was performed using the R statistical software suite,16 and graphs were generated using the R ggplot2.17 This work used the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov, last accessed November 5, 2015).
Performance Comparison Study
To evaluate whether the efficiency of detecting variants in CPSG samples is comparable to FFPE genomic DNA, we conducted a parallel spike-in study by preparing two pairs of samples harboring a mutation contained in a plasmid and the paired sample containing the identical mutation in genomic DNA derived from FFPE processed cell pellets (Figure 3A). The first pair of samples, pBRAF_476 and the FFPE prepared melanoma cell line MALME-3M, carries the same BRAF c.1799T>A SNV mutation (ENST00000288602, p.V600E, COSM476). Both pBRAF_476 plasmid DNA and MALME-3M FFPE genomic DNA were spiked into FFPE CEPH genomic DNA at the same copy number ratio (50%, 25%, 12.5%, 6.25%, and 3.125%). The DNA samples of both series were sequenced in triplicate by the NCI-MPACT assay on the PGM platform, and regression lines for the observed versus expected VAF were generated. As a marker of detection efficiency over a dilution series, the slopes of these regression lines were determined, and an analysis of covariance was performed to determine whether the slopes of the regression lines of the two series were different. The second pair of sample series carries the same APC c.4248delC indel mutation (ENST00000457016, p.I1417fs*2, COSM18584) in the pAPC_18584 control plasmid and FFPE colon HCT-15 cell line. The starting titration point for this series was increased to 75%, followed by a similar twofold dilution series (ie, 50%, 25%, 12.5%, 6.25%) to accommodate the 10% VAF cutoff in the default pipeline parameters.
Figure 3.
Parallel spike-in study strategy. A: Schematic diagram of parallel spike-in study. A pair of serially diluted sample sets made by spiking control plasmid pBRAF_476 or genomic DNA from cell line MALME-3M carrying the same BRAF V600E mutation, into the hapmap genomic DNA at the same titration points, followed by sequencing and regression analysis of the expected allele frequency versus the observed allele frequency. B: Scatterplot and regression analysis for pBRAF_476/MALME-3M and pAPC_18584/HCT-15. Linear regression models were fit for each. FFPE, formalin-fixed, paraffin-embedded; PGM, Personal Genome Machine; VAF, variant allele frequency.
Reproducibility Assessment
To assess the reproducibility of CPSG performance, a 13-plasmid pool named CPSG13, consisting of four SNVs, one SNV at HP, three indels, two indels at HP, and three large indels (Table 1), was made and spiked into CEPH (NA12878) hapmap genomic DNA at an estimated ratio of 25%. A large preparation of this material was aliquotted into 25-use tubes and stored frozen at −80°C until use. The CPSG13 sample was used as a positive control for the NCI-MPACT assay and characterized 90 times during a period of 18 months by three operators (30 times by each). The VAFs of the 13 mutations detected by the NCI-MPACT assay were plotted against the date of assay performance to assess variation within and between operators across the 13 plasmids. A two-way analysis of variance test was used to evaluate whether there is a statistically significant difference among VAFs measured by three operators after adjusting for differences in plasmids.
Assessment of the Effect of Data Analysis Pipelines on NGS Results
To assess the effect of different NGS data analysis pipelines on variant calling, a pool of 51 control plasmids, named CPSG51, was spiked into hapmap CEPH (NA12878) genomic DNA at five titration points: 50%, 25%, 12.5%, 6.25%, and 3.125% (Table 1). CPSG51 samples were sequenced by the NCI-MPACT assay, and the same raw NGS data (ie, pre–base-called, unaligned sequence) were analyzed by TSS versions 3.2.1, 4.0.2, and 4.4.2. The default limit of detection for SNVs and indels in all three pipelines were manually reduced to 1% to call variants at a lower allele frequency and thus determine the low end of detection of each in each version of software. pATM_41596 (c.5557G>A, p.D1853N) was excluded from the calculation because it was found to be a naturally occurring heterozygous single-nucleotide polymorphism (rs1801516) found in CEPH NA12878. This creates an inflated VAF compared with the expected VAF of the plasmid alone. Plasmid pPDGFRA_736 (c.2525A>T, p.D842V) was also excluded from calculation because of an error made in dilution during the CPSG51 preparation. The data from each of the pipelines was used to calculate the variant detection rates of each, defined as the percentage of the number of detected variants divided by the total number of variants spiked into each titration point for the 49 mutations in four mutation types. The 95% CI was estimated using the Clopper and Pearson method.16, 18
Performance Comparison of Three NGS Assays on Different Platforms
CPSG51 series samples were used to assess the limit of detection of three NGS sequencing assays run on different platforms: the AmpliSeq/PGM based NCI-MPACT assay, the TruSeq Custom Amplicon/MiSeq based TSCA assay, and the SureSelect Human All exon v5 baits/Hiseq2000 WES assay. The following 17 plasmids in CPSG51 were excluded from the analysis of three-platform comparison: pAPC_18561, pATR_20627, pEGFR_6224, pERBB2_682, pERCC1_140843, pGABRA6_70853, pGNAQ_28758, pGNAS_27887, pJAK2_12600, pKIT_1314, pMTOR_94356, pNBN_35664, pNRAS_584, pPARP2_75849, pPTEN_5152, pRET_965, and pTP53_10662. These plasmids were excluded because they could not be detected by the locked TSCA design and data analysis methods because the MOI was outside the region covered by the custom TSCA assay panel or there was interference from the molecular barcode residing within the library primer binding region (see Using CPSG to Access Performance of NGS Assays Designed for Different Platforms). For the aforementioned reasons, pATM_41596 and pPDGFRA_736 were also excluded. Therefore, only 32 plasmids were used in all three assays for interassay comparison. The data from each assay were used to calculate the variant detection rates as defined above for 32 mutations in four mutation types. The 95% CI was estimated using the Clopper and Pearson method.16, 18
Results
Construction of Control Plasmids and Function of Molecular Barcode
All 69 control plasmids were successfully constructed and manufactured with a mean yield >100 μg. Given a mean size of <3000 bp for each control plasmid and a human genome size of approximately 3 billion bp, this quantity is sufficient to prepare an immense amount of CPSG materials. To determine the ability of the inserted barcode to discriminate spiked plasmids from human genomic sequence, a pilot pool containing pNF1_34041 spiked into CEPH NA12878 genomic DNA was sequenced by the NCI-MPACT assay. The resulting BAM file was visualized in CLCBIO, and as shown in Figure 1B, the 6-bp molecular barcode consistently co-localized with the mutation of interest in the sequencing reads, indicating that the variant-containing sequence was in fact derived from the plasmid.
Comparison of the Performance of CPSG with Cell Line Genomic DNA
Although the primary sequence is identical, there are several differences between mutation-containing plasmid DNA and genomic DNA derived from FFPE cells, such as integrity of the DNA, including fragmentation and chemical modifications induced by the FFPE processing, and the presence of synthetically inserted sequence (ie, the inserted molecular barcode). Before further characterization of CPSG samples, a parallel spike-in study (Figure 3A) was conducted to determine whether the efficiency of detecting variants in control plasmids is similar to FFPE genomic DNA. Two regression lines were plotted from the observed versus expected VAFs for the BRAF c.1799T>A control plasmid and genomic DNA samples (Figure 3B). Similar values for the slope (plasmid = 1.0516, cell line = 0.9864) were observed between the two species. An analysis of covariance was performed to test for differences in the slopes of the two and found no evidence of significantly different slopes between the plasmid and cell line species (F = 1.03, P = 0.321). A similar analysis was performed for a deletion mutation, APC c.4248delC, and found to have nearly identical slopes (plasmid = 1.0967, cell line = 1.0643). Again, no significant difference was found between the two slopes after statistical analysis (F = 1.02, P = 0.321). These results indicate that performance of CPSG is very similar to FFPE genomic DNA in detecting different types of mutations over a wide range of allele frequencies.
Assessment of Reproducibility of CPSG Samples
Because these CPSG samples were intended to be used as internal controls to monitor assay performance over time, the ability to reproducibly detect the mutations in these plasmids by different operators at different times was assessed. The 13 mutations in CPSG13 were all detected 90 times by three operators during an 18-month period (Figure 4A). Aggregate data for each of the 13 plasmids for each operator are given in Table 2, and the VAF SD for each plasmid is indicated. A two-way analysis of variance test revealed that minimal differences in the detected VAFs were observed among the three operators (F = 1.54, P = 0.040). The detected VAF for each of the 13 mutations was a mean of 25.55% with a mean SD of approximately ±5.4% VAF in 90 replicates. Because of difficulties in quantifying these materials, there was a large range of VAFs observed for each plasmid, which does not seem to correlate with the variant type being detected (Table 3).
Figure 4.
CPSG13 reproducibility during 18 months with three operators. A: CPSG13 was repeatedly sequenced by the National Cancer Institute's MPACT (NCI-MPACT) assay 90 times by three operators (OP1, OP2, and OP3) during an 18-month period. Boxplots showing the variant allele frequency (VAF) distribution binned by plasmid for each operator were plotted. The dashed line represents the 25% VAF point at which the plasmids were intended to be titrated. B: The mean VAF for each of the 13 plasmids and three operators was plotted as a green solid line during 18 months to evaluate the variability and stability of the material. The blue dashed line represents the expected 25% VAF. The red line represents a regression model through the series. The gray shadow represents the range of VAF for the 13 plasmids at each test date on the x axis.
Table 2.
CPSG13 90 Replicate Operator Variant Allele Frequency Variances
| Plasmid | Operator 1 |
Operator 2 |
Operator 3 |
|||
|---|---|---|---|---|---|---|
| Mean | SD | Mean | SD | Mean | SD | |
| pATR_20627 | 25.59 | 2.85 | 26.03 | 2.70 | 24.81 | 2.46 |
| pBRAF_476 | 23.64 | 1.58 | 23.73 | 1.67 | 23.56 | 1.49 |
| pEGFR_12378 | 24.66 | 2.02 | 26.02 | 2.06 | 25.35 | 1.93 |
| pEGFR_6225 | 28.51 | 2.23 | 28.35 | 1.51 | 28.12 | 1.74 |
| pKRAS_521 | 22.57 | 1.86 | 23.08 | 2.04 | 24.07 | 1.79 |
| pMSH2_111644 | 29.18 | 1.84 | 29.20 | 2.11 | 29.80 | 1.86 |
| pNF1_24443 | 17.18 | 1.36 | 17.93 | 1.49 | 17.46 | 1.44 |
| pNF1_24468 | 23.38 | 2.16 | 22.98 | 2.02 | 22.46 | 1.79 |
| pNF1_34041 | 35.63 | 2.58 | 36.30 | 2.48 | 36.02 | 1.91 |
| pNPM1_17559 | 28.17 | 2.87 | 27.94 | 1.90 | 27.90 | 1.85 |
| pNRAS_584 | 19.41 | 1.92 | 19.62 | 1.81 | 19.51 | 1.15 |
| pPDGFRA_28053 | 20.19 | 2.08 | 20.31 | 1.86 | 21.00 | 1.71 |
| pPIK3CA_763 | 31.69 | 2.02 | 32.18 | 2.30 | 32.92 | 1.95 |
Table 3.
CPSG13 90 Replicate Reproducibility Study VAF Data
| Type | Plasmid | Minimum VAF | Maximum VAF | Mean | SD | CV, %∗ | 5th Percentile | 95th Percentile | Minimum Δ† | Maximum Δ† |
|---|---|---|---|---|---|---|---|---|---|---|
| Indel | pEGFR_12378 | 19.98 | 30.60 | 25.34 | 2.06 | 8.13 | 22.32 | 29.09 | 2.34 | 1.51 |
| Indel | pNF1_24468 | 18.11 | 27.70 | 22.94 | 2.01 | 8.76 | 19.90 | 26.55 | 1.79 | 1.15 |
| Indel | pPDGFRA_28053 | 17.31 | 24.96 | 20.50 | 1.91 | 9.30 | 17.86 | 24.29 | 0.55 | 0.67 |
| Indel at HP | pMSH2_111644 | 25.03 | 34.24 | 29.39 | 1.94 | 6.61 | 27.05 | 32.96 | 2.02 | 1.28 |
| Indel at HP | pNF1_34041 | 30.78 | 41.02 | 35.98 | 2.33 | 6.47 | 32.71 | 40.41 | 1.93 | 0.61 |
| Large indel | pATR_20627 | 20.53 | 35.70 | 25.48 | 2.69 | 10.56 | 21.91 | 30.27 | 1.38 | 5.43 |
| Large indel | pEGFR_6225 | 24.29 | 32.34 | 28.33 | 1.84 | 6.48 | 25.55 | 32.17 | 1.26 | 0.17 |
| Large indel | pNPM1_17559 | 23.92 | 34.09 | 28.00 | 2.24 | 7.98 | 25.06 | 32.22 | 1.14 | 1.87 |
| SNV | pBRAF_476 | 20.70 | 27.85 | 23.64 | 1.57 | 6.62 | 21.54 | 26.24 | 0.84 | 1.61 |
| SNV | pKRAS_521 | 18.61 | 27.70 | 23.24 | 1.98 | 8.52 | 20.29 | 26.33 | 1.68 | 1.38 |
| SNV | pNRAS_584 | 16.01 | 23.76 | 19.52 | 1.64 | 8.41 | 16.91 | 22.12 | 0.90 | 1.64 |
| SNV | pPIK3CA_763 | 27.98 | 38.67 | 32.26 | 2.13 | 6.61 | 29.55 | 35.93 | 1.57 | 2.74 |
| SNV at HP | p NF1_24443 | 14.88 | 21.52 | 17.52 | 1.45 | 8.26 | 15.39 | 20.30 | 0.51 | 1.22 |
CPSG, control plasmid spiked-in genome; HP, homopolymeric region; indel, insertion/deletion; SNV, single-nucleotide variant; VAF, variant allele frequency.
The CV was calculated as (SD/mean) × 100%.
Minimum Δ is the difference between the fifth percentile and the minimum VAF. Maximum Δ is the difference between the 95th percentile and the maximum VAF. Minimum Δ and maximum Δ are used to estimate the outlier data point.
Analysis of the CPSG13 DNA samples during 18 months revealed only minor changes in VAF (mean = 25.55, SD = 1.40) of each of the 13 plasmids (Figure 4B), which is well within the expected normal variance of an NGS assay.9 These results indicate that CPSG material is highly stable and variant detection is highly reproducible among different operators for a long period. In addition, subjecting the material to at least 25 freeze/thaw cycles during the period the material was tested has no observable effect on detection of the variants and allele frequencies within the sample. Taken together, these data indicate that the CPSG samples are reliable, stable positive controls for monitoring the performance of NGS assays.
Using CPSG to Assess Performance of Different Data Analysis Pipelines
It is known that different data analysis pipelines can markedly affect the results of variant calling, yet it is difficult to evaluate which pipelines are more accurate because of the lack of well-characterized calibrator materials. To demonstrate the value of using the CPSG as standards in assessing the performance of NGS data analysis pipelines, the same prealigned NGS data generated from sequencing CPSG51 samples by the NCI-MPACT assay were analyzed by three, sequential versions of the TSS data analysis pipelines: 3.2.1, 4.0.2, and 4.4.2. The detection rates for each of the titration points for 49 plasmids (excluding pATM_41596 and pPDGFRA_736 as described) in CPSG51 were used to evaluate the performance of the three pipelines. The overall detection rates of all 49 over the five titration points are given in Table 4. A tile plot (Figure 5) was generated for each of the three pipelines to indicate the performance of each of the plasmids over the development of the alignment and variant calling pipelines. The detection rates of each variant type by three versions of pipeline are summarized in Supplemental Table S3. In general, the performance of the three pipelines was similar in SNV, SNV at HP, and large indel variant types. However, improvements made in the TSS version 4.0.2 and TSS version 4.4.2 pipelines have made them more effective at calling three of five indels at HP-type variants at lower titration points compared with the earlier TSS version 3.2.1 pipeline. Two variants that appear to be challenging to all versions of the pipeline are an SNV in pAKT1_33765 located 1 bp away from the end of the amplicon (Supplemental Figure S1) and a 1-bp insertion in pAPC_18561 within a long homopolymeric region composed of a repeat of six consecutive adenosine residues (Supplemental Figure S2). These results indicate that CPSG samples can serve as powerful calibrators to assess the performance of data analysis pipelines in detecting and calling different types of variants and can highlight and identify the weaknesses in these pipelines, facilitating the ability to further improve on these algorithms.
Table 4.
Comparison of Detection Rates across Three Data Analysis Pipelines
| Titration | Total No. of mutations | TSS version 3.2.1 |
TSS version 4.0.2 |
TSS version 4.4.2 |
|||
|---|---|---|---|---|---|---|---|
| No. of detected mutations | Detection rate, % (95% CI) | No. of detected mutations | Detection rate, % (95% CI) | No. of detected mutations | Detection rate, % (95% CI) | ||
| 50 | 49 | 48 | 97.96 (89.15–99.95) | 47 | 95.92 (86.02–99.50) | 47 | 95.92 (86.02–99.50) |
| 25 | 49 | 46 | 93.88 (83.13–98.72) | 47 | 95.92 (86.02–99.50) | 46 | 93.88 (83.13–98.72) |
| 12.5 | 49 | 43 | 87.76 (75.23–95.37) | 46 | 93.88 (83.13–98.72) | 47 | 95.92 (86.02–99.50) |
| 6.25 | 49 | 42 | 85.71 (72.76–94.06) | 46 | 93.88 (83.13–98.72) | 47 | 95.92 (86.02–99.50) |
| 3.125 | 49 | 42 | 85.71 (72.76–94.06) | 45 | 91.84 (80.40–97.73) | 46 | 93.88 (83.13–98.72) |
TSS, Torrent Suite Software.
Figure 5.
Comparison of the three CPSG51 data analysis pipelines. CPSG51 was sequenced by the National Cancer Institute's MPACT assay, and the same raw data were analyzed by Torrent Suite Software (TSS) versions 3.2.1, 4.0.2, and 4.4.2. Tile plots indicating whether a variant was detected (red box) or not (gray box) were generated for each of the three pipeline versions. Each row represents a mutation in a control plasmid, and each column represents a titration point. The plasmid name and the variant type (in parenthesis) are indicated on the left. HP, homopolymeric region; indel, insertion/deletion; SNV, single-nucleotide variant.
Using CPSG to Assess Performance of NGS Assays Designed for Different Platforms
In addition to the AmpliSeq/PGM-based NCI-MPACT assay, the same CPSG51 series was used to evaluate the performance of two other NGS assays that are based on different chemistry and sequencing platforms: the MiSeq-based TSCA assay and HiSeq-based WES assay. A tile plot indicating detection of the plasmids with each of the three platforms was generated (Figure 6).
Figure 6.
Comparison of the three CPSG51 assay platforms. A 32-plasmid subset detectable across three different next-generation sequencing chemistries and platforms was compared. A tile plot was generated for each of the three platforms, and an indication was made as to whether the variant was detected (red box) or not (gray box). Each row represents a mutation in a control plasmid, and each column represents a titration point. The plasmid name and the variant type (in parenthesis) are indicated on the left. HP, homopolymeric region; indel, insertion/deletion; TSCA, TruSeq Custom Amplicon; SNV, single-nucleotide variant; WES, whole exome sequencing.
With the TSCA assay, variants in 32 plasmids were detected at the expected VAF (Figure 6), but variants in 15 plasmids (pMTOR_94356, pRET_965, pPTEN_5152, pPARP2_75849, pERBB2_682, pERCC1_140843, pGNAS_27887, pATR_20627, pKIT_1314, pAPC_18561, pGABRA6_70853, pEGFR_6224, pNBN_35664, pJAK2_12600, pGNAQ_28758) were either not detected or detected with VAFs <20% in the 50% titration point sample. By inspecting the locations of the library primers, the molecular barcode, and the MOI, the molecular barcode was found to be located within the 5′ library primer in 13 plasmids, and one (pAPC_18561) was found in the 3′ library primer. In addition, the MOI is located within 3 bp of the end of the amplicon in most of these cases. Taken together, variants in these plasmids were either not detected or detected at a highly diminished VAF compared with the expected finding because of a bias toward cell line genomic DNA background in which the plasmids were spiked, which does not contain the molecular barcode. In two other cases, the MOI occurs within the library primer: pERBB2_682, a 15-bp tandem repeat insertion occurs within the 5′ library primer, and pERCC1_140843, a G>T SNV also occurs within the 5′ library primer. Therefore, neither of these two variants could be detected using this panel because the MOI would be masked by the library primer. To get an informative comparison among the three different platforms, we removed these 15 plasmids from the analysis, leaving 32 that were detected consistently across all three platform-based assays.
The WES assay detected all 32 variants in the control plasmids at higher titration points. One plasmid, pNPM1_17559, was missed at the 12.5% titration point, a variant in pTP53_10648 was missed at the 6.25% titration point, and six plasmids, pFGFR3_715, pIDH1_28747, pMLH1_26085, pPTEN_5809, pNPM1_17559, and pPIK3CA_763, were missed at the 3.125% titration point (Figure 6). Further analyses suggest that the lack of detection of those low VAF variants was due to lower coverage read depth within the targeted region. On average, the whole exome sequencing assay produced a 176.5 times read depth in targeted regions. Given the mean read depth, the number of sequencing reads harboring the lower allele frequencies derived from the spiked-in plasmids would decrease to single digits, and the default cutoffs of the data analysis pipelines prohibit detection of these variants at such low variant allele frequencies.
Overall, the 32 mutations had a very high concordance in detection rates among the three assay platforms over the five titration points. These data indicate that the plasmids are detected at a similar sensitivity over each of the 32 plasmids (Table 5) or by each variant type (Supplemental Table S4) across the three platforms.
Table 5.
Comparison of Detection Rates of 32 Plasmids across Three Platforms
| Titration | Total No. of mutations | NCI-MPACT |
TSCA |
WES |
|||
|---|---|---|---|---|---|---|---|
| No. of detected mutations | Detection rate, % (95% CI) | No. of detected mutations | Detection rate, % (95% CI) | No. of detected mutations | Detection rate, % (95% CI) | ||
| 50 | 32 | 31 | 96.88 (82.00–99.84) | 32 | 100.00 (86.66–100.00) | 32 | 100.00 (86.66–100.00) |
| 25 | 32 | 31 | 96.88 (82.00–99.84) | 32 | 100.00 (86.66–100.00) | 32 | 100.00 (86.66–100.00) |
| 12.5 | 32 | 31 | 96.88 (82.00–99.84) | 32 | 100.00 (86.66–100.00) | 31 | 96.88 (82.00–99.84) |
| 6.25 | 32 | 31 | 96.88 (82.00–99.84) | 32 | 100.00 (86.66–100.00) | 31 | 96.88 (82.00–99.84) |
| 3.125 | 32 | 31 | 96.88 (82.00–99.84) | 32 | 100.00 (86.66–100.00) | 26 | 81.25 (62.96–92.14) |
NCI, National Cancer Institute; TSCA, TruSeq Custom Amplicon; WES, whole exome sequencing.
Discussion
To date, there are no widely accepted multianalyte standards or controls for clinical NGS assays. Well-characterized and available multianalyte controls would be valuable for a variety of applications for clinical NGS assays used for oncology patient diagnosis and treatment selection. They can be used as routine run controls for assessment of analytical performance of a given run or lot of reagents used and for the assessment of assay and reagent analytical performance over time, operators, laboratories, and instruments. These materials may even serve as good controls for validation of clinical NGS assays in the absence of samples with hard to find mutations, as was reported in a recent submission of a 510 (k) premarket notification to the Food and Drug Administration by Illumina for their MiSeqDx Cystic Fibrosis 139-Variant Assay panel. We designed and generated a plasmid-based calibrator and quality control material by constructing 69 control plasmids that contain frequently occurring mutations in tumors, representing many different types of variants (SNVs, small indels, large indels, and variants located within or near homopolymeric sequence) (Table 1). We found that the performance of synthetic plasmids is nearly identical to endogenous variants contained within genomic DNA derived from FFPE cell lines (Figure 3B). Our data indicate that CPSG samples are highly stable and generate reproducible results when assayed by different operators for a long period (Figure 4). Importantly, we illustrated the utility of CPSG samples in evaluating different DNA analysis pipelines (Figure 5) and compared the performance of NGS assays designed for different NGS platforms (Figure 6). To our knowledge, our work represents the first example of establishing a highly multiplex analyte control materials suitable for NGS clinical assays. Although only two plasmid pools (CPSG13 and CPSG51) were used for proof of principle and demonstrating the utilities of 69 control plasmids, we believe that these results were representative and larger panels should perform similarly.
By comparison to homologous recombination–mediated site-specific mutagenesis of cell line genomes, CPSG materials represent an approach far less expensive and time consuming, as well as more straightforward, with no limitation in variant number and types, for manufacturing high-quality controls for NGS assays. Because the size of the plasmids is much smaller in relation to a human genome, the mass of each plasmid spiked into a reference genome is negligible, allowing one to spike a very large number of unique targets into a single genome, and at varying quantities. Plasmid-based panels could include clinically relevant, frequently occurring, and rare mutations in tumors with actionable value. In contrast to the inherent limitation of the highest available allele fraction in the cell line genomic DNA blending approach, the allele fraction for all mutations in a pool of CPSG are flexible over a very wide range. In addition, genetic instability is often problematic in transformed cancer-derived cell lines,19, 20 whereas plasmids offer clonal selection and stability. The hapmap genomic DNA (NA12878) selected for plasmid spike is derived from normal cells and is extremely well characterized by multiple sequencing platforms and available as a certified reference material from the National Institute of Standards and Technology.7 In addition, our data clearly indicate that CPSG is applicable to a variety of frequently used library construction methods, sequencing chemistries, and sequencers. Taken together, we believe that the advantages in cost, turnaround time, scalability, flexibility, stability, and applicability make the CPSG an ideal control material for NGS assays.
The 6-bp molecular barcode sequence was originally designed to function as a distinguishable marker and to offer a possibility to spike the control plasmid directly into the clinical specimen. Although these barcodes served as a distinguishable marker of the control plasmid–bearing mutations (Figure 1B), this sequence also created difficulty for NGS informatics pipelines, especially when the mutation was located at the end of an amplicon (Supplemental Figure S1). We also experienced either a failure to detect or a severely reduced VAF for several mutations when using our TSCA panel, which we originally thought might be due to an inability of alignment and variant calling pipeline to map this sequence. However, after looking in depth at the resultant reads in the BAM file and identifying the location of the target-specific library primers, we learned that the difficulty was actually due to a less than ideal panel design in which the molecular barcode was located within the primer binding region. This therefore reduced the robustness of amplification for these variant containing sequence reads (Supplemental Figures S3 and S4). Because the TSCA assay panel was designed after the control plasmids were designed and synthesized, the interference in assay chemistry by molecular barcode and its effect on target selection was not expected or discovered until obtaining the results. Because of the routine soft clipping and/or sequence trimming in the MiSeq Reporter pipeline, as with most alignment and variant calling pipelines, it is unknown whether these variants would have been detected even in the absence of the molecular barcode sequence. Perhaps induced by the presence of the molecular barcode, aggressive sequence trimming in almost all of these cases led to strand bias (Supplemental Figures S3 and S4). It is likely, though, that these problems can be rectified by moving the primers further away and downstream of the MOI or by moving the molecular barcode toward the center of amplicon, on the other side of the MOI.
These two elements highlight how useful such material is in ensuring not only that the chemistry is ideal to identify mutations of interest in lieu of being able to obtain samples with rare variants but also that the pipeline being used is performing as well as possible.
We also observed that in many cases the observed VAF deviated by a large amount (up to 10.5%) from an expected 25% titration point in CPSG13 samples (Table 3) and speculated that this could result from the inaccurate quantification method we used. Better upfront quantification methods, such as digital PCR, may allow for more precise quantification of each plasmid species before pooling, which may result in more similar observed VAFs among the population of plasmids to be detected.
To add further value to these materials, we have designed plasmids that contain specific gene fusion transcripts that can be in vitro transcribed for RNAseq experiments and have found the possibility of using the relative copy number of plasmids as calibration standards for CNV assays and circulating tumor DNA variant detection. Circulating tumor DNA is an area of keen research interest because data support the quantitative assessment of circulating tumor DNA as a marker of therapy response and disease progression.21, 22 CPSG could be a good calibration control for circulating tumor DNA assessment because VAFs in this material can often be ultralow.
For the NGS-based diagnostics field to progress, it is imperative that assay control and calibration standards are made available. On the basis of nearly identical analytical performance of the CPSG and genomic DNA, we have established a reagent that can act as a control material for already developed assays, a test material for the development of new assays and informatics pipelines and algorithms, and routine testing of new reagent lots and informatics pipelines. We recognize that the CPSG material would not be useful as a full process control from tissue biopsy to a final clinical report because preanalytical procedures, such as tissue fixation and embedding, tumor enrichment, and nucleic acid extraction, are outside the scope of the CPSG material. However, use of control materials, such as CPSG, would eliminate the need to find and consume limited tumor specimens and materials from laboratory archives for assay and pipeline development and improvements and save these clinical materials for more informative uses, such as analytical validation and proficiency testing. Furthermore, this would allow for a common set of materials that could be used across sites for standardization testing of clinical assay validation results.
Acknowledgments
We thank the core facilities in Frederick National Laboratory for Cancer Research: the Protein Expression Laboratory for construction and preparation control plasmids support, the Laboratory of Molecular Technology for Sanger sequencing support, and Advanced Biomedical Computation Center for computation support.
D.J.S., P.M.W., and C.-J.L. conceived and designed the study; D.J.S., T.D.F., M.G.M., P.M.W., and C.-J.L. developed the methodology; D.J.S., R.D.H., T.D.F., B.D., K.N.H., P.M.M., C.E.C., and C.H.B. acquired data; D.J.S., R.D.H., T.D.F., B.D., P.M.W., and C.-J.L. acquired data; D.J.S., R.D.H., T.D.F., B.D., B.A.C., J.H.D., P.M.W., and C.-J.L. wrote, reviewed, and revised the manuscript.
Footnotes
Supported by National Cancer Institute, NIH, grants HHSN261200800001E and NO1-CO-2008-00001.
This work does not express or represent the opinion of the National Cancer Institute, National Institutes of Health, or Department of Health and Human Service.
Disclosures: None declared.
Current address of M.G.M., Department of Pediatrics, University of Washington, Seattle, WA.
Supplemental material for this article can be found at http://dx.doi.org/10.1016/j.jmoldx.2015.11.008.
Supplemental Data
Inspection of pAKT_33765 variant not detectable by the National Cancer Institute's MPACT (NCI-MPACT) assay. The G to A substitution variant is located at the last base of the amplicon (pale blue the NCI-MPACT_Regions_2.0 track) in the NCI-MPACT assay and is therefore difficult to detect because of both amplicon trimming and quality and interference from the adjacent molecular barcode in the plasmid. MOI, mutation of interest.
Inspection of pAPC_18561 not detectable by the three next-generation sequencing assays. This variant is not called by many variant callers due to low confidence signal despite being clearly present in the reads. This is due to the long A homopolymer track. MOI, mutation of interest.
Control plasmid pGABRA6_70853 with molecular barcode in the TruSeq Custom Amplicon (TSCA) primer. The TSCA read data, the National Cancer Institute's MPACT (NCI-MPACT) read data, 5′ primer location, molecular barcode, and mutation of interest (MOI) location are indicated. The molecular barcode was found to be within the TSCA assay primer region, which created strand bias and therefore a negative call.
Control plasmid pJAK2_12600 with molecular barcode in the TruSeq Custom Amplicon (TSCA) primer. The TSCA read data (top panel), the National Cancer Institute's MPACT (NCI-MPACT) read data (lower panel), 5′ primer location, molecular barcode, and mutation of interest (MOI) location are indicated. The molecular barcode was found to be within the TSCA assay primer region, which created strand bias and therefore a negative call.
References
- 1.Schrijver I., Aziz N., Farkas D.H., Furtado M., Gonzalez A.F., Greiner T.C., Grody W.W., Hambuch T., Kalman L., Kant J.A., Klein R.D., Leonard D.G., Lubin I.M., Mao R., Nagan N., Pratt V.M., Sobel M.E., Voelkerding K.V., Gibson J.S. Opportunities and challenges associated with clinical diagnostic genome sequencing: a report of the Association for Molecular Pathology. J Mol Diagn. 2012;14:525–540. doi: 10.1016/j.jmoldx.2012.04.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Xuan J., Yu Y., Qing T., Guo L., Shi L. Next-generation sequencing in the clinic: promises and challenges. Cancer Lett. 2013;340:284–295. doi: 10.1016/j.canlet.2012.11.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Tran B., Dancey J.E., Kamel-Reid S., McPherson J.D., Bedard P.L., Brown A.M., Zhang T., Shaw P., Onetto N., Stein L., Hudson T.J., Neel B.G., Siu L.L. Cancer genomics: technology, discovery, and translation. J Clin Oncol. 2012;30:647–660. doi: 10.1200/JCO.2011.39.2316. [DOI] [PubMed] [Google Scholar]
- 4.Andre F., Mardis E., Salm M., Soria J.C., Siu L.L., Swanton C. Prioritizing targets for precision cancer medicine. Ann Oncol. 2014;25:2295–2303. doi: 10.1093/annonc/mdu478. [DOI] [PubMed] [Google Scholar]
- 5.Rehm H.L., Bale S.J., Bayrak-Toydemir P., Berg J.S., Brown K.K., Deignan J.L., Friez M.J., Funke B.H., Hegde M.R., Lyon E. ACMG clinical laboratory standards for next-generation sequencing. Genet Med. 2013;15:733–747. doi: 10.1038/gim.2013.92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gargis A.S., Kalman L., Berry M.W., Bick D.P., Dimmock D.P., Hambuch T. Assuring the quality of next-generation sequencing in clinical laboratory practice. Nat Biotechnol. 2012;30:1033–1036. doi: 10.1038/nbt.2403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zook J.M., Chapman B., Wang J., Mittelman D., Hofmann O., Hide W., Salit M. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014;32:246–251. doi: 10.1038/nbt.2835. [DOI] [PubMed] [Google Scholar]
- 8.Pant S., Weiner R., Marton M.J. Navigating the rapids: the development of regulated next-generation sequencing-based clinical trial assays and companion diagnostics. Front Oncol. 2014;4:78. doi: 10.3389/fonc.2014.00078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Singh R.R., Patel K.P., Routbort M.J., Reddy N.G., Barkoh B.A., Handal B., Kanagal-Shamanna R., Greaves W.O., Medeiros L.J., Aldape K.D., Luthra R. Clinical validation of a next-generation sequencing screen for mutational hotspots in 46 cancer-related genes. J Mol Diagn. 2013;15:607–622. doi: 10.1016/j.jmoldx.2013.05.003. [DOI] [PubMed] [Google Scholar]
- 10.Dias-Santagata D., Akhavanfard S., David S.S., Vernovsky K., Kuhlmann G., Boisvert S.L., Stubbs H., McDermott U., Settleman J., Kwak E.L., Clark J.W., Isakoff S.J., Sequist L.V., Engelman J.A., Lynch T.J., Haber D.A., Louis D.N., Ellisen L.W., Borger D.R., Iafrate A.J. Rapid targeted mutational analysis of human tumours: a clinical platform to guide personalized cancer medicine. EMBO Mol Med. 2010;2:146–158. doi: 10.1002/emmm.201000070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Baker S.C., Bauer S.R., Beyer R.P., Brenton J.D., Bromley B., Burrill J. The External RNA Controls Consortium: a progress report. Nat Methods. 2005;2:731–734. doi: 10.1038/nmeth1005-731. [DOI] [PubMed] [Google Scholar]
- 12.Lih C.-J., Sims D.J., Harrington R.D., Polley E.C., Zhao Y., Mehaffey M.G., Forbes T.D., Das B., Datta V., Harper K.N., Bouk C.H., Rubinstein L.V., Simon R.M., Conley B.A., Chen A.P., Kummar S., Doroshow J.H., Williams P.M. Analytical validation and application of a targeted next generation sequencing mutation detection assay for use in treatment assignment in the NCI-MPACT trial ( NCT01827384) J Mol Diagn. 2016;18:51–67. doi: 10.1016/j.jmoldx.2015.07.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Li H., Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26:589–595. doi: 10.1093/bioinformatics/btp698. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Thorvaldsdottir H., Robinson J.T., Mesirov J.P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14:178–192. doi: 10.1093/bib/bbs017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.R Development Core Team . R: a language and environment for statistical computing. R Foundation for Statistical Computing; Vienna, Austria: 2014. [Google Scholar]
- 17.Wickham H. Springer-Verlag; New York: 2009. ggplot2: Elegant Graphics for Data Analysis. [Google Scholar]
- 18.Clopper C.J., Pearson E.S. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika. 1934;26:404–413. [Google Scholar]
- 19.Roschke A.V., Tonon G., Gehlhaus K.S., McTyre N., Bussey K.J., Lababidi S., Scudiero D.A., Weinstein J.N., Kirsch I.R. Karyotypic complexity of the NCI-60 drug-screening panel. Cancer Res. 2003;63:8634–8647. [PubMed] [Google Scholar]
- 20.Stults D.M., Killen M.W., Shelton B.J., Pierce A.J. Recombination phenotypes of the NCI-60 collection of human cancer cells. BMC Mol Biol. 2011;12:23. doi: 10.1186/1471-2199-12-23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sausen M., Phallen J., Adleff V., Jones S., Leary R.J., Barrett M.T., Anagnostou V., Parpart-Li S., Murphy D., Kay Li Q., Hruban C.A., Scharpf R., White J.R., O'Dwyer P.J., Allen P.J., Eshleman J.R., Thompson C.B., Klimstra D.S., Linehan D.C., Maitra A., Hruban R.H., Diaz L.A., Jr., Von Hoff D.D., Johansen J.S., Drebin J.A., Velculescu V.E. Clinical implications of genomic alterations in the tumour and circulation of pancreatic cancer patients. Nat Commun. 2015;6:7686. doi: 10.1038/ncomms8686. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Piotrowska Z., Niederst M.J., Karlovich C.A., Wakelee H.A., Neal J.W., Mino-Kenudson M., Fulton L., Hata A.N., Lockerman E.L., Kalsy A., Digumarthy S., Muzikansky A., Raponi M., Garcia A.R., Mulvey H.E., Parks M.K., DiCecca R.H., Dias-Santagata D., Iafrate A.J., Shaw A.T., Allen A.R., Engelman J.A., Sequist L.V. Heterogeneity underlies the emergence of EGFRT790 wild-type clones following treatment of T790M-positive cancers with a third-generation EGFR inhibitor. Cancer Discov. 2015;5:713–722. doi: 10.1158/2159-8290.CD-15-0399. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Inspection of pAKT_33765 variant not detectable by the National Cancer Institute's MPACT (NCI-MPACT) assay. The G to A substitution variant is located at the last base of the amplicon (pale blue the NCI-MPACT_Regions_2.0 track) in the NCI-MPACT assay and is therefore difficult to detect because of both amplicon trimming and quality and interference from the adjacent molecular barcode in the plasmid. MOI, mutation of interest.
Inspection of pAPC_18561 not detectable by the three next-generation sequencing assays. This variant is not called by many variant callers due to low confidence signal despite being clearly present in the reads. This is due to the long A homopolymer track. MOI, mutation of interest.
Control plasmid pGABRA6_70853 with molecular barcode in the TruSeq Custom Amplicon (TSCA) primer. The TSCA read data, the National Cancer Institute's MPACT (NCI-MPACT) read data, 5′ primer location, molecular barcode, and mutation of interest (MOI) location are indicated. The molecular barcode was found to be within the TSCA assay primer region, which created strand bias and therefore a negative call.
Control plasmid pJAK2_12600 with molecular barcode in the TruSeq Custom Amplicon (TSCA) primer. The TSCA read data (top panel), the National Cancer Institute's MPACT (NCI-MPACT) read data (lower panel), 5′ primer location, molecular barcode, and mutation of interest (MOI) location are indicated. The molecular barcode was found to be within the TSCA assay primer region, which created strand bias and therefore a negative call.






