Abstract
Medical sequencing for diseases with locus and allelic heterogeneities has been limited by the high cost and low throughput of traditional sequencing technologies. “Second-generation” sequencing (SGS) technologies allow the parallel processing of a large number of genes and, therefore, offer great promise for medical sequencing; however, their use in clinical laboratories is still in its infancy. Our laboratory offers clinical resequencing for dilated cardiomyopathy (DCM) using an array-based platform that interrogates 19 of more than 30 genes known to cause DCM. We explored both the feasibility and cost effectiveness of using PCR amplification followed by SGS technology for sequencing these 19 genes in a set of five samples enriched for known sequence alterations (109 unique substitutions and 27 insertions and deletions). While the analytical sensitivity for substitutions was comparable to that of the DCM array (98%), SGS technology performed better than the DCM array for insertions and deletions (90.6% versus 58%). Overall, SGS performed substantially better than did the current array-based testing platform; however, the operational cost and projected turnaround time do not meet our current standards. Therefore, efficient capture methods and/or sample pooling strategies that shorten the turnaround time and decrease reagent and labor costs are needed before implementing this platform into routine clinical applications.
Genetic testing for disorders with locus and allelic heterogeneity has been a challenge due to the high cost of sequencing entire coding regions of numerous genes. Classically, medical sequencing has used capillary-based “Sanger” sequencing technology and this has remained the gold standard for three decades. However, this method is expensive and has low throughput. It was not until novel technology platforms emerged that comprehensive testing became within reach. One such technology is array-based sequencing, which drastically increased the number of genes that could be analyzed simultaneously.1,2,3,4,5,6,7 We previously developed an array-based resequencing test for dilated cardiomyopathy (DCM), that doubled the number of genes analyzed in parallel while reducing test cost and turnaround time.6 However, this technology has two major drawbacks, particularly in a clinical setting. First, resequencing arrays have a poor detection rate for insertions and deletions (in/dels).8,9 Second, the somewhat static nature of the chip design makes it time-consuming and impractical to add new content, especially in a disease area like DCM where genes are being discovered at a rapid pace. As such, novel technologies are needed to provide comprehensive sequencing of all DCM genes with high analytical sensitivity and with the goal of further reducing the cost of diagnostic testing.
Second-generation sequencing (SGS) technologies, commonly referred to as “next-generation sequencing,” are based on massive parallel sequencing of millions of DNA templates through cycles of enzymatic treatment and image-based data acquisition. Several platforms have been developed in the last few years based on different biochemistries and cluster generation.10,11,12 The most commonly used platforms include the Illumina Genome Analyzer (GAII; Solexa Technology),13 Roche Applied Sciences (454 sequencing),14 and ABI-Applied Biosystems (SOLiD platform).15 These technologies have been adopted for a wide variety of research applications10,11,16 and have now matured sufficiently to be considered as robust enough for clinical applications. For example, SGS has recently been applied to resequencing the NF1 locus as well as the mitochondrial and small-cell lung cancer genomes.17,18,19 In addition, whole exome resequencing has been conducted to discover genes underlying rare monogenic diseases.20,21,22,23
The specific advantages offered by SGS are twofold: the low cost per base and the ability to sequence millions of reads in parallel, allowing for simultaneous analysis of a large number of genes. However, these are offset by two major disadvantages: shorter reads and reduced accuracy as compared to Sanger sequencing.11 Despite their disadvantages, improvements in these technologies promise to meet the technical requirements and strict quality standards of clinical diagnostics including analytical sensitivity, reproducibility and cost effectiveness.
Several methodological approaches to capturing target gene regions have evolved to complement the higher capacity and throughput of novel sequencing technologies. These methods involve constructing and enriching a DNA “library” and use both PCR and/or hybridization as the mode of target selection. The classic PCR approach to library generation requires amplification of target regions, pooling, concatenation, shearing and ligation of adaptors and sequencing primers. Secondary droplet-based microfluidic technologies have evolved to facilitate high-throughput PCR in picoliter droplets.24 With hybridization-based methods, libraries are constructed by shearing total gDNA followed by adaptor ligation and hybridization to oligonucleotides that are complementary to the desired target. Hybridization can be performed either on a solid surface array, on a filter, or by hybridization in solution21,25,26,27 A third general approach to target selection uses molecular inversion probes. Molecular inversion probes consist of two primers linked together by a backbone and, similar to PCR, bind to specific target DNA. This is followed by gap filling, ligation, and enrichment steps.28,29
Important considerations in choosing a method include total amount of starting DNA, the size of the target region and the types of sequence alteration under investigation. Technical parameters such as target specificity, uniformity, and completeness of coverage also vary with each methodology.12 Although each method has advantages or disadvantages depending on the application, we selected PCR, a well-established and robust targeted amplification method for this study.
Leveraging on our array-based resequencing test for DCM,6 we evaluated the suitability of targeted PCR followed by Illumina GAII resequencing for a clinical testing environment. We used a number of samples enriched for a large number of substitutions as well as insertions and deletions in DCM genes to assess the analytical performance parameters. We also evaluated the cost and turnaround time of such a test and compared it to our current, array-based resequencing test.
Materials and Methods
Validation Samples
This study was approved by the Partners HealthCare Institutional Review Board. Anonymized samples with known sequence alterations were used to create five pooled positive controls. Towards this end, individual exons harboring known mutations were PCR amplified from multiple individual patient DNA samples. In a given sample, each exon is represented once though DNA of different individuals was used to maximize the number of known mutations that could be incorporated. PCR was done by multiplexing different exons together or individually and then PCR products were combined to form a “pool” as described in the library construction methods. All mutations were confirmed by Sanger sequencing. The 19 genes and 268 exons analyzed in the current DCM CardioChip were sequenced in this study. One additional gene (HOPX) and two non-coding exons in other genes were included bringing the total number of genes and exons to 20 and 275, respectively.
Library Construction
Equal volumes (25 μl/reaction) of PCR products from each PCR reaction were pooled. PCR products were precipitated, washed, pelleted, and resuspended in 50 μl of TE buffer. PCR fragments were end repaired using the End-IT DNA End Repair kit (EpiCentre, Madison, WI) and then purified with the MinElute kit (Qiagen, Valencia, CA) according to the manufacturer's instructions, and then PCR products were concatenated using quick ligase (New England Biolabs, Ipswich, MA). DNA was sheared to 100–500 bp by sonication using a Covaris S2 Instrument as recommended. End repair of DNA fragments, “A” addition to the 3′ end of blunt-ended DNA fragments (Klenow Exo Enzyme from Illumina, San Diego, CA) and adapter ligation were all done as recommended by Illumina's library construction protocol. Enrichment of adapter-modified DNA fragments was carried out by PCR using Phusion DNA polymerase according the manufacturer's protocol followed by purification using the MinElute kit.
Sample Run Using Illumina GAII
Libraries were subjected to standard Illumina protocols for cluster generation and detection. Detection was done for 36 cycles in an Illumina Genome Analyzer II using a standard 1.0-mm flow cell.
Illumina Pipeline Processing
The Illumina Genome Analyzer Pipeline version 1.0 was used for data analysis. Image analysis was done using the Firecrest module, followed by base calling using the Bustard module, and a quality check alignment using ELAND. Default settings were used for base calling and alignment. The Gerald module, also part of the pipeline, filters the reads based on a given condition and then performs alignment on the filtered reads. The reads are filtered based on a CHASTITY score allocated to each base. CHASTITY score for cycle i is defined as where I1 is the highest intensity and I2 is the next highest intensity obtained in a base position. The read passes filter if where ScoreCi =1 if Ci > 0.6 and ScoreCi = 0 if Ci ≤ 0.6.
The filtered reads were then subjected to variant and in/del detection using NextGENe software (SoftGenetics, LLC, State College, PA).
NextGENe Software
Three main steps are involved in calling variants: i) condensation of reads; ii) alignment to reference; and iii) variant calling by applying thresholds. The Condensation Tool is used to polish and lengthen short sequence reads into fragments that are longer and more accurate as Illumina Genome Analyzer reads are often not unique within the genome. Reads are first clustered based on a 12-bp anchor sequence determined from the reads themselves and grouped together. This step statistically removes errors maintaining true variations. Allele frequency information is recorded for each condensed read. This step could be automatically repeated for many cycles of condensation thereby increasing the read length.
Alignment is then performed by determining possible matching positions using 12-bp seed sequences within the reads. The best matching position is chosen and the alignment is then extended. If reads match to multiple positions in the reference with the same number of mismatches, the uniqueness score is used to determine the best position for alignment. A more unique region of the reference is chosen for alignment over regions that are found repeatedly in the reference. Usage of the smaller seed sequences for alignment allows the accurate mapping of reads with in/dels.
Variant calling is based on three thresholds set by the user: A) mutation % (percentage of reads with a variant above a given threshold); B) coverage (the number of reads that are aligned at a given position); and C) SNP allele (the number of reads that contain a variant). A variant is called when all three parameters exceed the set thresholds. For this study we set thresholds of 10%, 10 reads, and one read for parameters A, B, and C, respectively.
PCR and Capillary-Based Sanger Sequencing
All mutations were confirmed by Sanger sequencing. Exons were amplified by PCR using standard conditions (15-μl reactions containing 5 ng of genomic DNA, 2 mmol/L MgCl2, 0.75 μl of dimethyl sulfoxide, 1 mol/L Betaine, 0.2 mmol/L dNTPs, 20 pmol primers, 0.2 μl of AmpliTaq Gold (Applied Biosystems, Foster City, CA)) using the following cycling conditions: 95°C for 10 minutes; 95°C for 30 seconds, 60°C for 30 seconds; 72°C for 1 minute for 30 cycles; and 72°C for 10 minutes. PCR products were purified with Ampure magnetic beads (Agencourt, Beverly, MA) and then sequenced bidirectionally by dideoxy sequencing using BigDye terminator ddNTPs. Original primer sets were designed with M13 linkers for universal priming for the sequencing reaction. PCR primer sequences are available on request. Sequencing products were purified using Cleanseq magnetic beads (Agencourt) and separated by capillary electrophoresis on an ABI 3730 DNA Analyzer. Version 3.0 of the ABI DNA data collection software was used (Applied Biosystems).
Our laboratory analyzes capillary sequencing data using both a manual and semiautomated method: an automated pipeline bins traces based on Phred and Mutation Surveyor (SoftGenetics, LLC) quality scores and the presence of variants. Traces with scores above a set threshold and no variants are manually reviewed once, traces with moderate quality scores and/or variants have two independent reviews and traces with low quality scores are considered failed.
Results
Approach
Our current, array-based DCM test covers 19 DCM genes (Table 1). To evaluate the performance of targeted PCR-based Solexa sequencing, we generated five positive control samples enriched for a large number of sequence alterations in these genes (between 28 and 49 unique mutations per sample) (Figure 1). These positive control samples were sequenced in five different lanes of a single flow cell using the Illumina GAII with a 36-bp cycle run. Mutation calls were extracted from Illumina sequence reads using a program called NextGENe (SoftGenetics, LLC). For all five samples, the results were compared to capillary as well as array-based resequencing data.
Table 1.
List of Genes
| Gene | Name | Location/function | Locus | Reference sequence | Exons | Bases |
|---|---|---|---|---|---|---|
| ABCC9 | ATP-binding cassette, subfamily C, member 9 | ATP-sensitive K channel | 12p12.1 | NM_005691 | 39 | 5568 |
| ACTC | Actin, alpha | Sarcomere | 15q14 | NM_005159 | 6 | 1277 |
| ACTN2 | Alpha actinin | Z-disk | 1q42-q43 | NM_005691 | 21 | 3105 |
| CSRP3 | Cysteine- and glycine-rich protein 3 | Z-disk | 11p15.1 | NM_003476 | 6 | 835 |
| CTF1 | Cardiotrophin 1 | Cytokine | 16p11.2-p11.1 | NM_001330 | 3 | 666 |
| DES | Desmin | Intermediate filament | 2q35 | NM_001927 | 9 | 1593 |
| EMD | Emerin | Nuclear lamina | Xq28 | NM_000117 | 6 | 885 |
| HOPX* | HOP homeobox | Transcription | 4q11-q12 | NM_139212.1; exon 2: NM_032495; exon 3: NM_139211 | 5 | 760 |
| LDB3 | LIM domain binding 3 | Z-disk | 10q22.2-q23.3 | NM_007078.2: exons 5, 6 and 9: NM_001080116.1 | 16 | 2828 |
| LMNA | Lamin A/C | Nuclear lamina | 1q21.2 | NM_170707.1 | 12 | 2235 |
| MYBPC3 | Myosin-binding protein 3 | Sarcomere | 11p11.2 | NM_000256 | 34 | 4524 |
| MYH7 | Myosin heavy chain 7, beta | Sarcomere | 14q12 | NM_000257 | 38 | 6577 |
| PLN | Phospholamban | Inhibitor of cardiac Ca-ATPase | 6q22.1 | NM_002667.2 | 1 | 219 |
| SGCD | Sarcoglycan, delta | Dystrophin-glycoprotein complex | 5q33 | NM_172244; exons 8b and 9: NM_000337 | 10 | 1789 |
| TAZ | Tafazzin | Mitochondrial | Xq28 | NM_000116.2 | 11 | 1099 |
| TCAP | Titin-cap | Z-disk | 17q12 | NM_003673 | 2 | 544 |
| TNNI3 | Troponin I | Sarcomere | 19q13.4 | NM_000363 | 8 | 793 |
| TNNT2 | Troponin T2 | Sarcomere | 1q32 | NM_001001430.1; exon 2: NM_000364.2 | 15 | 1182 |
| TPM1 | Tropomyosin 1 | Sarcomere | 15q22.1 | NM_000366.5; exon 6b: NM_001018005.1 | 11 | 1151 |
| VCL | Vinculin | Z-disk | 10q22.1-q23 | NM_014000 | 22 | 3845 |
| Total | 20 | 275 | 41475 |
Sequenced in addition to the 19 genes analyzed by the array-based DCM test.
Figure 1.

Flow chart of experimental design and analytical pipeline. Pooled, positive controls were amplified by PCR as described in Materials and Methods. The number of mutations is indicated for each positive control sample. Each positive control sample was sequenced on one lane of one flow cell on the Illumina GAII instrument. The Illumina GAII pipeline uses three software packages for image analysis (Firecrest), base calling (Bustard), and analysis (Gerald). Variant detection was carried out using the NextGENe software package (SoftGenetics, LLC). The number of false positives (FP), false negatives (FN), and true positives (TP) identified by NextGENe are indicated for the given parameters and thresholds described in Materials and Methods.
Analytical Performance Parameters
Illumina GAII Sequence Data
We used the Illumina Genome Analysis Pipeline version 1.0 for performing image analysis, base calling, and QC alignment. The number of raw 36-bp reads ranged from 6.62 to 9.88 × 106 for the five samples. The pipeline filtered out those reads that were below a CHASTITY score of 0.6 for the first 12 cycles (see Materials and Methods for details) after which we obtained between 3.67 and 4.66 × 106 reads. Of these, 59% to 69% of the reads aligned to our reference sequence yielding between 2.19 and 3.16 × 106 sequences per lane (see Supplemental Figure 1 at http://jmd.amjpathol.org). Although the number of quality reads that can be sequenced on a Illumina GAII has steadily improved, this compares well with the number of reads obtained in other similar experiments at the time this work was done.21,30
Target Coverage
The accuracy of base calls is largely dependent on the extent of coverage. To assess this parameter, we obtained the distribution of average condensed coverage (or read depth) within the region of interest for each amplicon (Figure 2, A–C). Within each sample or lane the distribution of the average condensed coverage per amplicon is fairly constant at ∼50-fold condensed coverage across all samples (Figure 2A). This excludes PCR products that failed amplification and therefore had an average read depth of zero. The uniformity was determined by plotting the fraction of bases with a given condensed read depth. In each sample on average, more than 90% of all bases were within a twofold range. For example, in sample 1 more than 60% of the bases have read depths between 40 and 50 and more than 90% of the bases have read depths between 40 and 80. More than 99% of reads were within a 10-fold range (Figure 2B). This uniformity of coverage is consistent with other PCR-based amplification methods,31 and higher than other genomic target capture methods such as molecular inversion probes, in-solution–based hybridization,27 and solid-phase Agilent arrays.26,32
Figure 2.

Illumina GAII sequence data. A: Average condensed coverage for each amplicon within the region of interest (ROI) for each sample. Amplicons are arranged by concatenation of the ROI in alphabetical order of gene names and by sequential order of exons (see Supplemental Table 2 at http://jmd.amjpathol.org). B: Distribution of the fraction of bases within the ROI with a given condensed coverage in intervals of 10 (see Supplemental Table 3 at http://jmd.amjpathol.org). C: Graphical representation of the distribution of condensed coverage within and surrounding the ROI. Dotted lines demarcate the boundaries of the ROI. The read depths of all bases within a given bin is averaged over all amplicons and plotted as a point in the graph. To do this, we divided each amplicon's region of interest into 10 bins and calculated the average read depth within each bin by taking the mean of the coverage of all of the bases. The flanking regions are obtained from a 200-bp window divided into four bins.
To explore the read depths further we looked at the uniformity of coverage within amplicons across the region of interest. The average distribution for all amplicons was fairly uniform within and across the region of interest with reduced coverage near the amplicons ends (Figure 2C), but differences in read depths were observed between each individual amplicon (Figure 2A). The concatenation of PCR products followed by shearing might explain this observation because this might lead to filtering of reads that do not align as the result of having two overlapping concatenated amplicons.
Analytical Sensitivity
The total number of unique variants covered by these samples was 136 (109 substitutions and 27 in/dels. Of those, 37 were present in more than one sample and were considered as independent counts for calculating the sensitivity and specificity (191 in total: 160 single base substitutions and 31 in/dels) (Table 2 and Supplemental Table 1 at http://jmd.amjpathol.org). As shown in Table 2, 4/160 substitutions were not detected (after PCR failure had been excluded), resulting in an analytical sensitivity of 97.93 ± 1.28%, which is comparable to the sensitivity of the DCM array.6 One of the main drawbacks of array-based sequencing technology is poor performance in detecting small insertions and deletions.6 As expected, Illumina GAII based sequencing performed significantly better, with only 3/31 insertions and deletions missed, resulting in an analytical sensitivity of 90.6 ± 4.72% (compared to almost two-thirds being missed by the DCM array).6 This is significantly better than the reported analytical sensitivity of 81.8% for in/dels.30
Table 2.
False Positive and False Negative Substitutions and in/dels
| Substitutions |
In/dels |
||||||
|---|---|---|---|---|---|---|---|
| Sample no. | FN | FP | Total | FN | FP | Total | Total mutations |
| 1 | 2 | 0 | 45 | 1 | 4 | 4 | 49 |
| 2 | 0 | 1 | 37 | 0 | 1 | 5 | 42 |
| 3 | 2 | 3 | 34 | 0 | 3 | 2 | 36 |
| 4 | 0 | 1 | 29 | 1 | 4 | 7 | 36 |
| 5 | 0 | 2 | 15 | 1 | 3 | 13 | 28 |
| Total | 4 | 7 | 160 | 3 | 15 | 31 | 191 |
FP, false positive; FN, false negative.
We investigated coverage levels for seven substitutions and in/dels that were not detected by GAII sequencing (Table 3). Two variants (c.574-5A>C in exon 5 of ABCC9, c.1357C>T in exon 14 of MYH7) had a raw read coverage ≤10, suggesting that insufficient coverage was the reason why they were missed. Two additional variants (c.3476_3477delTT in exon 31 of MYBPC3 and c.3332_3335dupAGTG in exon 31 of MYBPC3) had low but acceptable raw coverage. These in/dels are likely to have been missed due to the short read length and difficulty to align. One missed variant (c.3742_3759dup in exon 33 of MYBPC3) had elevated raw coverage, but this was an 18-base duplication and may therefore have not been properly aligned to the reference sequence. Finally, two substitutions (308C>T in VCL and 25-8T>A in TNNI3) had more than adequate coverage and it is unclear why they were not detected. Amplification bias or allele drop out may underlie the inability to detect these substitutions. A common source of allelic drop out is the presence of SNPs within primer binding sites. However, none of our PCR primers had known SNPs (dbSNP 130) and both variants were readily detected by the DCM array using the same PCR primers.
Table 3.
False Negative Variants and Depth of Coverage
| Amplicon (gene–exon) | FN variant | Sample | Raw coverage | Condensed coverage (NexGENe) |
|---|---|---|---|---|
| MYH7-EX14 | c.1357C>T | 3 | 4 | 1 |
| ABCC9-EX05 | c.574-5A>C | 1 | 10 | 8 |
| MYBPC3-EX31 | c.3476_3477delTT | 1 | 38 | 29 |
| MYBPC3-EX31 | c.3332_3335dupAGTG | 4 | 44 | 33 |
| MYBPC3-EX33 | c.3742_3759dup | 5 | 505 | 68 |
| TNNI3-EX03 | c.25-8T>A | 3 | 2080 | 88 |
| VCL-EX03 | c.308C>T | 1 | 2367 | 181 |
While certain regions of DNA are difficult to sequence due to high GC content, polynucleotide repeats, or extended homopolymers,33 there is no evidence that these issues led to the false negatives in this study since the variants missed do not fall under this sequence context (data not shown).
Analytical Specificity
From the five samples analyzed using Illumina technology, we obtained an average of 4.4 false positive mutations per 41,475 bases sequenced corresponding to a false positive rate of 0.011 ± 0.002%. This is a ninefold reduction compared to the DCM array (average false positive rate of 0.09%).6 In a clinical setting all identified mutations are confirmed with an independent reaction, often using a second independent methodology (Sanger sequencing in our laboratory). This confirmatory process results in a final false positive rate of close to 0% (equal to the false positive rate of dideoxy sequencing).
Translation into Clinical Testing
Confirmatory Follow-Up
In a clinical laboratory, the amount of confirmatory follow-up per sample generally consists of the number of variant calls as well as the number of low-quality reactions that are deemed insufficient to return an accurate result. The definition of “low quality” depends on the technology used. For capillary sequencing it is the amount of nonspecific background, for array-based sequencing it is the call rate across an amplicon. Additional assays are often carried out using another technology such as capillary sequencing, which is still deemed the gold standard for medical resequencing. Therefore, to assess the suitability of a novel platform for routine clinical operations, one must also assess these factors, which in turn impact turnaround time (TAT) and test cost.
For second generation sequencing technologies, the confirmatory follow-up is defined by the number of variant calls (including mostly false positive calls) and the number of amplicons with suboptimal coverage (coverage that was shown to be insufficient to allow reliable variant detection). In our validation experiment, all positions at which variants were missed due to insufficient coverage had a condensed read coverage of close to 10 or below (Table 3). Although insufficient coverage may not have been the reason for missing the two in/dels that had raw coverage of 38 and 44 (condensed coverage of 29 and 33), we assigned a conservative condensed coverage threshold of 30. Any amplicon with one or more bases with coverage below 30 would need to be confirmed by Sanger sequencing. Using this threshold, an average of 23 amplicons (out of a total of 246 amplicons) would need to be repeated (Table 4). Considering sample 1 to be an outlier, this average reduces to 16. Together with the average number amplicons containing a variant call, the total projected number of amplicons that need to be followed-up per sample is 18 (Table 4). This is lower than the average number of amplicons currently followed up on for our array-based resequencing test (∼40) and therefore does not limit the translation of this technology into clinical practice.
Table 4.
Total Number of Amplicons Followed Up by Sanger Sequencing
| Sample | Number of amplicons with <30× coverage | Number of amplicons with a false positive variant | Total number of unique amplicons resequenced |
|---|---|---|---|
| 1* | 53 | 4 | 56 |
| 2 | 14 | 2 | 15 |
| 3 | 14 | 5 | 17 |
| 4 | 17 | 4 | 20 |
| 5 | 19 | 3 | 20 |
| Average | 23.4 | 3.6 | 25.6 |
| Average (excluding sample 1) | 16 | 3.5 | 18 |
Considered to be an outlier.
Operational Considerations
Due to the complexity of many genetic tests, their cost and TAT can be extremely high compared to other clinical tests. For genetically heterogeneous diseases, diagnostic laboratories must strive to increase clinical sensitivity by including more genes. However, while a test's clinical sensitivity is of utmost importance to the patient, so is the overall test cost and TAT. In addition, throughput (dependent on the number of samples one can process simultaneously) is an important consideration for the testing laboratory to ensure that the testing volume is sustainable at a given TAT when migrating a test to a novel platform. Until very recently, the chief limiting factor for adding test content has been the high cost of Sanger sequencing. We and others showed that parallel processing of genes using array-based sequencing decreased test cost and TAT substantially (by almost 50% compared to capillary based sequencing)2,6 and thus one would expect second-generation sequencing technologies to offer even greater cost savings.
Figure 3 compares the workflows of our existing DCM array to the SGS method described in this study. The majority of steps are either identical (eg, generation of PCR products and confirmatory Sanger sequencing) or very similar with regard to their impact on labor (eg, the use of a custom informatics pipeline to determine the confirmatory follow-up). Steps leading to an increase in either test cost (through higher reagent cost or increased labor) or TAT are highlighted in grey and warrant further description. Hybridization of PCR products to the GeneChip would be replaced by library generation and subsequent sequencing on the Illumina GAII, which is currently the biggest cost factor in SGS technologies and therefore one of the bottlenecks for clinical adaptation. A detailed cost analysis of these two steps revealed that they would be five times more expensive than the array processing, mostly due to the high lane charges. Another additional cost factor is data storage, which is negligible for array-based or capillary sequence data. Currently there are no set standards as to what SGS data need to be stored and for how long but for this exercise, we assumed that one will store the intensity files as well as raw sequence reads. At our facility storage of this data for 10 years would currently add several hundred dollars per lane. Turnaround time is mostly impacted by the GAII sequencing run, which adds roughly a week depending on the read length. Minor increases in TAT are due to the alignment of raw sequence reads (∼4 hours/lane using NextGENe) and a possible need to transfer data from one operating system to another (Linux to MS-DOS in our laboratory, using NextGENe). The overall projected TAT of an SGS assay as used in this study would be roughly 3 weeks longer than our current DCM array. Finally, throughput would be negatively impacted by the smaller batch sizes for library construction and the GAII run. In routine clinical operations, the potential for fatal errors such as sample switches increases dramatically with the number of steps that need to be performed. Due to its complexity, library construction (if performed manually) lends itself to such errors and a smaller batch size is beneficial to avoid sample errors.
Figure 3.

Comparative analysis of array-based resequencing and Illumina GAII sequencing workflows with regard to cost and turnaround time. Laboratory workflows for the DCM array (left) and DCM SGS test (right) are shown. Steps that lead to increased test cost or turnaround time are highlighted in gray. The estimated relative impact of transitioning the DCM test from the array-based technology to SGS-based technology is shown on the right by arrows as well as the level of expected increase.
Overall, our results show that the analytical sensitivity of a SGS-based DCM test outperforms that of the current DCM array. However, to implement this technology in clinical operations, additional improvements are needed to reduce the overall operational cost as well as TAT and for this SGS test to be viable in a genetic diagnostic laboratory. To this end, alternative genomic capture or selection methods as well as sample pooling strategies that rely on molecular barcoding will be necessary.
Discussion
SGS technologies are on the verge of being broadly used in clinical laboratories. Over the next years, they will likely replace many traditional (ie, Sanger-based as well as array-based) sequencing tests used for genetically heterogeneous disorders such as DCM. Due to the ease of adding content without paying a cost penalty, these platforms will pave the way to clinical screening of very large gene sets such as the “cardiome” or, ultimately the exome or genome.
Our current, array-based DCM test relies on upfront PCR amplification, which can also be coupled to SGS technologies. Several studies have reported acceptable analytical performance parameters, suggesting that SGS technology is sufficiently mature for diagnostic applications. Our study determined the suitability of PCR-based SGS for clinical applications by comparing it to an existing array-based screening 19 DCM genes with regard to its analytical sensitivity, specificity, as well as by evaluating its impact on operational cost.
It is well established that sequencing arrays have a poor ability to detect in/dels,8,9 and this is considered one of the key weaknesses of this technology in a diagnostic setting as it prevents further addition of genes if their variant spectrum contains a significant number of this type of variant. In/dels are not frequently observed in the genes that are most commonly associated with DCM (MYH7, MYBPC3, TNNT2, TNNI3, TPM1, ACTC1)6 and the variant spectrum for those genes that contribute infrequently to disease is not yet well defined. Therefore, array-based sequencing has been an acceptable compromise as it enabled cost-effective sequencing of an unprecedented number of genes compared to traditional Sanger sequencing. However, the availability of SGS, which has shown great promise in the research realm, makes it imperative to evaluate these technologies in a clinical setting to further improve the DCM test. We demonstrate that, although it did not reach the analytical sensitivity of Sanger-based sequencing, GAII sequencing performed significantly better than the DCM array at detecting in/dels. Importantly, GAII sequencing missed only 9% of in/dels, which is a significant improvement compared to array-based tests.6,8,9 Possible reasons for failure to detect variants include low coverage, amplification bias or allelic dropout, and high G/C or A/T content,33,34 or a combination of these factors. Several known sequence alterations were clearly missed due to insufficient coverage and this class of false negatives can be avoided in a diagnostic setting by applying a strict threshold below which an amplicon is routed for Sanger sequencing. Several missed variants were in/dels, which are more likely to be missed due to the short read lengths and therefore increased difficulty to align to the reference sequence. Sequencing read lengths of up to 50 bp and 76 bp are now routinely being used for the GAII and are likely to overcome this problem. Furthermore, new in/del calling algorithms are being developed and will further improve alignment and detection of these variants (http://www.broadinstitute.org/science/software, last accessed April 23, 2010). In summary, while improvements can and will be made, we consider the technical performance of PCR-based GAII sequencing sufficient for genetic testing.
By establishing a set coverage threshold that is sufficient to accurately detect variants, we show that confirmatory follow up of amplicons that did not meet average thresholds and/or contained potential variants were lower for SGS than that of our current array-based DCM test and would therefore not constitute an obstacle in routine operations. Therefore, one should be able to include additional content without dramatically increasing reflex follow-up testing.
While analytical sensitivity and the level of confirmatory follow up are of critical importance in a diagnostic setting, other factors impact the overall ability to implement a workflow into clinical operations. Our analysis showed that without additional improvements, both cost and turnaround time for the SGS-based assay as used in this study are higher than that of the current DCM array. While one could argue that a higher test cost may be acceptable as long as the per gene cost for testing decreases by increasing the number of tested genes, we feel that it is critical to decrease the absolute test cost given that expensive genetic tests are often not accessible to a broad community where insurance coverage is not guaranteed.
One logical and straightforward improvement is sample pooling, which drastically reduces test cost through spreading the high lane charges across several samples. In addition, emerging genomic selection technologies such as molecular inversion probes are attractive because they do not rely on the rather complex and error-prone library generation process.
As part of the initial test to determine suitability of the PCR-based capture and the Illumina GAII sequencing methods, we only estimated some critical parameters such as overall cost and turnaround time in addition to the false positive and false negative rates for variant detection. As a laboratory with standard workflows, we find the robustness of individual steps such as targeted PCR (which is part of our current DCM test) as well as library generation in addition to the GAII sequence detection (routinely performed in our sequencing core) to be adequate. However, a full-scale clinical validation is further needed to determine the robustness and run variability before SGS can be implemented for clinical diagnostic testing.
We conclude that, while the technical performance parameters of a PCR based SGS test are acceptable and would readily allow the addition of more gene content, the projected overall operational cost is limiting and alternative approaches such as genomic selection methods and sample pooling will be necessary to bring the overall test cost down. Given the speed with which SGS sequencing is evolving, the above-described limitations are going to be short lived, allowing the cost of genetic testing to drop and leading to a new paradigm in how we think about personalized genetic medicine.
Acknowledgements
We thank the members of the Partners Healthcare Center for Personalized Genetic Medicine High Throughput Sequencing Facility for technical help with this study.
Footnotes
Supported by CIHR Fellowship200702OWF (J.P.L.-E.).
S.G. and J.P.L.-E. contributed equally to this work.
M.M., K.L., and J.L. are employed by SoftGenetics, LLC. NextGENe, the software used to analyze the data described in this manuscript, is made by this company. Their authorship is the result of a collaboration to evaluate NextGENe for clinical use.
Supplemental material for this article can be found on http://jmd.amjpathol.org.
Web Extra Material
Numbers (in millions) of raw reads; number of reads after filtering (passing filter: PF); and number of reads that align to the ROI for all samples using ELAND.
This table lists all known variants used for pooled, positive control sample sets. False-positive variants that were identified by sequencing are highlighted. Variants are listed by gene and then by coding variant. Variants are described by the position of the first coding base in the cDNA where c.1 is the first A of the ATG start site. Call status is determined by whether a known variant with the correct zygosity was identified or not. Call status was defined as FN: false-negative call; FP: false-positive call; TP: true-positive call; and “–” : not applicable (NA) (not included as part of a pooled, positive control sample set). The total number of FN, FP, TP, and NA calls are listed for substitutions and indels.
Average condensed coverage per exon for the region of interest. All genes are listed by alphabetical order and exon number.
Percentage of bases with a given coverage across the region of interest.
References
- 1.Denning L, Anderson JA, Davis R, Gregg JP, Kuzdenyi J, Maselli RA. High throughput genetic analysis of congenital myasthenic syndromes using resequencing microarrays. PLoS One. 2007;2:e918. doi: 10.1371/journal.pone.0000918. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Fokstuen S, Lyle R, Munoz A, Gehrig C, Lerch R, Perrot A, Osterziel KJ, Geier C, Beghetti M, Mach F, Sztajzel J, Sigwart U, Antonarakis SE, Blouin JL. A DNA resequencing array for pathogenic mutation detection in hypertrophic cardiomyopathy. Hum Mutat. 2008;29:879–885. doi: 10.1002/humu.20749. [DOI] [PubMed] [Google Scholar]
- 3.Mandal MN, Heckenlively JR, Burch T, Chen L, Vasireddy V, Koenekoop RK, Sieving PA, Ayyagari R. Sequencing arrays for screening multiple genes associated with early-onset human retinal degenerations on a high-throughput platform. Invest Ophthalmol Vis Sci. 2005;46:3355–3362. doi: 10.1167/iovs.05-0007. [DOI] [PubMed] [Google Scholar]
- 4.Waldmuller S, Muller M, Rackebrandt K, Binner P, Poths S, Bonin M, Scheffold T. Array-based resequencing assay for mutations causing hypertrophic cardiomyopathy. Clin Chem. 2008;54:682–687. doi: 10.1373/clinchem.2007.099119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Zhou S, Kassauei K, Cutler DJ, Kennedy GC, Sidransky D, Maitra A, Califano J. An oligonucleotide microarray for high-throughput sequencing of the mitochondrial genome. J Mol Diagn. 2006;8:476–482. doi: 10.2353/jmoldx.2006.060008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zimmerman RS, Cox S, Lakdawala N, Cirino A, Mancini-DiNardo D, Clark E, Leon A, Duffy E, White E, Baxter S, Alaamery M, Farwell L, Weiss S, Seidman CE, Seidman JG, Ho CY, Rehm HL, Funke BH. A Novel Custom Resequencing Array For Dilated Cardiomyopathy. Genet Med. 2010;12:268–278. doi: 10.1097/GIM.0b013e3181d6f7c0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zwick ME, McAfee F, Cutler DJ, Read TD, Ravel J, Bowman GR, Galloway DR, Mateczun A. Microarray-based resequencing of multiple Bacillus anthracis isolates. Genome Biol. 2005;6:R10. doi: 10.1186/gb-2004-6-1-r10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Clark RM, Schweikert G, Toomajian C, Ossowski S, Zeller G, Shinn P, Warthmann N, Hu TT, Fu G, Hinds DA, Chen H, Frazer KA, Huson DH, Scholkopf B, Nordborg M, Ratsch G, Ecker JR, Weigel D. Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana. Science. 2007;317:338–342. doi: 10.1126/science.1138632. [DOI] [PubMed] [Google Scholar]
- 9.Zeller G, Clark RM, Schneeberger K, Bohlen A, Weigel D, Ratsch G. Detecting polymorphic regions in Arabidopsis thaliana with resequencing microarrays. Genome Res. 2008;18:918–929. doi: 10.1101/gr.070169.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. [DOI] [PubMed] [Google Scholar]
- 11.Shendure J, Ji H. Next-generation DNA sequencing. Nature Biotechnol. 2008;26:1135–1145. doi: 10.1038/nbt1486. [DOI] [PubMed] [Google Scholar]
- 12.Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11:31–46. doi: 10.1038/nrg2626. [DOI] [PubMed] [Google Scholar]
- 13.Bentley DR. Whole-genome re-sequencing. Curr Opin Genet Dev. 2006;16:545–552. doi: 10.1016/j.gde.2006.10.009. [DOI] [PubMed] [Google Scholar]
- 14.Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. doi: 10.1038/nature03959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, Rosenbaum AM, Wang MD, Zhang K, Mitra RD, Church GM. Accurate multiplex polony sequencing of an evolved bacterial genome. Science. 2005;309:1728–1732. doi: 10.1126/science.1117389. [DOI] [PubMed] [Google Scholar]
- 16.Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008;40:1413–1415. doi: 10.1038/ng.259. [DOI] [PubMed] [Google Scholar]
- 17.Chou LS, Liu CS, Boese B, Zhang X, Mao R. DNA sequence capture and enrichment by microarray followed by next-generation sequencing for targeted resequencing: neurofibromatosis type 1 gene as a model. Clin Chem. 2010;56:62–72. doi: 10.1373/clinchem.2009.132639. [DOI] [PubMed] [Google Scholar]
- 18.Pleasance ED, Stephens PJ, O'Meara S, McBride DJ, Meynert A, Jones D, Lin ML, Beare D, Lau KW, Greenman C, Varela I, Nik-Zainal S, Davies HR, Ordonez GR, Mudie LJ, Latimer C, Edkins S, Stebbings L, Chen L, Jia M, Leroy C, Marshall J, Menzies A, Butler A, Teague JW, Mangion J, Sun YA, McLaughlin SF, Peckham HE, Tsung EF, Costa GL, Lee CC, Minna JD, Gazdar A, Birney E, Rhodes MD, McKernan KJ, Stratton MR, Futreal PA, Campbell PJ. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature. 2010;463:184–190. doi: 10.1038/nature08629. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Vasta V, Ng SB, Turner EH, Shendure J, Hahn SH. Next generation sequence analysis for mitochondrial disorders. Genome Med. 2009;1:100. doi: 10.1186/gm100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, Shendure J, Bamshad MJ. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet. 2010;42:30–35. doi: 10.1038/ng.499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461:272–276. doi: 10.1038/nature08250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.ten Bosch JR, Grody WW. Keeping up with the next generation: massively parallel sequencing in clinical diagnostics. J Mol Diagn. 2008;10:484–492. doi: 10.2353/jmoldx.2008.080027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Tucker T, Marra M, Friedman JM. Massively parallel sequencing: the next big thing in genetic medicine. Am J Hum Genet. 2009;85:142–154. doi: 10.1016/j.ajhg.2009.06.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Kiss MM, Ortoleva-Donnelly L, Beer NR, Warner J, Bailey CG, Colston BW, Rothberg JM, Link DR, Leamon JH. High-throughput quantitative polymerase chain reaction in picoliter droplets. Anal Chem. 2008;80:8975–8981. doi: 10.1021/ac801276c. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Herman DS, Hovingh GK, Iartchouk O, Rehm HL, Kucherlapati R, Seidman JG, Seidman CE. Filter-based hybridization capture of subgenomes enables resequencing and copy-number detection. Nat Methods. 2009;6:507–510. doi: 10.1038/nmeth.1343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Hodges E, Rooks M, Xuan Z, Bhattacharjee A, Benjamin Gordon D, Brizuela L, Richard McCombie W, Hannon GJ. Hybrid selection of discrete genomic intervals on custom-designed microarrays for massively parallel sequencing. Nat Protoc. 2009;4:960–974. doi: 10.1038/nprot.2009.68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C, Gabriel S, Jaffe DB, Lander ES, Nusbaum C. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nature Biotechnol. 2009;27:182–189. doi: 10.1038/nbt.1523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Li JB, Gao Y, Aach J, Zhang K, Kryukov GV, Xie B, Ahlford A, Yoon JK, Rosenbaum AM, Zaranek AW, LeProust E, Sunyaev SR, Church GM. Multiplex padlock targeted sequencing reveals human hypermutable CpG variations. Genome Res. 2009;19:1606–1615. doi: 10.1101/gr.092213.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL, LeProust EM, Peck BJ, Emig CJ, Dahl F, Gao Y, Church GM, Shendure J. Multiplex amplification of large sets of human exons. Nat Methods. 2007;4:931–936. doi: 10.1038/nmeth1110. [DOI] [PubMed] [Google Scholar]
- 30.Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork NJ, Murray SS, Topol EJ, Levy S, Frazer KA. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol. 2009;10:R32. doi: 10.1186/gb-2009-10-3-r32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Brouzes E, Medkova M, Savenelli N, Marran D, Twardowski M, Hutchison JB, Rothberg JM, Link DR, Perrimon N, Samuels ML. Droplet microfluidic technology for single-cell high-throughput screening. Proc Natl Acad Sci USA. 2009;106:14195–14200. doi: 10.1073/pnas.0903542106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, Middle CM, Rodesch MJ, Albert TJ, Hannon GJ, McCombie WR. Genome-wide in situ exon capture for selective resequencing. Nat Genet. 2007;39:1522–1527. doi: 10.1038/ng.2007.42. [DOI] [PubMed] [Google Scholar]
- 33.Yang A. Solutions for Sequencing Difficult Regions. In: Kieleczawa J, editor. Jones and Bartlett; Sudbury, MA: 2009. pp. 65–90. [Google Scholar]
- 34.Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat Methods. 2009;6:291–295. doi: 10.1038/nmeth.1311. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Numbers (in millions) of raw reads; number of reads after filtering (passing filter: PF); and number of reads that align to the ROI for all samples using ELAND.
This table lists all known variants used for pooled, positive control sample sets. False-positive variants that were identified by sequencing are highlighted. Variants are listed by gene and then by coding variant. Variants are described by the position of the first coding base in the cDNA where c.1 is the first A of the ATG start site. Call status is determined by whether a known variant with the correct zygosity was identified or not. Call status was defined as FN: false-negative call; FP: false-positive call; TP: true-positive call; and “–” : not applicable (NA) (not included as part of a pooled, positive control sample set). The total number of FN, FP, TP, and NA calls are listed for substitutions and indels.
Average condensed coverage per exon for the region of interest. All genes are listed by alphabetical order and exon number.
Percentage of bases with a given coverage across the region of interest.
