Next-generation sequencing (NGS) approaches are highly applicable to clinical studies. We review recent advances in sequencing technologies, as well as their benefits and tradeoffs, to provide an overview of clinical genomics from study design to computational analysis. Sequencing technologies enable genomic, transcriptomic, and epigenomic evaluations. Studies that use a combination of whole genome, exome, mRNA, and bisulfite sequencing are now feasible due to decreasing sequencing costs. Single-molecule sequencing increases read length, with the MinION™ nanopore sequencer, which offers a uniquely portable option at a lower cost. Many of the published comparisons we review here address the challenges associated with different sequencing methods. Overall, NGS techniques, coupled with continually improving analysis algorithms, are useful for clinical studies in many realms, including cancer, chronic illness, and neurobiology. We, and others in the field, anticipate the clinical use of NGS approaches will continue to grow, especially as we shift into an era of precision medicine.
Keywords: Sequencing, genomics, clinical utility
Next-generation sequencing (NGS) has become ubiquitous over the past few years, producing a deluge of new data at an unprecedented rate. However, how to incorporate the novel insights from these data into clinical practice is not always obvious. Here, we review the current challenges associated with different genomic, transcriptomic, and epigenomic sequencing approaches and platforms, as well as important considerations when designing sequencing studies to maximize statistical power and clinical utility. We also describe the current applications of these technologies across a range of topics including cancer genomics and precision medicine, with a focus on integrative study design and computational analyses. Collectively, this review provides a guide to experimental and computational methods of using NGS in clinical research. NGS technologies have already improved medical interventions and will continue to transform medicine in the clinic and at a personal level by offering individuals increased opportunities to manage their health throughout a lifetime.
A. Sequencing Has Promising Clinical Utility
The development of NGS has lowered the cost of sequencing from $100 million in 2001 to $1000 in 2014. The lower cost has made sequencing more accessible to the medical community for diagnostic support (Fig. 1). Sequencing can generate a wide variety of data types, which favors its use over other existing techniques to characterize nucleic acids, including PCR and microarrays. Traditional NGS platforms, such as the Illumina HiSeq sequencer, are widely used for DNA sequencing, RNA sequencing, and bisulfite sequencing. Emerging sequencing techniques from the past few years provide alternatives to the short reads produced by these platforms. Together, the analysis of patients’ genomes, transcriptomes, and DNA methylomes can aid diagnosis and prognostic classifications.
FIG. 1.
Overview of genomic, transcriptomic, and methylomic sequence analysis workflows for disease characterization and precision medicine. Computational analysis pipelines are further described in Fig. 2.
B. Single-Molecule Sequencing
Both the PacBio RS and the MinION™ nanopore sequencer offer longer read lengths than other sequencing technologies, on the order of kilobases or tens of kilobases.1 Pacific Biosciences released the PacBio RS sequencer in 2010, and although accuracy was initially poor at 86%, repeated sequencing of each strand can increase accuracy to 99%. While specific bioinformatics tools have been developed over the past few years to cope with the error rate,2 the high cost of the instrument has limited its adoption.
Oxford Nanopore Technologies began distributing the MinION™ sequencer to researchers through an early-access program in 2014 before releasing the sequencer commercially in 2015. Unlike the PacBio RS, the MinION™ is highly portable, at the size of a large USB stick, and requires a relatively small investment of approximately $1000. These features could help avoid the time and cost of sending samples to reference laboratories by bringing sequencing to clinics themselves, particularly in remote locations. However, nanopore technology is still in the nascent stages of development. Estimates have placed perbase error rates at 10–15%,3 which needs to improve drastically before nanopore sequencers can be considered a viable tool for many diagnostic applications. Efforts to demonstrate the clinical potential of the MinION™ have focused on pathogen identification and characterization, including sequencing of the influenza virus,4 antibiotic resistance genes in Salmonella enterica serovar Typhi,5, and the Ebola virus to identify transmission patterns during the recent outbreak.6 Researchers have also taken advantage of long read lengths to analyze isoform expression of alternatively spliced RNA using cDNA libraries.7 Current coverage depths and error rates best facilitate targeted studies of specific genes or RNA isoforms but not whole-genome or whole-transcriptome analysis. Yet, as the chemistry and analysis continue to evolve, nanopore sequencing shows increasing promise as an accessible and powerful means of evaluating patients and the pathogens that affect them.
C. DNA Sequencing for Clinical Applications
Many consortiums are devoted to standardizing sequencing performance. Accuracy and reproducibility are the two key factors necessary for sequencing technology that are widely used in clinical practice. DNA sequencing enables the detection of germline and somatic mutations. Whole-genome sequencing (WGS) is an approach to determining the complete DNA sequence of a genome with a single assay. A cross-platform WGS performance comparison revealed that 88.1% of SNVs detected were shared by Illumina and Complete Genomics.8 The concordance of insertion and deletion (indel) calling is much lower, with just 26% shared.8 Another study comparing Illumina MiSeq, 454 GS Junior and Ion Torrent PGM from Life Technology for bacteria genome sequencing showed that Illumina has the lowest error rate and no homopolymer-associated indel errors.9
Whole-exome sequencing (WES), which captures only genic regions, provides a cost-efficient alternative to whole genome sequencing. WES shows high accuracy for detecting single-nucleotide variants (SNVs) and short indels; however, when compared to high-coverage WGS, WES has limited power for detecting copy-number variation (CNV).10 A recent assessment of WES and exome array comparative genomic hybridization (CGH) using clinical samples has shown that WES has the potential for clinical CNV detection, but currently, the combination of an array-based approach with WES improves the accuracy of CNV calling, especially for intergenic regions and single-exon changes.11
When using WES, the choice of exome-seq protocol affects results. A comprehensive comparison between Agilent, Roche, and Illumina exome-seq protocols showed varying strengths in the detection of variants across genic and untranslated regions.12 NimbleGen, from Roche, is the only platform that uses high-density overlapping baits and has higher sensitivity in variant detection. A concurrent study also confirmed that the NimbleGen platform has higher coverage of exonic regions, with at least 20× coverage.13 The Agilent and Illumina platforms, however, target a wider range of genomic regions, and with deeper sequencing, these two platforms detect more variants.14 Another advantage to Illumina’s capture method is that it provides coverage for untranslated areas, which might be of interest to researchers who would like to include noncoding variants in their analyses.
For an even more targeted and affordable method than WES, specific cancer panels are commonly used. These require prior knowledge of recurrent genetic or epigenetic lesions. Recurrent somatic mutations appear in many cancer types and can predict risk levels of the disease.15 In acute myeloid leukemia (AML), 15 biomarkers have been used to further stratify patients who were previously all placed in the intermediate risk group by cytogenetic classification.16 This method helps to develop treatment plans for AML patients tailored to the risk for each group. Indeed, targeted sequencing provides a much deeper view of the known genes and hotspots for mutations. However, with ever-decreasing sequencing cost and increasing detection of possible drug targets, exomeseq covering larger areas of the genome has the potential for wider application in clinical diagnosis and prognostic decisions.
D. RNA Sequencing (RNA-seq): A Promising Candidate for Clinical Applications
This technique enables whole-transcriptome examination, including detection of gene expression, alternative isoforms, fusion genes, and expressed variants.17,18 However, RNA-seq is also very sensitive to systematic bias.19,20 Previously, we and others have defined multiple quality metrics that flag samples with potential gene expression quantification issues, including gene body coverage evenness, GC content, insert size, and base error rate.21 The FDA-led Sequencing Quality Control (SEQC) study for RNA-seq performance evaluation showed that gene body coverage evenness, GC content, and insert size relate to library preparation and that base error rate depends on the sequencer used.19
Multiple software packages exist for gene expression normalization. EDAseq, which corrects for both the intra-group variations and quantification bias caused by GC-content and gene length, offers the best accuracy for differential gene expression analysis.21 PEER and sva show greater power to detect latent variables for the quantification of gene expression among different sites of sequencing data.22 For a statistically powerful RNA-seq study design, consistent experimental strategies are recommended, including sequencer, read length, sequencing depth, and protocol.23 High sequencing depth is critical for the discovery of new genes and accurate gene expression profiling.24 Follow-up studies on differential gene expression analysis have shown that increasing biological replicates improves the accuracy of gene quantifications.25 Therefore, experimental design for RNA-seq analysis is critical for accurate differential gene expression analysis.
E. DNA Methylation Provides a Complementary Approach to Clinical Measures for Patient Classification
In humans, DNA methylation involves the addition of a methyl group to the fifth position of cytosine, which has the specific effect of suppressing gene expression. DNA methylation is one of the hallmarks of cancers and aging.26,27 Many different types of cancers show consistent dysregulation of DNA methylation.28–31 The Cancer Genome Atlas (TCGA) consortium and many other research studies have shown that cancers can be classified based on their degree of DNA methylation.28,32 Subgroups of many cancers exhibit CpG island methylator phenotype (CIMP), including breast cancer,33 brain cancer,28,30,34 blood cancer,29 gastric cancer,35 liver cancer,36 and lung cancer.31 Groups of patients classified based on DNA methylation patterns show distinct clinical outcomes, including overall survival and disease free progression.28,29 The CIMP-positive group can be used to differentiate and stratify patients into groups with distinct clinical outcomes. For example, in glioblastoma patients, a CIMP-positive phenotype is usually associated with distinct copy number changes, appears exclusively in the proneural subtypes, and is associated with IDH1 mutations and improved clinical outcomes.28 In a recent study of ependymoma, which is the third most common pediatric brain tumor, researchers showed that CIMP-positive patients with posterior fossa ependymoma have worse clinical outcomes than CIMP-negative patients.30,34 The genetic background of CIMP-positive patients presents a blended picture and indicates the importance of DNA methylation as an alternative approach for patient risk stratification.30
There are many advantages to using DNA methylation analysis for clinical profiling. First, this analysis does not rely on the genetic alterations of the diseases; thus, it can be applied to diseases with sparse somatic mutations. Second, the material under analysis is DNA, which is advantageous because DNA is less sensitive to heat or enzymatic degradation than RNA, resulting in more accurate profiling.
Several types of methods have emerged to quantitatively measure DNA methylation, grouped here into three categories: (1) PCR-based methods, (2) microarray-based methods, and (3) sequencing-based methods. The PCR-based methods are usually used as a validation approach for high-throughput quantification. Among microarray-based methods, the HpaII tiny fragment Enrichment by Ligation-mediated PCR Assay (HELP Assay) is a common regional DNA methylation quantification approach for research and clinical sample profiling.37 It is based on the restriction enzyme Hpall’s ability to exclusively recognize and cleave methylated CpG DNA sites. Another common microarray-based DNA methylation quantification approach with single-base resolution is the Illumina Infinium BeadChip Kit. The BeadChip array platform uses two different bead types to measure DNA methylation levels at single cytosine. The Infinium HumanMethylation450 BeadChip Kit (450K array) is one of the Infinium Kits that covers the most methylation sites for human samples (485,000 sites). This kit covers 99% of RefSeq genes, which, on average, have 17 CpG sites per gene. The 450K array has been widely used in DNA methylation quantification over the past few years, with more than 10,000 entries in the Gene Expression Omnibus (GEO) database, providing a valuable international resource for comparison among different cohorts of patient samples.38
Sequencing-based methods provide either single-base resolution or regional quantifications of DNA methylation levels.39–41 Single-base resolution methods mainly use bisulfite conversion sequencing, where bisulfite converts cytosines without methylation into uracil but leaves cytosines with methylation intact as cytosines. In the final sequencing readout, unmethylated CpG sites appear as thymine instead of cytosine.39,40 CpG methylation levels for individual sites are calculated based on the percentage of reads with cytosine among the total number of reads mapped. Bisulfite-based methods include whole genome bisulfite sequencing (WGBS),39 reduced representation bisulfite sequencing (RRBS),40 and targeted methylation sequencing (TMS).42 WGBS requires high sequencing depth, as at least four reads must cover each base in the whole genome to achieve accurate quantification. WGBS enables the inclusion of regions with both high and low CG density. RRBS and TMS each cover a subset of regions in the genome, providing cheaper alternatives to WGBS, and accurately quantify of approximately 15% of higher density CpG sites, including CpG islands and promoter regions.40,42. These targeted approaches make it possible to profile more patients with regions that are of particular interest in transcriptome regulation.
Regional quantification approaches for methylation analysis mainly use affinity-based DNA methylation sequencing, such as methylated DNA immunoprecipitation sequencing (MeDIP-seq).41 This approach uses antibodies that recognize genomic locations with methylated CpGs. Comparison of the 450K array and WGS approaches showed sufficient correlation (Spearman correlation coefficient = 0.68).43 Another study showed the 450K array generated highly reproducible data between seven technical replicates of clinical samples.37 However, a large-scale effort comparing different platforms remains to be done.
F. Computational Analysis of Multi-Omics Data
One of the biggest challenges in going from bench to bedside in sequencing studies is the accurate and reproducible analysis of the resulting terabytes of data. Sequencing data analysis is a multistep process that researchers often need to adapt for specific experiments or scientific questions. For any sequencing data, computational analysis generally begins with aligning reads to a reference genome (Fig. 2). Commonly used programs include BWA for DNA reads, STAR for RNA-seq data, and Bismark, BSMAP, or BSmapper for bisulfite sequencing data.44–48 The choice between different aligners available for a specific type of data depends on such factors as sequencing platform, read length, and desired SNP tolerance, with various programs optimized for different read characteristics.49 After mapping to the genome, analysis depends on the scientific question, with specific programs designed broadly to call variants, identify differential expression, or quantify the extent of methylation. For example, the Genome Analysis Toolkit (GATK) offers variant calling algorithms for both DNA- and RNA-seq data, the results of which are often followed with annotation using SnpEff or Oncotator for cancer studies.50–52 Another method useful for the analysis of variants in cancer data, specifically intra-tumor heterogeneity, is PyClone.53 For RNA-seq data, the pipeline r-make, mediated by GNU Make, provides an easy, one-step method to align data with STAR, perform quality assessments, and generate gene counts, which can then be used for differential expression analysis with tools such as edgeR.19,54 Software for downstream analysis of methylation data includes methylKit, eDMR, and methclone.55–57 When multiple data types are available, their integration can identify a network of interacting and interdependent processes contributing to disease states using tools such as iCluster and Cytoscape.58,59 Indeed, clustering patient samples using models that computationally combine different data types has revealed new subtypes not seen when evaluating a single data type.60,61 Despite challenges in cost, cross-platform comparisons, technical standards, and analysis methods, advances in massively parallel sequencing techniques present new opportunities to improve clinical research, which we explore in the next section.
FIG. 2.
Computational analysis pipelines for integrative genomic, transcriptomic, and methylomic data. SNV, singlenucleotide variant; SV, structural variant; VAF, variant allele frequency; CN, copy number; CNV, copy-number variant; LOH, loss of heterozygosity.
A. Leveraging Electronic Health Records Data
Many aspects of patient care increasingly incorporate genomics and informatics, especially with the transition to electronic health records (EHR). Despite the relatively recent shift to EHR, large-scale studies using machine learning and data mining methods are already leveraging the data, as EHR offers unprecedented access to large sample sizes and diverse patient cohorts. These studies include mining for adverse drug effects,62 and developing a classifier for disease phenotype severity.63 The implications of a transition to EHR for clinical genomics, including genetic testing, have been reviewed previously.64
B. Genomics and Chronic Illnesses
Genomic approaches are important for preventing and managing chronic illnesses, such as diabetes and inflammatory bowel disease. The Human Microbiome Project and other metagenomic studies have revealed the role of gut microbiota in health. Fecal microbiota transplants for treating Clostridium difficile infections, ulcerative colitis, Crohn’s disease, and other digestive illnesses represent the translation of this finding into clinical practice.65,66
C. Personalized Healthcare and Direct-to-Consumer Genomics
Statistical models can incorporate genomic features and family history, coupled with factors such as age, weight, and ethnicity, for disease risk prediction in healthy individuals. These models have been especially useful for early intervention in individuals at high risk for diabetes and cardiovascular disease. Clinical genomics platforms such as Foundation Medicine, Ingenuity, and Personalis facilitate the implementation of genetic testing in clinical platforms.67 As of August 2015, the NIH’s genetic testing registry catalogued 28,542 tests spanning 4,726 genes for the purpose of diagnosing any of 9,927 conditions. This registry not only includes classical Mendelian diseases, such as Huntington’s chorea, but also predicts predisposition to complex diseases, such as Alzheimer’s, and drug response, such as sensitivity to the anticoagulant warfarin. With direct-to-consumer tools like 23andme and ancestry.com, which make this type of information accessible to interested individuals, people are more empowered than ever to advocate for their own health. Research continues into disease risk prediction through computational methods that use patients’ genetic information, coupled with EHR in some cases.68 Federal policies are changing to reflect the shift to clinical genomics, as evidenced by the 2015 repeal of the FDA’s shutdown of 23andme’s genetic testing arm, and by the 2013 landmark supreme court case that barred the previously common practice of patenting genes.69 Other legal and ethical issues surrounding clinical genomics include those relating to genetic testing in children and adolescents, previously reviewed by Botkin et al.70
D. Genomics and Cancer
Despite challenges, genomics has produced a paradigm shift in medicine, especially in the treatment of cancer. Where historically cancer was categorized by the tissue type it affects, it is now increasingly being defined by genetic alterations. The vast breadth of knowledge we have gained from large national and international cancer sequencing efforts, mainly The Cancer Genome Atlas and the International Cancer Genome Consortium, has immeasurably increased our understanding of the genetic mechanisms, molecular subtypes, and heterogeneity of cancers.71,72 These data are easily accessible to the scientific community. Tools like the cBio Portal, for example, allow anyone to query the mutation load of any given gene in all assayed cancer types (Figure 1). Thus, cancer genomics is continuously being translated to clinical settings.73 One such case is recurrent mantle cell lymphoma, for which researchers used an integrative genomics and transcriptomics approach coupled with extensive functional studies to attribute the cause of relapse after ibrutinib treatment to a relapse-specific SNV in the drug target, BTK.74 The therapeutic decision-making pipeline can now incorporate this discovery by testing for this BTK mutation. Similar efforts in a wide variety of cancers have categorized subtypes of cancers based on genetic information, and these classifications are actively used in diagnoses, prognoses, and therapeutics.
A classic success story surrounding the use of genomics in cancer therapy relates to the BRAF inhibitor vemurafenib in metastatic melanoma. Genomic screening of metastatic melanoma patients identified BRAF V600 mutations in half of all patients that increased the sensitivity of cancer cells to BRAF inhibitors.75 One of the common challenges of targeted therapies, however, has been the development of resistance, which occurred in cases of melanoma treated with BRAF inhibitors. Combinations of drugs as opposed to monotherapies lower the risk of resistance and relapse. For example, dabrafenib in combination with trametinib prolongs progressionfree survival and increases response rates in BRAF V600 melanoma compared to monotherapy.76
Combination therapies often perform more successfully, as developing resistance is less likely. Computational methods for predicting effective drug combinations alleviate the enormous cost of exhaustive experimental testing in every cancer model. Instead, these machine learning methods can use data from cell line assays as training sets and predict successful combinations for genetically defined subtypes that researchers can then test in patient-derived xenograft models.77 Some of the experimental data sets currently available for use in computational models are the NCI 60 cancer cell line and drug screening data,78 NIH’s Library of Integrated Cellular Signatures (LINCS), and the Broad Institute’s connectivity map.79 By modeling drug-gene interactions coupled with the genomic alterations of a patient’s tumor, doctors are now able to predict the efficacy of different chemotherapies or targeted therapies in a personalized manner. These models not only include rule-based decision-tree methods but are also more complex computational models. In addition to predicting the efficacy of combination therapies, computational methods for drug repositioning are also continuously gaining popularity and producing effective therapies.80
Because many of these drug development and prediction approaches rely on accurate and detailed patient stratification based on genomic data, clinical samples increasingly undergo whole genome, exome, and transcriptome sequencing, either at the time of collection for rapid turnaround or after the banking for future analysis. A vast amount of sequencing data has enabled better assessment of prognosis in many cases, although this is not new to the sequencing era. By 2000, microarrays were being used for molecular stratification of cancer samples through the identification of gene signatures defining differential survival.81 Unsurprisingly, the advent of NGS methods increased studies in this vein.
Even with applications to all aspects of human health and disease, cancer remains the one disease (really an innumerable collection of diseases) on which genomics has had the biggest impact. Cancer is genetic in nature; cancers arise from the accumulation of inherited and somatic genetic alterations.82 Heterogeneous subpopulations comprising tumors have been experimentally observed through cytogenetic, Sanger sequencing, and NGS experiments.83 As originally proposed by Nowell in 1976, these subpopulations compete with each other for space and resources, and the clones better equipped to survive and proliferate in the tumor microenvironment will progress.84 Genomics enables researchers to assess the compositions of tumors and infer the molecular characteristics of distinct subpopulations.
The main challenge in accurately inferring heterogeneity and clonal evolution is that most tumorprofiling methods involve a bulk sample of cells, effectively masking intratumoral variability. With novel technological developments in single-cell sequencing, we can now measure these subpopulations directly and at a previously unprecedented resolution. Single-cell sequencing will add a new level to clinical applications of tumor sequencing by increasing the resolution with which we can model complex tumor dynamics and incorporate that into prognosis assessment and drug efficacy prediction (Figure 3). The development of single-cell sequencing methods has addressed this issue, especially single cell RNA-seq, which researchers have used in immune cells,85 breast cancer,86 melanoma circulating tumor cells,87 and glioblastoma.88 Each of these cases revealed new levels of heterogeneity that are undetectable in bulk samples, suggesting that single-cell resolution is necessary to accurately characterize complex tissue samples.
FIG. 3.
Conceptual overview of single-cell sequencing for clinical applications.
An added benefit is that all of these sequencing data are submitted to curated repositories with publication, such as the database of Genotypes and Phenotypes, the Sequencing Reads Archive, and the Gene Expression Omnibus. Public data help alleviate the problem of small sample sizes common in clinical settings and/or rare diseases. Researchers interested in any of these data can download them and apply their own analysis. For those unfamiliar with computational and bioinformatics methods, there are also pipelines with guided user interfaces that facilitate these steps, such as STORMseq,89 Genesifter, Ingenuity variant analysis software, and more. Currently research is also being conducted in software design for use by non-computational clinical scientists.90 Although the data repositories in place serve a much-needed purpose, there are opportunities for better infrastructure, support for IRB approvals, ease of submission, and ease of access.
E. Advances in Genomics Approaches for Neurobiology
The use of molecular stratification with genomic sequencing to guide patient therapy is not limited to cancer drugs. Although less understood, genomic approaches also apply to neurobiology, especially in the study of Alzheimer’s and autism spectrum disorders.91 With large-scale efforts in mapping the human brain using cutting edge brain imaging techniques, high-volume data approaches are becoming increasingly useful in understanding neurodegenerative diseases. Understanding mutations and predispositions to these diseases would allow for early intervention, which is often the only hope for therapy.
F. National and International Personalized Medicine Initiatives
Overall, clinical genomics pervasively affects human health and disease, especially in oncology. Federal policy changes mirror this evolution in our understanding and treatment of cancer, most notably through President Obama’s Precision Medicine Initiative, announced in his 2015 State of the Union Address. This initiative includes increased funding to the National Cancer Institute for researching genomic drivers in cancer and for streamlining the design and testing of targeted therapies based on genetics. Relatedly, the prototypical clinical trial is transforming to reflect a personalized medicine approach, as seen by the success of the IMPACT and following IMPACT2 studies. Importantly, these changes in clinical genomics are occurring on a global scale, inspiring international cooperation to advance medicine.92–93
We would like to thank the Epigenomics Core Facility at Weill Cornell Medicine. The authors would like to thank the following sources of financial support: Ty Louis Campbell Foundation, Elizabeth’s Hope and the Families of the Children’s Brain Tumor Project, WorldQuant Foundation, the Starr Cancer Consortium grants (I7-A765, I9-A9-071) and funding from the Irma T. Hirschl and Monique Weill-Caulier Charitable Trusts, Bert L and N Kuggie Vallee Foundation, the Pershing Square Sohn Cancer Research Alliance, NASA (NNX14AH50G, 15-15Omni2-0063), the National Institutes of Health (R25EB020393, R01NS076465, R01AI125416, R01ES021006), the Bill and Melinda Gates Foundation (OPP1151054), and the Alfred P. Sloan Foundation (G-2015-13964).
