Skip to main content
American Journal of Epidemiology logoLink to American Journal of Epidemiology
. 2017 May 10;186(8):1000–1009. doi: 10.1093/aje/kww224

Human Genome Sequencing at the Population Scale: A Primer on High-Throughput DNA Sequencing and Analysis

Rachel L Goldfeder , Dennis P Wall, Muin J Khoury, John P A Ioannidis, Euan A Ashley *
PMCID: PMC6250075  PMID: 29040395

Abstract

Most human diseases have underlying genetic causes. To better understand the impact of genes on disease and its implications for medicine and public health, researchers have pursued methods for determining the sequences of individual genes, then all genes, and now complete human genomes. Massively parallel high-throughput sequencing technology, where DNA is sheared into smaller pieces, sequenced, and then computationally reordered and analyzed, enables fast and affordable sequencing of full human genomes. As the price of sequencing continues to decline, more and more individuals are having their genomes sequenced. This may facilitate better population-level disease subtyping and characterization, as well as individual-level diagnosis and personalized treatment and prevention plans. In this review, we describe several massively parallel high-throughput DNA sequencing technologies and their associated strengths, limitations, and error modes, with a focus on applications in epidemiologic research and precision medicine. We detail the methods used to computationally process and interpret sequence data to inform medical or preventative action.

Keywords: DNA sequence analysis, DNA sequencing, genetics, genomics, high-throughput sequencing, next-generation sequencing, sequencing technologies


DNA sequence variants are responsible for much of the diversity in human appearances, traits, and disease status. For instance, over 4,500 rare genetic conditions collectively affect 3%–4% of Americans (1). Additionally, many common traits are heritable; obesity, for instance, is approximately 70% heritable (2). Genetic variation can also influence the metabolism of medications, affecting their efficacy and toxicity.

Approaches for identifying genetic contributions to disease have evolved over time. For decades, twin and family studies were the dominant approach to estimation of heritability. In the mid-1980s, twin studies gave way to positional cloning methods that had higher power to detect disease variants, such as a genomic locus responsible for cystic fibrosis (35). Shortly thereafter, Sanger sequencing (6) became the primary approach for both small- and large-scale projects, including the Human Genome Project (7). From this 13-year, $3 billion effort, we learned the sequence of the approximately 6.4 billion nucleotide bases spread across 23 pairs of chromosomes that make up the human genome. Only 1% of these bases comprise the approximately 20,000 protein coding genes found in humans (8). The Human Genome Project also resulted in a critical advance for the field: a human reference genome to use in identifying and cataloging clinically meaningful variants. Because the reference genome is haploid, meaning that only 1 copy of each chromosome is represented, and because it is derived from the DNA of several anonymous donors (a majority from 1 African-American individual), the reference genome does not necessarily contain the most common alleles in the population, and it even includes some known disease risk alleles (9).

While Sanger sequencing provided critical advances in the rise of genomics, it is prohibitively expensive and time-consuming to use for sequencing entire human genomes. This reality drove innovators to develop faster, cheaper, and higher-throughput sequencing approaches capable of sequencing billions of bases in only a matter of hours (albeit with a higher error rate than Sanger sequencing) (10). These massively parallel high-throughput technologies, often referred to as next-generation sequencing (NGS), permit sequencing to be performed at truly epidemiologic scales. In fact, a single system today (Illumina's X10; Illumina Inc., San Diego, California) can sequence 18,000 human genomes every year at a cost of approximately $1,000 per genome.

In this review, we describe several NGS technologies and analytical methods and discuss key challenges for the field, with a focus on applications in epidemiologic research and precision medicine.

MASSIVELY PARALLEL HIGH-THROUGHPUT DNA SEQUENCING TECHNOLOGY

There are several NGS platforms, all of which leverage the ability to sequence millions or billions of fragmented pieces of DNA simultaneously. Table 1 lists the characteristics of several popular NGS platforms and their typical uses. Each approach makes trade-offs between frequency and type of errors, throughput, and sequence read (the sequence of nucleotides observed from a DNA fragment) length.

Table 1.

Characteristics (Preparation, Sequencing Method, Results, and Typical Uses) of Several Available DNA Sequencing Platforms

Platform Preparation for Sequencing Sequencing Read Length, base pairs Raw Error Rate, % Error Mode Throughput, GB Time, hours Typical Uses
Illumina (Illumina Inc., San Diego, California) DNA is isolated from a sample and fragmented into smaller pieces, adapter sequences are ligated, and then DNA is inserted into a flowcell and clonally amplified to create clusters. Reversible terminator technology 36, 75, 100, 150, 250, 300 0.1–1 Higher error rate at the ends of reads; most errors are substitutions. MiSeq Reagent Kit v2 (2 × 150): 4.5–5.1HiSeq X: 1,600–1,800 MiSeq Reagent Kit v2 (2 × 150): 24 HiSeq X: <72 Cost-effective for whole human genomes or whole human exomes. Strengths in SNV and small INDEL detection. Difficulties in aligning short reads to reference genome.
Ion Torrent (Thermo Fisher Scientific Inc., Waltham, Massachusetts) DNA is isolated from a sample and fragmented into smaller pieces, adapter sequences are ligated, and then DNA is amplified on bead surfaces via emulsion PCR. The beads are added to wells on a semiconductor chip (1 bead per well) for sequencing. Semiconductor sequencing 200, 400 1–2 Most errors are insertions or deletions, particularly in homopolymers. Ion PGM 318 Chip v2: 0.6–2 Ion Proton, Ion PI Chip: approximately 10 Ion PGM 318 Chip v2: 4–7.5 Ion Proton, Ion PI Chip: 2–4 Strengths include low start-up costs for machine and reagents and short run time. Used for sequencing whole human genomes or targeted regions.
Pacific Biosciences (Pacific Biosciences of California, Inc., Menlo Park, California) The samples do not need to be amplified before sequencing. DNA is fragmented and transformed to SMRTbell library format, which is a circularized fragment (double-stranded DNA flanked by 2 hairpin loops). Single-molecule real-time sequencing Read length selectable up to approximately 10,000–15,000 14–15 Errors are stochastic; most errors are insertions or deletions. RS II (P6–C4): 0.5–1 Sequel: 5–10 RS II (P6–C4): 0.5–4 Sequel: 0.5–4 Long reads are advantageous for phasing small variants or identifying large structural variants. Sequencing (machine and reagents) is expensive, so typically used for specifically targeted regions of human genomes.
Oxford Nanopore (Oxford Nanopore Technologies, Oxford, United Kingdom) The samples do not need to be amplified before sequencing. DNA is fragmented and adapters are ligated. Nanopore sequencing Read length selectable up to approximately 250,000 5–40 Errors are stochastic. MinION Mk1B: 5–10 PromethION: 1,400–2,800 MinION Mk1B: up to 48 PromethION: up to 48 Portable device is useful for sequencing in the field, but not used for personal genomes yet due to high (and inconsistent) error rate. Default run time is 48 hours, but the user may reduce run time when less output is desired.

Abbreviations: INDEL, insertion or deletion variant; PCR, polymerase chain reaction; PGM, Personal Genome Machine; SMRT, single-molecule real-time; SNV, single-nucleotide variant.

Short-read platforms

Illumina's reversible terminator technology (Illumina Inc.) and Ion Torrent semiconductor sequencing (Thermo Fisher Scientific Inc., Waltham, Massachusetts) are examples of short-read sequencing platforms; the sequence reads produced by these machines range in length from tens of bases to a few hundred bases. Both of these platforms require a DNA amplification step in preparation for sequencing to ensure that a sufficient signal is available for detection by the sequencing device during the sequencing reaction.

Illumina: reversible terminator technology

The Illumina sequencing pipeline begins with inserting DNA fragments into a flowcell and amplifying (copying) them to create “clonal clusters” containing many copies of the same single-stranded DNA fragment. Then, in each sequencing cycle, all 4 types of fluorescently labeled bases compete to bind to the template strand; only the nucleotide that complements the template is incorporated (along with a blocking group that stops the sequencing reaction so that only 1 nucleotide is incorporated per cycle). A laser identifies the incorporated nucleotide based on the fluorescence, and then the blocking group and fluorescent labeling are removed. These steps occur simultaneously for all clusters.

Thermo Fisher Scientific: Ion Torrent semiconductor sequencing

In preparation for Ion Torrent sequencing, DNA fragments are attached to a bead and amplified until the bead is covered. The beads are added to wells of a semiconductor chip (1 bead per well); sequencing takes place in the millions or billions of wells on the chip simultaneously. In each round of sequencing, the chip is flooded with one of the 4 types of nucleotides in a prespecified order. If a nucleotide is incorporated, a hydrogen ion is released into the solution in the well. This results in a change in the pH of the solution (proportional to the number of nucleotides incorporated), which is converted into voltage and detected by the chip.

Long-read platforms

Pacific Biosciences’ single-molecule real-time (SMRT) sequencing (Pacific Biosciences of California, Inc., Menlo Park, California) and Oxford Nanopore's nanopore sequencing (Oxford Nanopore Technologies, Oxford, United Kingdom) are examples of long-read platforms, producing reads containing thousands to hundreds of thousands of base pairs. These approaches do not require input material amplification, which is advantageous because amplification biases can result in errors or nonuniform coverage of the genome. However, in practice, users often choose to amplify starting material before sequencing for applications that require larger amounts of input DNA.

Pacific Biosciences: SMRT sequencing

Pacific Biosciences’ sequencing takes place on a SMRT cell. Each SMRT cell consists of tens of thousands of chambers, each containing a DNA polymerase and a “zero-mode waveguide” (small light microscope). In preparation for sequencing, the input DNA is circularized and then added to a SMRT cell chamber as a template for DNA replication. Fluorophore-labeled nucleotides are added to the chambers, and the DNA polymerase incorporates the nucleotide complementing the template DNA. When a nucleotide is incorporated, its fluorophore is detached, emitting a fluorescent light signal that is unique for each type of nucleotide, which is then recorded by the zero-mode waveguide in the chamber.

Oxford Nanopore: nanopore sequencing

Oxford Nanopore's sequencing approach takes place on a flowcell containing an array of wells; each well has a membrane that contains a nanopore. As a DNA molecule passes through the nanopore, the change in electrical current across the nanopore can be measured. Since each nucleotide creates a characteristic change in current, this change in current can be used to determine the DNA strand's nucleotide sequence. Notably, the MinION sequencing device from Oxford Nanopore is similar in size to a USB stick, enabling sequencing from the field rather than requiring a laboratory setting.

Error profiles

NGS affords fast and inexpensive sequencing at the cost of a higher raw error rate compared with Sanger sequencing. To overcome this, researchers sequence each region of the genome multiple times; depth of coverage, the number of times a position is sequenced, can be increased to yield a lower consensus error rate (11) (Figure 1A).

Figure 1.

Figure 1.

A) Sequence reads aligned to the reference genome and coverage calculation at 1 genomic location. Depth of coverage is the number of times a genomic locus has been sequenced. The mean coverage over the region is 2.8; the depth of coverage at the circled locus is 3. B) Two reads in a read pair from paired-end sequencing. A DNA molecule of known length is sequenced from both ends. Read A and read B are considered a read pair.

Because the platforms take different approaches, they each have distinctive error profiles. For instance, Illumina's main source of error is substitutions (overall error rate = 0.1%–1%, with >90% of errors being substitutions), while Ion Torrent tends to create insertion or deletion errors, particularly in homopolymers, or stretches of the genome where the same nucleotide is seen consecutively—for example, AAAAAAA (overall error rate = 1%–2%, with >70% being insertions or deletions; accuracy for deletions is as low as 60% in homopolymers of 6 bases or longer) (12, 13). Meanwhile, Pacific Biosciences’ and Oxford Nanopore's approaches both have higher overall error rates (approximately 15% and 5%–40%, respectively). However, these errors, which are primarily insertions and deletions, tend to be random, and they theoretically can be overcome by more sequencing (14).

Despite the known differences among platforms, Ross et al. (13) found commonalities in the error profiles across platforms. For instance, Pacific Biosciences, Ion Torrent, and Illumina all exhibit lower coverage in regions with high or low GC content. However, this can be overcome by improved chemistry optimized for these regions (15).

Experimental design: whole-genome versus targeted sequencing

Rather than sequence a subject's entire genome, many researchers focus only on the exome, the portion of the genome that codes for protein. Although whole-genome sequencing provides sequence information for the noncoding regions of the genome, it is difficult to interpret the effect of variants in these regions. Since the exome is only approximately 1% of the genome, fewer resources (time, computational power, and storage) are needed to analyze exome sequencing data compared with whole-genome sequencing (16). Additionally, most Mendelian disease risk alleles are expected to lie within the exome. Some researchers choose to narrow their search even further by capturing only specific genes known to be involved in their disease of interest.

Exome sequencing and other targeted sequencing approaches do have drawbacks, however, in comparison with whole-genome sequencing (Figure 2). For instance, there are upfront costs for capture kits and time required to design target panels; and obviously, these approaches do not provide data for the large area of the noncoding genome which contains variants associated with complex diseases, as well as many important regulatory regions (17). Additionally, downstream algorithms to detect copy number variants or structural variants (discussed below) perform poorly with data from targeted sequencing, because current capture methods yield a nonuniform distribution of reads (18, 19).

Figure 2.

Figure 2.

Suitability of whole-genome sequencing, whole-exome sequencing, and targeted sequencing for various applications. Lighter shading indicates that the sequencing approach is more suitable for the task, and darker shading indicates that the approach is less suitable. eQTL, expression quantitative trait loci; GWAS, genome-wide association studies.

Due to differences in their error profiles and costs, particular platforms are more commonly used for certain endeavors (Table 1). For instance, long-read sequencing approaches are more expensive; thus, they are often used for targeted sequencing, while short-read approaches are common for whole-genome sequencing.

Currently, the most widespread sequencing approach is paired-end sequencing deployed on Illumina HiSeq machines (2000, 2500, X10, and now X5). In paired-end sequencing, both ends of a DNA fragment are sequenced (Figure 1B). Therefore, paired-end sequencing provides the sequence content of the 2 reads, as well as information about the 2 reads’ relative positions, which is useful during data analysis.

DATA ANALYSIS PIPELINE

An NGS run results in a collection of text files containing each DNA fragment's nucleotide sequence, along with quality scores indicating the probability of error for each base in the sequence. The subsequent data analysis involves ordering the reads, determining the subject's consensus genome sequence, and then interpreting the impact of each genomic variant.

Alignment

Methodology

The goal of alignment is to find the best placement of a sequence read against the reference genome. Though algorithms to find the optimal alignment between the 2 sequences exist—such as the Needleman and Wunsch (20) algorithm for optimal global alignment and the Smith and Waterman (21) algorithm for optimal local alignment—these algorithms have significant computational complexity that causes them to be prohibitively slow. As a result, optimal alignment methods are not routinely used for human genomes. In fact, one of the most commonly used alignment software programs, BWA, uses a Burrows-Wheeler transform and other heuristics to increase speed and reduce memory requirements, albeit at some cost to sensitivity (22, 23). The current version of BWA, called BWA-MEM, takes a seed-and-extend heuristic approach to alignment. Many alignment tools, including BWA-MEM, leverage the information about the expected distance between paired-end reads to find the best alignment for both reads in the pair.

Limitations of alignment

Alignment is particularly complicated by the presence of repeat sequences in the genome, which can cause a single read to align equally well with multiple loci in the reference genome. This is a major problem with whole-genome sequencing, as an estimated 45%–67% of the genome consists of repetitive sequences (7, 24). Additionally, reads that contain large insertions or deletions relative to the reference genome may not align to the correct place in the genome, if at all.

Genotype and variant detection

Alignment is followed by variant calling: the process of determining where the subject's genome differs from the reference. Aligned sequence reads are used to determine the subject's genome sequence. In the following sections, we discuss several classes of human genetic variation and how they are identified from short-read NGS data.

Single-nucleotide variants

A single-nucleotide variant (SNV) is a type of variation where the subject's genome contains a different nucleotide than the reference. In a typical human genome, there are approximately 2.4 million SNVs (25). Common approaches for detecting SNVs from NGS data employ Bayesian methods to determine the most probable genotype at each genomic position given the nucleotides present in the reads aligned with that position (26, 27).

Small insertions and deletions and large structural variants

Small insertion or deletion variants (INDELs) are sequence variations where the subject has more (insertion) or fewer (deletion) nucleotides than the reference. INDELs are typically 1–5 bases long, though they may be up to 1,000 bases long (Figure 3). There are, on average, roughly 600,000 INDELs in any human subject's genome (25).

Figure 3.

Figure 3.

Difference between the reference genome and a genome containing a single-nucleotide variant, insertion, or deletion.

Structural variants refer to more substantial chromosomal abnormalities, such as insertions and deletions larger than 1,000 bases. Other types of structural variants include inversions, where a segment of a chromosome is reversed; translocations, where segments of 2 chromosomes are exchanged; and copy number variants, where a segment of a chromosome has been deleted or duplicated (Figure 4).

Figure 4.

Figure 4.

Difference between the reference genome and a genome containing a structural variant, such as a duplication (A), inversion (B), or translocation (C). Chromosome(s) to the left of each arrow depict the reference genome, and chromosome(s) to the right of each arrow depict the variant.

Common approaches to detecting small INDELs or larger structural variants employ split-read, read-pair, or local de novo assembly techniques. Split-read methods break reads into several smaller pieces and align the smaller pieces. Variants are then called when the smaller pieces do not align with the reference genome in consecutive positions. Read-pair methods (for paired-end sequencing) identify INDELs by searching for genomic regions where read pairs align significantly further apart (or closer together) than expected (28). Local de novo assembly methods use reads that aligned with a candidate region to develop a consensus sequence for the region based on the overlap of the reads themselves (i.e., without making use of a reference genome). Then, variants are called on the basis of differences between the region's consensus sequence and the reference genome.

Copy-number variants are a type of structural variant in which the subject's genome has a different number of copies of a particular genomic segment than the reference genome. A common approach for detecting copy-number variants involves identifying regions with a read depth that is significantly higher or lower than average. The underlying theory is that if the subject's genome has additional copies of a genomic region, sequence reads will be generated from all copies of the region, and then reads from all copies will align to the single copy that is present in the reference genome. Therefore, the read depth at that reference genome locus will appear inflated (or deflated for missing copies). Twelve percent of the genome is thought to be copy-number-variable (29).

Another substantial chromosomal abnormality, called aneuploidy, can occur when an individual's genome does not contain 2 of each chromosome. Autosomal aneuploidies typically do not support life; however, there are some well-known survivable aneuploidies, such as trisomy 21, which results in Down syndrome.

Limitations of variant detection

Current algorithms for identifying SNVs perform very well, but analyses for other forms of variation remain immature (30). For instance, Dewey et al. (25) recently compared variant calls from 2 sequencing platforms for 9 individuals and found SNV calls to be highly consistent, averaging 99%–100% concordance. However, INDEL calls agreed only 53%–59% of the time (25). While short-read sequencing approaches are more popular for identifying SNVs and short INDELs, long-read sequencing approaches are beneficial for identifying larger structural variants.

Variant prioritization

The initial list of putative variants represents the starting point of the investigation. This list will be trimmed (i.e., filters exclude low-quality variants) and prioritized to identify variants influencing or associated with the phenotype.

Associating genetic variation with phenotype.

Case-control sequencing studies

Genetic differences between groups of cases and controls can be associated with traits using statistical approaches. For instance, basic allelic association analyses for a single position employ χ2 or Fisher's exact tests to test for significant differences in allele frequencies between cases and controls. Modifications of this approach test for significant differences in counts of genotypes or even groups of genotypes based on the expected inheritance model. Additionally, Cochran-Armitage tests (and others) for trend may be employed when an ordered effect is expected, such as a more severe phenotype for a homozygous variant than for a heterozygous variant. These analyses can easily be performed genome-wide, with appropriate adjustments for multiple testing, using freely available software like PLINK (31).

Interestingly, in the Exome Sequencing Project, which focused on discovering genes contributing to heart, lung, and blood disorders, Tennessen et al. (32) found that 86% of detected SNVs were present in fewer than 0.5% of the 2,440 individuals sequenced as part of their project, as were 95.7% of SNVs predicted to be of functional importance. Traditional association tests for common variants are underpowered for such rare variants. Therefore, rare variant association tests, such as burden tests, provide an approach to finding associations between groups of rare variants—combined by gene or other genomic region—and phenotypes (3337). Simple burden tests collapse all rare variants in a genomic region into 1 representative count and then test for a difference between cases and controls (38). These methods are particularly useful when the variation affecting the phenotype acts in the same direction (i.e., deleterious or protective). More advanced burden and nonburden association tests weight variants on the basis of allele frequency or variant type, allow variants within the same gene to have a different direction of effect, or incorporate adjustments for clinical covariates (3941).

Family sequencing studies

In family sequencing studies, variants that segregate with the phenotype in the family are prioritized. For example, the transmission-disequilibrium test leverages genetic information from a sample consisting of families (parents and 1 or more affected children) to test a locus for association with a phenotype by testing whether a particular allele is transmitted from a heterozygous parent to an affected child more often than would be expected by chance (42). Other tests do not require parental genotype information, such as the discordant-alleles test, which uses genotype information from 1 affected sibling and 1 unaffected sibling to identify alleles associated with the trait (43). Family-based association tests aim to be robust against spurious associations due to population stratification.

Genomic data from family members of a proband can be leveraged for other important purposes. For instance, sequencing data from a proband's parents enables identification of de novo variants (variants that are present in the proband but not the proband's parents) (44), which are expected to be key in sporadic diseases. It is estimated that the average genome contains approximately 74 de novo SNVs (44). More commonly, variants are inherited from parent to offspring. As such, parental genotype information can be used to determine which variants are on the same chromosome, an analysis known as phasing. Phasing can increase the quality of genotype calls in complex regions of the genome (4548) and can provide insight into variant impact when cis (same chromosome) and trans (different chromosome) variants have different impacts.

Variant interpretation

Candidate variants are further characterized and prioritized by prediction algorithms and database annotations, including predicted functional effect, population allele frequency, and previous knowledge of gene function.

Functional effect prediction

Variants in the coding regions of the genome, where groups of 3 nucleotides code for protein, are described by the impact they have on the resulting amino acid. An INDEL or structural variant within the protein-coding regions of the genome causes a frameshift mutation when the number of nucleotides inserted or deleted is not a multiple of 3. Frameshift mutations alter the downstream string of amino acids, dramatically changing that portion of the protein (Figure 5). A coding SNV is nonsynonymous if it changes the amino acid and synonymous if it does not. Computational approaches have been developed for predicting the functional effect of nonsynonymous variants based on the resulting amino acid change (49, 50). The degree of deleteriousness is estimated on the basis of the chemical structure of the new amino acid, as well as how well the amino acid sequence is conserved in closely related species.

Figure 5.

Figure 5.

Impact of coding-region DNA variation on the resulting protein. Single-nucleotide variants may cause synonymous or nonsynonymous changes, while insertions and deletions may cause a frameshift mutation or an insertion or deletion involving a number of bases that is not a multiple of 3 and consequently disrupts the triplet reading frame, usually resulting in a truncated or otherwise altered protein product.

Most variants occur in the 99% of the genome that does not code for protein; the functional importance of these noncoding variants is more difficult to predict. The Encyclopedia of DNA Elements (ENCODE) Project characterized the noncoding regions including the areas important for DNA regulation (17), and these annotations can be used to predict the impact of noncoding variants (51).

Allele frequency estimation

Population allele frequencies can be used to prioritize variants. For example, variants that are very common in the population may be considered less likely to cause rare conditions. Large sequencing projects such as the Exome Sequencing Project, the Exome Aggregation Consortium (52), and the 1000 Genomes Project contribute catalogs of genetic variation that can be used to estimate population allele frequencies as well as new methods, tools, and file standards (32, 5357). For instance, Exome Aggregation Consortium databases contribute aggregated exome sequencing data from over 60,706 unrelated individuals sequenced as part of various disease-specific and population genetic studies. Additionally, the pilot phase of the 1000 Genomes Project, which cataloged SNVs, INDELs, and structural variants with a focus on understanding how genetics impact disease (5860), established some of the first tools for population genotype calling to overcome challenges associated with low-coverage sequencing (61).

Gene-phenotype association

Several databases, including Online Mendelian Inheritance in Man (http://www.omim.org/) and ClinVar (62), catalog the association between gene (or variant) and phenotype. Many of these associations come from case-control studies, described above. Information from these databases aids variant interpretation; for instance, variants in genes that have been previously implicated with a phenotype of interest can be given higher priority.

Limitations of variant prioritization

There are many challenges in identifying disease-causing variants. Candidate variants, those that are rare and predicted to be deleterious, are enriched for variants that are likely to be errors. Additionally, interpreting the functional impact of any variant is challenging, as public information is frequently conflicting or incorrect: Approximately 25% of all variants within variant databases are misclassified (25). Even after substantial algorithmic prioritization, a subjective (and laborious) manual literature curation process (25) is required in order to interpret the meaning of the top candidate variant findings and to choose which variants to pursue for further experimental validation and functional follow-up studies.

BIG DATA CHALLENGES

The cost of DNA sequencing itself is rapidly plummeting, encouraging researchers to generate sequence data at an unprecedented rate. At this point, generation of sequence data is no longer the major obstacle; rather, it is the computational power and storage needed to analyze them. Storing the data files for the 18,000 whole human genomes that an Illumina X10 can produce annually requires approximately 5.4 petabytes. As of January 2017, more than 500,000 individuals have had their genomes sequenced; that number will continue to rise in the coming years (63, 64).

DNA sequence analysis is highly parallelizable; therefore, many research groups choose to use cloud computing for their analysis rather than local computer clusters. Software packages (and even entire companies) have been designed to optimize the parallelization of these tasks (65, 66).

OPPORTUNITIES AND FUTURE DIRECTIONS

President Barack Obama highlighted the importance of precision medicine through genomics in his 2015 State of the Union address (67). The process of subcategorizing diseases using genomics and then targeting those subcategories with specific therapies has already begun (68, 69). Population-scale data will be needed to enable increasing genomics-assisted precision medicine for Mendelian disease and for complex diseases.

In addition, as sequencing technologies continue to move towards longer reads, error rates in alignment and variant discovery, particularly for INDELs and structural variants, will drastically decrease. Further, variant prioritization and interpretation will improve as more genomes are sequenced and shared publicly, thereby allowing researchers to learn more about how genetic variants affect human health and disease.

Despite widespread enthusiasm, it is not possible to fathom the full potential of genomics-backed precision medicine at the moment (7072). Technical, diagnostic, and therapeutic challenges remain, and these limit our ability to predict the entire potential of genome sequencing with regard to medicine. Certainly, the utility of genomic information should not be taken for granted, and we should continue to assess in a rigorous way the outcomes for patients and families. This may not always be straightforward, since certain outcomes are hard to quantify—for example, the value to a family in knowing the cause of long-unexplained symptoms. An emphasis not just on technological advancement but also on innovation in the measurement of outcome will maximize our ability to translate the exciting advances described here to the bedsides of our patients.

CONCLUSION

Next-generation sequencing has given us the ability to study the entire human genome at a lower cost and faster speed than ever before. Remaining challenges include improvement of sequencing technology to limit errors; continued development of analysis algorithms; and identification of the appropriate opportunities and clinical or prevention settings where these technologies can maximize the potential benefits for patients and populations.

ACKNOWLEDGMENTS

Author affiliations: Biomedical Informatics Training Program, School of Medicine, Stanford University, Stanford, California (Rachel L. Goldfeder); Division of Systems Medicine, Department of Pediatrics, School of Medicine, Stanford University, Stanford, California (Dennis P. Wall); Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, California (Dennis P. Wall); Office of Public Health Genomics, Centers for Disease Control and Prevention, Atlanta, Georgia (Muin J. Khoury); and Department of Medicine, School of Medicine, Stanford University, Stanford, California (John P. A. Ioannidis, Euan A. Ashley).

R.L.G. was supported by National Library of Medicine training grant T15 LM7033; grant U01FD004979 from the Food and Drug Administration, which supports the UCSF-Stanford Center of Excellence in Regulatory Sciences and Innovation; and a National Science Foundation graduate research fellowship.

The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official views of the US Department of Health and Human Services or the Food and Drug Administration.

Conflict of interest: none declared.

REFERENCES

  • 1. Yaneva-Deliverska M. Rare diseases and genetic discrimination. J IMAB Annu Proc. 2011;17(1):116–119. [Google Scholar]
  • 2. Walley AJ, Blakemore AI, Froguel P. Genetics of obesity and the prediction of risk for health. Hum Mol Genet. 2006;15(suppl 2):R124–R130. [DOI] [PubMed] [Google Scholar]
  • 3. Riordan JR, Rommens JM, Kerem B, et al. Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science. 1989;245(4922):1066–1073. [DOI] [PubMed] [Google Scholar]
  • 4. Kerem B, Rommens JM, Buchanan JA, et al. Identification of the cystic fibrosis gene: genetic analysis. Science. 1989;245(4922):1073–1080. [DOI] [PubMed] [Google Scholar]
  • 5. Rommens JM, Iannuzzi MC, Kerem B, et al. Identification of the cystic fibrosis gene: chromosome walking and jumping. Science. 1989;245(4922):1059–1065. [DOI] [PubMed] [Google Scholar]
  • 6. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA. 1977;74(12):5463–5467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860–921. [DOI] [PubMed] [Google Scholar]
  • 8. Pruitt K, Brown G, Tatusova T, et al. Chapter 18: the Reference Sequence (RefSeq) database In: The NCBI Handbook [Internet]. Bethesda, MD: National Center for Biotechnology Information, National Library of Medicine; 2002. http://www.ncbi.nlm.nih.gov/books/NBK21091/. Accessed November 24, 2016. [Google Scholar]
  • 9. Chen R, Butte AJ. The reference human genome demonstrates high risk of type 1 diabetes and other disorders. Pac Symp Biocomput. 2011:231–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. National Human Genome Research Institute The cost of sequencing a human genome. http://www.genome.gov/sequencingcosts/. Updated July 6, 2016. Accessed November 24, 2016.
  • 11. Ajay SS, Parker SC, Abaan HO, et al. Accurate and comprehensive sequencing of personal genomes. Genome Res. 2011;21(9):1498–1505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Loman NJ, Misra RV, Dallman TJ, et al. Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol. 2012;30(5):434–439. [DOI] [PubMed] [Google Scholar]
  • 13. Ross MG, Russ C, Costello M, et al. Characterizing and measuring bias in sequence data. Genome Biol. 2013;14(5):R51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Carneiro MO, Russ C, Ross MG, et al. Pacific Biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics. 2012;13(1):375. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Patwardhan A, Harris J, Leng N, et al. Achieving high-sensitivity for clinical applications using augmented exome sequencing. Genome Med. 2015;7(1):71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Bamshad MJ, Ng SB, Bigham AW, et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011;12(11):745–755. [DOI] [PubMed] [Google Scholar]
  • 17. ENCODE Project Consortium The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306(5696):636–640. [DOI] [PubMed] [Google Scholar]
  • 18. Karakoc E, Alkan C, O'Roak BJ, et al. Detection of structural variants and indels within exome data. Nat Methods. 2012;9(2):176–178. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Goldfeder RL, Ashley EA A precision metric for clinical genome sequencing. 2016. http://www.biorxiv.org/content/early/2016/05/24/051490. Accessed November 28, 2016.
  • 20. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48(3):443–453. [DOI] [PubMed] [Google Scholar]
  • 21. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147(1):195–197. [DOI] [PubMed] [Google Scholar]
  • 22. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Li H. Aligning Sequence Reads, Clone Sequences and Assembly Contigs with BWA-MEM [preprint]. New York NY: Oxford University Press; 2013. http://arxiv.org/pdf/1303.3997v2.pdf. Accessed November 28, 2016. [Google Scholar]
  • 24. de Koning AP, Gu W, Castoe TA, et al. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 2011;7(12):e1002384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Dewey FE, Grove ME, Pan C, et al. Clinical interpretation and implications of whole-genome sequencing. JAMA. 2014;311(10):1035–1045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Teer JK, Bonnycastle LL, Chines PS, et al. Systematic comparison of three genomic enrichment methods for massively parallel DNA sequencing. Genome Res. 2010;20(10):1420–1431. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. DePristo MA, Banks E, Poplin R, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491–498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat Rev Genet. 2011;12(5):363–376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Redon R, Ishikawa S, Fitch KR, et al. Global variation in copy number in the human genome. Nature. 2006;444(7118):444–454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. O'Rawe J, Jiang T, Sun G, et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med. 2013;5(3):28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Purcell S, Neale B, Todd-Brown K, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Tennessen JA, Bigham AW, O'Connor TD, et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337(6090):64–69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Lee S, Abecasis GR, Boehnke M, et al. Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet. 2014;95(1):5–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Price AL, Kryukov GV, de Bakker PI, et al. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010;86(6):832–838. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Asimit JL, Day-Williams AG, Morris AP, et al. ARIEL and AMELIA: testing for an accumulation of rare variants using next-generation sequencing data. Hum Hered. 2012;73(2):84–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Asimit J, Zeggini E. Rare variant association analysis methods for complex traits. Annu Rev Genet. 2010;44:293–308. [DOI] [PubMed] [Google Scholar]
  • 37. Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutat Res. 2007;615(1-2):28–56. [DOI] [PubMed] [Google Scholar]
  • 39. Wu MC, Lee S, Cai T, et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Neale BM, Rivas MA, Voight BF, et al. Testing for an unusual distribution of rare variants. PLoS Genet. 2011;7(3):e1001322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5(2):e1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet. 1993;52(3):506–516. [PMC free article] [PubMed] [Google Scholar]
  • 43. Boehnke M, Langefeld CD. Genetic association mapping based on discordant sib pairs: the discordant-alleles test. Am J Hum Genet. 1998;62(4):950–961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Veltman JA, Brunner HG. De novo mutations in human genetic disease. Nat Rev Genet. 2012;13(8):565–575. [DOI] [PubMed] [Google Scholar]
  • 45. Dewey FE, Chen R, Cordero SP, et al. Phased whole-genome genetic risk in a family quartet using a major allele reference sequence. PLoS Genet. 2011;7(9):e1002280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Roach JC, Glusman G, Smit AF, et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010;328(5978):636–639. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Tewhey R, Bansal V, Torkamani A, et al. The importance of phase information for human genomics. Nat Rev Genet. 2011;12(3):215–223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Yang H, Chen X, Wong WH. Completely phased genome sequencing through chromosome sorting. Proc Natl Acad Sci USA. 2011;108(1):12–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Adzhubei IA, Schmidt S, Peshkin L, et al. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7(4):248–249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc. 2009;4(7):1073–1081. [DOI] [PubMed] [Google Scholar]
  • 51. Schaub MA, Boyle AP, Kundaje A, et al. Linking disease associations with regulatory information in the human genome. Genome Res. 2012;22(9):1748–1759. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Lek M, Karczewski KJ, Minikel EV, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536(7616):285–291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Emond MJ, Louie T, Emerson J, et al. Exome sequencing of extreme phenotypes identifies DCTN4 as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis. Nat Genet. 2012;44(8):886–889. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Krumm N, Sudmant PH, Ko A, et al. Copy number variation detection and genotyping from exome sequence data. Genome Res. 2012;22(8):1525–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Norton N, Robertson PD, Rieder MJ, et al. Evaluating pathogenicity of rare variants from dilated cardiomyopathy in the exome era. Circ Cardiovasc Genet. 2012;5(2):167–174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Boileau C, Guo DC, Hanna N, et al. TGFB2 mutations cause familial thoracic aortic aneurysms and dissections associated with mild systemic features of Marfan syndrome. Nat Genet. 2012;44(8):916–921. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Regalado ES, Guo DC, Villamizar C, et al. Exome sequencing identifies SMAD3 mutations as a cause of familial thoracic aortic aneurysm and dissection with intracranial and other arterial aneurysms. Circ Res. 2011;109(6):680–686. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. 1000 Genomes Project Consortium, Abecasis GR, Auton A, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Sudmant PH, Rausch T, Gardner EJ, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526(7571):75–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. 1000 Genomes Project Consortium, Auton A, Brooks LD, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. 1000 Genomes Project Consortium, Abecasis GR, Altshuler D, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Landrum MJ, Lee JM, Benson M, et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 2016;44(D1):D862–D868. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Herper M. Illumina promises to sequence human genome for $100—but not quite yet. https://www.forbes.com/sites/matthewherper/2017/01/09/illumina-promises-to-sequence-human-genome-for-100-but-not-quite-yet/#20099e76386d. Published January 9, 2017. Accessed August 3, 2017.
  • 64. Stephens ZD, Lee SY, Faghri F, et al. Big data: astronomical or genomical? PLoS Biol. 2015;13(7):e1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Gafni E, Luquette LJ, Lancaster AK, et al. COSMOS: Python library for massively parallel workflows. Bioinformatics. 2014;30(20):2956–2958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Souilmi Y, Lancaster AK, Jung JY, et al. Scalable and cost-effective NGS genotyping in the cloud. BMC Med Genomics. 2015;8(1):64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Ashley EA. The precision medicine initiative: a new national effort. JAMA. 2015;313(21):2119–2120. [DOI] [PubMed] [Google Scholar]
  • 68. Lindeman NI, Cagle PT, Beasley MB, et al. Molecular testing guideline for selection of lung cancer patients for EGFR and ALK tyrosine kinase inhibitors: guideline from the College of American Pathologists, International Association for the Study of Lung Cancer, and Association for Molecular Pathology. J Thorac Oncol. 2013;8(7):823–859. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Brodlie M, Haq IJ, Roberts K, et al. Targeted therapies to improve CFTR function in cystic fibrosis. Genome Med. 2015;7:101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Khoury MJ, Ioannidis JP. Medicine. Big data meets public health. Science. 2014;346(6213):1054–1055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Khoury MJ, Evans JP. A public health perspective on a national precision medicine cohort: balancing long-term knowledge generation with early health benefit. JAMA. 2015;313(21):2117–2118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Joyner MJ, Paneth N. Seven questions for personalized medicine. JAMA. 2015;314(10):999–1000. [DOI] [PubMed] [Google Scholar]

Articles from American Journal of Epidemiology are provided here courtesy of Oxford University Press

RESOURCES