Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 May 19.
Published in final edited form as: Per Med. 2014;11(5):523–544. doi: 10.2217/pme.14.34

The road from next-generation sequencing to personalized medicine

Manuel L Gonzalez-Garay 1
PMCID: PMC4437232  NIHMSID: NIHMS626136  PMID: 26000024

Abstract

Moving from a traditional medical model of treating pathologies to an individualized predictive and preventive model of personalized medicine promises to reduce the healthcare cost on an overburdened and overwhelmed system. Next-generation sequencing (NGS) has the potential to accelerate the early detection of disorders and the identification of pharmacogenetics markers to customize treatments. This review explains the historical facts that led to the development of NGS along with the strengths and weakness of NGS, with a special emphasis on the analytical aspects used to process NGS data. There are solutions to all the steps necessary for performing NGS in the clinical context where the majority of them are very efficient, but there are some crucial steps in the process that need immediate attention.

Keywords: CADD, functional prediction program, genomics, GWAVA, NGS, personalized medicine, workflow management system


The current medical model focuses on the detection and treatment of pathologies. Treating disorders, especially on advanced states, is very expensive for patients and society in general. Screening for five of the most common disorders in the USA (cardiovascular disorders, stroke, cancer, chronic obstructive pulmonary disease and diabetes) could protect millions of lives and reduce the healthcare deficit [1]. Tailoring drug therapies by practicing personalized medicine (PM) has the potential to improve treatment of cancer and save lives by preventing drug-related fatalities. A new technology, next-generation sequencing (NGS), has the potential to accelerate the early detection of disorders and to detect pharmacogenetics markers to customize treatments [2].

Initial work to generate the human genome template

In 1977, the Nobel laureate, Frederick Sanger developed the ‘dideoxy’ chain-termination method coupled with electrophoretic size separation for sequencing DNA molecules [3]. Sanger sequencing, as it is known today, started with low efficiency and high cost, but thanks to the work of a large number of scientists the cost of sequencing was reduced dramatically reaching a price of US$0.0024/base by the mid-1990s [4]. The Human Genome Project started in 1990 after the scientific community recognized the urgent need for a complete map of the human genome. The project lasted 13 years with an astronomic cost of US$3 billion and the involvement of thousands of international scientists [5]. The Human Genome Project transformed molecular biology by eliminating the need to individually clone and sequence genes of interest. During this period, there was a ferocious competition between the International Human Genome Sequencing Consortium (IHGSC), under the direction of Francis Collins (MD, USA), head of the National Human Genome Research Institute at the NIH and the private sector (Celera [CA, USA]) headed by Craig Venter (MD, USA). Both groups published the first draft of their human genome assemblies in 2001. IHGSC published the sequence in 15 February [6] while Venter published in 16 February [7]. Venter’s group used a shotgun clustering approach while the IHGSC used an independent bacterial artificial chromosome (BAC)-by-BAC approach. We now know that both groups produced mistakes in their first human genome drafts. There was hundreds of thousands of gaps and misassembled regions in both drafts [8].

It took 3 years for the IHGSC sequencing centers to finished filling the gaps in the draft. The finished version of the human assemble was published by the National Center for Biotechnology Information (NCBI) as NCBI build 35, also known as hg17 [9]. At the time of this writing, three subsequent versions have been released. The Genome Research Consortium (GRC) is the new organization in charge of working with genome assemblies, the latest version of the human assembly is known as GRCh38, and it was released on 24 December 2013. However, the majority of the sequencing groups still use GRCh37 (hg19) since it takes time and effort to migrate all the previously generated genomes to the new assembly.

Annotating the first human genome

Before and during the release of the first human genome assembly, thousands of scientists produced information about the structure and function of single genes. Projects like the expressed sequence tag generated millions of short subsequence of a cDNA sequence. Expressed sequence tag project identified the presence of thousands of genes and provided valuable information about alternative splice variants of genes [10,11]. During this period of time, bioinformaticians developed programs to scan the human genome assemblies for potential new genes. The IHGSC selected three-gene prediction programs to scan the human assemblies: Genscan [12], a program developed by Burge et al. that identifies complete gene structures including exon–intron boundaries using a general probabilistic model of the gene structure and GC composition; Genie [13], a gene prediction program originally developed for the Drosophila genome, was selected to inspect the human assemblies. Genie was developed using generalized Hidden Markov models; and FGENES [14], a commercial software developed by Softberry, Inc. (NY, USA) The predicted gene models are continually validated using biological data from well-annotated databases.

With the release of the first human genome, a group of human geneticists became interested in generating a map of human genetic variations or a haplotype map (HapMap). For the international HapMap project, four populations were selected with a total of 270 people. Two populations consisted of trios (a father, mother and an adult child), the Yoruba people of Ibadan, Nigeria, provided 30 trios and the USA provided 30 trios from US residents with northern and western European ancestry (Centre d’Étude du Polymorphisme Humain [CEPH]). The remaining two populations consisted of unrelated individuals. Japan provided 45 samples and China provided another 45 samples [15]. By 2005, approximately 1 million variants were genotyped and their linkage disequilibrium patterns characterized in Phase I of the project [16]. A second set of results was published in 2007 where more than 3 million variants were identified and characterized [17]. During the third phase of the HapMap project additional samples were genotyped, increasing total number of samples to 1301 from a variety of human populations [18]. For a more detailed review about the HapMap project and its impact on the discovery of SNP associated with common diseases, see Manolio et al. [19]. The information generated by HapMap project, including allele frequencies, have been incorporated into the public catalog of variant sites in the Database of SNPs (dbSNP) [20].

The birth of the NGS technology

The next logical objective to pursue, after the human genome was finished, was to sequence the diploid genome of a single person. However, the main problem was that the Sanger sequencing technology was expensive and slow. These arguments did not stop Venter from sequencing his own genome in September 2007. Venter published the first diploid human genome (called ‘HuRef’) [21]. The HuRef genome was the most expensive personal genome in history (US$100 million).

On the other hand, visionaries like Jay Shendure (WA, USA) and George Church (MA, USA) concentrated their efforts into developing faster and more economical technologies. Church’s group developed the first multiplex sequencing technology (Polony Sequencing). The Polony Sequencing combined the used of emulsion PCR, ligation and four-color imaging [22]. The sequencing machine was named Polonator. Polonator was a low cost sequencing machine (US$170,000) [23].

Rothberg (CT, USA) developed an alternative sequencing technology based on miniaturized pyrosequencing reactions that run in parallel [24]. The technology captures the signals using charge-coupled device (CCD) camera-based imaging [25]. The final product was marked as 454 technologies, and it was quickly used to sequence multiple organisms including bacteria. In 2008, the entire genome of James Watson was sequenced using 454 technologies [26]. Watson’s genome was sequenced in a record time of 4 months at a cost of US$1,500,000 [27]. After 454 technologies was sold to Roche (Basel, Switzerland) and Rothberg departed, there was not a significant improvement in the technology and eventually in October 2013, Roche shut down 454.

Life technologies (CA, USA) developed a sequencing system borrowing the chemistry properties used by Polony Sequencing [28]. The machines were commercialized under the name SOLiD Instruments. SOLiD instruments allowed the sequencing of whole genomes at a lower price of US$100,000. The first genome sequenced using SOLiD technology was the genome of Lupski, a geneticist from Baylor College of Medicine (TX, USA) [29]. Even though SOLiD technology was the most accurate sequencing technology, the major obstacles for the acceptance of SOLiD technology ware the complexity of analyzing color space data and the large amount of computational resources required for its analysis. In addition, the read length was very short, 50 bp, in comparison with Illumina® (CA, USA) that normally generates reads over 100 bp for each side of every fragment (using the paired-end mode).

A fourth sequencing company emerged from the Cambridge Chemistry Department, Solexa with offices in Chesterford (UK) and Hayward (CA, USA). Solexa’s technology was different from the existing NGS technologies. It was based on clonal arrays, and massively parallel sequencing of short reads using solid-phase sequencing by reversible terminators. The first machine was commercialized under the name Genome Analyzer and became commercially available in 2006. Solexa was acquired by Illumina in early 2007. Illumina eventually became the predominant sequencing technology, thanks to their aggressive marketing team, the simplicity of their technology and their constant efforts to improve their technology [30,31].

DNA nanoball sequencing is a technology developed by Complete Genomics Inc., (CGI; CA, USA) [32]. CGI’s business strategy was different from other companies. Instead of selling machines, CGI exclusively sequenced human genomes and performed their downstream analysis delivering an annotated human genome as a final product. Their analysis included copy number variations, structural variations, variant calling, variant annotation, detection of mobile elements and multiple additional reports [33]. Their analysis reduced the computational challenges for customers. CGI was a very important player in the field; CGI’s marketing forced competition to lower the price of whole human genomes. In addition, CGI changed the model of purchasing expensive equipment to a model of genome sequencing as a service. CGI is a very creative company but they were limited in that their only product was their genome services, in comparison with their competitors that had multiple sources of revenues (e.g., instruments, reagents, support and service, among others).

Other technologies like the Ion Torrent Systems entered the market at a later time (February 2010). Ion Torrent brought semiconductor based detection systems to the sequencing arena. Ion Torrent technology produced a significant improvement to the omnipresent and slow technology of image acquisition [34]. Ion Torrent keeps increasing its market share. Their system has the benefit of a very short turnaround time, an advantage when working with critical care patients that need an answer on the same day.

Single-molecule real-time (SMRT) sequencing is based on the sequencing by synthesis and real-time detection of the incorporation of fluorescent labels. The advantage of this technology is the continuous long reads generated by the instruments [35]. The technology was developed by Pacific Biosciences® (PacBio; CA, USA) and recently, the latest machine PacBio RS II was released in April 2013. PacBio sequencing technology plays a very important role in filling the gaps in current assemblies [36].

There are many other new technologies on development that will make the sequencing even faster and more economical, such as Oxford Nanopore technologies (GridION System based on nanopore-based sensing), Fluidigm® (single-cell sequencing) and Nabsys (positional sequencing), among others. Figure 1 highlights the major events in next generation sequencing

Figure 1. Timeline: the major events in next-generation sequencing. On the left is the year of the event.

Figure 1

EST: Expressed sequence tag; IHGSC: International human genome sequencing consortium; ENCODE: Encyclopedia of DNA elements; NCBI: National Center for Biotechnology Information; WGS: Whole-exome sequencing.

Focus on the protein-coding genome

The best and more direct approach to study a person’s genome would be to sequence the whole genome. However, since only roughly 2–3% of the human genome code for proteins, but harbor approximately 85% of the mutations with large effects on disease-related traits [37], it becomes a logical choice to focus efforts on a smaller subset of the genome that contains the exons (i.e., the exome). In addition, the interpretation of the functional effects of a mutation in a noncoding region of the genome is an extremely difficult task, as you will read in a further section of this review. This targeted approach reduced the cost and time to sequence samples but more importantly it reduced the computational processing time by at least 50 times.

The process of enrichment by hybridization has been commercialized mainly by three companies: Illumina, Nimble-Gen (Basel, Switzerland) and Agilent (CA, USA). Illumina offers three products: Nextera (target region 37 Mb); Nextera Expanded Exome Kit (target region of 62 Mb) and TruSight One (12 Mb including exons with known human disease genes) [38]. NimbleGen offers ‘SeqCap EZ Exome v3’ (target region 64 Mb) [39]. Agilent offers ‘SureSelect Human All’ (target region 75 Mb) [40]. All the enrichment kits, with the exception of TruSight One, are capable of capturing exons, 5′ UTR, 3′ UTR, miRNA and other noncoding RNA.

The challenge of working with billions of short reads

The development of new instruments capable of generating data in the gigabase-pair scale generated a new problem: the lack of software capable of aligning and assembling short reads. During the early days of NGS (2007–2008), there were direct requests from NIH to the scientific community especially the computational biologists to design short-read sequencing mapping tools (SRSMT) that work with NGS data. The bioinformatics community solved the problem very fast. By 2008, the first open source SRSMT was released ‘Mapping and Assembly with Quality’ (Maq) [41]. Maq is capable of mapping short reads to reference sequences and build an assembly. A recent survey estimates that the current number of SRSMT is over 70 [42]. Most of the current SRSMTs accelerate the mapping by creating indexes (hash tables) for the reads or the reference genome. Some bioinformaticians categorize the SRSMTs as genome-indexing or read-indexing. In general, the read-indexing SRSMTs like Maq or RMAP [43] perform better in short genomes and the genome indexing SRSMTs perform better with larger genomes like humans. The majority of the current SRSMTs are genome-indexing. Genome-indexing SRSMTs differ from each other by the presence or absence of features or by the algorithm used to implement a feature of the software. The main differences between genome-indexing SRSMTs are in the following features: the technique used to create the index; the seeding algorithm; the usage of base-quality scores; the allowance of gaps during the alignment; and the quality threshold. The combination of each one of these features makes each SRSMT unique and a challenge for the user to select the right one. The most widely used SRSMTs are Bowtie2 [44], BWA [45], SOAP2 [46], GSNAP [47], Novoalign [48] and mrs-FAST/mrFAST [49,50]. Each one of them has its own strengths and weaknesses, and there is not a single best tool as each performs better under different conditions [51].

Variant callers

After the short reads have been aligned against the reference genome, variants need to be extracted from the alignments. Software packages that detect single nucleotide variations (SNV) and small insertion and deletions (Indels) are called SNV callers, while programs that determine the genotype for each site are called genotype callers. Before submitting information to the SNV callers, it is necessary to minimize the experimental errors in the alignment files or Binary files containing the Sequence Alignment/Map format (BAM files). Experimental errors and technology-specific artifacts could be introduced systematically or randomly.

SNV detection relies on the identification of statistical differences between the base found in a site of the template and the corresponding base found in the aligned reads. Any sequencing error can lead to an incorrect SNV identification. To avoid this problem, the Broad Institute (MA, USA) generated a programing suite PICARD [52] to identify and correct systematic errors on the initial BAM files. The PICARD suite complements and provides functionality to the Genome Analysis Toolkit (GATK) [53]. The GATK was developed at the Broad Institute to analyze NGS data and facilitate the identification of variant discovery. GATK was designed by geneticists and engineers with a very robust architecture. Some of the available high-quality variant callers are capable of identifying SNV and indels while others detect only SNVs. The most commonly used variant callers are listed in Table 1. High-quality BAM files with high levels of coverage are processed very well by all of them but BAM files with low levels of coverage and/or low quality are processed very poorly (for additional information and comparisons see [5456]).

Table 1.

The most frequently used variant callers.

Name Institution Comments Ref.
GATK Broad Institute GATK is a suite of tools designed by geneticist and engineers with a very robust architecture. It provides two widely used tools to detect variants: UnifiedGenotyper – a Bayesian genotype likelihood program; HaplotypeCaller – it uses an affine gap penalty pair Hidden Markov models [53,57]
FreeBayes Boston College FreeBayes is a Bayesian haplotype-based variant discovery program. It solves the problem of detecting haplotypes on regions where multiple alignments are possible [58,59]
Atlas2 HGSC, Baylor College of Medicine Atlas2 uses a logistic regression model that has been trained on a group of validated variants [60,61]
Bambino The National Cancer Institute’s Center for Biomedical Informatics and Information Technology Bambino takes advantages of pooling samples. It is specially designed for detection of somatic mutations. It takes a new approach of padding the reads to improve detection of insertions and deletions [62,63]
SAMtools The Wellcome Trust Sanger Institute SAMtools provides an additional tool, bcftools, and an perl script to extract the variants from a multialignment format (mpileup) generated from bamfiles [64,65]
SNVer New Jersey Institute of Technology It takes a statistical approach using a binomial–binomial model and test the significance of the of each allele generating a p-value [66,67]

GATK: Genome Analysis Toolkit; HGSC: Human Genome Sequencing Center.

Distinguishing the forest from the trees: rare variants

As described in a previous section, population geneticists have been studying the distribution of variants in the population for many years, and they have found a correlation between the frequency of the variant and the expression of a phenotype (penetrance). Population geneticists postulated that a very low frequency allele is more likely to be responsible for a Mendelian phenotype with extreme and rare phenotype and that a common variant that it is fixed in the genome carries a low risk of being responsible for the phenotype [68,69]. This observation provides a perfect explanation for Mendelian disorders and has become the practical basis to identify potentially damaging mutations on NGS experiments. Common variants in a population are called SNP, the exact minor allele frequency (MAF) used to distinguish a rare variant and a SNP is a subject of debate for the population geneticists. It has become common practice to filter out any variant that has a MAF bigger than 1.0%. The threshold of 1.0% for filtering is an arbitrary cutoff value, and the value depends on the source (population) and size of the samples used to generate the MAF information. Large sequencing centers, which have sequenced thousands or millions of local patients, will have better information about what frequency values to use as a cutoff value on such filters. A small laboratory has to use publicly available databases to estimate the MAF. Using publicly available data, as a sole source of frequency information, to filter NGS data increases the risk to over or under filter variants. Resources to obtain allele frequency information are listed in Table 2.

Table 2.

Resources for allele frequency information.

Name License Comments Ref.
HapMap project Free access HapMap project focus on the characterization of common SNPs with a minor allele frequency of ≥5% [15,18,70]
1000 Genomes project Free access Based on the Extended HapMap Collection. 1000 Genome project captured up to 98% of the SNPs with a minor allele frequency frequency of ≥1% in 1092 individuals from 14 populations [7173]
The NHLBI (MD, USA) Exome Sequencing Project Free access A project directed to discover genes responsible for heart, lung and blood disorder, decided to release the allele frequency of each variant detected in their exome sequencing project [7476]
The Personal Genome Project Free access Currently, the Personal Genome Project has the genomes of 174 individuals and the exomes of over 400 volunteers available for download [77,78]
NextCode Health Commercial 40 million validated variants collected from the genotype of 140,000 volunteers from Iceland [79,80]
CHARGE consortia Fee for access and require permission from CHARGE consortia 1000 whole exome data sets of well-phenotyped individuals from the CHARGE consortium [81,82]

CHARGE: Cohorts for Heart and Aging Research in Genomic Epidemiology; HapMap: Haplotype map; NHLBI: National Heart, Lung, and Blood Institute.

Information & material required to take NGS to the clinic

With the availability of many sequencing methods, short-read aligners and variant callers, there are significant differences between variant calls and interpretation of results. Efforts have been made to identify the most common practices between the top sequencing groups and suggest standards for best practices. A recent publication by the international CLARITY Challenge provides a comprehensive assessment of current practices for using genome sequencing to diagnose and report genetic diseases [83]. Their surveys and best practices provide important insights into clinical laboratories but do not provide the tools to evaluate their own implementation of the process. A universal, highly accurate set of genotypes across a genome that can be used as a benchmark is required to standardize clinical laboratories that offer clinical exomes and genomes.

The National Institute of Standards and Technology organized the ‘Genome in a Bottle Consortium’ (GBC) to develop such benchmarks. GBC developed and made publicly available the reference material, reference methods and reference data [84]. In a recent publication, GBC describes the sample selected for reference material, HapMap/Collection of European Samples (CEU) female NA12878, the 14 data sets generated by six different sequencing platforms, eight different mapping programs and various variant callers. GBC integrated all the information and provided a validated set of SNPs and indels, in addition they provided recommendations on how to deal with complex variants and genomic regions that are difficult to genotype [85]. Their work was essential for the recent authorization by the US FDA of the first next-generation sequencer Illumina’s MiSeqDx [86].

Distinguishing between benign & deleterious mutations

When a mutation occurs in the coding sequence of a protein, the result could be: a synonymous change (no amino acid change); a missense mutation (a single amino acid substitution in the protein); a premature chain termination; a frame-shift in the protein due to the addition or deletion of one or more nucleotides; and an altered exon–intron splice junction. The interpretation of the functional effect of all cases is readily done for all, except for the missense mutation(s). If a variant has not been studied before, it is considered a variant of unknown significance. Such variants are a source of diagnostic challenge and uncertainty for families.

The most straightforward approach to analyze a variant is to search databases that store information about known disease-causing mutations (DCM). Catalogs of DCMs are very useful, but the information has to be evaluated very carefully. DCM databases are very small and include errors that were carried over from the original scientific studies. The most widely used catalogs of DCMs are listed in Table 3. In most clinical laboratories pathogenic variants are detected using Human Genome Mutation Database (HGMD) Professional [87,88] and ClinVar databases [89]. HGMD is unquestionably the largest catalog of DCM mutations with approximately 116,000 DCM (release dated December 2013; variantType = DCM) while the latest release of ClinVar (March 2014) only has approximately 29,000 variants considered ‘pathogenic’. Unfortunately, the number of pathogenic variants in both databases represents only a small fraction of the potential number of pathogenic mutations in a population of approximately 7 billion humans. Consequently, the majority of the missense mutations found in a NGS experiment will not be classified by DCM databases and alternative approaches are needed for the interpretation of such variants.

Table 3.

Human catalogs of disease-causing mutations.

Name License Ref.
Human Genome Mutation Database (HGMD) Commercial [87,88,90]
ClinVar database Open [89,91]
Human Genome Variation Society has a Locus Specific Mutation Database Open [92,93]
Leiden Open source Variation Database (LOVD) Open [94,95]
Catalogue of Somatic Mutations in Cancer Open [96,97]
The Diagnostic Mutation Database (DMuDB) Commercial [98]
A human mitochondrial genome database (MITOMAP) Open [99,100]
PhenCode Open [101,102]

To perform the interpretation of the functional effect of variants that are not in a DCM catalog, functional prediction programs (FPPs) have to be used. FPP are capable of detecting pathogenic variations with some degree of certainty. Table 4 lists the majority of FPPs and few databases with precomputed scores. The method employed by each FPP is used to categorize them, and it is provided in the column label ‘Category’ of Table 4.

Table 4.

Functional prediction programs.

Tool Date Access Category Ref.
PANTHER 2003 A and C 3 [103,104]
Logre 2004 H 3 [105,106]
topoSNP 2004 C 3 [107,108]
MAPP 2005 A and C 3 [109,110]
nsSNPAnalyzer 2005 C 4 [111,112]
PMut 2005 H 4 [113]
LS-SNP 2005 C 2 [114,115]
FoldX 2005 A and F 1 [116,117]
Align-GVGD 2006 C 3 [118,119]
PhD-SNP 2006 A and B and C 4 [120,121]
FASTSNP 2006 C and H 4 [122,123]
Mupro 2006 A and C 1 [124,125]
snps3D 2006 C 1 [126,127]
CanPredict 2007 H 4 [128]
Parepro 2007 H 4 [129]
SNAP 2007 A and B and C 4 [130,131]
BONGO 2008 H 2 [132]
ETA 2008 C 1 and 4 [133,134]
MutPred 2009 C 4 [135,136]
SIFT 2009 A and B and C and E 3 [137,138]
SNPs&GO 2009 C 4 [139,140]
MuD 2010 C and H 4 [141,142]
Hope 2010 C 2 [143,144]
MutationTaster 2010 C 4 [145,146]
PolyPhen-2 2010 A and B and C and E 2 and 4 [147,148]
Condel & FannsDb 2011 B and C 7 [149152]
SDM 2011 C 1 [153,154]
PopMuSic 2011 C and F 1 [155,156]
Mutation-assessor 2011 C 3 [157,158]
PON-P 2012 C 2 [159,160]
PROVEAN 2012 A and B and C and E 3 [161,162]
KD4v 2012 C and D and I 1 and 4 [163,164]
SNPdbe 2012 C and G 6 [165,166]
VariBench 2012 C and G 5 [167,168]
CAROL 2012 B 7 [169,170]
Hansa 2012 C 4 [171,172]
SNPeffect 4 2012 C and F 2 [173,174]
Meta-SNP 2013 C 7 [175,176]
VAAST 2.0 2013 A and F 8 [177,178]
logit 2013 H 7 [179]
dbNSFP v2.0 2013 G 6 [180,181]
CoVEC 2013 A and B and C 7 [182,183]
PredictSNP 2014 C 7 [184,185]
mCSM 2014 C 1 [186,187]
HMM 2014 A 3 [188,189]
GWAVA 2014 B and C and E 4 [190,191]
CADD 2014 C and E 4 [192,193]

Access keys = A: Executables; B: Source; C: Web interface; D: Web services; E: Precomputed scores; F: Require registration; G: Download entire database; H: Site not available; I: Access to rules and training sets.

Category keys = 1: Protein stability; 2: Protein sequence and structure; 3: Sequence and evolution conservation; 4: Machine learning; 5: Data for benchmark; 6: Database; 7: Consensus classifier; 8: Conservation and frequency.

Under category 1 (protein stability), there are FPPs that evaluate how the stability of the protein is affected by an amino acid change. In an ideal situation, we would expect that the interpretation of the functional effect of a variant should be easily done by analyzing the 3D structure of a protein and query for the effect of the change on the 3D structure of the proteins. However, it is much more complicated process. The 3D structures of protein are stored in the protein data bank (PDB). PDB stores only 3D structures for a very small fraction of the entire set of human proteins (human proteome). In many cases, sections of a protein cannot be crystallized generating regions of a protein without a 3D structure. In addition, the majority of genes, during expression, will produce alternative splice variants. Alternative splice variants generate multiple protein isoforms from a single genetic locus. The vast majority of protein isoforms lack 3D structures. Furthermore, to be certain about the structural change of the amino acid substitution on the protein, we need the 3D structure of the wild-type protein and the 3D structure of the mutated protein. If we only have the 3D structure of the wild-type protein, it is possible to estimate the structural changes of the mutated protein by using molecular modeling [194] (for a recent review on molecular modeling, see [195]).

The FPPs under the category 2 (protein sequence and structure) evaluate the consequences of the amino acid changes by looking at individual amino acid properties and locations. For example, if an amino acid change is located in an important motif, of the protein or in a region associated with the activity of the protein, the probability that the change will affect the protein is high. The most widely use FPP in this category is PolyPhen-2. PolyPhen-2 is also a machine-learning FPP using a Bayesian classifier composed of eight sequence-based and three structure-based predictive features [147].

The FPPs grouped in category 3 are based on sequence and evolution conservation. The FPPs that use this method require multispecies sequence alignments, to calculate the divergence in a location. If the amino acid change occurred in a region that is highly conserved and the change is not observed in other species, the amino acid change is likely to affect the protein. Some of these FPPs use special matrices based on physicochemical properties to evaluate the changes. Others use Hidden Markov models to evaluate if the change is tolerated. The FPPs from this category that are more widely used are SIFT [137], MAPP [109] and PANTHER [103].

Category 8 (conservation and frequency) contains only one member Variant Annotation, Analysis and Search Tool 2 (VAAST2) [177]. VAAST2 employs a novel conservation-controlled AAS matrix (CASM), to incorporate information about phylogenetic conservation.

The new generation of FPPs has been developed using machine-learning algorithms (category 4). Learning algorithms include naïve Bayes classifiers, neural networks, support vector machines and random forests. Most often, the FPPs use a neural network or a support vector machine because these methods were designed to be trained with two data sets: for example, benign versus pathogenic variants. The FPPs learn to differentiate between both groups of variants. The most commonly used FPPs under this category are PMut [113], PhD-SNP [120], SNPs&GO [139] and MutationTaster [145].

Recently, several groups have begun developing methods to combine the scores of multiple FPPs into a single score (category 7). The Combined annotation scoRing toOL (CAROL) [169] combines the scores of two FPPs: PolyPhen-2 [147] and SIFT [137]. The Consensus deleteriousness score of missense mutations (Condel) [149] combines the scores of five FPPs: Logre [105], MAPP [109], Mutation assessor [157], PolyPhen- 2 [147] and SIFT [137]. The evaluation of tools that use a weighted average of the normalized scores from multiple FPPs indicates greater confidence levels in classifying missense mutations [196,197]. It is becoming a common practice to use this combinatorial approach.

In 2013, a group directed by Simpson evaluated seven predictive tools plus the two consensus tools, CAROL and Condel [182]. Their comparison showed that MutPred [135] had the highest sensitivity and the lowest number of false positives; PolyPhen-2 [147] was the second highest, and SNPs&GO [139] was the third best. The two combinatorial score programs CAROL [169] and Condel [149] performed very well but not as high as MutPred [135] by itself. Then Simpson’s group developed their own Consensus Variant Effect Classification tools (CoVEC). CoVEC integrated the prediction results from four predictors SIFT [137], PolyPhen-2 [147], SNPs&GO [139] and Mutation assessor [157]. According to their evaluation of CoVEC, the tool performed almost as high as MutPred [135] and higher than CAROL [169] and Condel [149] and PolyPhen-2 [147].

The column labeled ‘Access’ in Table 4 pinpoints to several problems: many of the available FPPs are not released to users for running locally and the authors provide access through web servers. Unfortunately, many of the web servers are not consistent. Only one group provided web services application programming interfaces) to access their services. Other groups provide simple batch processing, and some require that variants have been tested manually on their server, which is an impossible task when working with NGS where hundreds of missense mutations need to be evaluated. This problem is in part solved by databases with preprocessed variants like dbNSFP [180]. However, the major problem is the lack of standards between groups. Each group develops its own format and requires different input of the data. In addition, each group invents their own scoring system. In many cases, it is difficult to figure out what data sets were used to train their programs. An urgent call for standardization is required.

All the available FPPs are limited to evaluate the effect of single missense mutations. The effect of indels or multiple missense mutations in a single protein is beyond the scope of most, if not all, of the available programs. There is a lack of FPPs capable of evaluating the effect of variations in noncoding regulatory regions even when there is a plethora of annotations in the Encyclopedia of DNA elements (ENCODE) project. However, at the time of this writing, a new method was published, Genome Wide Annotation of Variants (GWAVA).

GWAVA uses a machine-learning algorithm (random forest) trained with annotations from ENCODE, GENCODE, and other sources to evaluate the effect of regulatory variants in noncoding portions of the genome. GWAVA uses a normalized score of 0–1 to report pathogenicity of variants. In addition, the group provides precomputed scores for all known noncoding variants that are available in Ensembl [190].

Very recently, the Combined Annotation-Dependent Depletion (CADD) framework was published [192]. CADD is based on the evolutionary principle that damaging mutations will be removed by natural selection from the gene pool. Shendure’s group trained their support vector machines with two data sets. The first set was generated by the simulation of 14.7 million variants that reflect known mutational events. The second set of 14.7 million variants contains variants known to be fixed in the human genome. CADD framework incorporates the annotations from 63 different sources and generated a single metric score or C score. C score measures deleteriousness, a property that strongly correlates with both molecular functionality and pathogenicity. Shendure’s group also precomputed and made available scores for all possible missense mutations that could occur at every position in the genome. In addition, CADD is capable to evaluate the effect of indels, but only a limited set of indels was precomputed at this time. The authors provided several examples between the correlation of C score with pathogenicity and tested CADD on several sets of known pathogenic variants. Their analysis shows that CADD outperform PolyPhen-2 [147] on distinguishing between pathogenic and benign variants. The precomputed data provide two types of scores: raw score, which goes from negative values to positive values (a negative value indicates that the variant is fixed in the population while a positive value indicates that the variant was simulated or rare), and a normalized Phred quality score scale. The advantage of using Phred scale, a ranking score, is that most of the people that work with sequence analysis are already familiar with Phred scale and the scores should be persistent between releases. For example, if a mutation ranks in the top 1% (CADD-20) of the whole set of mutations in the human genome and the program is updated the rank for the mutation tested would be the same regardless of the absolute value of the raw score or the Phred value generated by the updated program [192].

Integrated software & commercial solutions to analyze your data

During the last few years, many institutions have been able to acquire NGS sequencers, but many of them lack the infrastructure and expertise to perform the bioinformatics analysis and the medical interpretation of the data. For a small laboratory that processes a small number of samples, annotating the variant call format (VCF) file and selecting a subset of variants to study is sufficient. There are several software packages, listed in Table 5, that annotate an entire VCF file (under type ‘VCF annotator’).

Table 5.

Software to annotate variant call format files and manage workflow.

Name Type of analysis or system provided Access Ref.
Cassandra VCF annotator Free [198]
AnnTools VCF annotator Free [199]
Ensembl SNP Effect Predictor VCF annotator Free [200]
snpEff VCF annotator/predictor Free [201]
ANNOVAR VCF annotator Commercial and free [202]
Varianttools VCF annotator Free [203]
Galaxy Workflow management system Free [204]
Mercury Workflow management system Free [205]
NGSANE Workflow management system Free [206]
Seven Bridges Genomics, Inc. Workflow management system Commercial [207]
Chipster Workflow management system Free [208]
Anduril Workflow management system Free [209]
Genomatix Hardware and software Commercial [210]
CLC Bio Hardware and software Commercial [211]
Knome, Inc. Hardware and software Commercial [212]
SoftGenetics Software Commercial [213]
DNAStar, Inc. Software Commercial [214]
Partek, Inc. Software Commercial [215]
Complete Genomics, Inc. Whole genome and analysis Commercial [216]
Personalis Exome sequencing and analysis Commercial [217]
Omicia Analysis Commercial [218]
NextCODE Health Analysis Commercial [79]
Invitae Corp. Analysis Commercial [219]
Genformatic Analysis Commercial [220]
Bina Analysis Commercial [221]
Real Time Genomics Analysis Commercial [222]
DNAnexus Cloud service, storage and analysis Commercial [223]
Ingenuity Analysis Commercial [224]

VCF: Variant call format.

For a large laboratory that tries to analyze hundred or thousand of samples, the manual process is not a viable solution. A large laboratory wants to analyze every sample consistently and automatically. There are many bioinformatics steps between the raw data and the final report (Figure 2). For such laboratories the installation of a workflow management system is essential. In Table 5, there is a list of several workflow management systems, some of them free and others commercially available. Alternatively, there are many companies dedicated to providing a solution to analyze your data (Table 5). Several companies offer one-step solution like Genomatix and Knome. Others offer only the software and a third group offers to do the bioinformatics analysis and return the results.

Figure 2. Generic pipeline for the analysis of next-generation sequencing.

Figure 2

Multiple steps involved in the analysis of data from the next-generation sequencing. The paired-end short reads, from the sequencing machine, are submitted to a quality control process. The adaptors are removed from the reads, and then the reads are mapped to the human reference by using short-read sequencing mapping tools. The alignments in the sequence alignment/map format are cleaned with tools like Pickard and transformed into a binary version of the sequence alignment/map format BAM. The BAM file is processed with tools like the Genome Analysis Toolkit to clean up the alignments. Quality control reports are generated, and variants are extracted by the use of variant callers. The document containing the variants or variant call format is annotated and filtered. Low-frequency variants that are known or predicted to be damaging are validated and used to generate a final report to the physicians or genetic counselors.

BAM: Binary Sequence Alignment/Map format; dbNSFP: Lightweight database of human nonsynonymous SNPs and their functional predictions; GATK: Genome Analysis Toolkit; HGMD: Human gene mutation database; HPG: High performance genomics; MAF: Minor allele frequency; QC: Quality control; SAM: Sequence Alignment/Map format; SRSMT: Short read sequencing mapping tools; VEF: Variant effect predictor; VCF: Variant call format.

Use of NGS to diagnose human disorders

One of the major concerns of medical diagnosis is to identify genes and mutations responsible for human disorders. Early identification of causative mutations enables the early detection of a myriad of disorders. We are living in an age of high healthcare cost. Early detection of genetic disorders, carrier status, genetic predispositions for cancer and cardiovascular disease could potentially reduce the healthcare cost.

The first proof of concept that the NGS technology could be used to detect genetic disorders was provided by Shendure’s group on September 2009 [225]. A few months later, the same group reported the detection of the first recessive disorder (Miller syndrome) detected by whole-exome sequencing (WES) [226]. These two papers marked a new era where NGS became the preferred tool for rare Mendelian disease gene identification. There are several excellent reviews that describe the exponential growth in disease gene identification that started in 2010 [227229]. Up to 27 February 2014, the number of genes with phenotype-causing mutations has reached 3162 according to online Mendelian inheritance in man (OMIM) Mgene map statistics [230]. In a recent review, Rabbani et al. estimated that from January 2010 to May 2012, over 100 causative genes in various Mendelian disorders have been identified by means of exome sequencing [231].

WES is now a valid and standard diagnostic approach for the identification of molecular defects in patients with suspected genetic disorders. This fact was demonstrated last year by a publication in the New England Journal of Medicine by the Medical Genetics Laboratory group of Baylor College of Medicine. The group reported the WES sequencing of 250 probands referred by physician, 98% of the cases were billed to the insurance. They reported a 25% molecular diagnostic rate (62 cases) [232]. In September 2013, the NIH funded four groups to explore the use of NGS for newborn screening [233]. With the cost per genome getting close to the US$1000, it is becoming affordable to get sequenced at an early age, allowing for reanalysis of our genetic information at multiple intervals during the life of a person (Figure 3). A recent review outlines the approach, challenges, and benefits of such screening for adult genetic disease risks [2]. We also recently published a proof of concept project aimed to evaluate the benefits of screening healthy adults using WES. Our pilot project demonstrated that when WES is combined with medical and family history the findings are substantial. In a cohort of 81 unrelated individuals, we identified 271 recessive risk alleles (214 genes), 126 dominant risk alleles (101 genes) and three X-recessive risk alleles (three genes). In addition, we linked personal disease histories with causative disease genes in 18 volunteers [234].

Figure 3. The road from next-generation sequencing to personalized medicine.

Figure 3

An overall view of how next-generation sequencing will be incorporated into the medical healthcare system. At the time of birth, a small sample of blood is taken from the patient and submitted to whole genome sequencing. The physicians and genetic counselors will provide a detailed family and medical history to an entity that will store and analyze the next-generation sequencing data. This entity will receive additional information such as metabolomics, proteomics and transcriptomes, among others, as well as new bioinformatics interpretation will be performed in collaboration with molecular biologist, physicians and genetic counselors. The physicians will review the reports and formulate recommendations and treatments for the patient. The process will be interactive with constant communication between the doctor, patient and entity in charge of the data interpretation.

Conclusion

The development of NGS was a monumental achievement that involved thousands of individuals from multiple professions and with a myriad of motivations, but with a common goal: to understand what make us unique. Definitively, the major milestone required for reaching our goal was to sequence the first human genome this was accomplished under the Human Genome Project (HGP). Reaching the first milestone took 13 years with a cost of US$3 billion; however, we should not forget the overlapping project to annotate the human genome. Annotating the human genome was essential to understand and apply our newly acquired knowledge to improve human health. Before the end of the project, it became obvious that sequencing an individual genome was only the beginning of a long road to provide cures and prevention for genetic diseases. Two independent projects born after the completion of the HGP, one directed to understand the variability in the human population (HapMap Project) and a second project undertaken by commercial enterprises was able to develop the most economical massive parallel sequencing technology every seen. The success of both projects together with the growing catalog of human disorders merged to form what we now know now as clinical and medical genetics. Multiple commercial enterprises have been very successful in developing fast and affordable technology. We can now sequence the entire genome of an individual for approximately US$1000 in less than two weeks (summarized in Figure 1). With such overwhelming success to generate large amounts of short reads several groups of developers were motivated to generate efficient tools to align and detect variants. Currently, we have excellent short-read sequencing mapping tools (SRSMT) and very accurate variant callers (Table 1). The process of interpreting an individual genome starts by separating the variations that are common in the population from the unique mutations, to complete this task resources developed by population geneticist are essential (Table 2). Only 5 years ago (from the publication of this review) the first proof of concept that NGS could be used to detect human disorders was provided by Shendure’s group. Since that time an expansion in the number of pathogenic genes has surpassed the 3000 mark. Human catalogs of disease-causing mutations are also expanding very fast (Table 3) but since there are an extraordinary large number of potential damaging mutations in man, our repertoire of techniques to predict damaging mutations should become a priority. Currently, the number of functional prediction programs (FPPs) capable of detecting pathogenic variants is over 40 (Table 4). However, there is a variable degree of accuracy and agreement between them, also the lack of standards; maintenance and form of distribution make it our biggest liability for the acceptance of personalized medicine. We have come a long way from 2007; we have now a large number of commercial and free workflows capable of analyzing the enormous amount of information from NGS sequencers (Table 5 & Figure 2). I feel confident that future generations will have a much more bright and healthy life with the incorporation of NGS into medicine. Figure 3 shows how the use of NGS in combination with additional information from the patient, at different stages of life, will improve early treatments and real on time personalized medicine.

Future perspective

Despite its early age, NGS has successfully extended our knowledge about disease phenotype–genotype relationships and disease gene discovery. The number of genetic disorders with a corresponding causative gene is growing very fast and will continue to grow exponentially during the next few years. The NGS technology has been adopted for clinical diagnosis of suspected genetic disorders with a 25% success rate [232]. The success rate will increase with the development of new sequencing technologies and better analytical tools. NGS is now moving to the area of carrier testing, newborn screening and prenatal screening. We expect that during the next few year NGS will become a part of the standard set of newborn screening tests.

Currently, many laboratories offer NGS panels for patients with different types of cardiomyopathies that could have a genetic cause and for patients with family histories of hereditary cancers. Some laboratories offer services for the detection of variants that could improve the treatment of cancer patients such as pharmacogenomics panels. Some groups like the Mayo Clinic (MN, USA) [235], Foundation Medicine (MA, USA) [236], Genekey (CA, USA) [237] and Molecular Health (TX, USA) [238] offer genetic tests and work with oncologists to improve the treatment of their patients and provide state-of-the-art technologies to personalize cancer treatments. Some of their analyses include molecular profiling, gene expression profiling, the identification of genetic rearrangements in tumor samples, the detection of circulating tumor cells and the detection of somatic mutations in tumor samples. During the next few years, we expect there to be an exponential increase in the number of organizations that not only offer NGS tests but also professional guidance to oncologists for the personalized treatment of cancer patients. The role of these professional counselors will extend from cancer to other genetic disorders, personalizing many medical treatments.

At the moment, screening healthy adults for genetic risks is a controversial issue. However, as patients become more aware of the benefits of using NGS for early detection of adult-onset disorders there will be an increase in the number of requests for NGS analyses, especially from healthy adults that are looking for new approaches to prevent disorders. Eventually, NGS will become part of the routine yearly physical examinations, or it may become a medical specialty on its own [234].

New technologies such as the GridION System (Oxford Nanopore technologies [Oxford, UK]), single- cell sequencing (Fluidigm), positional sequencing (Nabsys) and long fragment read (CGI) will provide cheaper, faster and more accurate sequencing data. The use of supercomputers, in conjunction with parallelization, will accelerate the analysis of genomic data. The increasing number of catalogs of causative and risk genes will provide a foundation for PM and pharmacogenomics. The use of NGS technology for patients in critical care units will become possible with the presence of three elements: high-quality whole-genome sequences delivered at a very fast rate; fast analysis time; and large catalogs of DCM and pharmacogenomics markers. Predicting the functional effects of a mutation is a complex area in need of standardization, but of crucial importance for the identification of variants with high impact. New developments in this area such as GWAVA and CADD are helping to provide light at the end of a dark tunnel.

Executive summary.

Moving from traditional medicine to personalized medicine

  • With an overburdened and overwhelmed healthcare system new alternative strategies are required to reduce the cost and improve the well-being of the patients.

  • Personalized medicine is a medical model that proposes the customization of healthcare by using biological markers and pharmacogenomics to direct the customized treatment of patients.

  • A new technology, next-generation sequencing (NGS), has the potential to make personalized medicine a reality by accelerating the early detection of disorders and the identification of pharmacogenetics markers to customize treatments.

Brief history of NGS

  • The Human Genome Project lasted 13 years with a cost of US$3 billion and the involvement of thousands of international scientists.

  • The Human Genome Project provided the first draft of the human genome assemblies in 2001.

  • During the Human Genome Project the cost of sequencing was reduced dramatically with the development of better chemistry, the involvement of robotics and automation.

  • Bioinformatics and functional genomics flourished during this period, resulting in a myriad of biological annotations for the human genome.

  • The engagement of visionaries and entrepreneurs in the development of novel sequencing technologies bootstrapped the birth of NGS technology.

The goal of having an affordable diploid genome of a single person

  • The first diploid human genome of Dr Craig Venter (MD, USA) was published in 2007 with a cost of US$100 million.

  • In 2008, 454 technologies enabled the sequencing of the second human genome at a cost of US$1,500,000.

  • In 2010, SOLiD technology reduced the cost of a genome to US$100,000.

  • The developments of targeted sequencing of all human exons lowered the price of sequencing to few thousand dollars.

  • By 2012, a furious competition between Complete Genomics (CA, USA) and Illumina® (CA, USA) reduced the cost of a genome to US$3000.

The use of NGS to diagnose human disorders

  • The streamlining and the standardization of the sequencing analysis allowed detecting variations in a single individual.

  • The comparison of variants from an individual against those found in populations allows the identification of rare variants.

  • The evaluation of rare variants, using functional prediction programs, had identified a small subset of variants that could explain pathology.

  • The demonstration that NGS analysis could be used to detect genetic disorders was provided by Shendure’s laboratory (WA, USA) in September 2009.

  • Since 2010, NGS has identified hundreds of causative genes in various Mendelian disorders.

Future perspective

  • The identification of causative genes will continue to increase exponentially.

  • The involvement of NGS on generating personalized pharmacogenomics profiles will increase and move to standard medical practice.

  • NGS will become part of the standard set of newborn screening tests and ethicists; politicians and geneticists will debate for years to come about the value and risks of creating national databases for all newborn babies.

  • The role of NGS in prenatal screening will increase along with the debates between pro-life and pro-choice groups on whether or not we should use NGS for prenatal screening.

  • NGS will become part of the standard repertoire of techniques to guide the treatment of cancer patients.

  • Patients’ requests to primary care physicians for an NGS analysis will increase, especially from healthy adults looking for early detection or prevention of disorders.

Footnotes

For reprint orders, please contact: reprints@futuremedicine.com

Financial & competing interests disclosure

The research was supported by the Cullen Foundation for Higher Education. The funding organizations made the Awards to The University of Texas Health Science Center at Houston (UTHSCH). The author has no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

No writing assistance was utilized in the production of this manuscript.

References

Papers of special note have been highlighted as:

• of interest;

•• of considerable interest

RESOURCES