Abstract
Next-generation sequencing (NGS) technologies provide the potential for developing high-throughput and low-cost platforms for clinical diagnostics. A limiting factor to clinical applications of genomic NGS is downstream bioinformatics analysis for data interpretation. We have developed an integrated approach for end-to-end clinical NGS data analysis from variant detection to functional profiling. Robust bioinformatics pipelines were implemented for genome alignment, single nucleotide polymorphism (SNP), small insertion/deletion (InDel), and copy number variation (CNV) detection of whole exome sequencing (WES) data from the Illumina platform. Quality-control metrics were analyzed at each step of the pipeline by use of a validated training dataset to ensure data integrity for clinical applications. We annotate the variants with data regarding the disease population and variant impact. Custom algorithms were developed to filter variants based on criteria, such as quality of variant, inheritance pattern, and impact of variant on protein function. The developed clinical variant pipeline links the identified rare variants to Integrated Genome Viewer for visualization in a genomic context and to the Protein Information Resource’s iProXpress for rich protein and disease information. With the application of our system of annotations, prioritizations, inheritance filters, and functional profiling and analysis, we have created a unique methodology for downstream variant filtering that empowers clinicians and researchers to interpret more effectively the relevance of genomic alterations within a rare genetic disease.
Keywords: bioinformatics, genetic alterations, Mendelian Genetics, protein information resources
NGS is a technique that involves the massively parallel sequencing of many DNA molecules at once.1 NGS generates massive amounts of data and requires multiple computationally intensive steps to appropriately analyze. Currently, there are numerous publicly available algorithms for analyzing NGS data; however, the majority focus on 1 component of the NGS analysis at a time, such as alignment to reference genome, and require integration with other algorithms. There are commercially available packages that provide a suite of tools for analyzing NGS data, but these are expensive, can be difficult to integrate with other algorithms, and often times do not contain all of the required algorithms for analyzing the NGS data from start to finish. Nonetheless, NGS has rapidly enhanced the ability to detect genomic variants that are rare and novel, as a priori knowledge is not necessary compared with other techniques, such as microarrays or targeted resequencing.
Over the last several years NGS, in particular WES, has significantly impacted the study of rare Mendelian diseases,2 by focusing in a nonhypothesis-driven approach on genetic alterations that cause changes in protein-coding regions. A cost-effective approach for the use of NGS sequencing in the clinical environment for detection of genetic variants associated with Mendelian diseases is to first perform exon-capture, followed by NGS.3 Exon-capture is a technique for enriching regions of the genome that are responsible for coding proteins, which represents ∼1% of the human genome.3 With the use of this WES approach, the coverage—both depth and breadth—of protein coding regions should be appropriate to allow accurate detection of genomic alterations within those regions. For example, targeted NGS was recently used to analyze a region of the genome connected with Noonan syndrome and was able to validate variants previously associated with the syndrome, while also detecting new variants of interest.4 Exome capture can have technical biases, such as poor capture of a region of interest, and it is important to take into consideration these technical limitations when analyzing the results, as regions of the genome may not be represented equally.
Although WES is a powerful and cost-saving approach to study Mendelian disorders, a current bottleneck is in data analysis and data interpretation. The analysis of NGS data, whether with publicly available algorithms or commercially available tools, requires an appropriate computational infrastructure and basic skill sets in computer science, as the data files are large, and a graphical interface is not available for the majority of analysis software. Therefore, we established this project, in collaboration with clinical investigators, to develop solid and user-friendly tools that streamline NGS data analysis for the study of rare Mendelian disorders, thus bridging the gap between bioinformaticians and clinicians.
This article presents bioinformatics methodologies that were developed in response to specific needs of clinical investigators. A training dataset, previously validated, was used as hands-on clinical data to develop our user-friendly and robust clinical data-analysis pipeline. The training set included sequence data from Illumina paired-end WES, performed on 6 patients with a rare disorder characterized by facioskeletal abnormalities and their unaffected family members (n = 14). Specifically, we outline the use of publicly available databases and algorithms, plus the use of custom scripts and graphical interfaces, that when combined during data analysis, enhance a researcher’s ability to analyze genomic variant data from a system biology context. The main objective was to develop downstream bioinformatics methodologies for filtering and annotating short exomic variants (SNP/InDel) and to create a flexible and robust pipeline that could integrate other types of genomic variant data, such as CNV. A custom graphical interface, iProXpress, was used for visualizing protein and disease-rich information for genes that contained a potential variant of interest and enabled pathway and Gene Ontology (GO) enrichment analysis.5, 6 Genomic variants were viewed in context of aligned sequence reads by use of Integrative Genomic Viewer.7, 8
MATERIALS AND METHODS
Data analysis
Paired-end, exome-capture Illumina sequencing data were received in binary alignment files (bam) from Nemours Alfred I. duPont Hospital for Children (Newark, DE, USA). The NGS reads were extracted from the aligned bam files into fastq files by use of bam2fastq (Genomic Services Laboratory at HudsonAlpha, Huntsville, AL, USA), and the original bam files were discarded. The quality of the sequenced reads was examined by use of fastqc, a platform-independent NGS quality tool (Babraham Institute, Cambridgeshire, United Kingdom). The sequenced reads were of appropriate quality, and no adaptor sequences were detected (i.e., over-represented sequences) and aligned to the human reference genome (hg19) by use of Burrows-Wheeler aligner (bwa)-mem, version bwa-0.7.4 (http://bio-bwa.sourceforge.net/), producing new bam alignment files. Average depth of coverage per exon (vertical) and average exon coverage (horizontal) were calculated per sample by use of Genome Analysis Toolkit (GATK) DiagnoseTarget (version 2.5-2). Following GATK best practices,9 bam files were processed by use of Picard Tools, version 1.67 (http://broadinstitute.github.io/picard/), and GATK Unified, version 2.5-2, for SNP and InDel detection by use of the default parameters.10 GATK Unified is cited as being sensitive for detecting small InDels (<20 bp).11 The minimum base quality score used during SNP detection was 17, and the minimum phred-scaled confidence threshold at which variants should be called was 30. The GATK Unified variant call file (VCF; version 4.1) was then annotated with SnpEff, version 3.3a,12 by use of the package GRCh37.69 annotation recommended by SnpEff.
Copy number inference from exome reads (CoNIFER), version 0.2.2, was used to determine CNV.13 All 14 samples were analyzed at the same time by use of CoNIFER [singular value decomposition (SVD) = 3] plus 3 exon-capture 1000 Genomes samples (NA18517, ERR034551; NA18507, SRR764745; NA18956, SRR766028). The visualization of CoNIFER data was generated by use of the plot call script provided. Text and data mining were executed by use of Search Tool for the Retrieval of Interacting Genes/Proteins (STRING),14 a database of known and predicted protein interactions derived from genomic context, high-throughput experiments, coexpression, and previous knowledge. The network maps were downloaded as text files and viewed in Cytoscape for customization of the network maps.15
A custom pipeline was created for annotating and filtering the SNPs and InDel. The first module of the pipeline, Confidence Module, uses the genotype quality score (GQ), calculated by GATK Unified, and read depth. The Confidence Module annotates the VCF (version 4.2) with PASS (GQ > 20, and read depth ≥ 5) or FAIL. The next module in the pipeline, Rare Module, first annotates the variant with a global minor allele frequency (MAF) by use of SNP database (dbSNP)16 and then annotates the variant as PASS or FAIL based on the user-specific requirements. For the example analyzed in this paper, a MAF ≤ 4% was used as the threshold. This module also adjusts for alleles in the reference genome that are minor alleles instead of the major alleles. VCF file was then annotated with the Genetic Module to prioritize potential variants based on observed heritability patterns. The Genetic Module algorithm can annotate the alleles with the following heritability patterns: recessive, dominant, de novo, X-linked, or Y-linked. For the training set presented in this paper, a recessive or X-linked pattern of inheritance was assumed based on available pedigree evidence. An allele was considered a candidate for recessive inheritance if the parents were 0/1 (heterozygous), and the child was 1/1 (homozygous alternative). Likewise, a variant shared on an X-linked gene was considered a candidate when shared uniquely between the proband and his heterozygous mother. In previous validation studies of NGS data, our clinical collaborators had seen evidence of allelic imbalance resulting from tissue-specific somatic mosaicism (i.e., blood vs. buccal cells). Therefore, we also annotated alleles in which either of the parents was 0/1 or 0/0 (homozygous reference), and the child was 1/1 as potential candidates that would need further validation for a recessive pattern of inheritance. Any variant in which either of the parents was 1/1, and the child was 1/1 was not considered as relevant, as none of the parents were affected. With the use of the Human Genome Organization (HUGO) gene name or transcript ID, as annotated by SnpEff, variants were mapped to UniProt accession numbers. A custom output was then generated for iProXpress visualization. iProXpress is a publicly available webpage tool (http://proteininformationresource.org/iproxpress2/) that is an integrated protein expression analysis system, which is designed to help analyze proteomic and genomic data. The system includes protein data analysis for pathway and network discovery and is connected to >150 underlying databases. The resulting VCF file was formatted into a user-friendly output format, simplifying review and interpretation by laboratory/clinical collaborators.
RESULTS AND DISCUSSION
The test dataset included 14 samples (6 probands and their unaffected parents and siblings) of Illumina paired-end WES. Figure 1 represents the pedigree information for the dataset that was used to aid in the development of the bioinformatics methodologies. Ideally, when analyzing Mendelian diseases, complete trio datasets are preferred; however, it is not always possible to obtain genomic sequencing data from all individuals, which is a common challenge in the clinical setting. Therefore, we focused on developing our pipelines, such that the downstream analysis would be robust and flexible for complete and incomplete trio datasets.
Figure 1.
Overview of pedigree. In total, 14 exon-captured samples were analyzed from 5 different families. Two of the families had complete trio datasets, 2 of the families were a mother and proband dataset, and the 5th family was only proband data. Circles, Female; squares, male; black fill, affected individuals; white fill, unaffected individuals; gray fill, individuals without data.
Genome alignment of the exon-captured Illumina reads was performed by use of bwa-mem, which was a recently developed algorithm added to the bwa alignment tool suite and has similar features as bwa-sw. bwa-mem is recommended for high-quality queries, as it is faster and more accurate compared with bwa-sw and has better performance than bwa-backtrack for 70–100 bp Illumina reads (http://bio-bwa.sourceforge.net/) (Fig. 2).
Figure 2.
Overview of bioinformatics methodologies and custom pipeline for annotating and filtering SNPs and InDel. Green box is the starting input fastq file. Square boxes represent processes in the workflow. Blue cylinder represents dbSNP. Orange boxes represent output files from the pipeline. Not all output files are represented in the general workflow. iProXpress tool is diagrammed (see Fig. 4).
High-quality genome alignment of NGS reads is essential for accurate detection of genomic variants, although the measurement of the quality of genome alignment is not a simple task with a single well-defined formula. Publications tend to mention only percentage of reads mapped; however, this is not a comprehensive metric of genome alignment. To enhance further the quality checks implemented in our pipeline, GATK DiagnoseTarget was incorporated into the data analysis. DiagnoseTarget is an algorithm that measures the depth and breadth of coverage for a defined list of genome coordinates, which enables our pipeline to output summary statistics for the exact exons captured during library preparation.
As an example, in our test data, the average depth of coverage per exon per sample was ∼75 reads (Fig. 3A). There were several extreme outliers for each sample—exons with excessively high coverage. These types of upstream quality-control metrics should be taken into consideration in downstream analysis, and variants located in exons that had excessive coverage were flagged as potential artifacts. There were several exons that had a depth of coverage equal to 0, an interesting aspect in data analysis. If a variant was not detected in a potential gene of interest, it is essential to verify whether the gene had appropriate coverage and to note possible limitations of genomic variant detection.
Figure 3.
Summary genome alignment. A) Boxplot of the depth of coverage for each exon captured B) Bar plot of the percentage of exons without 100% coverage.
Algorithms for detecting SNPs in aligned bam files have been developed extensively over the past decade and have been well established in the literature. For the dataset presented, GATK Unified was selected for detecting SNP and InDel by use of the multisample option (Table 1). GATK Unified was selected as the preferred algorithm, as it is well maintained and supported; furthermore, upon initial comparisons with Samtools and GATK Haplotyper, it was more consistent with detecting previously established SNPs (data not shown).
TABLE 1.
Summary SNP and InDel detection
| Family | ID | Number of variants | Transition/ transversion ratio | Heterozygous genotypes | Nonreference genotypes |
|---|---|---|---|---|---|
| 1 | Proband 1 | 421,972 | 2.08 | 78,804 | 211,524 |
| Mother | 511,566 | 2.04 | 92,661 | 282,545 | |
| Father | 445,033 | 2.08 | 81,917 | 227,191 | |
| 2 | Proband 2 | 455,220 | 2.06 | 84,801 | 232,552 |
| 3 | Proband 3 | 460,964 | 2.04 | 83,931 | 246,113 |
| Mother | 497,875 | 2.05 | 91,740 | 279,767 | |
| Father | 515,782 | 2.04 | 96,387 | 291,874 | |
| Brother | 487,733 | 2.05 | 90,222 | 268,627 | |
| Half-sister | 470,183 | 2.06 | 90,098 | 252,503 | |
| 4 | Proband 4 | 440,493 | 2.09 | 82,323 | 222,682 |
| Mother | 431,266 | 2.09 | 81,889 | 217,590 | |
| 5 | Proband 5 | 427,036 | 2.13 | 97,201 | 232,074 |
| Proband 6 | 457,682 | 2.11 | 104,792 | 256,846 | |
| Mother | 487,126 | 2.12 | 112,406 | 282,454 |
Quality checks for SNP and InDel detection are an important factor for establishing bioinformatics methodologies in clinical applications. Therefore, quality checks were implemented in the workflow for SNP and InDel detection. VCF tools are part of a publicly available package of programs designed to work with VCF files.17 We used a combination of VCF tools and PSEQ (http://pngu.mgh.harvard.edu/~purcell/plink/) to extract metrics on SNPs and small InDel detected. Table 1 presents four calculations (number of total variants detected, transition/transversion ratio, number of heterozygous genotypes, and number of nonreference genotypes) and highlights the need to develop downstream annotation and filtering capabilities to further assess the high number of variants called. Transition to transversion ratio is expected to be ∼2 for exon-capture datasets,18 and our data were consistent with this trend; and this metic was monitored at various stages of the downstream filtering steps.
Next, we created a robust pipeline for annotating and filtering the detected variants. The first step of our annotation uses the publicly available algorithm SnpEff,12 which annotates variants and predicts the effects of the genetic alteration, enabling downstream clustering and analysis of SNPs based on functional predictions. Furthermore, we developed 3 custom modules capable of further annotating the SnpEff annotated VCF with user-specific criteria. The Confidence module extracts the GQ and allele depths per individual. If the GQ and allele depth requirements are in agreement with the user-specified criteria, then the variant is annotated as PASS or FAIL per individual. The Confidence Module was developed with flexibility to allow the user to specify whether both alleles need to have the same read depth. This is an important consideration, as allelic imbalance can arise from mosaic samples. Mosaicism could indicate a mutation that happened early in development and therefore, is not present in every cell of the body.19 However, such apparent allelic imbalance in the data may also arise artificially as a result of errors during sequencing. As with all SNP and InDel detected by NGS, it is essential for the researcher to validate the candidates of interest via an alternative method, such as Sanger sequencing.
The next portion of the pipeline involves annotating the alternative allele with a MAF. For rare inherited disorders, a common paradigm is that a true causative SNP will have an extremely low MAF and/or may not be present in any databases, such as dbSNP. With the vast amount of NGS data produced and analyzed and the incorporation of disease and nondisease results into public databases, we did not focus our pipeline on prioritizing variants that were considered novel (i.e., no dbSNP reference SNP ID number). Instead, we annotated the SNPs with a global MAF by use of dbSNP and allowed the user to specify the MAFs of interest. For this example, a MAF ≤4% was used for the presented case study (Table 2). A quality check implemented for this module was the creation of a distribution plot of the annotated MAFs. During the initial development of this module, it was noted that several of the SNPs had MAFs reported by dbSNP >50%, which is a contradictory metric, indicating that for these alleles, the reference genome used for alignment (hg19) represents the minor allele and not the major allele. This is a very important point that can be overlooked easily in analysis, resulting in true minor alleles being filtered, as the reported frequency was, for instance, in the 96–100% range rather than the expected 0–4%. Therefore, we enhanced the Rare Module to take into consideration alleles that were identified at positions in which the reference genome does not represent the major allele (Table 2).
TABLE 2.
Summary SNP detection and annotation
| Family | ID | Genomic variants detected | Variants confidence | Variants MAF ≤ 4% | Variants recessive genetic analysis | Controls subtracted |
|---|---|---|---|---|---|---|
| 1 | Proband 1 | 421,972 | 185,033 | 70,753 | 385 | 202 |
| Mother | 511,566 | 212,023 | 82,205 | – | – | |
| Father | 445,033 | 195,864 | 74,950 | – | – | |
| 2 | Proband 2 | 455,220 | 201,467 | 76,843 | – | 506 |
| 3 | Proband 3 | 460,964 | 203,705 | 77,384 | 481 | 417 |
| Mother | 497,875 | 200,679 | 77,855 | – | – | |
| Father | 515,782 | 204,281 | 78,865 | – | – | |
| Brother | 487,733 | 213,720 | 81,722 | – | – | |
| Half-sister | 470,183 | 209,129 | 79,959 | – | – | |
| 4 | Proband 4 | 440,493 | 198,252 | 75,692 | 833 | 318 |
| Mother | 431,266 | 192,376 | 73,843 | – | – | |
| 5 | Proband 5 | 427,036 | 186,047 | 69,891 | 543 | 306 |
| Proband 6 | 457,682 | 202,122 | 76,133 | 543 | 306 | |
| Mother | 487,126 | 209,695 | 79,244 | – | – |
Following the Rare Module, the VCF file was annotated with the Genetic Module. For this example, a recessive inheritance pattern (Table 2) was used based on the pedigree information and the expert knowledge of our research collaborators; however, other inheritance models are also supported. It is important to note that during the filtering process, no candidate variants are truly removed from the VCF; they are simply annotated with PASS/FAIL indications for given filters, allowing for flexibility to refilter the data easily. After our robust annotation and filtering of the variants, the VCF is then decomposed into focused lists of genomic variants of interest for each individual. A user-friendly Variant Output File (Table 3) is generated to allow the clinician to filter easily on desired characteristics, such as MAF or effect of variant. A single variant is represented per line (row), and the filterable columns can be divided into 5 categories: genomic, protein, pathway, disease, and patient-specific. Table 3 lists all of the filterable characteristics and provides a single-row example from the Variant Output File (Example row Variant Output). By providing the user-friendly Variant Output File, our pipeline connects numerous resources and enables a flexible filtering strategy.
TABLE 3.
Filterable characteristics Variant Output File
| Characteristic | Feature | Resource | Example row Variant Output |
|---|---|---|---|
| Genome | Chromosome | hg19 | chr10 |
| Position of variant | hg19 | 17171656 | |
| Variant ID | dbSNP | rs76788243 | |
| Reference allele | hg19 | T | |
| Alternative allele | GATK unified | G | |
| Quality of variant | GATK unified | 9485.04 | |
| Gene name | HUGO | CUBN | |
| Effect of variant | SnpEff | Nonsynonymous change | |
| MAF | dbSNP | 0.034 | |
| Rare Module | Custom | Pass | |
| Protein | Biotype | Ensembl | Protein coding |
| Amino acid change | SnpEff | I37L | |
| Protein feature | UniProt | Mature (processed) chain | |
| UniProt accession | UniProt | O60494 | |
| Pathway | KEGG pathway | KEGG | Organismal systems |
| Disease | Gene associated with disease | UniProt | Recessive hereditary megaloblastic anemia 1 MIM:261100 |
| Genetic inheritance pattern | Custom | Recessive | |
| Patient-specific | Genotype probands | GATK unified | G/G |
| Confidence module | Custom | Pass | |
| allele depth | GATK unified | 101 |
KEGG, Kyoto Encyclopedia of Genes and Genomes.
Variants were selected that passed all 3 custom modules and had 1 of the following SnpEff categories: codon change and insertion, codon insertion, frame shift, nonsynonymous coding, and start gained. Gene lists were created for each proband and mapped to UniProt and loaded into the custom iProXpress web interface. The web-based iProXpress provides tools for functional profiling, such as pathway and GO enrichment analysis, and allows for custom display of selected fields from >150 underlying databases, such as Online Mendelian Inheritance in Man disease information. The pathway enrichment tool can be used to identify pathways in which probands share mutations, although not necessarily in the same gene (Fig. 4).
Figure 4.
Integrating genomic variant data with protein and disease-rich information: iProXpress. The iProXpress web interface 1) allows user to select which data field will be visualized on webpage; 2) links the gene of interest to rich protein information by providing a hyper-text link to the UniProt resource; 3) GO term for the protein of interest and hyper-text links; and 4) pathway and GO enrichment analysis.
SNPs and InDels are not the only type of genomic variant that can influence protein-coding regions of the genome. CNVs, in gene-coding regions, may also influence gene-expression levels,20 and it has been reported previously that ∼0.5 of identified CNVs overlap with protein-coding regions.21 Computational algorithms for determining structural variants, such as CNVs, from aligned bam files follow various strategies. For example, some tools use paired-end and/or longer read mapping to identify “breakpoints” in the DNA coverage, whereas others use differences in read depth. CoNIFER was selected as the algorithm for detecting CNV, as it uses SVD to normalize CNV and avoids batch bias by integrating multiple samples.13
All 14 samples, plus 3 control samples chosen from the 1000 Genomes Project data, were analyzed with CoNIFER simultaneously. With the use of the inflection point in the singular values plot, it was determined that a SVD of 3 was appropriate for the analysis. Table 4 is a summary for the CNVs detected. Figure 5 is a graphical display of chromosome 16 for 3 of the probands plus a control. This region was selected for graphical display, as 3 of the probands from 2 unrelated families had a similar deletion pattern, and there have been recent publications linking this region of the human genome with phenotypes closely related to this dataset.22
TABLE 4.
Summary CNV-detected CoNIFER
| Family | ID | Total | Duplications | Deletions | Chromosomes |
|---|---|---|---|---|---|
| 1 | Proband 1 | 8 | 2 | 6 | 16 |
| Mother | 5 | 5 | 0 | 13, 14, 16 | |
| Father | 11 | 3 | 8 | 14, 16, 19, 5 | |
| 2 | Proband 2 | 6 | 5 | 1 | 13, 14, 4, 5, 6, 8 |
| 3 | Proband 3 | 3 | 1 | 2 | 19 |
| Mother | 4 | 3 | 1 | 19, 7, 9 | |
| Father | 2 | 2 | 0 | 1, 13 | |
| Half-sister | 7 | 6 | 1 | 1, 16, 17, Y, 5, 7 | |
| 4 | Proband 4 | 1 | 0 | 1 | 1 |
| Mother | 2 | 2 | 0 | 15 | |
| 5 | Proband 5 | 10 | 5 | 5 | 1, 16, 19, 6, 7 |
| Proband 6 | 8 | 1 | 7 | 1, 16, 19, 4 | |
| Mother | 4 | 2 | 2 | 1, 17, Y |
Figure 5.
CNV detection. Graphical display of CNV detected by CoNIFER. x-Axis, Position on chromosome; y-axis, SVD-ZRPKM values for each exon calculated by CoNIFER; red lines, SVD-ZRPKM values for each probe from the sample of interest; purple bars, genes; gray lines, smoothed SVD-ZRPKM values for each probe for a given sample. RPKM (reads per thousand bases per million reads mapped) (Mortazavi et al. 2008). RPKM values were transformed into standardized Z-scores (termed ZRPKM values) based on the mean and standard deviation across all analyzed exomes.
A list of unique genes that contained significant CNVs was created per each individual and was combined with the gene lists of interest identified from the SNP and InDel detection. This strategy allowed us to integrate the SNP analysis and the CNV analysis to discover the putative candidate gene associated with this rare disorder. This gene is still currently under investigation by our clinical collaborators. To enhance the development of the pipeline further, text- and data-mining modules were developed. The unique gene lists for each sample were combined into one large list and batch-uploaded into the STRING interface to view protein–protein interactions between the translated gene candidates. The network text file generated by STRING was downloaded and viewed inside Cytoscape. The STRING analysis yielded several protein–protein interactions between the genes of interest for each of the probands and provided a potential system for connecting genes with genomic variants between different probands.
Whole-genome sequencing generates massive amounts of data that even when processed can be difficult for clinicians and biomedical scientists to analyze and apply appropriately in the medical health field. Rigorous bioinformatics methodologies are required to analyze the data with appropriate statistical methods that will ultimately link the genetic data to the disease phenotype. In total, 14 exome NGS samples were analyzed, and our innovative annotation of variants allowed a thorough analysis of genomic variants by use of a system biology approach.
Our bioinformatics methodologies incorporated several different types of genomic alteration detection methods and allowed for a comprehensive understanding of the genomic architecture for the rare disease analyzed. By applying our system of annotations, prioritizations, inheritance filters, and functional profiling and analysis, we have created a unique methodology for further filtering of disease-relevant variants that impact protein-coding genes. After applying filters for high-impact variants and recessive variants inherited from parents, we obtained a focused list of potential clinically relevant variants. Our methodology allows for the analysis of variant lists (e.g., variants with high and moderate impact) by use of pathway/GO enrichment tools that use a systems biology approach. Taken together, the integrative approach allows better selection of disease relevant genomic variants by use of both genomic and disease/protein-centric information. Overall, our analysis pipeline makes a complex bioinformatics workflow more approachable for clinicians and researchers.
Acknowledgments
The authors give special thanks to Karol Miaskiewicz and the BioIT staff for extensive computer systems support. This project benefitted from access to the Biohen computational cluster, hosted by the Delaware Biotechnology Institute and University of Delaware Bioinformatics Core. This project was partially supported by the Delaware IDeA Network of Biomedical Research Excellence program, with a grant from the National Institute of General Medical Sciences, U.S. National Institutes of Health (8 P20 GM103446-13).
REFERENCES
- 1.Metzker ML. Sequencing technologies—the next generation. Nat Rev Genet 2010;11:31–46. [DOI] [PubMed] [Google Scholar]
- 2.Gilissen C, Hoischen A, Brunner HG, Veltman JA. Unlocking Mendelian disease using exome sequencing. Genome Biol 2011;12:228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wang Z, Liu X, Yang B-Z, Gelernter J. The role and challenges of exome sequencing in studies of human diseases. Front Genet 2013;4:160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lepri FR, Scavelli R, Digilio MC, Gnazzo M, Grotta S, Dentici ML, Pisaneschi E, Sirleto P, Capolino R, Baban A, Russo S, Franchin T, Angioni A, Dallapiccola B. Diagnosis of Noonan syndrome and related disorders using target next generation sequencing. BMC Med Genet 2014;15:14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Huang H, Hu Z-Z, Arighi CN, Wu CH. Integration of bioinformatics resources for functional analysis of gene expression and proteomic data. Front Biosci 2007;12:5071–5088. [DOI] [PubMed] [Google Scholar]
- 6.McGarvey PB, Zhang J, Natale DA, Wu CH, Huang H. Protein-centric data integration for functional analysis of comparative proteomics data. Methods Mol Biol 2011;694:323–339. [DOI] [PubMed] [Google Scholar]
- 7.Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP. Integrative genomics viewer. Nat Biotechnol 2011;29:24–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 2013;14:178–192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011;43:491–498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Mckenna A, Hanna M, Banks E, et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20:1297–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Narzisi G, O’Rawe JA, Iossifov I, Fang H, Lee YH, Wang Z, Wu Y, Lyon GJ, Wigler M, Schatz MC. Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nat Methods 2014;11:1033–1036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Cingolani P, Platts A, Wang L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 2012;6:80–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Krumm N, Sudmant PH, Ko A, O’Roak BJ, Malig M, Coe BP, Quinlan AR, Nickerson DA, Eichler EE; NHLBI Exome Sequencing Project . Copy number variation detection and genotyping from exome sequence data. Genome Res 2012;22:1525–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, Bork P, von Mering C. STRING 8—a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res 2009;37:D412–D416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Saito R, Smoot ME, Ono K, Ruscheinski J, Wang PL, Lotia S, Pico AR, Bader GD, Ideker T. A travel guide to Cytoscape plugins. Nat Methods 2012;9:1069–1076. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001;29:308–311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R; 1000 Genomes Project Analysis Group . The variant call format and VCFtools. Bioinformatics 2011;27:2156–2158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Keller I, Bensasson D, Nichols RA. Transition-transversion bias is not universal: a counter example from grasshopper pseudogenes. PLoS Genet 2007;3:e22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Graham JM Jr., Hennekam RC. Genetics of common malformations. Eur J Med Genet 2014;57:353–354. [DOI] [PubMed] [Google Scholar]
- 20.Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics 2013;14 (Suppl 11):S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Månér S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M. Large-scale copy number polymorphism in the human genome. Science 2004;305:525–528. [DOI] [PubMed] [Google Scholar]
- 22.Tropeano M, Ahn JW, Dobson RJ, Breen G, Rucker J, Dixit A, Pal DK, McGuffin P, Farmer A, White PS, Andrieux J, Vassos E, Ogilvie CM, Curran S, Collier DA. Male-biased autosomal effect of 16p13.11 copy number variation in neurodevelopmental disorders. PLoS ONE 2013;8:e61365. [DOI] [PMC free article] [PubMed] [Google Scholar]





