Multi-Perspective Quality Control of Illumina Exome Sequencing Data Using QC3

Yan Guo; Shilin Zhao; Quanhu Sheng; Fei Ye; Jiang Li; Brian Lehmann; Jennifer Pietenpol; David C Samuels; Yu Shyr

doi:10.1016/j.ygeno.2014.03.006

. Author manuscript; available in PMC: 2018 Jan 5.

Published in final edited form as: Genomics. 2014 Apr 3;103(5-6):323–328. doi: 10.1016/j.ygeno.2014.03.006

Multi-Perspective Quality Control of Illumina Exome Sequencing Data Using QC3

Yan Guo ^1,^*,^✉, Shilin Zhao ^1,^*, Quanhu Sheng ¹, Fei Ye ¹, Jiang Li ¹, Brian Lehmann ³, Jennifer Pietenpol ³, David C Samuels ², Yu Shyr ^1,^✉

PMCID: PMC5755963 NIHMSID: NIHMS912921 PMID: 24703969

Abstract

Advances in next-generation sequencing (NGS) technologies have greatly improved our ability to detect genomic variants for biomedical research. The advance in NGS technologies has also created significant challenges in bioinformatics. One of the major challenges is the quality control of sequencing data. There has been heavy focus on performing raw data quality control. In order to correctly interpret the quality of the DNA sequencing data, however, proper quality control should be conducted at all stages of DNA sequencing data analysis: raw data, alignment, and variant detection. We designed QC3, a quality control tool aimed at those three major stages of DNA sequencing. QC3 monitors quality control metrics at each stage of NGS data and provides unique and independent evaluations of the data quality from different perspectives. QC3 offers unique features such as detection of batch effect and cross contamination. QC3 and its source code are freely downloadable at https://github.com/slzhao/QC3.

1. Introduction

High throughput sequencing is the most effective way to screen for non-specific germline variants, somatic mutations, and structural variants. Some of the most popular sequencing paradigms in DNA sequencing are exome sequencing, whole genome sequencing, and target panel sequencing. While vastly informative, sequencing data poses significant bioinformatics challenges in areas such as data storage, computation time, and variant detection accuracy. One of the easily overlooked challenges associated with sequencing is quality control. Quality control for DNA sequencing data can be categorized into three stages: raw data, alignment, and variant calling [1]. One of the major misconceptions of DNA sequencing quality control is that quality control is only needed at the raw data stage. QC on raw sequencing data has been given more attention than QC on alignment and variant calling, and this is also reflected in the number of tools developed for raw data QC versus alignment and variant calling. Raw data QC tools include FastQC [2], FastQ Screen [3], FASTX-Toolkit [4], NGS QC Toolkit[5], RRINSEQ [6] and QC-Chain[7]. In contrast, fewer tools have been developed for conducting quality control on alignment and variant calling despite the fact that quality control is essential at all three stages of DNA sequencing data processing. Several notable alignment QC tools are SAMStat [8], and GATK’s QC measures [9]. SAMStat focuses on reads mapping statistics such as distribution of mismatches and insertions. When representing mapping statistics, SAMStat uses a pie chart which arbitrarily divide reads by map quality score. In our approaches, we uses distribution plot which does not arbitrarily stratify reads. GATK’s QC measures contain a series of independent scripts and tools for various purposes. The disadvantage would be running GATK multiple times in order to get a complete QC results and GATK’s results are hard interpretable for people with less experiences. At the raw data stage, quality control serves as a quick screening for data with serious quality issues resulting from the sequencer, flagging those data with questionable quality. Quality control at the alignment stage focuses on alignment quality, which is crucial for successful variant detection. Quality control at the variant calling stage is the last chance to identify samples with quality issues that slipped through quality control at earlier stages and to further reduce the number of false-positive variants.

We have previously detailed three-stage QC strategies in Briefings in Bioinformatics. Here, we present the actual implementation of these strategies in a single package: QC3, a quality control tool for all three stages of DNA sequencing data processing. QC3 provides both graphic and tabulated reports for quality control results. It also offers several unique features such as separation of good and bad quality reads (based on Illumina’s filter) and the detection of batch effects and cross contamination of samples. QC3 accepts three types of data as input: FASTQ [10], Binary Alignment Map (BAM) [11], and Variant Calling Format (VCF), corresponding to the three stages of raw data, alignment, and variant detection respectively. The overview of the concept behind QC3 is shown in Figure 1. QC3 is written with the combination of Perl and R and is freely avaliable for public use. QC3 can be configured to run in serial or parallel. We recommend running the complete QC in sequential order of the sequencing analysis. If needed, however, QC3 can perform the QC checks individually at each stage. QC3 can provide results with a responable time for processing. For Example, on a Linux machine with a single core CPU at 2.8Ghz speed and 8GB memory, QC3 can process a standard sequencing dataset (30–50 million pair-end reads) in around 20 minutes at each stage with some variation depending on file size.

The quality control strategies behind QC3.

2. Materials and methods

2.1. Raw data

Sequencing data analysis depends on raw data quality control for informative results, and there are many publically available tools that suit this need. Investigators need to be aware of several considerations regarding these quality control parameters. The first is the Phred scale used to measure base quality. The standard scale is Phred +33 (ASCII 0 to 62), but older Illumina pipelines (before Casava 1.3) used Phred +64 (ASCII 59 to 126). Misidentifying the Phred scale [12, 13] will result in errors. Correctly scaled data will have a median base quality score between 35 and 40 [14] for Illumina-generated data, and the shape of a base quality vs. cycle plot should be the same regardless of Phred scale. The second consideration is the nucleotide distribution across sequencing cycles. Ideally, the four nucleotides should have a stable distribution across all reads, although there may be some fluctuation at the ends of reads or on the beginning cycles on some platforms. Nucleotide distribution is closely related to base quality, and an unstable nucleotide distribution will often accompany poor base quality. Both are good indicators of raw data quality. The third consideration is GC content. While the percentage of GC content varies across species and genomic regions, a significant deviation (over 10%) from the expected range of the sample may indicate contamination. For raw data quality control, QC3 is offers similarly functionality as the existing tools which check base quality distribution by cycle visualized by a Q-score vs. cycle plot, nucleotide distribution by cycle, and GC content

An important feature that distinguishes QC3 from other QC tools is that QC3 performs data quality assessment by separating the low-quality reads from the high-quality reads. Low-quality reads are denoted with the letter “Y” in the 9^th field in each read’s description row, which is contained in all FASTQ data generated by Illumina Casava 1.8+. Since the majority of current DNA sequencing data is paired-end, differences between read1 and read 2 can provide information about read quality. Due to reasons such as phasing/pre-phasing, decreased signal to noise ratio, and template damage over many cycles of laser imaging, the second read sometimes has significantly worse quality than the first read. QC3 takes both types of reads into account and, can detect discrepancies by calculating Pearson’s correlation and Euclidean distance between the base qualities scores of the two reads. Significantly different base quality distribution between the read pair would indicate quality issue.

Quality control on raw sequencing data provides quick insight into sample quality and can potentially save a significant amount of time in later analysis by allowing early identification of questionable samples. Passing quality control at the raw data level, however, does not necessarily mean that a sample is free from quality issues. Nor does questionable quality within a sample mean that the sample is necessarily useless for subsequent analysis. For example, a portion of reads in a sample can be bad, causing the sample to fail the raw data quality control checks, but after removing those bad reads, a sufficient number of good quality reads may still be present to allow further meaningful analysis to be carried out. Raw data quality control is necessary and informative, but one cannot determine the sample quality solely based on the raw data quality control results.

2.2. Alignment

Alignment is a non-optional step for any re-sequencing analysis. Quality control on the alignment provides additional insights into sample quality and can help identify bad samples that slip through the raw data quality control checks. Despite this, quality control of the alignment stage is not formally included in many sequencing data analysis pipelines and is not performed on a regular basis. For exome sequencing, one of the most important QC parameter is the capture efficiency, defined as the number of reads aligned to capture regions divided by the total number of reads after some quality filter. Capture efficiency is highly dependent on capture kit. For exome sequencing, there are three major exome sequencing capture kits on the market: Illumina TrueSeq, Agilent SureSelect, and NimbleGen SeqCap EZ. The capture regions for the exome capture kits range from 37.6 to 62.1 million base pairs. Other capture techniques available include array-based and selector probe-based methods. Previous studies have shown that capture efficiencies between 40 to 70% are typical for exome sequencing [14–17]. Lower capture efficiencies can indicate low complexity in the target library, sub-optimal probe hybridization conditions, or low stringency washes post-capture. QC3 evaluates capture efficiency if the capture kit’s BED file is provided. Otherwise, QC3 uses the standard exome annotation file from RefSeq to define the exome regions as capture regions.

Furthermore, it has been shown that reads aligned in intron and intergenic regions can generate high-quality data [17], and mitochondria genome sequence can be extracted from exome sequencing data [18]. Thus, we designed QC3 to separate reads into several different categories (on target, intronic, intergenic, mitochondrial, and unmapped) and to compute quality control parameters such as the median and mean mapping quality, insert size, and depth for each category. Median depth is not calculated by default due to the long computation time required, but to the user can choose to run the calculation. As informative as quality control on alignment can be, contaminated samples might not be detectable in this stage.

2.3. SNP Detection

For the majority of exome sequencing studies, detecting SNPs is one of the pivotal steps leading toward the final conclusion of the study. Quality control on SNP calls will not only help identify bad samples that have passed through the raw data and alignment quality control checks but will also minimize the rate of false positive SNP calls.

QC3 uses several important criteria for evaluating SNP calls, including the transition/transversion (Ti/Tv) ratio and the heterozygosity to non-reference homozygosity ratio (het/non-ref-hom). The Ti/Tv ratio has been used by multiple studies [18, 19] as a quality control parameter for assessing the overall SNP quality. The Ti/Tv ratio is computed as the number of transition SNPs divided by the number of transversion SNPs. Even though the number of possible transversions is twice as many as the number of possible transitions, leading to a Ti/Tv ratio of 0.5 if mutations occurred at equal rates, the actual Ti/Tv ratio differs by genomic regions. The Ti/Tv ratio is around 3.0 for SNPs inside exons and about 2.0 elsewhere [14, 20], and the ratio also differs between synonymous and non-synonymous SNPs [21]. Because the target regions of exome capture kits often cover more than just exons, the Ti/Tv ratio for SNPs inside these target regions is expected to lie between 2.0 and 3.0 with the value depending on the fraction of exons inside target regions. Ti/Tv ratios in exome sequencing below the 2–3 range may be cause for concern. The het/non-ref-hom ratio is another good quality control parameter for DNA sequencing data. We have mathematically proved that for whole genome sequencing data het/non-ref-hom ratio is 2.0 based on the Hardy-Weinberg equilibrium [1]. For exome regions het/non-ref-hom would be significantly lower. QC3 also checks the number of non-synonymous SNPs. In 2009, Bamshad et al. [22] showed that about 200 novel nonsynonymous SNPs should be expected per person through exome sequencing when compared to dbSNP 131. The most updated dbSNP version 137 contains 72,952,578 additional SNPs, a 63% increase compared to version 131. Thus, the number of novel nonsynonymous SNPs should be less than what Bamshad et al. reported in 2009. Excessive number of novel nonsynonymous SNPs detected would raise quality control concern. QC3 performs annotation against dbSNP 137 and reports novel non-synonymous SNPs by sample using the ANNOVAR tool [23].

QC3 is capable of detecting cross-contamination of samples, which can slip through the raw data and alignment quality control checks without drawing any attention. Cross-contamination happens when the DNA of different samples are accidentally mixed. QC3 detects such problematic samples by computing the genotype consistencies between all possible sample pairs. Three different types of genotype consistency are computed: 1) heterozygous consistency using sample 1 in the pair as denominator; 2) heterozygous consistency using sample 2 in the pair as denominator, 3) overall consistency including both heterozygous and homozygous genotypes. The genotype consistency checking is similar to the idea of computing identity by descent (IBD) in a genetic study. Excessive heterozygous consistency between two random samples indicates cross contamination, or unknown relatedness between the samples. Because majority of the human genome is homozygous, the overall consistency should be much higher than heterozygous consistency. Low heterozygous or overall consistency between known related samples can indicate pedigree error or other contamination.

Additionally QC3 is able to estimate gender. Gender is estimated from two perspectives: read count in Y chromosome and proportion of heterozygous SNPs in X chromosome. Female do not have Y chromosome. Thus, for female, the reads aligned to Y chromosome should be significantly less than male. Some reads may still align to Y chromosome due to homologous regions. Male have two X chromosomes. Thus, all SNPs on X chromosome should be Homozygous. Excessive heterozygous SNPs on X chromosome indicate the sample is from a female host. By examining both read count on Y chromosome and proportion of heterozygous SNP on X chromosome, we can make reasonably accurate inference on gender.

2.4. Batch Effect Detection

Batch effects are systematic technical variations in data caused by processing data in batches. It can cause complication during data analysis especially for large dataset. The Cancer Genome Atlas (TCGA), one of the largest public available datasets also documented the evidence of batch effect. The most common sequencing failures often occur non-randomly, by lane, flow cell, run, or machine. QC3 recognizes and records the machine name, run ID, flow cell ID, and lane ID of an experiment from either the FASTQ file or BAM files. Based on this information, QC3 determines whether a batch effect exists using nonparametric Kruskal-Wallis [24] test and Fligner-Killeen test of homogeneity of variances [25] on the QC parameters (read count, mapping quality, base quality, and capture efficiency).

3. Results

To demonstrate the effectiveness of QC3, we performed three stage quality control using QC3 on a breast cancer exome sequencing study. This study contains a total of 30 breast tumor samples. The exomes were captured using Illumina’s TrueSeq capture kit and the libraries were sequenced on a HiSeq 2000 platform. The reads are 75 base-pair long and paired-end.

We first performed the raw data quality control on the raw data of the 30 samples using QC3. Immediately, we noticed sample 37 had only 14,353 reads, which clearly indicated failed sequencing. Interestingly, sample 37 was not picked up on the base sequence quality plot or nucleotide distribution plot, which most raw data quality control programs draw (Figure 2B, C). By checking the batch effect, the low read count of sample 37 also skewed the read count distribution of lane 7, which was captured by the Fligner test (p-value < 0.001) (Figure 2A TotalReadsByLane). Other interesting findings from the batch effect test were relatively lower GC content and base quality for samples on lane 8 (Figure 2D). These results prompted us to re-sequence sample 37 and flag samples run on lane 8. For demonstration purposes, we included the failed sample 37 for the later quality control steps. The tabular report of raw data quality control can be viewed in Table S1.

A. Distribution of read count by lane identifies Lane 7 as a clear outlier. Lane 7 containes sample 37, which later was determined to have failed sequencing. B. Base quality by cycle plot for sample 37 fails to detect sample 37 as outlier. C. Nucelotide distribution by cycle plot for sample 37 does not detect sample 37 as outlier. D. GC content by lane show samples on lane 8 have a slightly lower GC content, causing us to flag these samples.

After raw data quality control, we performed alignment use BWA[26] and performed alignment quality control on the aligned BAM files. Unsurprisingly, sample 37 was picked out as the obvious outlier (Figure 3A). In addition, sample 46 showed lower mean mapping quality, which caused us to flag it (Figure 3B). Furthermore, we did not detect any abnormal behavior on alignment statistics for the flagged samples on lane 8. The complete tabular report can be found in Table S2.

A. Number of aligned reads by sample identifies Sample 37 as outlier again. B. Average mapping quality (MQ) based on BWA alignment shows a slightly lower average for sample 46, causing us to flag it. C. Mean insert size distribution. D. Meandepth distribution

SNPs were called using GATK’s Unified Genotyper [9]. The VCF files were quality controlled by QC3. The Ti/Tv ratio and heterozygous/non-reference homozygous ratio of all samples appear normal even for sample 37 (Table S3). This shows that even though the number of reads is too few to be considered a successful sequencing, the few reads did get sequenced produced a few hundred reliable SNPs. Through contamination analysis, we found possible cross-contamination between samples 10 and 14, and samples 39 and 40 (Figure 4A, B). The heterozygous consistency rate of 0.91 and 0.99 were observed between these two pairs of samples respectively. The normal heterozygous consistency rate between two unrelated samples should be around 0.2 to 0.3. Also sample 43 showed high heterozygous consistencies with multiple samples which also indicate possible contamination. No outliers were observed for the samples run on lane 8, despite observing slight abnormalities for these samples during raw data quality control. Therefore we conclude that these samples are useable. The complete tabular report of contamination can be found in Table S4. QC3 organize the figures and tables into a HTML file for better readability. The figures and tables presented here only represent partial output of QC3’s results.

A. Heatmap displays pair-wise sample heterozygous genotype consistency using first sample in the pair as denominator. Pairwise comparison identifies possible cross-contamination between samples 10 and 14, and samples 39 and 40. B. Heatmap matrix displays pair-wise sample heterozygous genotype consistency using secnod sample in the pair as denominator. C. Overall pair wise consistency which is computed as nubmer of consistent genotype divided by number of total shared positions with genotypes called for both samples.

A second example was run using exome sequencing data from 11 TCGA breast cancer samples. TCGA data has been through rigorous quality control before releasing to public. Thus, it would generally not have any quality issue. The samples were chosen randomly without any bias. Out of the 11 samples, 10 are from tumor tissue, and the 11^th was blood sample from a same individual already presented in the previous 10. The reason to pick the 11^th sample from a same individual was to artificially create a situation where two samples have very similar heterozygous SNP consistency which resembles the scenario of cross contamination. As expected, raw data and alignment QC on this dataset showed great quality. However by running VCF QC, we successfully identified that two samples are closely related through their high heterozygous SNP consistency rate. This example once again demonstrates that by just performing QC on raw data and alignment can fail to identify cross contamination problem. The complete QC report of raw data, alignment, VCF and information on the TCGA sample’s barcode can be viewed at https://github.com/slzhao/QC3.

4. Discussion

Illumina’s sequencing platform has dominated the sequencing market for the last few years with no sign of diminishing. Thus, we developed our tool based on the Illumina sequencing platform. However, the general concepts discussed here are applicable across a range of sequencing platforms, with appropriate modifications where necessary. QC3 is designed with a large sample size in mind. Although it works with single samples in general, some of its features such as detection of contamination and batch effect are only appropriate for large sample size studies.

As we have demonstrated, QC3 performs quality control at three different stages and from different perspectives. It is essential that we do not make hasty decisions based on the quality control results from just a single perspective. Evaluating high throughput data quality from the perspectives of raw data, alignment, and variant calling allows researchers to identify problematic samples with greater confidence.

Even with the most rigorous quality control protocol, there are still certain false positive results that can evade our quality control efforts. For high impact studies, the use of additional methods such as RT-PCR, Sequenom, or Sanger sequencing to validate the findings independently from NGS sequencing is highly recommended.

5. Availability

The source code, instruction manual and examples used in the manuscript of QC3 can be accessed at https://github.com/slzhao/QC3.

Supplementary Material

supplement

NIHMS912921-supplement.xlsx^{(65.3KB, xlsx)}

Highlights.

QC3 provides quality control for Ilumina sequencing data at three stages: raw data, alignment and variant call.
QC3 can be used to check for batch effect
QC3 can be used to check for cross contamination or mislabeling

Acknowledgments

This study was supported by CCSG (P30 CA068485). We would also like to thank Margot Bjoring for editorial support.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Competing interests statement

The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.

References

1.Guo Y, et al. Three-stage quality control strategies for DNA re-sequencing data. Briefings in Bioinformatics. 2013 doi: 10.1093/bib/bbt069. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Bioinformatics, B. FastQC. Available from: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
3.Bioinformaticsm B. FastQ Screen. Available from: http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/
4.Lab, H. FASTX Toolkit [Google Scholar]
5.Patel RK, Jain M. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PLoS One. 2012;7(2):e30619. doi: 10.1371/journal.pone.0030619. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Schmieder R, Edwards R. Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011;27(6):863–4. doi: 10.1093/bioinformatics/btr026. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Zhou Q, et al. QC-Chain: fast and holistic quality control method for next-generation sequencing data. PLoS One. 2013;8(4):e60234. doi: 10.1371/journal.pone.0060234. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Lassmann T, Hayashizaki Y, Daub CO. SAMStat: monitoring biases in next generation sequencing data. Bioinformatics. 2011;27(1):130–1. doi: 10.1093/bioinformatics/btq614. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491–8. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Cock PJ, et al. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research. 2010;38(6):1767–71. doi: 10.1093/nar/gkp1137. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–9. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Ewing B, et al. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998;8(3):175–85. doi: 10.1101/gr.8.3.175. [DOI] [PubMed] [Google Scholar]
13.Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998;8(3):186–94. [PubMed] [Google Scholar]
14.Guo Y, et al. Exome sequencing generates high quality data in non-target regions. BMC genomics. 2012;13:194. doi: 10.1186/1471-2164-13-194. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Yi X, et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science. 2010;329(5987):75–8. doi: 10.1126/science.1190371. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Ng SB, et al. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet. 2010;42(1):30–5. doi: 10.1038/ng.499. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Samuels DC, et al. Finding the lost treasures in exome sequencing data. Trends Genet. 2013 doi: 10.1016/j.tig.2013.07.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Guo Y, et al. MitoSeek: Extracting Mitochondria Information and Performing High Throughput Mitochondria Sequencing Analysis. Bioinformatics. 2013 doi: 10.1093/bioinformatics/btt118. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Durbin RM, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Bainbridge MN, et al. Targeted enrichment beyond the consensus coding DNA sequence exome reveals exons with higher variant densities. Genome biology. 2011;12(7):R68. doi: 10.1186/gb-2011-12-7-r68. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Yang Z, Nielsen R. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. Journal of molecular evolution. 1998;46(4):409–18. doi: 10.1007/pl00006320. [DOI] [PubMed] [Google Scholar]
22.Bamshad MJ, et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011;12(11):745–55. doi: 10.1038/nrg3031. [DOI] [PubMed] [Google Scholar]
23.Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic acids research. 2010;38(16):e164. doi: 10.1093/nar/gkq603. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Kruskal WH, Wallis WA. Use of Ranks in One-Criterion Variance Analysis. Journal of the American Statistical Association. 1952;47(260):583–621. [Google Scholar]
25.Conover WJ, Johnson ME, Johnson MM. A Comparative Study of Tests for Homogeneity of Variances, with Applications to the Outer Continental Shelf Bidding Data. Technometrics. 1981;23(4):351–361. [Google Scholar]
26.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–60. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

NIHMS912921-supplement.xlsx^{(65.3KB, xlsx)}

[R1] 1.Guo Y, et al. Three-stage quality control strategies for DNA re-sequencing data. Briefings in Bioinformatics. 2013 doi: 10.1093/bib/bbt069. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Bioinformatics, B. FastQC. Available from: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

[R3] 3.Bioinformaticsm B. FastQ Screen. Available from: http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/

[R4] 4.Lab, H. FASTX Toolkit [Google Scholar]

[R5] 5.Patel RK, Jain M. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PLoS One. 2012;7(2):e30619. doi: 10.1371/journal.pone.0030619. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Schmieder R, Edwards R. Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011;27(6):863–4. doi: 10.1093/bioinformatics/btr026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Zhou Q, et al. QC-Chain: fast and holistic quality control method for next-generation sequencing data. PLoS One. 2013;8(4):e60234. doi: 10.1371/journal.pone.0060234. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Lassmann T, Hayashizaki Y, Daub CO. SAMStat: monitoring biases in next generation sequencing data. Bioinformatics. 2011;27(1):130–1. doi: 10.1093/bioinformatics/btq614. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491–8. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Cock PJ, et al. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research. 2010;38(6):1767–71. doi: 10.1093/nar/gkp1137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–9. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Ewing B, et al. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998;8(3):175–85. doi: 10.1101/gr.8.3.175. [DOI] [PubMed] [Google Scholar]

[R13] 13.Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998;8(3):186–94. [PubMed] [Google Scholar]

[R14] 14.Guo Y, et al. Exome sequencing generates high quality data in non-target regions. BMC genomics. 2012;13:194. doi: 10.1186/1471-2164-13-194. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Yi X, et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science. 2010;329(5987):75–8. doi: 10.1126/science.1190371. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Ng SB, et al. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet. 2010;42(1):30–5. doi: 10.1038/ng.499. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Samuels DC, et al. Finding the lost treasures in exome sequencing data. Trends Genet. 2013 doi: 10.1016/j.tig.2013.07.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Guo Y, et al. MitoSeek: Extracting Mitochondria Information and Performing High Throughput Mitochondria Sequencing Analysis. Bioinformatics. 2013 doi: 10.1093/bioinformatics/btt118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Durbin RM, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Bainbridge MN, et al. Targeted enrichment beyond the consensus coding DNA sequence exome reveals exons with higher variant densities. Genome biology. 2011;12(7):R68. doi: 10.1186/gb-2011-12-7-r68. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Yang Z, Nielsen R. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. Journal of molecular evolution. 1998;46(4):409–18. doi: 10.1007/pl00006320. [DOI] [PubMed] [Google Scholar]

[R22] 22.Bamshad MJ, et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011;12(11):745–55. doi: 10.1038/nrg3031. [DOI] [PubMed] [Google Scholar]

[R23] 23.Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic acids research. 2010;38(16):e164. doi: 10.1093/nar/gkq603. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Kruskal WH, Wallis WA. Use of Ranks in One-Criterion Variance Analysis. Journal of the American Statistical Association. 1952;47(260):583–621. [Google Scholar]

[R25] 25.Conover WJ, Johnson ME, Johnson MM. A Comparative Study of Tests for Homogeneity of Variances, with Applications to the Outer Continental Shelf Bidding Data. Technometrics. 1981;23(4):351–361. [Google Scholar]

[R26] 26.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–60. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Multi-Perspective Quality Control of Illumina Exome Sequencing Data Using QC3

Yan Guo

Shilin Zhao

Quanhu Sheng

Fei Ye

Jiang Li

Brian Lehmann

Jennifer Pietenpol

David C Samuels

Yu Shyr

Abstract

1. Introduction

Figure 1.