Skip to main content
American Journal of Human Genetics logoLink to American Journal of Human Genetics
. 2018 Jan 4;102(1):142–155. doi: 10.1016/j.ajhg.2017.12.007

A Comprehensive Workflow for Read Depth-Based Identification of Copy-Number Variation from Whole-Genome Sequence Data

Brett Trost 1, Susan Walker 1, Zhuozhi Wang 1, Bhooma Thiruvahindrapuram 1, Jeffrey R MacDonald 1, Wilson WL Sung 1, Sergio L Pereira 1, Joe Whitney 1, Ada JS Chan 1,2, Giovanna Pellecchia 1, Miriam S Reuter 1, Si Lok 1, Ryan KC Yuen 1, Christian R Marshall 1,3, Daniele Merico 1,4,6, Stephen W Scherer 1,2,5,6,
PMCID: PMC5777982  PMID: 29304372

Abstract

A remaining hurdle to whole-genome sequencing (WGS) becoming a first-tier genetic test has been accurate detection of copy-number variations (CNVs). Here, we used several datasets to empirically develop a detailed workflow for identifying germline CNVs >1 kb from short-read WGS data using read depth-based algorithms. Our workflow is comprehensive in that it addresses all stages of the CNV-detection process, including DNA library preparation, sequencing, quality control, reference mapping, and computational CNV identification. We used our workflow to detect rare, genic CNVs in individuals with autism spectrum disorder (ASD), and 120/120 such CNVs tested using orthogonal methods were successfully confirmed. We also identified 71 putative genic de novo CNVs in this cohort, which had a confirmation rate of 70%; the remainder were incorrectly identified as de novo due to false positives in the proband (7%) or parental false negatives (23%). In individuals with an ASD diagnosis in which both microarray and WGS experiments were performed, our workflow detected all clinically relevant CNVs identified by microarrays, as well as additional potentially pathogenic CNVs < 20 kb. Thus, CNVs of clinical relevance can be discovered from WGS with a detection rate exceeding microarrays, positioning WGS as a single assay for genetic variation detection.

Keywords: whole-genome sequencing, WGS, copy-number variation, CNV, structural variation, SV, variation detection, read depth

Introduction

Genetic variations are generally grouped by size and class, and all forms have been associated with disease. Smaller “sequence-level” variations include single-nucleotide variations (SNVs) and insertions/deletions < 50 bp (indels).1 Larger “structural variations” (SVs) are comprised of copy-number variations (CNVs—unbalanced changes from 50 bp to entire chromosomes), balanced translocations, inversions, and combinations thereof. Individual genome sequences have ∼3.2 million SNVs, ∼760,000 indels, tens of thousands of SVs, and very rarely larger balanced structural changes when compared with the human reference genome.2, 3, 4, 5, 6, 7

In medical genetics, detection of large genomic changes like trisomy 21 has been accomplished by karyotyping for 50 years.8, 9 Subsequently, chromosomal microarrays (CMAs) became the first-tier clinical test for detecting disease-causing CNVs,10 followed by exome sequencing for finding SNVs and indels affecting the <2% of the genome that encodes genes.11 Whole-genome sequencing (WGS) promises to eventually supplant combination genetic testing using karyotyping, CMAs, and/or exome sequencing, since, in principle, it covers all types of genetic variations. Further, it provides higher diagnostic yields than targeted panels,12 can reduce health-care costs by providing earlier diagnoses and reducing the need for other tests,13 and may have benefits for the primary care of healthy individuals.14

For extracting SNVs and indels from short-read WGS data (the current industry standard15), there is already a well-accepted “best practices” workflow, which involves mapping reads using Burrows-Wheeler Aligner (BWA)16 and detecting variations using the Genome Analysis Toolkit (GATK).17, 18, 19 In contrast, there are >50 heterogeneous algorithms for detecting SVs from sequence data (Figure S1).20, 21 The large number of methods, combined with the distribution of their citations (the most-cited algorithm has <12% of the total citations; Figure S2), show that there is currently no widely accepted method for identifying SVs from WGS data. The absence of best practices for identifying SVs extends to other steps in the workflow, including DNA library preparation, sequencing, reference mapping, and quality control.

Here, we describe the development and evaluation of a detailed workflow for the read depth-based identification of germline CNVs from short-read WGS data. We first focus on CNVs because they comprise the majority of SVs, which has led to CMAs being used worldwide in diagnostic settings.10, 22, 23, 24 We restrict our evaluation to algorithms that are primarily based on read depth because (1) they can be analyzed cohesively (i.e., we expect them to exhibit similar issues, such as with amplification bias in sequence data from PCR-based DNA library preparations); (2) they are conceptually the most direct analog for CMAs; (3) there is a wealth of CMA data for comparison and benchmarking; and (4) they are the method of choice for detecting larger CNVs, which are more likely to be clinically relevant. (Strategies for detecting smaller CNVs and other types of SVs based on paired-end mapping or split reads will be the focus of a separate study.) Although our workflow does not detect all possible CNVs (we restrict the analysis to CNVs > 1 kb, as these are most amenable to detection by read depth), it comprehensively addresses all stages of the process of detecting CNVs from short-read WGS data based on read depth. Specifically, in addition to evaluating the accuracy of read depth-based CNV detection algorithms, we investigate the effects of DNA library preparation (including insert size and PCR-based versus PCR-free protocols), sequencing depth, algorithm parameters, read-mapping software, and choice of reference genome. We also explore filtering strategies, quality-control metrics, and deletion breakpoint accuracy. Using data from a large cohort of individuals with autism spectrum disorder (ASD) and their family members, we demonstrate that our workflow has a low false discovery rate (FDR) for clinically relevant CNVs.

Material and Methods

Study Subjects

HuRef DNA was the same used for the original sequencing and assembly of the HuRef genome.3 DNA from individual NA1287825 was purchased from Coriell. In addition, 43 DNA samples were from participants in the Personal Genome Project Canada (PGPC), 3,369 from participants in the Autism Speaks MSSNG autism WGS project (1,846 individuals with ASD and 1,523 parents of those individuals),26 111 from pediatric development research (unpublished data), and 8 from children referred for pediatric clinical genetics testing.23 All DNA was blood derived. All samples analyzed in this study were collected under approved protocols through The Hospital for Sick Children and its Research Ethics Board.

DNA Library Preparation and Sequencing

DNA was quantified using the Qubit dsDNA HS Assay and sample purity checked by OD260/OD280 ratio using a Nanodrop. For PCR-based libraries, 100 ng of DNA were used for the Illumina TruSeq Nano DNA library preparation following the manufacturer’s protocol. Briefly, DNA was fragmented to an average size of 350 bp on a Covaris S2 or LE220 instrument, end-repaired, and A-tailed. Indexed TruSeq Illumina adapters were added by ligation, followed by six or eight PCR cycles. For PCR-free libraries, 100, 500, or 1,000 ng of DNA were used for library preparation using either the TruSeq Nano DNA library preparation protocol (omitting the PCR step) or the Lucigen NxSeq AmpFREE Low DNA library preparation protocol, following the manufacturers’ instructions. All libraries were assessed using a Bioanalyzer High Sensitivity DNA Chip and quantified by qPCR using the KAPA Library Quantification Illumina/ABI Prism protocol (Kapa Biosystems). Validated libraries were pooled in equimolar quantities and sequenced on a HiSeq X following Illumina’s recommended protocol. DNA library preparation and sequencing were performed either by The Centre for Applied Genomics (TCAG) or Macrogen.

HuRef CNV Benchmark

Previously, our laboratory compiled an extensive dataset of variations in the HuRef genome4, 27, 28 using a number of technologies and strategies, both CMA based (Affymetrix Genome-Wide Human single nucleotide polymorphism [SNP] 6.0; Agilent 24M comparative genomic hybridization [CGH];29 Nimblegen 42M CGH;30 and Illumina BeadChip 1M) and sequencing based (comparison of de novo-assembled sequences; mate pair- and read depth-based analysis of data from Complete Genomics; and mate pair- and split read-based analysis of Sanger reads). While standard CGH microarrays typically detect CNVs ≥20 kb, the NimbleGen 42M and Agilent 24M have more probes and thus can identify CNVs as small as 500 bp.29, 30 Although sequencing based, the Complete Genomics variations can be considered orthogonal to those from Illumina WGS data, with different library preparation, sequencing chemistry, read size, alignment method, and variation-detection methods. For this study, we are interested only in CNVs, so other variations (e.g., insertions) were removed from the benchmark. Only CNVs ≥1 kb were retained, because three of the six CNV-detection algorithms tested did not report smaller CNVs (see below) and because smaller CNVs are more readily detected by strategies not involving read depth. Overlapping CNVs from the same technology were manually merged. CNVs based on mate-pair analysis of Sanger data were removed, as the breakpoints and CNV sizes could not be precisely determined. If the coordinates for a given benchmark CNV were relative to a reference assembly other than the one to which reads were mapped (GRCh37/h19 or GRCh38/hg38), then they were converted using the University of California Santa Cruz (UCSC) Batch Coordinate Conversion (LiftOver) tool.31 Because CGH microarrays rely on a reference sample, they are susceptible to falsely identifying duplications for regions deleted in the reference sample and vice versa.24, 32 Thus, we removed any benchmark CNV (CGH microarray or otherwise) that had ≥50% reciprocal overlap with an opposite-direction CNV of ≥30% frequency in either the Database of Genomic Variants Gold Standard33 or in MSSNG parents (detected by both the CNVnator and Estimation by Read Depth with Single-nucleotide variants [ERDS] methods).26 Files S1 and S2 contain the full HuRef benchmarks for GRCh37/hg19 and GRCh38/hg38, respectively.

NA12878 CNV Benchmark

A previously published benchmark of NA12878 variations34 was downloaded (Web Resources) and filtered to retain only deletions ≥1 kb (File S3).

AK1 Reads and CNV Benchmark

HiSeq X reads from the DNA of individual AK1 were downloaded in FASTQ format from the NCBI Sequence Read Archive.35 To make the average depth consistent with our HuRef WGS data, as well as with the depth at which genomes are typically sequenced, we used just one of the two available read sets (accession number SRA: SRR3602759). A benchmark of variations in the AK1 genome, derived by comparing the AK1 assembly with the human reference assembly, was obtained from the supplementary materials of Seo et al.35 and filtered to retain only deletions ≥1 kb (File S4).

Preprocessing of Sequence Data

For all CNV-detection algorithms except Canvas, base-calling was performed using bcl2fastq2 v2.17.1.14 and reads were mapped to the reference genome using BWA-MEM v.0.7.12 (for GRCh37/hg19) or v.0.7.15 (for GRCh38/hg38).16 BWA-MEM v.0.7.15 was used for GRCh38/hg38 because it fixed an error related to ALT-aware mapping that was present in v.0.7.12; the output from the two versions is identical in the absence of ALT contigs (GitHub BWA webpage, Web Resources). Read quality was checked using FastQC prior to mapping. GRCh37/hg19 was used as the reference genome except where specified otherwise. Duplicate reads were marked using Picard v.1.133. Base quality score recalibration, indel realignment (no longer necessary for GATK v.3.6 and later; see Web Resources), and SNV and indel detection were performed with GATK v.3.4-46 using best practices recommendations.17, 18, 19 SNV and indel detection were performed because ERDS requires these variations in variant call format (VCF) as part of its input. We tested two mapping pipelines designed to handle contigs representing alternate haplotypes: one described by BWA author Heng Li and the other by the developers of GATK at the Broad Institute (Web Resources). For Canvas, HiSeq Analysis Software v.2.5.55.1311 (Illumina) was used for base calling and Isaac version iSAAC-SAAC00776.15.01.27 for alignment. For all alignments, binary alignment/map (BAM) file manipulation, including read subsampling, was performed using SAMtools v.0.1.19-44428cd.36 Read-depth statistics were calculated using GATK’s DepthOfCoverage function or the “depth” function in SAMtools. Commands for performing these tasks, as well as scripts for performing other analyses described in this paper, are available on GitHub (Web Resources).

Definition of Repetitive and Low-Complexity Regions of the Human Genome

Both array- and sequencing-based methods of CNV detection can be confounded by repetitive and low-complexity regions (RLCRs).24 To define a comprehensive set of RLCRs, we combined four datasets: (1) the set of assembly gaps defined by UCSC, which includes centromeres, telomeres, constitutive heterochromatin domains, gaps between or within clones and contigs, and the repeat-dominated short arms of chromosomes 13, 14, 15, 21, and 22; (2) the UCSC list of segmental duplications; (3) the pseudoautosomal regions of the sex chromosomes; and (4) repeat regions as defined by RepeatMasker version open-3-2-7 (see Web Resources) using the –s (sensitive) option with release 20090120 of library RepeatMaskerLib.embl. The number of regions present in each data source, as well as the total number of base pairs, is given in Table S1. Links to download the full RLCR definition, as well as a version that omits the RepeatMasker definitions, are provided in Web Resources.

Filtering out CNVs with Substantial Repeat Content

For generating filtered sets of CNVs, a given CNV was removed if ≥70% of the CNV overlapped with RLCRs. This overlap could be comprised of just a single continuous region or multiple regions. To illustrate, if a CNV spans coordinates 100,000–105,000 of chromosome 1 and two RLCRs span coordinates 100,000–102,000 and 103,000–105,000 of chromosome 1, then 80% of the CNV overlaps with RLCRs.

CNV-Detection Algorithms

The CNV-detection algorithms tested were Canvas v.1.3.5,37 Copy Number estimation by a Mixture Of PoissonS (cn.MOPS) v.1.16.2,38 CNVnator v.0.3.2,39 ERDS v.1.1,20 Genome STRucture in Populations (Genome STRiP) v.2.00.1696,40 and Read Depth eXplorer (RDXplorer) v.2.0 release 3.41 Seven other read depth-based algorithms (Figure S2) were considered but not tested because they were specific to certain experimental designs (e.g., matched tumor-control samples) or had no implementation available. All algorithms were initially evaluated using default parameters. Genome STRiP contains two separate, complementary pipelines, but we evaluated only the read depth-based “CNV Discovery Pipeline” using parameters recommended for genomes sequenced at 30–40× average depth (Web Resources). Both cn.MOPS and Genome STRiP require multiple genome sequences as input. For cn.MOPS, the authors recommend ≥6 genomes,38 while the authors of Genome STRiP recommend ≥20 (Web Resources); thus, we conservatively used 30 genome sequences from MSSNG participants (in addition to the genome sequence of interest) as input to both algorithms. CNVs from Genome STRiP that overlapped one another were merged to form a single CNV. Supplementary alignments were removed from BAM files prior to running ERDS.20 The raw output files from all six CNV-detection algorithms for the HuRef, NA12878, and AK1 genomes are included in File S5.

Concordance of CNV-Detection Algorithms

To determine sets of equivalent CNVs, CNVs were iteratively added to sets as follows. For each possible pair of CNVs (where each CNV in a pair was identified by a different algorithm or HuRef benchmark method), we determined whether they satisfied the 50% reciprocal overlap criterion. If so, and if neither CNV was yet in a set, then a new set was created including both CNVs. If only one of the CNVs was already in a set, then the other CNV was added to it. If both CNVs were already in different sets (a rare “conflict”), then they were left as such. When determining how many unique benchmark CNVs fell into each size bin, the size of a particular unique benchmark CNV was calculated as the median of the sizes of all the CNVs in the corresponding set. The overall similarity of two sets of CNVs A and B, where A and B were detected by two different methods, was measured using the Jaccard index J=(AB/AB). As not all algorithms reported absolute copy number, it was not taken into account when comparing CNVs.

Accuracy of CNV-Detection Algorithms

The accuracies of the CNV-detection algorithms were evaluated according to their sensitivity and FDR when compared with a benchmark of known CNVs (HuRef, NA12878, or AK1 benchmarks). A CNV detected by an algorithm and a benchmark CNV were deemed to be the same if the first CNV overlapped with at least 50% of the second CNV and vice versa (50% reciprocal overlap).42 Sensitivity was defined as TP / (TP + FN), where TP (“true positives”) is the number of benchmark CNVs identified correctly and FN (“false negatives”) is the number of benchmark CNVs not detected by the algorithm. FDR was defined as FP / (FP + TP), where FP (“false positives”) is the number of identified CNVs not found in the benchmark. For the HuRef benchmark, which contains CNVs identified by multiple technologies and strategies, sensitivity was determined for CNVs having 50% reciprocal overlap with a benchmark CNV from at least one technology (denoted “1+”) or at least two technologies (“2+”).

Deletion Breakpoint Verification Assays

PCR primers were designed to span the putative breakpoint and the deletion-specific PCR product was sequenced by Sanger sequencing.

Read-Depth Uniformity

Read-depth uniformity was measured by the inter-quartile range (IQR). Let Ra denote the read depth for which a% of the bases in the reference genome have smaller depths. Then IQR = R75R25. The GitHub repository for this paper (Web Resources) provides scripts for calculating IQR from BAM files.

Visualizations

Set intersections were visualized using UpSetR v.1.3.3.43 Circle diagrams were created using circos v.0.69-3,44 and other plots were generated using ggplot2 v.2.2.1.45

Detection of Rare, Genic Variants in the Autism Speaks MSSNG WGS Dataset

Two algorithms, ERDS and CNVnator, were used for detecting CNVs in the MSSNG WGS dataset. Both programs were run using default parameters. For CNVnator, CNVs with q0 > 50 were removed except for homozygous deletions or hemizygous X-linked deletions in males (normalized read depth < 0.03). Adjacent CNVs of the same type were merged when separated by a gap in the genome assembly or separated by a small region (sum of the lengths of individual CNVs > 70% of the merged CNV length). The ERDS and CNVnator CNVs were merged independently and the merged ERDS CNVs were annotated with CNVnator CNVs having 50% reciprocal overlap.

BAM Confirmation

To manually assess the accuracy of predicted CNVs, Integrative Genomics Viewer46 was used to visually compare the read depth of the CNV with that of the surrounding regions, with specific attention to deviations in read depth corresponding to the predicted change in copy number (e.g., a 50% reduction for a heterozygous deletion or a 50% increase for a heterozygous duplication). Predicted CNVs with clear start and end breakpoints and that were supported by split-read and/or read-pair information were deemed more likely to be correct. A more detailed description of how we used BAM confirmation to evaluate putative CNVs is given in the Supplemental Tutorial.

Results

An overview of this study is given in Figure 1.

Figure 1.

Figure 1

Overview of the Three Stages of This Study

In stage 1 (“algorithm selection”), three WGS datasets and corresponding CNV benchmarks (HuRef,3, 4, 28 NA12878,34 and AK135) were used to assess the accuracy of six read depth-based CNV-detection algorithms—Canvas, cn.MOPS, CNVnator, ERDS, Genome STRiP, and RDXplorer. In stage 2 (“workflow development”), other factors influencing CNV detection were evaluated in the context of the most accurate algorithms identified in stage 1. Based on results from the first two stages, we propose a comprehensive workflow for detecting CNVs from short-read WGS data. In stage 3 (“workflow evaluation”), we show that our workflow can accurately identify clinically relevant CNVs. Green parallelograms represent data, and gray rectangles represent actions. The blue shape represents the CNV detection workflow developed from the results of the first two stages.

Stage 1: Algorithm Selection

Concordance of CNV-Detection Algorithms

We determined the concordance in the CNVs detected by the six CNV-detection algorithms when applied to WGS data from the HuRef-Free-500 DNA library. File S6 contains information on this library and the other libraries used in this study, including library preparation methods and common sequencing metrics. The algorithms exhibited only moderate agreement in terms of number (Table 1), size distribution (Figure S3), and overlap (Figures 2, S4, and S5) of detected CNVs. Concordance was higher for deletions than for duplications (Figure S5). The low concordance was unsurprising given that different algorithms for SNV and indel detection also exhibit modest agreement,47 and similarly with different CMA platforms and algorithms.42

Table 1.

Sensitivities and False Discovery Rates (FDRs) of the Six CNV-Detection Algorithms when the CNVs Identified in the HuRef Genome Were Compared with Those in the HuRef CNV Benchmark


Unfiltered
Filtered
n Sensitivity
FDR n Sensitivity
FDR
Algorithm 1+ 2+ 1+ 2+
Deletions

Canvas 493 0.23 0.44 0.56 253 0.35 0.64 0.53
cn.MOPS 386 0.16 0.28 0.62 93 0.19 0.38 0.33
CNVnator 2,093 0.35 0.57 0.84 356 0.44 0.76 0.59
ERDS 680 0.39 0.69 0.46 251 0.51 0.89 0.32
Genome STRiP 582 0.25 0.42 0.60 241 0.37 0.68 0.49
RDXplorer 12,685 0.55 0.76 0.96 3,232 0.66 0.94 0.93

Duplications

Canvas 225 0.06 0.62 0.89 56 0.11 0.67 0.77
cn.MOPS 421 0.05 0.38 0.95 61 0.08 0.33 0.84
CNVnator 337 0.06 0.33 0.93 27 0.07 0.33 0.70
ERDS 303 0.07 0.50 0.91 48 0.09 0.53 0.77
Genome STRiP 499 0.03 0.21 0.98 110 0.03 0.20 0.96
RDXplorer 7,500 0.10 0.46 1.00 3,052 0.15 0.47 0.99

For each algorithm, we determined the total number of CNVs detected (n), the number identified by at least one method in the HuRef benchmark (1+), and the number identified by at least two benchmark methods (2+). The algorithms’ accuracies varied substantially, both in terms of sensitivity (the proportion of benchmark CNVs identified by an algorithm) and FDR (the proportion of CNVs identified by an algorithm that were not in the benchmark). “Unfiltered” refers to all CNVs ≥1 kb, while “filtered” refers to CNVs that had <70% overlap with repetitive and low-complexity regions of the genome. The apparent superiority of Canvas for duplications is somewhat misleading, as it was the most accurate for only some sizes of CNVs (shown subsequently).

Figure 2.

Figure 2

Overlap in the CNVs Detected by the Six Algorithms

The bottom-left bar chart shows the number of CNVs identified by each algorithm. The remainder shows the number of CNVs detected by various intersections of the algorithms; for instance, the far-left bar for deletions represents the number of CNVs detected by RDXplorer only, while the far-right bar represents deletions detected by Canvas, cn.MOPS, CNVnator, and RDXplorer but not ERDS or Genome STRiP. Due to the log scale, zero-height bars represent a count of 1.

Accuracies of CNV-Detection Algorithms

The CNVs detected by each algorithm in the sequencing data from the HuRef-Free-500 library were compared with those in the HuRef CNV benchmark (Tables S2 and S3) using both unfiltered CNVs and filtered CNVs (those with <70% overlap with RLCRs). For unfiltered deletions, the most accurate algorithm was ERDS, which had high sensitivity and a low FDR (Table 1; Figure S6). For unfiltered duplications, ERDS and Canvas had the highest sensitivities (although much lower than the highest sensitivities for deletions), and FDRs were uniformly high. Accuracies for filtered CNVs were consistently better than for unfiltered CNVs, and the algorithm rankings remained the same. Thus, subsequent analyses use only filtered CNVs unless specified otherwise.

Detection was more accurate for larger CNVs (Table S4). For deletions, ERDS was the most accurate in all size bins. The most accurate algorithm for duplications varied: Canvas for 1–5 kb, cn.MOPS for 5–10 kb, and ERDS for 10–100 kb.

Combining Algorithms

We tested a scoring procedure in which a particular CNV was considered to be detected only if it was identified by at least m algorithms. Overall, this strategy was no better than the most accurate individual algorithm alone (Tables 1 and S5). Considering deletions, the most accurate pair of algorithms was CNVnator and ERDS, which had a lower FDR than ERDS alone, but lower sensitivity (Table 1; Figure S7). For duplications, several pairs of algorithms had lower FDRs than any individual one, but again at the expense of sensitivity.

Additional Genomes and Benchmarks

To ensure the HuRef results were representative of genomes in general, we used the above strategies to analyze deletions detected from Illumina HiSeq X reads from the NA12878 and AK1 genomes (Tables S6 and S7). The results were highly concordant with those from the HuRef genome and benchmark (Tables S8–S11).

Reproducibility of CNV Detection

To evaluate reproducibility, we analyzed monozygotic twins and their parents. Two metrics were used: percentage of CNVs detected in both twins (“twin concordance”) and percentage consistent with Mendelian inheritance (“parent concordance”). Most algorithms exhibited high concordance (≥80%) for both filtered and unfiltered deletions ≥5 kb (Tables 2 and S12)—better than the reproducibility of CMAs.42 Lower concordances were observed for deletions <5 kb and duplications.

Table 2.

Reproducibility of the Filtered CNVs Identified in the Genomes of a Mother, Father, and Their Monozygotic Twins


[1 kb,5 kb)
[5 kb,10 kb)
[10 kb,100 kb)
[100 kb,1,000 kb)
[1,000 kb,…)
Algorithm n % n % n % n % n %
Deletions: Twin Concordance

Canvas 26 42.3 38 55.3 32 68.8 4 25.0 18 55.6
cn.MOPS 54 61.1 34 91.2 19 73.7 0 N/A 0 N/A
CNVnator 264 39.0 49 83.7 31 80.6 2 100.0 0 N/A
ERDS 303 55.1 39 84.6 24 87.5 2 100.0 0 N/A
Genome STRiP 484 35.5 26 88.5 20 70.0 1 100.0 0 N/A
RDXplorer 33,849 17.8 104 47.1 30 93.3 1 0.0 0 N/A

Deletions: Parent Concordance

Canvas 26 53.8 38 63.2 32 87.5 4 25.0 18 61.1
cn.MOPS 54 68.5 34 85.3 19 89.5 0 N/A 0 N/A
CNVnator 264 54.9 49 91.8 31 93.5 2 100.0 0 N/A
ERDS 303 63.0 39 92.3 24 100.0 2 100.0 0 N/A
Genome STRiP 484 42.1 26 84.6 20 75.0 1 100.0 0 N/A
RDXplorer 33,849 38.8 104 61.5 30 93.3 1 100.0 0 N/A

Duplications: Twin Concordance

Canvas 22 22.7 12 50.0 13 84.6 0 N/A 0 N/A
cn.MOPS 29 62.1 13 69.2 10 60.0 0 N/A 0 N/A
CNVnator 0 N/A 10 60.0 14 92.9 3 100.0 0 N/A
ERDS 19 63.2 13 61.5 11 81.8 4 100.0 0 N/A
Genome STRiP 79 55.7 29 65.5 21 76.2 0 N/A 0 N/A
RDXplorer 19,753 1.2 21 57.1 31 80.6 1 0.0 0 N/A

Duplications: Parent Concordance

Canvas 22 36.4 12 58.3 13 76.9 0 N/A 0 N/A
cn.MOPS 29 51.7 13 53.8 10 90.0 0 N/A 0 N/A
CNVnator 0 N/A 10 70.0 14 85.7 3 100.0 0 N/A
ERDS 19 73.7 13 76.9 11 81.8 4 100.0 0 N/A
Genome STRiP 79 68.4 29 75.9 21 71.4 0 N/A 0 N/A
RDXplorer 19,753 3.6 21 76.2 31 80.6 1 100.0 0 N/A

n refers to the number of CNVs in a given size bin that were detected in at least one of the twins. “%” refers to the percentage of those CNVs that were also detected in the other twin (“twin concordance”) or that were also detected in at least one of the parents (“parent concordance”).

CNVnator and ERDS Excel

Overall, ERDS was the most accurate algorithm for deletions (Table 1). Although the ranking of the remaining algorithms for deletions was less clear, CNVnator exhibited several advantages over its closest competitors (Canvas and Genome STRiP): it had particularly strong sensitivity for deletions ≥5 kb, making it complementary to ERDS; its reproducibility was generally better than Canvas and Genome STRiP; and it is comparatively easy to use and has modest compute time and storage requirements.

Given the lesser accuracy for duplications among all algorithms, and that sample sizes were smaller, there was insufficient evidence to justify using additional algorithms beyond CNVnator and ERDS. Thus, in stage 2 of this study, CNVnator and ERDS are used as the basis for examining other steps in the CNV-detection workflow.

Stage 2: Workflow Development

CNVnator and ERDS Parameters

We investigated two parameters potentially influencing the sensitivity/FDR tradeoff of CNVnator and ERDS: the window size and fraction of multi-mapping reads (q0). For filtered CNVs, there was limited or no benefit to modifying default parameters, but for unfiltered CNVs, using a q0 cutoff of ∼0.5 for CNVnator gave an excellent sensitivity-FDR tradeoff versus no cutoff (Figures S8–S10).

Read Depth

The optimal sensitivity-FDR tradeoff was achieved at average depths of 25–40× for both CNVnator (Figure S11) and ERDS (Figure S12); however, only CNVs 1–5 kb were particularly sensitive to depth.

Sequence Read Mapping

To determine how CNVnator and ERDS are affected by different read-mapping software, we tested them using alignments produced by Isaac48 and NovoAlign (Web Resources) but found they were most accurate using BWA16 (Figure S13).

Library Preparation Methods

We generated four HuRef DNA libraries—two using PCR-free protocols and two using PCR-based protocols. All were similar in terms of typical quality metrics (File S6); however, the WGS data from the PCR-based libraries exhibited lower read-depth uniformity (RDU), which appeared to cause the detection of false CNVs (Figure S14). Regions of high or low depth were consistent in the two PCR-based libraries, suggesting a role for systematic PCR biases. Read depth in the PCR-free libraries was largely unrelated to guanine/cytosine (GC) content, while they were positively correlated (except at high GC content) in the PCR-based libraries (Figure S15). CNVnator had a substantially higher FDR for deletions in the PCR-based libraries (Figure S16).

Quality-Control Metrics

To investigate quality-control metrics and to further compare PCR-based with PCR-free libraries, we used DNA from 43 Personal Genome Project Canada (PGPC) participants. For 22 participants, a single library was sequenced (11 PCR-based and 11 PCR-free); for the remaining 21, both a PCR-based and a PCR-free library were sequenced. For each WGS dataset, we determined its RDU as measured by IQR. All WGS datasets with high IQR (≥15) were generated from PCR-based libraries and had an excess of CNVs detected by CNVnator (and often ERDS) (Figure S17). Despite the high false positive rate for CNVs detected by ERDS and CNVnator alone in WGS datasets with high IQR, these datasets had very high false negative rates when considering CNVnator and ERDS in combination (i.e., there were very few CNVs detected by both algorithms; Figure S17), further highlighting the poor quality of the sequencing data and the resulting CNVs. Overall, both IQR and CNV count are useful quality metrics—the accuracy of read depth-based algorithms on WGS datasets with high IQR is problematic, even in the absence of aberrant CNV counts; further, if excess CNVs are detected by ERDS or CNVnator (or very few CNVs are detected by both), then it is reasonable to assume that the data are of poor quality regardless of IQR. Based on Figure S17, samples with more than ∼400–500 filtered CNVs detected by either ERDS or CNVnator, or fewer than 100 CNVs detected by both, could reasonably be categorized as outliers. In the future, it would be desirable to develop a statistically rigorous method of detecting outliers in CNV count data, especially for clinical purposes. However, such a method would undoubtedly be complex—due not only to the inherent difficulty of outlier detection, but also due to ethnicity considerations49, 50 and because counts from ERDS and CNVnator appear to follow different statistical distributions (ERDS appears to be Gaussian, whereas CNVnator appears to be Poisson). In the absence of such a method, and in the presence of larger datasets, we suggest performing the analysis depicted in Figure S17 to intuitively identify reasonable cutoffs.

Effect of Alternate Haplotypes in the Reference Genome

Unlike GCRh37/hg19 (used throughout this study), GCRh38/hg38 contains “ALT contigs” representing alternate haplotypes, which could interfere with read depth-based CNV detection by artificially sequestering reads from the primary assembly. This is best handled using ALT-aware workflows (Figures S18–S20), with differences between builds mainly consisting of more false duplications detected when using GCRh38/hg38.

Other Technical Issues

In running ERDS on various WGS datasets, we observed that the number of deletions (but not duplications) detected depended on the median insert size of the DNA library (Figure S21) and that this problem could be best corrected via a simple modification of ERDS (Figure S22). In addition, ERDS consistently detected more deletions in WGS data from v.2.0 of the HiSeq X flow cell than v.2.5 (Figure S21), which were mostly false (shown subsequently).

Deletion Breakpoint Accuracy

We selected 18 deletions between 3 and 468 kb identified by both CNVnator and ERDS and verified their breakpoints by Sanger sequencing. ERDS’ breakpoints were generally close to the experimentally determined ones, while those determined by CNVnator were substantially less accurate (Table S13).

Best Practices Workflow

A “best practices” workflow based on the above observations is given in Figure 3.

Figure 3.

Figure 3

Recommended Workflow for Use of Read Depth-Based Algorithms for Detecting Germline CNVs from Short-Read WGS Data

The green and blue shapes represent the beginning and end of the workflow, respectively. Red rectangles represent quality-control steps, and other actions are colored in gray. Yellow diamonds represent decision points. For maximum stringency, the action “Remove CNVs with ≥70% overlap with RLCRs” may be performed using the full RLCR definition, including RepeatMasker (as in the algorithm selection and workflow development sections). For increased sensitivity, such as when examining rare, genic CNVs, it may be performed using the RLCR definition that omits RepeatMasker, as was done in the workflow evaluation section.

Stage 3: Workflow Evaluation

Identification of Clinically Relevant CNVs

To test our workflow (Figure 3) for CNV detection in medical genetics (where CMAs are the standard), we identified rare (<1% frequency in unrelated, unaffected individuals26 and absent from the Database of Genomic Variants33), genic CNVs detected by both CNVnator and ERDS in HiSeq X WGS datasets from 1,846 individuals with ASD.26 Preliminary analysis suggested that for rare, genic CNVs, filtering based on RepeatMasker reduced sensitivity slightly (by approximately 5%) and did not reduce FDR, so we filtered out CNVs only if they had ≥70% overlap with the non-RepeatMasker portion of our RLCR definition. The resulting CNVs were grouped into three categories: (1) ≥3 Mb, (2) representing a known genomic disorder, or (3) overlapping putative ASD-risk genes (File S7). We experimentally tested all CNVs in the first two categories (n = 77) and a subset of those in the third (n = 43/313 total) using either CMAs or PCR-based methods, and all were successfully confirmed, giving an FDR of 0% (Table 3; File S8). The high confirmation rate for duplications differs from the high FDR observed in our HuRef analysis (Table 1), indicating that the detection of rare, genic duplications may be more reliable than duplications in general. In addition, all clinically relevant CNVs identified by CMA (≥20 kb) were detected via WGS.

Table 3.

Confirmation of Potentially Clinically Relevant CNVs Detected by Both CNVnator and ERDS in the Genomes of Participants in the Autism Speaks MSSNG WGS Project

≥3 Mb ASD-Risk Genes Genomic Disorders
# deletions identified 10 143 17
# deletions successfully confirmed/total tested 10/10 27/27 17/17
# duplications identified 16 170 34
# duplications successfully confirmed/total tested 16/16 16/16 34/34

CNVs that were both ≥3 Mb and represented a genomic disorder were assigned to the latter category. Most CNVs were tested as part of another study,26 and the remainder were tested specifically for this study.

To evaluate our workflow’s ability to accurately detect clinically relevant CNVs in the sub-CMA size range, we used the same cohort and selected rare CNVs <20 kb detected by both CNVnator and ERDS that overlapped disease-associated genes from the Clinical Genomics Database (CGD), resulting in 41 deletions and 30 duplications. Although these CNVs are not expected to contribute to ASD risk, we deemed them representative of pathogenic CNVs. All CNVs appeared correct based on “BAM confirmation” (manual read pileup inspection) (File S8), and 6/6 deletions and 2/2 duplications selected for PCR confirmation were deemed correct.

As another test, we used the HiSeq X to sequence the DNA of eight children referred for pediatric clinical genetics testing, from which nine clinically relevant CNVs between 300 kb and 90 Mb had previously been detected using clinical CMAs and Complete Genomics WGS (Table 2 of Stavropoulos et al.23). All nine CNVs were detected by our workflow, with breakpoints close to those of the other technologies (Table S14).

ERDS Plus CNVnator versus ERDS Alone

Our evaluations thus far did not clearly establish whether accuracy improved when using CNVnator in addition to ERDS (Table 1; Figure S7). We therefore evaluated the accuracy of rare, genic ERDS-only CNVs (i.e., <50% reciprocal overlap with a CNVnator CNV). ERDS-only deletions exhibited several indications of systematic biases (detailed in Figure S23), and only 8/40 ERDS-only deletions overlapping CGD genes selected for verification appeared to be real by BAM confirmation (File S8). ERDS-only duplications had more favorable confirmation rates (35/40 by BAM confirmation). Of the 80 ERDS-only CNVs described above, 14 were tested using PCR, and the PCR result matched the BAM result in all but one case (File S8). Finally, ERDS-only CNVs were more sensitive to technical confounders, as shown in the workflow development section. For this reason, we recommend using CNVnator and ERDS in combination. However, to improve sensitivity, ERDS-only duplications can be used as a “discovery set,” for which the correctness of any duplications of potential clinical or experimental importance should be verified either experimentally or using BAM confirmation.

De Novo CNVs

We identified genic CNVs that were detected by both CNVnator and ERDS in a given ASD-affected proband but neither algorithm in both parents. This was done for the 959 ASD-affected case subjects in the MSSNG dataset for which the proband and both parental samples were sequenced on HiSeq X (total corresponding parental samples = 1,523). Of 71 such CNVs, 50 were confirmed as de novo after BAM confirmation and experimental validation. Of the 21 that were not de novo, five were due to false positives in the proband and 16 were parental false negatives (File S8). Our detection rate for genic de novo CNVs (approximately one per 19 children with ASD) was similar to that of previous studies from our group and others, several of which used much more complex and labor-intensive sets of algorithms and methods than suggested here.51, 52, 53, 54 For falsely detected de novo CNVs that were due to parental false negatives, the raw read depth of the CNV region in one parent was almost always clearly different from the surrounding region (Table S15), suggesting an easily automatable method for filtering putative de novo CNVs. Scripts for comparing the read depth of a CNV with the surrounding region are provided on the GitHub page for this paper (Web Resources).

Small CNVs

Earlier, we showed that FDRs are generally higher for CNVs <5 kb (Tables S4, S10, andS11).30, 55 To determine whether this pattern applies to rare, genic CNVs, we specifically examined the CNVs analyzed above that were <5 kb. Three such deletions overlapping ASD genes and 18 overlapping CGD genes were tested, and all were successfully confirmed (File S8). Further, 11 of the putative de novo deletions were <5 kb, and all but two were successfully confirmed as CNVs in the proband (although five of these CNVs were missed in a parent, suggesting that—consistent with Table S4—sensitivity may be lower for small CNVs). There were no duplications <5 kb in these categories (both ERDS and CNVnator were required to have detected these CNVs, and CNVnator rarely detects duplications <5 kb); however, eight of the rare, genic, ERDS-only duplications were <5 kb, and all but one were successfully confirmed. The observed FDR for rare, genic CNVs <5 kb was thus quite low; nonetheless, given the higher FDRs observed for general CNVs in this size range, we recommend that small CNVs that appear to be clinically relevant be subjected to additional validation measures (BAM confirmation or laboratory testing).

Discussion

We have developed a robust workflow for applying read depth-based computational algorithms to short-read WGS data in order to identify all CNVs, and more, detected by CMAs. Coupling this workflow with the established best-practices pipeline for SNV and indel detection16, 17, 18, 19 positions WGS as a single experiment capable of replacing combined testing using CMAs and exome sequencing, and for the most part karyotyping. However, while our workflow is comprehensive in terms of the steps required to detect CNVs >1 kb using read depth, it does not yet address the detection of several other classes of variation. For instance, most CNVs are <1 kb in size,6 and these are more readily detected from short-read data using algorithms involving paired-end mapping and split reads. These strategies are also more suitable for detecting other types of structural variations, such as translocations and inversions. Additionally, our workflow undoubtedly misses some CNVs >1 kb, as evidenced by our own comparisons to CNV benchmarks (Table 1) and because long-read sequencing data detects some such CNVs not discovered by short-read data (though these are mostly <5 kb6). Finally, our workflow is suitable only for detecting deviations from diploid copy number (or for deviations from one copy on sex chromosomes in males). For the precise detection and characterization of multiallelic CNVs, more specialized methods are required, such as that developed by Handsaker et al.56 Overall, further work is necessary to establish workflows that cover the full repertoire of human genetic variation.

In our recommended workflow (Figure 3), we have suggested redoing library preparation and sequencing when quality-control metrics such as IQR and CNV count are poor. However, we recognize that this may be impractical (e.g., due to cost or lack of DNA); thus, an important area of future work is developing methods for making the best use of suboptimal sequencing data.57

The HuRef CNV benchmark that formed the core of our analysis has several advantages: the technologies used to generate it detect a wide spectrum of CNV sizes,4 it allowed us to identify high-confidence CNVs (those detected by multiple technologies), and it was not used in the development of the six algorithms evaluated, ensuring an unbiased comparison. The NA12878 and AK1 deletion benchmarks were useful for ensuring that our findings generalized to other genomes. However, a difficulty in this study was the lesser quality and quantity of duplication benchmark data.42, 58 The marked difference in accuracy between deletions and duplications suggests that the HuRef benchmark duplications are less complete and have a higher false-positive rate than the deletions (Table 1). We are unaware of any other complete duplication benchmark; however, we partially addressed this deficiency by showing that many clinically relevant duplications were detected with a low FDR in WGS data from ASD-affected individuals (Table 3). In the future, our ability to perform even more thorough testing of variation-detection methods will rely on the creation of additional comprehensive, multi-technology, well-validated benchmarks of all types of variations (especially duplications).34, 59

This study also sheds light on limitations inherent in current methods for the read depth-based detection of CNVs from WGS data. Although our workflow has a low FDR, particularly for rare, coding CNVs (Table 3), there is room for improvement with respect to sensitivity (Tables 1, S4, and S15). Although we did not examine the detection of germline mosaic CNVs, imperfect sensitivity for heterozygous CNVs suggests that sensitivity for mosaic CNVs would be low. The window-based approaches used by most CNV-detection algorithms are not well suited to aneuploidy detection, as read depth comparisons are made only within chromosomes. More advanced algorithms that are less dependent on parameters, windows, and thresholds may be able to address the above issues, as well as others observed in this study (e.g., repeat content, insert-size dependency, and PCR-amplification noise).

In addition to improved algorithms, CNV and SV detection from WGS data will be aided by new sequencing and library preparation technologies that provide long reads or other mechanisms of obtaining long-range genetic information.15 Already, such technologies have been used to create high-quality genome assemblies and to catalog SVs that would be difficult to resolve using short reads.60, 61 However, so far these technologies are costlier, provide lower coverage, and have higher error rates than the industry-standard short-read technology, limiting broad applicability. Thus, our workflow promises to have impact in high-throughput clinical genomic12, 23, 26, 51, 62, 63 and population-based7, 64 studies for the foreseeable future.

Acknowledgments

We thank Richard Wintle for critical feedback on the manuscript, Sylvia Lamoureux, Kiera Drew, and Akshaya Raajkumar for validation work, Lok Kan Lee for assistance with writing scripts, David Glazer, Matt Bookman, and others at Verily (Google) for their work with the Autism Speaks MSSNG Whole Genome Sequencing Project, and The Centre for Applied Genomics for technical support. This work was funded by Autism Speaks, the Canada Foundation for Innovation, the Canadian Institute for Advanced Research, the University of Toronto McLaughlin Centre, Genome Canada/Ontario Genomics Institute, the Government of Ontario, the Canadian Institutes of Health Research (CIHR) (grant number FDN-143295), Ontario Brain Institute, and The Hospital for Sick Children Foundation. B.T. is funded by the CIHR Banting Postdoctoral Fellowship. S.W.S. is funded by the GlaxoSmithKline-CIHR Chair in Genome Sciences at the University of Toronto and The Hospital for Sick Children.

Published: January 4, 2018

Footnotes

Supplemental Data include 23 figures, 15 tables, 8 files, and a tutorial and can be found with this article online at https://doi.org/10.1016/j.ajhg.2017.12.007.

Web Resources

Supplemental Data

Document S1. Figures S1–S23, Tables S1–S15, and Tutorial
mmc1.pdf (3.4MB, pdf)
File S1. HuRef CNV Benchmark (GCRh37/hg19 Coordinates)
mmc2.txt (81.1KB, txt)
File S2. HuRef CNV Benchmark (GCRh38/hg38 Coordinates)
mmc3.txt (78.6KB, txt)
File S3. NA12878 CNV Benchmark (GCRh37/hg19 Coordinates)
mmc4.txt (39.9KB, txt)
File S4. AK1 CNV Benchmark (GCRh37/hg19 Coordinates)
mmc5.txt (41.1KB, txt)
File S5. Raw Output Files from the Six CNV-Detection Algorithms for the HuRef, NA12878, and AK1 Genomes
mmc6.zip (4.9MB, zip)
File S6. Details on the DNA Libraries Used in This Study and Corresponding WGS Data
mmc7.xlsx (60KB, xlsx)
File S7. Putative ASD-Risk Genes
mmc8.xlsx (34.3KB, xlsx)
File S8. CNVs Analyzed in Stage 3 of This Paper
mmc9.xlsx (73.8KB, xlsx)
Document S2. Article plus Supplemental Data
mmc10.pdf (4.6MB, pdf)

References

  • 1.Zarrei M., MacDonald J.R., Merico D., Scherer S.W. A copy number variation map of the human genome. Nat. Rev. Genet. 2015;16:172–183. doi: 10.1038/nrg3871. [DOI] [PubMed] [Google Scholar]
  • 2.Feuk L., Carson A.R., Scherer S.W. Structural variation in the human genome. Nat. Rev. Genet. 2006;7:85–97. doi: 10.1038/nrg1767. [DOI] [PubMed] [Google Scholar]
  • 3.Levy S., Sutton G., Ng P.C., Feuk L., Halpern A.L., Walenz B.P., Axelrod N., Huang J., Kirkness E.F., Denisov G. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254. doi: 10.1371/journal.pbio.0050254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Pang A.W., MacDonald J.R., Pinto D., Wei J., Rafiq M.A., Conrad D.F., Park H., Hurles M.E., Lee C., Venter J.C. Towards a comprehensive structural variation map of an individual human genome. Genome Biol. 2010;11:R52. doi: 10.1186/gb-2010-11-5-r52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Auton A., Brooks L.D., Durbin R.M., Garrison E.P., Kang H.M., Korbel J.O., Marchini J.L., McCarthy S., McVean G.A., Abecasis G.R., 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Chaisson M.J.P., Huddleston J., Dennis M.Y., Sudmant P.H., Malig M., Hormozdiari F., Antonacci F., Surti U., Sandstrom R., Boitano M. Resolving the complexity of the human genome using single-molecule sequencing. Nature. 2015;517:608–611. doi: 10.1038/nature13907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Maretty L., Jensen J.M., Petersen B., Sibbesen J.A., Liu S., Villesen P., Skov L., Belling K., Theil Have C., Izarzugaza J.M.G. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature. 2017;548:87–91. doi: 10.1038/nature23264. [DOI] [PubMed] [Google Scholar]
  • 8.Jacobs P.A., Browne C., Gregson N., Joyce C., White H. Estimates of the frequency of chromosome abnormalities detectable in unselected newborns using moderate levels of banding. J. Med. Genet. 1992;29:103–108. doi: 10.1136/jmg.29.2.103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lee C., Scherer S.W. The clinical context of copy number variation in the human genome. Expert Rev. Mol. Med. 2010;12:e8. doi: 10.1017/S1462399410001390. [DOI] [PubMed] [Google Scholar]
  • 10.Miller D.T., Adam M.P., Aradhya S., Biesecker L.G., Brothman A.R., Carter N.P., Church D.M., Crolla J.A., Eichler E.E., Epstein C.J. Consensus statement: chromosomal microarray is a first-tier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. Am. J. Hum. Genet. 2010;86:749–764. doi: 10.1016/j.ajhg.2010.04.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Alexander R.P., Fang G., Rozowsky J., Snyder M., Gerstein M.B. Annotating non-coding regions of the genome. Nat. Rev. Genet. 2010;11:559–571. doi: 10.1038/nrg2814. [DOI] [PubMed] [Google Scholar]
  • 12.Lionel A.C., Costain G., Monfared N., Walker S., Reuter M.S., Hosseini S.M., Thiruvahindrapuram B., Merico D., Jobling R., Nalpathamkalam T. Improved diagnostic yield compared with targeted gene sequencing panels suggests a role for whole-genome sequencing as a first-tier genetic test. Genet. Med. 2017 doi: 10.1038/gim.2017.119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Soden S.E., Saunders C.J., Willig L.K., Farrow E.G., Smith L.D., Petrikin J.E., LePichon J.-B., Miller N.A., Thiffault I., Dinwiddie D.L. Effectiveness of exome and genome sequencing guided by acuity of illness for diagnosis of neurodevelopmental disorders. Sci. Transl. Med. 2014;6:265ra168. doi: 10.1126/scitranslmed.3010076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Vassy J.L., Christensen K.D., Schonman E.F., Blout C.L., Robinson J.O., Krier J.B., Diamond P.M., Lebo M., Machini K., Azzariti D.R., MedSeq Project The impact of whole-genome sequencing on the primary care and outcomes of healthy adult patients: a pilot randomized trial. Ann. Intern. Med. 2017;167:159–169. doi: 10.7326/M17-0188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Goodwin S., McPherson J.D., McCombie W.R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 2016;17:333–351. doi: 10.1038/nrg.2016.49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Li H., Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.DePristo M.A., Banks E., Poplin R., Garimella K.V., Maguire J.R., Hartl C., Philippakis A.A., del Angel G., Rivas M.A., Hanna M. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–498. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., DePristo M.A. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Van der Auwera G.A., Carneiro M.O., Hartl C., Poplin R., Del Angel G., Levy-Moonshine A., Jordan T., Shakir K., Roazen D., Thibault J. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics. 2013;43:1–33. doi: 10.1002/0471250953.bi1110s43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Zhu M., Need A.C., Han Y., Ge D., Maia J.M., Zhu Q., Heinzen E.L., Cirulli E.T., Pelak K., He M. Using ERDS to infer copy-number variants in high-coverage genomes. Am. J. Hum. Genet. 2012;91:408–421. doi: 10.1016/j.ajhg.2012.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Guan P., Sung W.-K. Structural variation detection using next-generation sequencing data: A comparative technical review. Methods. 2016;102:36–49. doi: 10.1016/j.ymeth.2016.01.020. [DOI] [PubMed] [Google Scholar]
  • 22.Noll A.C. Clinical detection of deletion structural variants in whole-genome sequences. NPJ Genom. Med. 2016;1:16026. doi: 10.1038/npjgenmed.2016.26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Stavropoulos D.J., Merico D., Jobling R., Bowdin S., Monfared N., Thiruvahindrapuram B., Nalpathamkalam T., Pellecchia G., Yuen R.K.C., Szego M.J. Whole-genome sequencing expands diagnostic utility and improves clinical management in paediatric medicine. NPJ Genom. Med. 2016;1:15012. doi: 10.1038/npjgenmed.2015.12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Scherer S.W., Lee C., Birney E., Altshuler D.M., Eichler E.E., Carter N.P., Hurles M.E., Feuk L. Challenges and standards in integrating surveys of structural variation. Nat. Genet. 2007;39(7, Suppl):S7–S15. doi: 10.1038/ng2093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Eberle M.A., Fritzilas E., Krusche P., Källberg M., Moore B.L., Bekritsky M.A., Iqbal Z., Chuang H.-Y., Humphray S.J., Halpern A.L. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 2017;27:157–164. doi: 10.1101/gr.210500.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.C Yuen R.K., Merico D., Bookman M., L Howe J., Thiruvahindrapuram B., Patel R.V., Whitney J., Deflaux N., Bingham J., Wang Z. Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder. Nat. Neurosci. 2017;20:602–611. doi: 10.1038/nn.4524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Pang A.W.C., Migita O., Macdonald J.R., Feuk L., Scherer S.W. Mechanisms of formation of structural variation in a fully sequenced human genome. Hum. Mutat. 2013;34:345–354. doi: 10.1002/humu.22240. [DOI] [PubMed] [Google Scholar]
  • 28.Pang A.W.C., Macdonald J.R., Yuen R.K.C., Hayes V.M., Scherer S.W. Performance of high-throughput sequencing for the discovery of genetic variation across the complete size spectrum. G3 (Bethesda) 2014;4:63–65. doi: 10.1534/g3.113.008797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Park H., Kim J.-I., Ju Y.S., Gokcumen O., Mills R.E., Kim S., Lee S., Suh D., Hong D., Kang H.P. Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing. Nat. Genet. 2010;42:400–405. doi: 10.1038/ng.555. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Conrad D.F., Pinto D., Redon R., Feuk L., Gokcumen O., Zhang Y., Aerts J., Andrews T.D., Barnes C., Campbell P., Wellcome Trust Case Control Consortium Origins and functional impact of copy number variation in the human genome. Nature. 2010;464:704–712. doi: 10.1038/nature08516. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Speir M.L., Zweig A.S., Rosenbloom K.R., Raney B.J., Paten B., Nejad P., Lee B.T., Learned K., Karolchik D., Hinrichs A.S. The UCSC Genome Browser database: 2016 update. Nucleic Acids Res. 2016;44(D1):D717–D725. doi: 10.1093/nar/gkv1275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Iafrate A.J., Feuk L., Rivera M.N., Listewnik M.L., Donahoe P.K., Qi Y., Scherer S.W., Lee C. Detection of large-scale variation in the human genome. Nat. Genet. 2004;36:949–951. doi: 10.1038/ng1416. [DOI] [PubMed] [Google Scholar]
  • 33.MacDonald J.R., Ziman R., Yuen R.K.C., Feuk L., Scherer S.W. The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 2014;42:D986–D992. doi: 10.1093/nar/gkt958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Parikh H., Mohiyuddin M., Lam H.Y.K., Iyer H., Chen D., Pratt M., Bartha G., Spies N., Losert W., Zook J.M., Salit M. svclassify: a method to establish benchmark structural variant calls. BMC Genomics. 2016;17:64. doi: 10.1186/s12864-016-2366-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Seo J.-S., Rhie A., Kim J., Lee S., Sohn M.-H., Kim C.-U., Hastie A., Cao H., Yun J.-Y., Kim J. De novo assembly and phasing of a Korean human genome. Nature. 2016;538:243–247. doi: 10.1038/nature20098. [DOI] [PubMed] [Google Scholar]
  • 36.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Roller E., Ivakhno S., Lee S., Royce T., Tanner S. Canvas: versatile and scalable detection of copy number variants. Bioinformatics. 2016;32:2375–2377. doi: 10.1093/bioinformatics/btw163. [DOI] [PubMed] [Google Scholar]
  • 38.Klambauer G., Schwarzbauer K., Mayr A., Clevert D.-A., Mitterecker A., Bodenhofer U., Hochreiter S. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res. 2012;40:e69. doi: 10.1093/nar/gks003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Abyzov A., Urban A.E., Snyder M., Gerstein M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011;21:974–984. doi: 10.1101/gr.114876.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Handsaker R.E., Korn J.M., Nemesh J., McCarroll S.A. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat. Genet. 2011;43:269–276. doi: 10.1038/ng.768. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Yoon S., Xuan Z., Makarov V., Ye K., Sebat J. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2009;19:1586–1592. doi: 10.1101/gr.092981.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Pinto D., Darvishi K., Shi X., Rajan D., Rigler D., Fitzgerald T., Lionel A.C., Thiruvahindrapuram B., Macdonald J.R., Mills R. Comprehensive assessment of array-based platforms and calling algorithms for detection of copy number variants. Nat. Biotechnol. 2011;29:512–520. doi: 10.1038/nbt.1852. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Lex A., Gehlenborg N., Strobelt H., Vuillemot R., Pfister H. UpSet: visualization of intersecting sets. IEEE Trans. Vis. Comput. Graph. 2014;20:1983–1992. doi: 10.1109/TVCG.2014.2346248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Krzywinski M., Schein J., Birol I., Connors J., Gascoyne R., Horsman D., Jones S.J., Marra M.A. Circos: an information aesthetic for comparative genomics. Genome Res. 2009;19:1639–1645. doi: 10.1101/gr.092759.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Wickham H. Springer-Verlag: New York; 2009. ggplot2: Elegant Graphics for Data Analysis. [Google Scholar]
  • 46.Robinson J.T., Thorvaldsdóttir H., Winckler W., Guttman M., Lander E.S., Getz G., Mesirov J.P. Integrative genomics viewer. Nat. Biotechnol. 2011;29:24–26. doi: 10.1038/nbt.1754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.O’Rawe J., Jiang T., Sun G., Wu Y., Wang W., Hu J., Bodily P., Tian L., Hakonarson H., Johnson W.E. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med. 2013;5:28. doi: 10.1186/gm432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Raczy C., Petrovski R., Saunders C.T., Chorny I., Kruglyak S., Margulies E.H., Chuang H.-Y., Källberg M., Kumar S.A., Liao A. Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms. Bioinformatics. 2013;29:2041–2043. doi: 10.1093/bioinformatics/btt314. [DOI] [PubMed] [Google Scholar]
  • 49.Pinto D., Marshall C., Feuk L., Scherer S.W. Copy-number variation in control population cohorts. Hum. Mol. Genet. 2007;16 Spec No. 2:R168–R173. doi: 10.1093/hmg/ddm241. [DOI] [PubMed] [Google Scholar]
  • 50.Uddin M., Thiruvahindrapuram B., Walker S., Wang Z., Hu P., Lamoureux S., Wei J., MacDonald J.R., Pellecchia G., Lu C. A high-resolution copy-number variation resource for clinical and population genetics. Genet. Med. 2015;17:747–752. doi: 10.1038/gim.2014.178. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Yuen R.K.C., Thiruvahindrapuram B., Merico D., Walker S., Tammimies K., Hoang N., Chrysler C., Nalpathamkalam T., Pellecchia G., Liu Y. Whole-genome sequencing of quartet families with autism spectrum disorder. Nat. Med. 2015;21:185–191. doi: 10.1038/nm.3792. [DOI] [PubMed] [Google Scholar]
  • 52.Kloosterman W.P., Francioli L.C., Hormozdiari F., Marschall T., Hehir-Kwa J.Y., Abdellaoui A., Lameijer E.-W., Moed M.H., Koval V., Renkens I., Genome of Netherlands Consortium Characteristics of de novo structural changes in the human genome. Genome Res. 2015;25:792–801. doi: 10.1101/gr.185041.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Brandler W.M., Antaki D., Gujral M., Noor A., Rosanio G., Chapman T.R., Barrera D.J., Lin G.N., Malhotra D., Watts A.C. Frequency and complexity of de novo structural mutation in autism. Am. J. Hum. Genet. 2016;98:667–679. doi: 10.1016/j.ajhg.2016.02.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Turner T.N., Coe B.P., Dickel D.E., Hoekzema K., Nelson B.J., Zody M.C., Kronenberg Z.N., Hormozdiari F., Raja A., Pennacchio L.A. Genomic patterns of de novo mutation in simplex autism. Cell. 2017;171:710–722.e12. doi: 10.1016/j.cell.2017.08.047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Redon R., Ishikawa S., Fitch K.R., Feuk L., Perry G.H., Andrews T.D., Fiegler H., Shapero M.H., Carson A.R., Chen W. Global variation in copy number in the human genome. Nature. 2006;444:444–454. doi: 10.1038/nature05329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Handsaker R.E., Van Doren V., Berman J.R., Genovese G., Kashin S., Boettger L.M., McCarroll S.A. Large multiallelic copy number variations in humans. Nat. Genet. 2015;47:296–303. doi: 10.1038/ng.3200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Laehnemann D., Borkhardt A., McHardy A.C. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief. Bioinform. 2016;17:154–179. doi: 10.1093/bib/bbv029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Buchanan J.A., Scherer S.W. Contemplating effects of genomic structural variation. Genet. Med. 2008;10:639–647. doi: 10.1097/gim.0b013e318183f848. [DOI] [PubMed] [Google Scholar]
  • 59.Zook J.M., Chapman B., Wang J., Mittelman D., Hofmann O., Hide W., Salit M. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 2014;32:246–251. doi: 10.1038/nbt.2835. [DOI] [PubMed] [Google Scholar]
  • 60.Huddleston J., Chaisson M.J.P., Steinberg K.M., Warren W., Hoekzema K., Gordon D., Graves-Lindsay T.A., Munson K.M., Kronenberg Z.N., Vives L. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 2017;27:677–685. doi: 10.1101/gr.214007.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Norris A.L., Workman R.E., Fan Y., Eshleman J.R., Timp W. Nanopore sequencing detects structural variants in cancer. Cancer Biol. Ther. 2016;17:246–253. doi: 10.1080/15384047.2016.1139236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Yuen R.K.C., Merico D., Cao H., Pellecchia G., Alipanahi B., Thiruvahindrapuram B., Tong X., Sun Y., Cao D., Zhang T. Genome-wide characteristics of de novo mutations in autism. NPJ Genom. Med. 2016;1:160271. doi: 10.1038/npjgenmed.2016.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Jiang Y.H., Yuen R.K.C., Jin X., Wang M., Chen N., Wu X., Ju J., Mei J., Shi Y., He M. Detection of clinically relevant genetic variants in autism spectrum disorder by whole-genome sequencing. Am. J. Hum. Genet. 2013;93:249–263. doi: 10.1016/j.ajhg.2013.06.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Telenti A., Pierce L.C.T., Biggs W.H., di Iulio J., Wong E.H.M., Fabani M.M., Kirkness E.F., Moustafa A., Shah N., Xie C. Deep sequencing of 10,000 human genomes. Proc. Natl. Acad. Sci. USA. 2016;113:11901–11906. doi: 10.1073/pnas.1613365113. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S23, Tables S1–S15, and Tutorial
mmc1.pdf (3.4MB, pdf)
File S1. HuRef CNV Benchmark (GCRh37/hg19 Coordinates)
mmc2.txt (81.1KB, txt)
File S2. HuRef CNV Benchmark (GCRh38/hg38 Coordinates)
mmc3.txt (78.6KB, txt)
File S3. NA12878 CNV Benchmark (GCRh37/hg19 Coordinates)
mmc4.txt (39.9KB, txt)
File S4. AK1 CNV Benchmark (GCRh37/hg19 Coordinates)
mmc5.txt (41.1KB, txt)
File S5. Raw Output Files from the Six CNV-Detection Algorithms for the HuRef, NA12878, and AK1 Genomes
mmc6.zip (4.9MB, zip)
File S6. Details on the DNA Libraries Used in This Study and Corresponding WGS Data
mmc7.xlsx (60KB, xlsx)
File S7. Putative ASD-Risk Genes
mmc8.xlsx (34.3KB, xlsx)
File S8. CNVs Analyzed in Stage 3 of This Paper
mmc9.xlsx (73.8KB, xlsx)
Document S2. Article plus Supplemental Data
mmc10.pdf (4.6MB, pdf)

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

RESOURCES