Detection of elusive DNA copy-number variations in hereditary disease and cancer through the use of noncoding and off-target sequencing reads

Mathieu Quinodoz; Karolina Kaminska; Francesca Cancellieri; Ji Hoon Han; Virginie G Peter; Elifnaz Celik; Lucas Janeschitz-Kriegl; Nils Schärer; Daniela Hauenstein; Bence György; Giacomo Calzetti; Vincent Hahaut; Sónia Custódio; Ana Cristina Sousa; Yuko Wada; Yusuke Murakami; Almudena Avila Fernández; Cristina Rodilla Hernández; Pablo Minguez; Carmen Ayuso; Koji M Nishiguchi; Cristina Santos; Luisa Coutinho Santos; Viet H Tran; Veronika Vaclavik; Hendrik PN Scholl; Carlo Rivolta

doi:10.1016/j.ajhg.2024.03.001

. 2024 Mar 25;111(4):701–713. doi: 10.1016/j.ajhg.2024.03.001

Detection of elusive DNA copy-number variations in hereditary disease and cancer through the use of noncoding and off-target sequencing reads

Mathieu Quinodoz ^1,^2,³, Karolina Kaminska ^1,², Francesca Cancellieri ^1,², Ji Hoon Han ^1,², Virginie G Peter ^1,^2,⁴, Elifnaz Celik ^1,², Lucas Janeschitz-Kriegl ^1,², Nils Schärer ^1,², Daniela Hauenstein ^1,², Bence György ^1,², Giacomo Calzetti ^1,², Vincent Hahaut ^1,², Sónia Custódio ⁵, Ana Cristina Sousa ⁵, Yuko Wada ⁶, Yusuke Murakami ⁷, Almudena Avila Fernández ^8,⁹, Cristina Rodilla Hernández ^8,⁹, Pablo Minguez ^8,⁹, Carmen Ayuso ^8,⁹, Koji M Nishiguchi ¹⁰, Cristina Santos ^11,¹², Luisa Coutinho Santos ¹², Viet H Tran ^13,¹⁴, Veronika Vaclavik ¹³, Hendrik PN Scholl ^1,², Carlo Rivolta ^1,^2,^3,^∗

PMCID: PMC11023916 PMID: 38531366

Summary

Copy-number variants (CNVs) play a substantial role in the molecular pathogenesis of hereditary disease and cancer, as well as in normal human interindividual variation. However, they are still rather difficult to identify in mainstream sequencing projects, especially involving exome sequencing, because they often occur in DNA regions that are not targeted for analysis. To overcome this problem, we developed OFF-PEAK, a user-friendly CNV detection tool that builds on a denoising approach and the use of “off-target” DNA reads, which are usually discarded by sequencing pipelines. We benchmarked OFF-PEAK on data from targeted sequencing of 96 cancer samples, as well as 130 exomes of individuals with inherited retinal disease from three different populations. For both sets of data, OFF-PEAK demonstrated excellent performance (>95% sensitivity and >80% specificity vs. experimental validation) in detecting CNVs from in silico data alone, indicating its immediate applicability to molecular diagnosis and genetic research.

Copy-number variations are a type of DNA variants that can lead to genetic disease and cancer. Because of important technical limitations, they are usually difficult to identify. We have created a software, OFF-PEAK, which makes use of data that are discarded by conventional procedures to efficiently detect these particular variants.

Introduction

Targeted next-generation sequencing (NGS) approaches, such as whole-exome sequencing (WES), are widely used to investigate the molecular origin of hereditary disease, cancer, or normal interindividual genetic variability. Small DNA changes such as single-nucleotide variants (SNVs) or short insertions and deletions are often identified as the underlying cause of these disorders or phenotypes (e.g., Töpf et al.,¹ Perea-Romero et al.,² Bae et al.³). However, in many cases, pathogenic genetic variants consist of DNA rearrangements, such as deletions or duplications, involving dozens to millions of base pairs. These larger events, collectively termed copy-number variants (CNVs), can be responsible for disease in up to 20% of affected individuals, depending on the specific condition and the ethnicity of the cohorts analyzed.⁴^,⁵^,⁶

CNVs can be detected using specific molecular biology techniques, such as microarray-based comparative genomic hybridization (array-CGH) or multiplex ligation dependent probe amplification (MLPA).⁷ However, these analyses involve higher costs with respect to mainstream genomic technologies and are not routinely applied in genetic diagnosis. A few studies have shown that CNVs can be detected by exploiting the information contained in NGS data, such as WES, whole-genome sequencing (WGS), or targeted sequencing (NGS panels). More specifically, they can be inferred using multiple layers of information that are embedded in sequencing data: relative read coverage, split-reads, split pairs, B-allele frequency (BAF) of SNVs, de novo assembly, or a combination of these.⁸^,⁹^,¹⁰

Approaches based on coverage of captured regions are the most relevant ones when data from WES or from NGS panels are considered, since these experiments are unlikely to include split-reads, split pairs, or enough SNVs for BAF analysis. In these instances, CNVs are detected by comparing the depth of reads aligning to the reference sequence of the human genome, since deletions should in theory result in lower local coverage, whereas duplications should result in increased coverage. However, coverage-based CNV detection faces a major challenge, namely the considerable variability in read depth that is normally detected within and across samples. Coverage variability, due to differences in DNA quality, capture efficiency, read mappability, etc., represents in fact an overwhelming source of noise that can easily mask the true signal originating from CNVs.¹¹ To circumvent this problem, several denoising approaches have already been implemented and incorporated into in silico tools, including principal components analysis (PCA) and singular value decomposition (SVD), leading to variable results.¹⁰ These tools are also based on various algorithms and may be adapted to mine specific sequencing sets (e.g., NGS panels, WGS, etc.) or be primarily suitable for particular CNV types (e.g., large CNVs associated with cancer or, conversely, small CNVs associated with Mendelian diseases). Moreover, it can be noted that the degree of user-friendliness and the type of output files differ from tool to tool.¹⁰

In WES and NGS panels, the template DNA to be sequenced is pre-processed by hybridization capture or by other techniques, in order to select some regions of the genome for downstream analysis while discarding others. Interestingly, however, in addition to sequences from captured regions, the raw output of these experiments also contains many off-target reads, i.e., sequences belonging to portions of the genome that were not selected for further processing but were nonetheless present as contaminants in sequencing libraries. Such sequences may represent in fact up to 60% of the total reads¹² and can be used, in principle, to detect CNVs. We reasoned that this a priori unwanted information, in combination with region-specific denoising approaches, could be harnessed to predict the presence of CNVs from NGS data. By exploiting this concept, we have developed OFF-PEAK, a tool that, by taking BAM and BED files as inputs, is capable of performing three essential functions: (1) detecting rare CNVs (in targeted regions, untargeted regions, or both) by using off-target reads, (2) achieving high performance in this task by using only primary data from WES or NGS panels, and (3) providing comprehensive and user-friendly output files. In essence, as described below, we created a robust and versatile CNV detection software that can be used to analyze any generic NGS project.

Material and methods

Samples from human subjects

This study was performed according to the tenets of the Declaration of Helsinki and was approved by the Ethics Committees of the respective Institutions involved: the Ethikkommission Nordwest- und Zentralschweiz, the Commission Cantonale d’Étique de la Recherche sur l’Être Humain du Canton de Vaud, the Comissão de Ética para a Saúde do Instituto de Oftalmologia Dr. Gama Pinto, the Comité de Ética de la Investigación de la Fundación Jiménez Díaz, the Institutional Review Boards of the Kyushu University Hospital, the Yuko Wada Eye Clinic, and the Nagoya University Hospital. Written informed consent was obtained from all participants or their legal guardians prior to their inclusion in this study. DNA was extracted from the participants’ whole-blood or saliva samples.

We selected 130 persons with inherited retinal diseases (IRDs) who did not have a clear molecular diagnosis prior to CNV detection to be used as a validation set for OFF-PEAK. For the quantification of on- vs. off-target reads, we used 22 of these samples, as well as 172 other affected individuals; cross-platform comparison was achieved by analyzing data from 60 additional subjects (51 + 9), as detailed below.

Whole-exome sequencing procedures

WES was performed at CeGaT GmbH. There, sequencing libraries were generated using either the Twist Human Core Exome (kit 1, batch of 11 samples), the Twist Human Core Exome Plus (kit 2, batch of 97 samples), or the Twist Exome 2.0 Panel (kit 3, batch of 22 samples) (Twist Bioscience) following the manufacturers’ protocols. Libraries underwent paired-end sequencing on a NovaSeq 6000 (Illumina), resulting in reads of 100 bases. Obtained reads were subsequently processed by our team.

Scoring of on- and off-target reads

We used the CollectHsMetrics command from Picard (v.2.23.8) with default parameters to retrieve the number of mapped reads in targeted regions, near them, or somewhere else. The command was run for the 194 samples (172 + 22) described above, which were also sequenced using the Twist Exome 2.0 panel (kit 3) and Illumina NovaSeq 6000 machines with at least 12 Gb of output per sample. Additionally, the same command was run on 51 samples sequenced using the SureSelect Human All Exon V6 capture kit by Agilent (sequenced on a NovaSeq 6000), as well as 9 samples sequenced using the TruSight One Expanded capture kit by Illumina and sequenced with a NextSeq 500 system (Illumina).

Mapping and variant calling

The raw sequence files were assessed, trimmed, and finally mapped back to the human genome reference sequence (build hg19/GRCh37) using BWA (v.0.7.17). Then, Picard (v.2.14.0-SNAPSHOT) and GATK (v.4.1.4.1) were used to process mapped reads and perform base quality score recalibration and variant calling. DNA variants were processed and scored according to an internal computational pipeline,¹³ using ANNOVAR.¹⁴

Processing of on- and off-target intervals by OFF-PEAK

The 01_targets-offtargets.sh script of OFF-PEAK was used to process the input BED files provided by Twist Bioscience for capture kits 1–3, containing the information relative to targeted regions (--targets option) defined for either the hg19 or the hg38 genome builds (--genome option). The process developed as follows: first, regions smaller than a given threshold (--minOntarget, default 100 bp) were extended to reach this specific threshold, and regions larger than a given value (--maxOntarget, default 300 bp) were split into equal parts to reach a size below this value. Next, off-target regions were defined as the whole reference sequence minus all padded (--paddingOfftarget, default 300 bp) on-target regions. These were then further divided into equal parts if they were larger than a specific value (--maxOfftarget, default 50,000 bp) or discarded if they were smaller than a minimum size (--minOfftarget, default 1 bp). RefSeq exons within both on- and off-target regions were annotated as such in the output BED file. This step required the use of R (annotate-off-targets.R and annotate-targets.R scripts), without the use of any particular libraries, as well as of the reference genome as a single FASTA file (--ref option).

All parameters were heuristically optimized following the analysis of more than 1,000 internal WES data and can be further optimized for other types of capture kits or sequencing procedures. More specifically, we recommend adapting the maxOfftarget parameter according to the percent of off-target vs. on-target reads, i.e., maxOfftarget = 2 ^∗ maxOntarget ^∗ (% on-target reads)/(% off-target reads), the value 2 representing a “safety” parameter that takes into account the variability of the coverage for off-target regions. For instance, if the maximum size of on-target regions is 300 bp, we would advise to select a maximum size of off-target regions of 50 kbp for the Twist capture kit, 70 kbp for the Agilent kit, and 60 kbp for the Illumina kit.

Determination of coverage of on- and off-target regions by OFF-PEAK

To compute the number of base coverage to each specific region, OFF-PEAK (02_bam-count.sh script) used the mosdepth software (v.0.3.2),¹⁵ with parameters: --no-per-base, --threads 2, --mapq 50. As input, it used the processed targets (output of 01_targets-offtargets.sh script, --targetsBED), the working directory (--work), the location of the mosdepth software (--mosdepth), and a tab-delimited text file containing one or two columns (BAM file and possibly sample IDs, --listBAM).

CNV detection by OFF-PEAK

For the whole procedure, the following R libraries were used: optparse (v.1.7.3),¹⁶ gplots (v.3.1.3),¹⁷ ExomeDepth (v.1.1.16),¹⁸ pROC (v.1.18.0),¹⁹ and caTools (v.1.18.2).²⁰ CNV detection was achieved by analyzing each sample separately, according to the following procedure. First, control samples were selected as those displaying high correlation values with respect to the test sample (--mincor, default: 0.9), based on 10,000 randomly selected autosomal target regions passing the requirement on minimum coverage and maximum variance (--minsignal and --maxvar, defaults: 2,500 and −0.2, respectively). The smallest number of control samples selected was 15 (unless there were fewer samples in total), the maximum was 96.

Target regions were then selected to include an annotated exon or to have a minimal size (--minOfftarget, default: 1,000). The signal on such targets was subsequently normalized to their GC content by using the lm function from the R Stats package on the signal for the test sample, divided by the average signal from the control samples. This procedure was applied separately for autosomes and for the X chromosome. Normalization of the signal from each target region was also performed with respect to the total coverage from the sample it belonged to (sum of coverages of all regions). This, again, was done separately for autosomes and for the X chromosome. After that, all targets with an average coverage inferior to a pre-defined value or with variance of coverage superior to a threshold were filtered out (--minsignal and --maxvar options).

Noise removal was achieved by using a leave-one-out principal components analysis (LOO-PCA) approach (see supplemental methods for details). Following this denoising process, CNVs were computed on the test sample versus the control samples, excluding control samples with coverage values that were outside of 2 standard deviations from the average of controls. CNVs involving more than one target were detected using the viterbi.hmm function (ExomeDepth package v.0.8.0),¹⁸ with transition probabilities of 0.00005 between a normal copy number state and another state, and of 0.5 to continue in the same state. Single-target events were selected based on the absolute Z score (--minZ, default 4), and only if they were not part of events involving multiple targets. Ploidy was determined as corresponding to nearest entire number with the smaller Z score with respect to the test sample.

Further details on detection and annotation of CNVs, as well as on the production of output files, are described in the supplemental methods.

Results

Scoring of off-target reads in WES data

We found that a considerable number of contaminant DNA from untargeted regions of the genome (Figure 1A) are present in the sequence files of WES, even when using the latest technologies for DNA capture and parallel sequencing (Figure S1). Specifically, the analysis of 194 WES datasets that were recently generated by our team using a capture kit from Twist (Exome 2.0) revealed that, on average, 47.2% of the reads obtained mapped directly to the targeted regions, 33.6% lay in their close vicinity (within 250 bp), 18.0% mapped to further non-captured regions, and 1.2% did not map anywhere. Since targeted regions and their proximal sequences represent only 5.7% of the human genome, we can expect the coverage of off-target regions to be approximately 80 times lower than that of targeted ones (18.0/(47.2 + 33.6)^∗0.057 ≈ 1/80). This would imply a coverage of ∼2× in off-target regions for a WES having an average coverage of 200×, representing a major source of data and an unexpected opportunity for CNV detection. Following this reasoning, a CNV tool that could identify a CNV affecting a single exon should also, in principle, be able to detect a CNV affecting a ∼100 times larger off-target region.

Use of uncaptured reads contaminating WES or NGS panels’ libraries for the detection of CNVs

(A) Schematic view of a typical coverage of a genomic region, following targeted NGS. Individual reads are represented by thin horizontal lines and coverage by grey-shaded areas. Since, at the pre-sequencing stage, capturing probes target only specific sequences of the genome (e.g., exons, in blue, in B), coverage tends to be higher for such regions of interest. However, target selection and downstream fragment purification are not 100% efficient, and DNA fragments that lie in uncaptured regions are frequently co-purified along with fragments from targeted areas.

(B and C) Regions of the genome that are targeted by a capturing system, along with their size and position, can be used to define on-target (green boxes) and off-target (orange boxes) sets for further computing the presence of CNVs occurring in captured areas, in uncaptured areas, or in both. Other procedures to improve CNV calling, such as intron padding and resizing of targets, are also shown.

To investigate whether the number and distribution of off-target reads was comparable to those of other capture kits, we analyzed sequencing data produced by using the Agilent (SureSelect) and Illumina (TruSight One) systems. We found indeed very similar values, and specifically that 50.9% and 48.5% of reads mapped to targeted regions, 35.6% and 34.9% to their close vicinities, and 12.7% and 14.2% further in non-captured regions for the Agilent and the Illumina kits, respectively.

Implementation of OFF-PEAK

Based on these data and on the possibility of exploiting off-target reads for CNV detection, we designed a specific workflow, ideally meant to be run on a set of samples sequenced in similar experimental conditions, and condensed it into a single software, OFF-PEAK. The analytical process of this tool was divided into four main steps; (1) target region pre-processing, (2) counting reads and coverage, (3) CNV detection, and (4) CNV annotation and graphical representation (Figure 2, details in supplemental methods).

Outline of the procedures used by OFF-PEAK

The workflow is divided into four main steps: (1) processing of target regions, (2) counting reads and coverage, (3) CNV detection, and (4) CNV annotation and graphical representation.

As part of the first step, in order to harvest as much information as possible from off-target reads, we had OFF-PEAK divide the genome based on captured and uncaptured regions (Figure 1B). This was achieved by processing the file containing the coordinates of the targeted sequences that are usually provided by the vendor of the capture kit (BED file). Then, both uncaptured and captured regions were subdivided in parts of similar lengths, defining “on-target regions” and “off-target regions,” respectively (Figure 1C). Sizes of these latter regions were purposely set to be larger than those from the former regions, to compensate for the higher coverage displayed by exonic/captured stretches of DNA and allow for a similar number of reads to be present within any given region. In other words, we performed differential partitioning to increase the sensitivity to differences in coverage for every single DNA stretch, and specifically to allow the detection of partial exonic rearrangements. In addition, within off-target sets, we excluded DNA regions that were in close vicinity to on-target regions (default range: 300 bp from each side). Again, the goal of this padding process was to allow for a better identification of differences in coverage in off-target regions, since the high number of reads in these proximal regions, a byproduct of exonic capture, would hide the true coverage value of introns (Figure 1B). Then, small captured regions were extended in silico to result in larger on-target regions, using these byproduct reads to allow reliable estimation of coverage (Figures 1B and 1C). Finally, all regions were annotated with exons from RefSeq transcripts for further use.

As a second step, we used the mosdepth software¹⁵ on BAM files to determine the coverage of each on-target and off-target region, for each sample. We then merged these individual data to produce a single coverage file, which contained standardized coverage information relative to all regions.

Step 3 involved the actual detection of CNVs, based on the comparison between every single sample with samples from the rest of the pool (used as controls). Specifically, we first selected all controls that showed highly correlated coverage with the test sample (based on 10,000 randomly selected regions, see material and methods) and then normalized all coverage values based on each sample’s total coverage and each region’s GC content, using a regression method (details in material and methods). We subsequently filtered the on- and off-target regions based on average coverage and standard deviation to remove outlier regions with insufficient signal or with very high variation in coverage, mostly representing regions with high homology or repeats such as telomeres, centromeres, and pseudogenes. After that, for the test sample we operated a random selection of 1,000 on-target or off-target regions. To half of them, we assigned an artificial coverage corresponding to 50% of the real one, simulating a heterozygous deletion, and to the remaining 500 a 150% coverage, to simulate a heterozygous duplication. OFF-PEAK then applied the leave-one-out PCA (LOO-PCA) method associated with these artificial CNVs to compute how many principal components (PCs) need to be removed to achieve the best performance in retrieving all 1,000 artificial CNVs (see supplemental methods for details on this selection). Noise removal using LOO-PCA is very similar to PCA-based methods, the difference being that the test sample is not included in the computation of PCs and is simply projected on them afterward. By this method, the variation in coverage resulting from the presence of a true CNV in the test sample is not used to build the PCs and therefore the signal is not lost when PCs are removed for noise correction. Our analysis showed that computing PCs only on control samples (not including the test sample) results in higher robustness in CNV detection. In other words, CNV signals remain strong even following the removal of multiple PCs and this is, in fact, a signature feature of OFF-PEAK. As an example to illustrate the difference between PCA and LOO-PCA, we selected a heterozygous deletion affecting 20 exons that was detected in an individual sequenced with the capture kit 1. When no PC is removed, the CNV is detectable by visual inspection but cannot be scored as a true CNV since it lies within the range of experimental uncertainty, represented by the gray area in Figure 3A. When an increasing number of PCs is removed with standard PCA, the noise decreases, but so does the signal originating from the true CNV (green squares and blue circles), to the point that it is, in the end, completely lost. Conversely, with LOO-PCA, the noise progressively decreases with an increasing number of PCs removed, but the signal originating from the CNV does not. To better score this phenomenon, we assessed the detection performance on artificial CNVs when PCs are removed with PCA or LOO-PCA. We found that performance of LOO-PCA remains robust, even when many PCs are removed, whereas it drops quickly with PCA. This effect was found to be similar for samples sequenced within small (Figure 3B) and large (Figure 3C) batches. After noise removal using PCs, CNVs were detected by comparing PC-processed coverage values of the test sample vs. control samples. CNVs affecting multiple consecutive targets were further inferred with a Hidden Markov Model (HMM) approach using the Viterbi algorithm,¹⁸ as described in the material and methods.

Denoising approaches for CNV detection

(A) Example of the effect of principal components (PC) removal on the detection of a CNV, using PCA or LOO-PCA. Plots indicate coverage for each target along a common DNA stretch. When no PC is removed, coverage (dotted line) of the CNV (in this case, a heterozygous deletion) falls into the normal range of variation of 10 control samples (gray area) and cannot be automatically detected. Removal of initial PC components reduces the noise linked to normal variation of coverage; however, when standard PCs removal is applied, it also reduces the amplitude of the true signal linked to the real CNV (top row). This is not the case for PC removal using the LOO-PCA approach, which reduces the noise associated with normal variation of coverage but does not affect the true signal (bottom row).

(B) OFF-PEAK’s performance in retrieving artificial CNVs (500 heterozygous deletions and 500 heterozygous duplications) by removing an increasing number of PCs (x axis) using either PCA or LOO-PCA for a small batch of 11 samples. Performance values are presented as area under the curve (AUC) of a receiver operating characteristic (ROC) curve.

(C) Same procedure depicted in (B), but for a larger WES batch (97 samples).

As a fourth step, multiple output files were created: text files annotating all detected CNVs with various features, files compatible with AnnotSV²¹ and the Integrative Genomics Viewer (IGV),²² graphical representations of CNVs, as well as genome-wide and chromosome-specific coverage plots (Figure S2). These files allow the user to analyze all identified CNVs in detail and, if needed, take advantage of additional software for further investigations. The graphical representation of the detected CNVs (Figure S3), in particular, allows for a quick visualization of the CNV and its neighboring reads (including off-target reads, frequent [benign] CNVs from the gnomAD database, and pathogenic CNVs from the ClinVar database). It also allows a better representation of the noise level around a CNV, compared to standard quality metrics.

Moreover, we implemented an additional filter, called OFF-PEAK-HQ, to select “high-quality” CNVs, i.e., structural variants that are called with higher confidence and are more likely to represent true positive events (see supplemental methods for details).

Validation of the tool on the ICR96 dataset

As a first testing ground for OFF-PEAK, we selected the ICR96 reference dataset (ICR96 exon CNV validation series), comprising a collection of 96 cancer samples for which 26 genes were sequenced by NGS. All their exons were further experimentally investigated for the presence of copy-number events using MLPA.²³ This set is routinely used for estimating the performance of exon-CNV calling software on NGS data. As a comparison set, we selected eight widely used tools that were developed for CNV detection starting from WES or panel-NGS reads,¹⁰ and specifically: cn.mops,²⁴ CNVkit,²⁵ CODEX2,²⁶ CoNIFER,²⁷ Control-FREEC,²⁸ ExomeDepth,¹⁸ GATK gCNV,²⁹ and SavvyCNV.³⁰ All these tools were run by using their default parameters, as recommended by their respective developers.

We started by scoring the number of CNVs detected by each tool that overlapped with the true CNVs found by MLPA for at least one captured region, considering the correct predicted ploidy as well. At the end of the process, OFF-PEAK was the only tool that retrieved all of the 68 CNVs validated by MLPA (sensitivity = 100%), with a specificity of 56.7% (Figure 4A; Tables S1 and S2). Three other tools, SavvyCNV, ExomeDepth, and GATK gCNV, displayed very good sensitivity as well (>92%), missing only 2, 4, and 5 CNVs, respectively, but had relatively low specificity (9.8%, 29.6%, and 1.4%, respectively) (Figure 4A; Table S2). The remaining five tools all displayed much lower sensitivity (<36%), clearly separating from the previous ones, while one did not produce any output, because the number of target regions per chromosome was lower than the minimal required input (Figure 4A; Table S2).

Performance of OFF-PEAK and other tools with respect to different testing sets

(A) Specificity-sensitivity plot for the ICR96 dataset of 96 cancer samples, on 68 validated CNVs.

(B) Specificity-sensitivity plot for the cohort of 130 individuals with retinal phenotypes, on 37 validated CNVs. The curves in light gray represent the F-score or harmonic mean of sensitivity and specificity.

(C) Bar plots of sensitivities for the latter cohort, stratified according to the type of CNVs considered: off-target (e.g., noncoding), on-target single-exon, small batches (kits 1 and 3), and large batch (kit 2, see supplemental methods).

We then repeated the analysis by scoring only predicted CNVs that overlapped fully with the captured regions of true CNVs. Sensitivity was slightly lower for all well-performing tools (OFF-PEAK [−4.4%], ExomeDepth [−4.4%], GATK gCNV [−5.8%], and SavvyCNV [−4.4%]), usually because they detected the correct event but missed one captured region or included one adjacent extra region with normal ploidy (Figure S4; Table S2).

When the OFF-PEAK-HQ filter was deployed, specificity increased from 56.7% to 86.8%, at the price of a slightly lower sensitivity, which decreased to 97.1%, due to two CNVs that did not have sufficient quality to be called with high confidence (Figure 4A; Table S2).

Validation on WES from individuals with rare diseases

As an additional test for evaluating the performance of our tool, we gathered WES data from 130 individuals with inherited retinal diseases (IRDs), for whom no causative SNVs or small insertions or deletions (indels) had been found (Table S3). IRDs are rare Mendelian disorders for which mutations in any one of multiple disease-associated genes are at the same time a sufficient and necessary cause of the condition.³¹ Specifically, every person with IRD necessarily bears in their genome one (dominant, mitochondrial, X-linked in males) or two (recessive) pathogenic variants, and failure to detect them is generally attributed to technical limitations. The samples analyzed were from individuals of various ethnicities, recruited in Switzerland (n = 79), Portugal (n = 29), and Japan (n = 22). Since, unlike the ICR96 set, true CNVs were not known, we performed a CNV discovery phase within the 132 genes associated with IRDs (Table S4). Using all tools considered above, 1,743 CNVs were identified (Table S5), which were further filtered to be compatible with each gene’s inheritance mode (see supplemental methods). CNVs detected by controlFREEC (869 events on 130 exomes) and by CNVkit (75 events, of which 72 in 3 out of 130 exomes) were not considered for the next step, since they represented either falsely positive or artifactual data (see supplemental methods and Table S5). The remaining 556 CNVs were manually examined, looking at the number and the distribution of mapped reads, and then curated based on the correlation between genotype (gene affected by a given CNV) and phenotype (specific clinical signs possibly resulting from the inactivation of that gene). The final list of candidates included 37 CNVs in 34 individuals, which were all experimentally validated and found to correspond to real events by polymerase chain reaction (PCR), WGS, MLPA, or a combination of these techniques (Table S6).

We therefore considered these 37 CNVs as true positives and computed the performance of OFF-PEAK vs. that of the other tools on these specific data, first by taking into account CNV calls having the correct ploidy and overlapping with true CNVs. OFF-PEAK outputted a total of 138 predictions, including the 37 true positive ones, corresponding to a sensitivity value of 100% and a specificity of 27.4% (Figure 4B; Tables S2 and S7). OFF-PEAK-HQ detected 35 events out of 37 as high-quality CNVs, resulting in a sensitivity of 94.6% and a specificity of 79.5% (Figure 4B). Interestingly, the two missed CNVs belonged to the two samples with the lowest maximum pairwise correlations, which emphasizes the importance of using high-quality DNA as starting material and the need for a sufficiently large batch of samples in order to obtain a reliable analysis. As for the ICR96 data, ExomeDepth, GATK gCNV, and SavvyCNV also displayed good performances, although inferior to OFF-PEAK, missing 7, 7, and 13 CNVs, respectively (sensitivity = 81.1%, 81.1%, and 64.9%; specificity = 34.9%, 37.5%, and 10.8%, respectively) (Figure 4B). Of note, for all tools considered, specificity values were potentially underestimated in this analysis, since it is likely that other true CNVs, apart from the validated ones, were detected in silico but were not considered as true positives, since they were not judged to be causative of the disease.

As before, we repeated the analysis by considering only predicted CNVs that strictly overlapped with all captured regions of the true events. OFF-PEAK and SavvyCNV maintained the same performance, with all CNVs completely overlapping with the true ones. ExomeDepth failed to detect complete overlap for one CNV, although it detected it with partial overlap (−2.7% in sensitivity), and GATK gCNV did the same for 4 CNVs (−10.8% in sensitivity) (Figure S4; Table S2).

The graphical output of OFF-PEAK for validated CNVs is shown in Figure S3, and selected examples can be seen in Figure 5.

Four relevant examples of OFF-PEAK graphical outputs

(A) Heterozygous deletion affecting 13 exons of *PRPF31* and 3 other genes for CHbasl0108 with retinitis pigmentosa.

(B) Heterozygous duplication affecting exon 35 of *EYS* for CHbasl0212 with retinitis pigmentosa.

(C) Homozygous deletion affecting a non-coding and non-covered exon of *PCDH15* for CHbasl0303 with Usher syndrome type I.

(D) Heterozygous deletion affecting only a part of exon 4 of *RP1* for OPH-102-940 with retinitis pigmentosa.

Detailed performance results

Heterozygous CNVs are notoriously difficult to detect, especially with respect to their homozygous counterparts, due to a lower difference in coverage compared to controls. OFF-PEAK identified all 25 heterozygous CNVs from the retinal disease cohort with high performance (vs. 21 for ExomeDepth and GATK gCNV, the second-best performers; Table S8, example in Figure 5A).

Similarly, single-exon events are generally more difficult to identify because of a lower overall read coverage. This is also the case for CNVs affecting only parts of an exon, since differences in coverage involve only a subset of a captured region. The ICR96 set comprised 25 single-exon CNVs, and OFF-PEAK was able to detect them all (sensitivity = 100%, Table S2). Three other tools—SavvyCNV, ExomeDepth, and GATK gCNV—had good performances, detecting 24, 22, and 21 of them (96%, 88%, and 84% sensitivity, respectively) (Table S2). The other tools tested displayed only 20% sensitivity or lower. In the retinal disease cohort, there were 10 single-exon events, including one partial exonic deletion. Again, OFF-PEAK could identify all of them (vs. 9 identified by ExomeDepth and GATK gCNV and 6 by SavvyCNV; Figure 4C; Table S8, example Figures 5B–5D).

Another type of challenging CNVs are those occurring in off-target regions, since they have much lower coverage with respect to targeted ones. In the IRD set, OFF-PEAK detected all of the 5 off-target CNVs (vs. 2 identified by SavvyCNV; Figure 4C; Table S8, example in Figure 5C). These CNVs involved non-coding exons comprising the 5′ UTR of EYS, PRPF31, and PCDH15. Overall, ExomeDepth and GATK gCNV missed all CNVs affecting non-coding exons that were not targeted for DNA capture. This was expected, given that ExomeDepth analyzes only coding exons and GATK gCNV solely uses targeted regions as reference (Figure 4C; Table S8). SavvyCNV detected some CNVs in non-captured regions but missed CNVs in samples from smaller batches (Figure 4C; Table S3, see supplemental methods for details on batches). This is consistent with previous analyses showing that SavvyCNV needs more than 50 samples to achieve high performances.³⁰

We also assessed the presence of split reads by manually inspecting the regions containing validated CNVs using the IGV software.²² In 73% of all cases (n = 27), no split read could be found, and precise breakpoints could not be resolved (Table S8). In 13.5% of cases (n = 5), only one split read was identified and, finally, in an additional 13.5% of cases more than 1 (precisely, 20 or more) split reads were found, involving the same breakpoint, which was present in a captured region (Table S8).

Finally, we evaluated all tools in terms of time needed to compute the three WES batches used in the IRD tests. For the largest batch (97 samples), computing time ranged from 4.8 to 34.3 h. Among the tools with good performance, GATK gCNV was the fastest (6.8 h), while OFF-PEAK took 21.2 h to complete the analysis. For the smaller batches (11 and 22 samples), the time needed was of 5 h or less for most tools (Figure S5).

Clinical classification of the CNVs detected and further analyses of IRD data

When evaluated for their pathogenicity, 34 out of the 37 validated CNVs detected by OFF-PEAK in the IRD set were classified as likely pathogenic or pathogenic according to the American College of Medical Genetics and Genomics (ACMG) guidelines³²^,³³ (Table S6). The most frequently affected genes were EYS (n = 11), PRPF31 (n = 5), and USH2A (n = 3). Thirteen CNVs were found in genes for which rare and deleterious small variants had already been detected, presumably in trans (Table S9). A homozygous deletion of exon 9 of DRAM2 (GenBank: NM_001349884.2), resulting in an in-frame deletion of 30 amino acid residues, was classified to be of uncertain significance due to insufficient evidence for pathogenicity, while a duplication affecting exons 1 to 9 of RCBTB1, occurring in two affected individuals, was classified as likely benign, because it still allows for a complete copy of the gene to be present.

To further investigate all genomes for which causative mutations had not yet been identified and to indirectly test the adaptability of our tool to particular datasets/conditions, we performed a supplementary OFF-PEAK analysis with lower stringency (see supplemental methods). This led to the detection of two additional likely pathogenic CNVs: a partial heterozygous deletion of exon 6 of ABCA4 in LL359, who also harbors a likely pathogenic missense variant in the same gene (GenBank: NM_000350.3; c.1749G>C [p.Lys583Asn]), and an apparent partial heterozygous deletion of exon 4 of RP1 in YWC267, which was in fact the sign of a homozygous insertion of an Alu element—a frequent pathogenic variant found in the Japanese population (Table S3).³⁴ Both events were validated by PCR and Sanger sequencing. Of note, these events were also identified by the Scramble software,³⁵ which detects mobile element insertions (MEIs) and small deletions by identifying clusters of soft-clipped reads.

Discussion

Short-read NGS procedures, including targeted and whole-exome sequencing, are the most commonly used techniques in molecular medical genetics, in particular to detect germline DNA variants that are associated with rare hereditary conditions or somatic mutations leading to cancer. However, despite being tremendously effective in identifying single-nucleotide variants or small pathogenic events, NGS panels and WES perform rather poorly in detecting CNVs. This is not due to a lack of primary information contained in sequencing data but, rather, to the fact that such information is not routinely exploited during data analysis. Specifically, most algorithms focus on coverage of captured regions and use this value as a proxy of ploidy, therefore discarding the data associated with the mapping of off-target reads.

Conversely, OFF-PEAK makes primary use of this type of contaminating data. It has been previously shown that off-target reads can be considered for the detection of CNVs in off-target regions, and a few tools, such as CNVkit,²⁵ SavvyCNV,³⁰ cnvOffSeq,³⁶ and CopywriteR,³⁷ already make use of them for this purpose. The main differences between OFF-PEAK and existing algorithms consist of two specific and important points. The first is the use of LOO-PCA in order to solve the most critical confounding factor in CNV detection: coverage variability. This noise is intrinsic to any NGS panel or WES, and therefore cannot be completely eliminated at the experimental level. LOO-PCA allows for a high reduction of such noise while keeping most of the signal related to variations in coverage that are linked to the presence of CNVs. In more technical terms, this denoising approach excludes the sample that is under investigation from the calculation of principal components of the coverage data (representing the actual noise), allowing real signal from CNVs not to be lost when PCs are removed by the denoising process. Specifically, the variance associated with coverage values for targets included in a CNV is smaller if the test sample is not taken into consideration. Therefore, such targets have a smaller correction associated with the removal of the first PCs, compared to standard PCA. In principle, it would be advisable not to include multiple members of the same family in the same batch, since such members would be used as controls in the LOO-PCA denoising procedure, and rare CNVs shared with the test sample could result in decreased true signal. In practice, however, it seems that OFF-PEAK’s denoising procedure is rather insensitive to this effect, as our test on the IRD large batch showed for instance that up to 7% of samples from the same batch could carry the same rare heterozygous deletion in the gene DMBT1 without interfering with the call of the CNV itself (Table S10).

The second advantage of OFF-PEAK with respect to other software is the pre-processing of captured and uncaptured regions, which allows for a better scoring of CNVs by the use of both on- and off-target reads. For targeted regions, DNA sequences are subdivided into bins of similar sizes, regardless of the total length of a consecutive captured stretch, defining on-target intervals. This operation normalizes the signal from such regions (usually exonic sequences) and allows the identification of both small and large CNVs. It also allows the recognition of events that affect captured sequences only in part (e.g., intra-exonic CNVs, Figure 5), which are usually difficult to detect. With respect to non-targeted regions, the same process is applied, although the size of the bins is set to be larger. Such a procedure allows harvesting enough reads for reliable CNV calls and, at the same time, to normalize them with respect to data from captured regions. Most importantly, OFF-PEAK operates a “padding” procedure on off-target regions that are immediately proximal to captured sequences. This process prevents the true signal from off-target regions to be masked (overscored) by the high number of reads covering such non-exonic flanking sequences, which are present in NGS data by virtue of their partial matches with capturing probes at pre-NGS stages. Moreover, unlike other tools, our software is particularly effective at identifying CNVs involving isolated small exons (or captured regions). This is possible because of an OFF-PEAK-specific process, which artificially extends the actual captured sequence for such small regions on both their 5′ and 3′ ends and provides increased sensitivity in coverage detection. All these processes, based on the analysis of off-target regions, also permit restricting the number and size of candidate regions harboring CNV-induced breakpoints and facilitate their identification by molecular biology techniques (e.g., by PCR), even in the absence of split-reads.

When tested on data from 96 cancer samples, OFF-PEAK had the highest performance, detecting all of the 68 MLPA-validated CNVs (100% sensitivity). Since CNVs in this dataset involve on-target regions, the advantage of OFF-PEAK over other tools was mainly due to the use of LOO-PCA, rather than the scoring of off-target reads. Some tools, such as ExomeDepth, SavvyCNV, and GATK gCNV, showed high sensitivity as well, although lower than that displayed by OFF-PEAK. In addition, OFF-PEAK also achieved the highest specificity of all tools considered.

Similarly, when tested on WES data from 130 individuals with hereditary retinal diseases, OFF-PEAK was the only tool that could identify all 37 experimentally validated CNVs affecting genes linked to such conditions. In this case, however, such a high performance could be attributed to the specific use of the information contained in off-target reads. This is evidenced, for instance, by the fact that most CNVs detected solely by our tool were located in untargeted regions. In terms of specificity, OFF-PEAK had a similar performance with respect to the other software, identifying likely causative CNVs in 32 out of 130 affected individuals (24.6%). Conversely, OFF-PEAK-HQ displayed the highest specificity (79.5%), with only a limited reduction in sensitivity with respect to OFF-PEAK (−5.4%). This high specificity is likely due to a more stringent filtering of CNVs, based on various metrics, since this is the only difference between OFF-PEAK and OFF-PEAK-HQ.

Like all the tools evaluated in this study, OFF-PEAK does not use information deriving from split reads to refine breakpoints of CNVs, which is limiting its capacity to identify precise chromosomal junctions. However, because of the paucity of split reads that are normally present in targeted sequencing data, we decided not to use such information, also considering that the presence of split reads can be individually assessed by using dedicated software (e.g., IGV)²² on the chromosomal regions identified by OFF-PEAK. In addition, even by using coverage information only, our tool demonstrated elevated performances in detecting true CNVs.

In summary, our tests showed that specific strong points of OFF-PEAK are related to the identification of types of events that are difficult to detect by other in silico methods, such as heterozygous CNVs, single-exon CNVs, intraexonic events, and CNVs occurring in non-targeted regions of the genome.

CNV rearrangements currently represent one of the most common yet elusive types of pathogenic genotypes in medical and cancer genetics, especially when mainstream sequencing procedures such as WES and targeted NGS are used. By using a specific denoising algorithm, a tailored scoring of different genomic regions and, most importantly, by exploiting the information contained in off-target NGS reads, we created a software that can analyze data from such experiments to detect the presence of small to very large rearrangements with high performance. Our hope is that OFF-PEAK will contribute to a more robust and sensitive detection of pathogenic CNVs, helping molecular diagnosis and basic genetic research alike.

Data and code availability

The OFF-PEAK code is available at https://github.com/mquinodo/OFF-PEAK. The code used for the development of OFF-PEAK and the testing of other tools is available at https://github.com/mquinodo/OFF-PEAK-publication.

Acknowledgments

This work was supported by the Swiss National Science Foundation (grant #176097 to C.R.) and by the Swiss RetinAward (to M.Q.). The authors would like to thank the people with IRDs and their families for their participation in this study. The authors are also grateful to Prof. Nazneen Rahman at The Institute of Cancer Research in London for providing access to the ICR96 data, to Mendurim Rashiti and Marc Perea for IT support, to Beryl Macé for suggestions, and to Sitta Föhr for her careful revision of this manuscript.

Declaration of interests

The authors declare that there is no conflict of interest.

Published: March 25, 2024

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2024.03.001.

Web resources

ClinVar, https://www.ncbi.nlm.nih.gov/clinvar
GenBank, https://www.ncbi.nlm.nih.gov/genbank/
gnomAD, https://gnomad.broadinstitute.org
UCSC, http://genome.ucsc.edu/cgi-bin/hgTables

Supplemental information

Document S1. Figures S1–S7 and supplemental methods

mmc1.pdf^{(4.4MB, pdf)}

Table S1. CNVs detected by all tools on the ICR96 data

mmc2.xlsx^{(527.3KB, xlsx)}

Table S2. Performance of the tested tools on both the ICR96 dataset and the IRD cohort

mmc3.xlsx^{(17.9KB, xlsx)}

Table S3. Samples from the IRD cohort

mmc4.xlsx^{(18.6KB, xlsx)}

Table S4. Disease genes for the IRD cohort, and inheritance model considered

mmc5.xlsx^{(14KB, xlsx)}

Table S5. Rare CNVs detected by all tested tools in the IRD cohort

mmc6.xlsx^{(73.9KB, xlsx)}

Table S6. List of all validated CNVs in the retinal disease cohort (n=37), along with their features

mmc7.xlsx^{(14.5KB, xlsx)}

Table S7. OFF-PEAK output for relevant CNVs from the retinal disease cohort

mmc8.xlsx^{(27.5KB, xlsx)}

Table S8. Characteristics of individual CNVs from the IRD cohort, and their detection by various tools

mmc9.xlsx^{(17.2KB, xlsx)}

Table S9. Details on SNVs and indels found in compound heterozygosity with CNVs in the retinal disease cohort

mmc10.xlsx^{(15.5KB, xlsx)}

Table S10. OFF-PEAK output for a common heterozygous deletion in the large IRD batch

mmc11.xlsx^{(12.3KB, xlsx)}

Table S11. Primers used for CNV validation in the IRD cohort

mmc12.xlsx^{(13.5KB, xlsx)}

Document S2. Article plus supplemental information

mmc13.pdf^{(7.3MB, pdf)}

References

1.Töpf A., Johnson K., Bates A., Phillips L., Chao K.R., England E.M., Laricchia K.M., Mullen T., Valkanas E., Xu L., et al. Sequential targeted exome sequencing of 1001 patients affected by unexplained limb-girdle weakness. Genet. Med. 2020;22:1478–1488. doi: 10.1038/s41436-020-0840-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Perea-Romero I., Gordo G., Iancu I.F., Del Pozo-Valero M., Almoguera B., Blanco-Kelly F., Carreño E., Jimenez-Rolando B., Lopez-Rodriguez R., Lorda-Sanchez I., et al. Genetic landscape of 6089 inherited retinal dystrophies affected cases in Spain and their therapeutic and extended epidemiological implications. Sci. Rep. 2021;11:1526. doi: 10.1038/s41598-021-81093-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Bae J.S., Kim N.K.D., Lee C., Kim S.C., Lee H.R., Song H.R., Park K.B., Kim H.W., Lee S.H., Kim H.Y., et al. Comprehensive genetic exploration of skeletal dysplasia using targeted exome sequencing. Genet. Med. 2016;18:563–569. doi: 10.1038/gim.2015.129. [DOI] [PubMed] [Google Scholar]
4.Retterer K., Scuffins J., Schmidt D., Lewis R., Pineda-Alvarez D., Stafford A., Schmidt L., Warren S., Gibellini F., Kondakova A., et al. Assessing copy number from exome sequencing and exome array CGH based on CNV spectrum in a large clinical cohort. Genet. Med. 2015;17:623–629. doi: 10.1038/gim.2014.160. [DOI] [PubMed] [Google Scholar]
5.Gambin T., Akdemir Z.C., Yuan B., Gu S., Chiang T., Carvalho C.M.B., Shaw C., Jhangiani S., Boone P.M., Eldomery M.K., et al. Homozygous and hemizygous CNV detection from exome sequencing data in a Mendelian disease cohort. Nucleic Acids Res. 2017;45:1633–1648. doi: 10.1093/nar/gkw1237. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Pfundt R., Del Rosario M., Vissers L.E.L.M., Kwint M.P., Janssen I.M., de Leeuw N., Yntema H.G., Nelen M.R., Lugtenberg D., Kamsteeg E.J., et al. Detection of clinically relevant copy-number variants by exome sequencing in a large cohort of genetic disorders. Genet. Med. 2017;19:667–675. doi: 10.1038/gim.2016.163. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Hills A., Ahn J.W., Donaghue C., Thomas H., Mann K., Ogilvie C.M. MLPA for confirmation of array CGH results and determination of inheritance. Mol. Cytogenet. 2010;3:19. doi: 10.1186/1755-8166-3-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Zhao M., Wang Q., Wang Q., Jia P., Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinf. 2013;14(Suppl 11):S1. doi: 10.1186/1471-2105-14-S11-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Zare F., Dow M., Monteleone N., Hosny A., Nabavi S. An evaluation of copy number variation detection tools for cancer using whole exome sequencing data. BMC Bioinf. 2017;18:286. doi: 10.1186/s12859-017-1705-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Gabrielaite M., Torp M.H., Rasmussen M.S., Andreu-Sánchez S., Vieira F.G., Pedersen C.B., Kinalis S., Madsen M.B., Kodama M., Demircan G.S., et al. A Comparison of Tools for Copy-Number Variation Detection in Germline Whole Exome and Whole Genome Sequencing Data. Cancers. 2021;13 doi: 10.3390/cancers13246283. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Gordeeva V., Sharova E., Babalyan K., Sultanov R., Govorun V.M., Arapidi G. Benchmarking germline CNV calling tools from exome sequencing data. Sci. Rep. 2021;11 doi: 10.1038/s41598-021-93878-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Samuels D.C., Han L., Li J., Quanghu S., Clark T.A., Shyr Y., Guo Y. Finding the lost treasures in exome sequencing data. Trends Genet. 2013;29:593–599. doi: 10.1016/j.tig.2013.07.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Royer-Bertrand B., Castillo-Taucher S., Moreno-Salinas R., Cho T.J., Chae J.H., Choi M., Kim O.H., Dikoglu E., Campos-Xavier B., Girardi E., et al. Mutations in the heat-shock protein A9 (HSPA9) gene cause the EVEN-PLUS syndrome of congenital malformations and skeletal dysplasia. Sci. Rep. 2015;5 doi: 10.1038/srep17154. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Wang K., Li M., Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38:e164. doi: 10.1093/nar/gkq603. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Pedersen B.S., Quinlan A.R. Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics. 2018;34:867–868. doi: 10.1093/bioinformatics/btx699. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Davis T.L. 2022. optparse: Command Line Option Parser. [Google Scholar]
17.Warnes G.R., Bolker B., Bonebakker L., Gentleman R., Huber W., Liaw A., Lumley T., Maechler M., Magnusson A., Moeller S., et al. 2022. gplots: Various R Programming Tools for Plotting Data. [Google Scholar]
18.Plagnol V., Curtis J., Epstein M., Mok K.Y., Stebbings E., Grigoriadou S., Wood N.W., Hambleton S., Burns S.O., Thrasher A.J., et al. A robust model for read count data in exome sequencing experiments and implications for copy number variant calling. Bioinformatics. 2012;28:2747–2754. doi: 10.1093/bioinformatics/bts526. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Robin X., Turck N., Hainard A., Tiberti N., Lisacek F., Sanchez J.C., Müller M. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinf. 2011;12:77. doi: 10.1186/1471-2105-12-77. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Tuszynski J. 2021. caTools: Moving Window Statistics, GIF, Base64, ROC AUC, etc. [Google Scholar]
21.Geoffroy V., Herenger Y., Kress A., Stoetzel C., Piton A., Dollfus H., Muller J. AnnotSV: an integrated tool for structural variations annotation. Bioinformatics. 2018;34:3572–3574. doi: 10.1093/bioinformatics/bty304. [DOI] [PubMed] [Google Scholar]
22.Robinson J.T., Thorvaldsdóttir H., Winckler W., Guttman M., Lander E.S., Getz G., Mesirov J.P. Integrative genomics viewer. Nat. Biotechnol. 2011;29:24–26. doi: 10.1038/nbt.1754. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Mahamdallie S., Ruark E., Yost S., Ramsay E., Uddin I., Wylie H., Elliott A., Strydom A., Renwick A., Seal S., Rahman N. The ICR96 exon CNV validation series: a resource for orthogonal assessment of exon CNV calling in NGS data. Wellcome Open Res. 2017;2:35. doi: 10.12688/wellcomeopenres.11689.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Klambauer G., Schwarzbauer K., Mayr A., Clevert D.A., Mitterecker A., Bodenhofer U., Hochreiter S. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res. 2012;40:e69. doi: 10.1093/nar/gks003. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Talevich E., Shain A.H., Botton T., Bastian B.C. CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing. PLoS Comput. Biol. 2016;12 doi: 10.1371/journal.pcbi.1004873. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Jiang Y., Wang R., Urrutia E., Anastopoulos I.N., Nathanson K.L., Zhang N.R. CODEX2: full-spectrum copy number variation detection by high-throughput DNA sequencing. Genome Biol. 2018;19:202. doi: 10.1186/s13059-018-1578-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Krumm N., Sudmant P.H., Ko A., O'Roak B.J., Malig M., Coe B.P., NHLBI Exome Sequencing Project. Quinlan A.R., Nickerson D.A., Eichler E.E. Copy number variation detection and genotyping from exome sequence data. Genome Res. 2012;22:1525–1532. doi: 10.1101/gr.138115.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Boeva V., Popova T., Bleakley K., Chiche P., Cappo J., Schleiermacher G., Janoueix-Lerosey I., Delattre O., Barillot E. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics. 2012;28:423–425. doi: 10.1093/bioinformatics/btr670. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., DePristo M.A. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Laver T.W., De Franco E., Johnson M.B., Patel K.A., Ellard S., Weedon M.N., Flanagan S.E., Wakeling M.N. SavvyCNV: Genome-wide CNV calling from off-target reads. PLoS Comput. Biol. 2022;18 doi: 10.1371/journal.pcbi.1009940. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Henderson R.H. Inherited retinal dystrophies. Paediatr. Child Health. 2020;30:19–27. [Google Scholar]
32.Richards S., Aziz N., Bale S., Bick D., Das S., Gastier-Foster J., Grody W.W., Hegde M., Lyon E., Spector E., et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 2015;17:405–424. doi: 10.1038/gim.2015.30. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Riggs E.R., Andersen E.F., Cherry A.M., Kantarci S., Kearney H., Patel A., Raca G., Ritter D.I., South S.T., Thorland E.C., et al. Technical standards for the interpretation and reporting of constitutional copy-number variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics (ACMG) and the Clinical Genome Resource (ClinGen) Genet. Med. 2020;22:245–257. doi: 10.1038/s41436-019-0686-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Nikopoulos K., Cisarova K., Quinodoz M., Koskiniemi-Kuendig H., Miyake N., Farinelli P., Rehman A.U., Khan M.I., Prunotto A., Akiyama M., et al. A frequent variant in the Japanese population determines quasi-Mendelian inheritance of rare retinal ciliopathy. Nat. Commun. 2019;10:2884. doi: 10.1038/s41467-019-10746-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Torene R.I., Galens K., Liu S., Arvai K., Borroto C., Scuffins J., Zhang Z., Friedman B., Sroka H., Heeley J., et al. Mobile element insertion detection in 89,874 clinical exomes. Genet. Med. 2020;22:974–978. doi: 10.1038/s41436-020-0749-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Bellos E., Coin L.J.M. cnvOffSeq: detecting intergenic copy number variation using off-target exome sequencing data. Bioinformatics. 2014;30:i639–i645. doi: 10.1093/bioinformatics/btu475. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Kuilman T., Velds A., Kemper K., Ranzani M., Bombardelli L., Hoogstraat M., Nevedomskaya E., Xu G., de Ruiter J., Lolkema M.P., et al. CopywriteR: DNA copy number detection from off-target sequence data. Genome Biol. 2015;16:49. doi: 10.1186/s13059-015-0617-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S7 and supplemental methods

mmc1.pdf^{(4.4MB, pdf)}

Table S1. CNVs detected by all tools on the ICR96 data

mmc2.xlsx^{(527.3KB, xlsx)}

Table S2. Performance of the tested tools on both the ICR96 dataset and the IRD cohort

mmc3.xlsx^{(17.9KB, xlsx)}

Table S3. Samples from the IRD cohort

mmc4.xlsx^{(18.6KB, xlsx)}

Table S4. Disease genes for the IRD cohort, and inheritance model considered

mmc5.xlsx^{(14KB, xlsx)}

Table S5. Rare CNVs detected by all tested tools in the IRD cohort

mmc6.xlsx^{(73.9KB, xlsx)}

Table S6. List of all validated CNVs in the retinal disease cohort (n=37), along with their features

mmc7.xlsx^{(14.5KB, xlsx)}

Table S7. OFF-PEAK output for relevant CNVs from the retinal disease cohort

mmc8.xlsx^{(27.5KB, xlsx)}

Table S8. Characteristics of individual CNVs from the IRD cohort, and their detection by various tools

mmc9.xlsx^{(17.2KB, xlsx)}

Table S9. Details on SNVs and indels found in compound heterozygosity with CNVs in the retinal disease cohort

mmc10.xlsx^{(15.5KB, xlsx)}

Table S10. OFF-PEAK output for a common heterozygous deletion in the large IRD batch

mmc11.xlsx^{(12.3KB, xlsx)}

Table S11. Primers used for CNV validation in the IRD cohort

mmc12.xlsx^{(13.5KB, xlsx)}

Document S2. Article plus supplemental information

mmc13.pdf^{(7.3MB, pdf)}

Data Availability Statement

[bib1] 1.Töpf A., Johnson K., Bates A., Phillips L., Chao K.R., England E.M., Laricchia K.M., Mullen T., Valkanas E., Xu L., et al. Sequential targeted exome sequencing of 1001 patients affected by unexplained limb-girdle weakness. Genet. Med. 2020;22:1478–1488. doi: 10.1038/s41436-020-0840-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Perea-Romero I., Gordo G., Iancu I.F., Del Pozo-Valero M., Almoguera B., Blanco-Kelly F., Carreño E., Jimenez-Rolando B., Lopez-Rodriguez R., Lorda-Sanchez I., et al. Genetic landscape of 6089 inherited retinal dystrophies affected cases in Spain and their therapeutic and extended epidemiological implications. Sci. Rep. 2021;11:1526. doi: 10.1038/s41598-021-81093-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Bae J.S., Kim N.K.D., Lee C., Kim S.C., Lee H.R., Song H.R., Park K.B., Kim H.W., Lee S.H., Kim H.Y., et al. Comprehensive genetic exploration of skeletal dysplasia using targeted exome sequencing. Genet. Med. 2016;18:563–569. doi: 10.1038/gim.2015.129. [DOI] [PubMed] [Google Scholar]

[bib4] 4.Retterer K., Scuffins J., Schmidt D., Lewis R., Pineda-Alvarez D., Stafford A., Schmidt L., Warren S., Gibellini F., Kondakova A., et al. Assessing copy number from exome sequencing and exome array CGH based on CNV spectrum in a large clinical cohort. Genet. Med. 2015;17:623–629. doi: 10.1038/gim.2014.160. [DOI] [PubMed] [Google Scholar]

[bib5] 5.Gambin T., Akdemir Z.C., Yuan B., Gu S., Chiang T., Carvalho C.M.B., Shaw C., Jhangiani S., Boone P.M., Eldomery M.K., et al. Homozygous and hemizygous CNV detection from exome sequencing data in a Mendelian disease cohort. Nucleic Acids Res. 2017;45:1633–1648. doi: 10.1093/nar/gkw1237. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Pfundt R., Del Rosario M., Vissers L.E.L.M., Kwint M.P., Janssen I.M., de Leeuw N., Yntema H.G., Nelen M.R., Lugtenberg D., Kamsteeg E.J., et al. Detection of clinically relevant copy-number variants by exome sequencing in a large cohort of genetic disorders. Genet. Med. 2017;19:667–675. doi: 10.1038/gim.2016.163. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Hills A., Ahn J.W., Donaghue C., Thomas H., Mann K., Ogilvie C.M. MLPA for confirmation of array CGH results and determination of inheritance. Mol. Cytogenet. 2010;3:19. doi: 10.1186/1755-8166-3-19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Zhao M., Wang Q., Wang Q., Jia P., Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinf. 2013;14(Suppl 11):S1. doi: 10.1186/1471-2105-14-S11-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Zare F., Dow M., Monteleone N., Hosny A., Nabavi S. An evaluation of copy number variation detection tools for cancer using whole exome sequencing data. BMC Bioinf. 2017;18:286. doi: 10.1186/s12859-017-1705-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Gabrielaite M., Torp M.H., Rasmussen M.S., Andreu-Sánchez S., Vieira F.G., Pedersen C.B., Kinalis S., Madsen M.B., Kodama M., Demircan G.S., et al. A Comparison of Tools for Copy-Number Variation Detection in Germline Whole Exome and Whole Genome Sequencing Data. Cancers. 2021;13 doi: 10.3390/cancers13246283. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Gordeeva V., Sharova E., Babalyan K., Sultanov R., Govorun V.M., Arapidi G. Benchmarking germline CNV calling tools from exome sequencing data. Sci. Rep. 2021;11 doi: 10.1038/s41598-021-93878-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Samuels D.C., Han L., Li J., Quanghu S., Clark T.A., Shyr Y., Guo Y. Finding the lost treasures in exome sequencing data. Trends Genet. 2013;29:593–599. doi: 10.1016/j.tig.2013.07.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Royer-Bertrand B., Castillo-Taucher S., Moreno-Salinas R., Cho T.J., Chae J.H., Choi M., Kim O.H., Dikoglu E., Campos-Xavier B., Girardi E., et al. Mutations in the heat-shock protein A9 (HSPA9) gene cause the EVEN-PLUS syndrome of congenital malformations and skeletal dysplasia. Sci. Rep. 2015;5 doi: 10.1038/srep17154. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Wang K., Li M., Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38:e164. doi: 10.1093/nar/gkq603. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Pedersen B.S., Quinlan A.R. Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics. 2018;34:867–868. doi: 10.1093/bioinformatics/btx699. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Davis T.L. 2022. optparse: Command Line Option Parser. [Google Scholar]

[bib17] 17.Warnes G.R., Bolker B., Bonebakker L., Gentleman R., Huber W., Liaw A., Lumley T., Maechler M., Magnusson A., Moeller S., et al. 2022. gplots: Various R Programming Tools for Plotting Data. [Google Scholar]

[bib18] 18.Plagnol V., Curtis J., Epstein M., Mok K.Y., Stebbings E., Grigoriadou S., Wood N.W., Hambleton S., Burns S.O., Thrasher A.J., et al. A robust model for read count data in exome sequencing experiments and implications for copy number variant calling. Bioinformatics. 2012;28:2747–2754. doi: 10.1093/bioinformatics/bts526. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Robin X., Turck N., Hainard A., Tiberti N., Lisacek F., Sanchez J.C., Müller M. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinf. 2011;12:77. doi: 10.1186/1471-2105-12-77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Tuszynski J. 2021. caTools: Moving Window Statistics, GIF, Base64, ROC AUC, etc. [Google Scholar]

[bib21] 21.Geoffroy V., Herenger Y., Kress A., Stoetzel C., Piton A., Dollfus H., Muller J. AnnotSV: an integrated tool for structural variations annotation. Bioinformatics. 2018;34:3572–3574. doi: 10.1093/bioinformatics/bty304. [DOI] [PubMed] [Google Scholar]

[bib22] 22.Robinson J.T., Thorvaldsdóttir H., Winckler W., Guttman M., Lander E.S., Getz G., Mesirov J.P. Integrative genomics viewer. Nat. Biotechnol. 2011;29:24–26. doi: 10.1038/nbt.1754. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Mahamdallie S., Ruark E., Yost S., Ramsay E., Uddin I., Wylie H., Elliott A., Strydom A., Renwick A., Seal S., Rahman N. The ICR96 exon CNV validation series: a resource for orthogonal assessment of exon CNV calling in NGS data. Wellcome Open Res. 2017;2:35. doi: 10.12688/wellcomeopenres.11689.1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Klambauer G., Schwarzbauer K., Mayr A., Clevert D.A., Mitterecker A., Bodenhofer U., Hochreiter S. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res. 2012;40:e69. doi: 10.1093/nar/gks003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Talevich E., Shain A.H., Botton T., Bastian B.C. CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing. PLoS Comput. Biol. 2016;12 doi: 10.1371/journal.pcbi.1004873. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Jiang Y., Wang R., Urrutia E., Anastopoulos I.N., Nathanson K.L., Zhang N.R. CODEX2: full-spectrum copy number variation detection by high-throughput DNA sequencing. Genome Biol. 2018;19:202. doi: 10.1186/s13059-018-1578-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Krumm N., Sudmant P.H., Ko A., O'Roak B.J., Malig M., Coe B.P., NHLBI Exome Sequencing Project. Quinlan A.R., Nickerson D.A., Eichler E.E. Copy number variation detection and genotyping from exome sequence data. Genome Res. 2012;22:1525–1532. doi: 10.1101/gr.138115.112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28.Boeva V., Popova T., Bleakley K., Chiche P., Cappo J., Schleiermacher G., Janoueix-Lerosey I., Delattre O., Barillot E. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics. 2012;28:423–425. doi: 10.1093/bioinformatics/btr670. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., DePristo M.A. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Laver T.W., De Franco E., Johnson M.B., Patel K.A., Ellard S., Weedon M.N., Flanagan S.E., Wakeling M.N. SavvyCNV: Genome-wide CNV calling from off-target reads. PLoS Comput. Biol. 2022;18 doi: 10.1371/journal.pcbi.1009940. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Henderson R.H. Inherited retinal dystrophies. Paediatr. Child Health. 2020;30:19–27. [Google Scholar]

[bib32] 32.Richards S., Aziz N., Bale S., Bick D., Das S., Gastier-Foster J., Grody W.W., Hegde M., Lyon E., Spector E., et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 2015;17:405–424. doi: 10.1038/gim.2015.30. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33.Riggs E.R., Andersen E.F., Cherry A.M., Kantarci S., Kearney H., Patel A., Raca G., Ritter D.I., South S.T., Thorland E.C., et al. Technical standards for the interpretation and reporting of constitutional copy-number variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics (ACMG) and the Clinical Genome Resource (ClinGen) Genet. Med. 2020;22:245–257. doi: 10.1038/s41436-019-0686-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34.Nikopoulos K., Cisarova K., Quinodoz M., Koskiniemi-Kuendig H., Miyake N., Farinelli P., Rehman A.U., Khan M.I., Prunotto A., Akiyama M., et al. A frequent variant in the Japanese population determines quasi-Mendelian inheritance of rare retinal ciliopathy. Nat. Commun. 2019;10:2884. doi: 10.1038/s41467-019-10746-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Torene R.I., Galens K., Liu S., Arvai K., Borroto C., Scuffins J., Zhang Z., Friedman B., Sroka H., Heeley J., et al. Mobile element insertion detection in 89,874 clinical exomes. Genet. Med. 2020;22:974–978. doi: 10.1038/s41436-020-0749-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36.Bellos E., Coin L.J.M. cnvOffSeq: detecting intergenic copy number variation using off-target exome sequencing data. Bioinformatics. 2014;30:i639–i645. doi: 10.1093/bioinformatics/btu475. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37.Kuilman T., Velds A., Kemper K., Ranzani M., Bombardelli L., Hoogstraat M., Nevedomskaya E., Xu G., de Ruiter J., Lolkema M.P., et al. CopywriteR: DNA copy number detection from off-target sequence data. Genome Biol. 2015;16:49. doi: 10.1186/s13059-015-0617-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Detection of elusive DNA copy-number variations in hereditary disease and cancer through the use of noncoding and off-target sequencing reads

Mathieu Quinodoz

Karolina Kaminska

Francesca Cancellieri

Ji Hoon Han

Virginie G Peter

Elifnaz Celik

Lucas Janeschitz-Kriegl

Nils Schärer

Daniela Hauenstein

Bence György

Giacomo Calzetti

Vincent Hahaut

Sónia Custódio

Ana Cristina Sousa

Yuko Wada

Yusuke Murakami

Almudena Avila Fernández

Cristina Rodilla Hernández

Pablo Minguez

Carmen Ayuso

Koji M Nishiguchi

Cristina Santos

Luisa Coutinho Santos

Viet H Tran

Veronika Vaclavik

Hendrik PN Scholl

Carlo Rivolta

Summary

Introduction

Material and methods

Samples from human subjects

Whole-exome sequencing procedures

Scoring of on- and off-target reads

Mapping and variant calling

Processing of on- and off-target intervals by OFF-PEAK

Determination of coverage of on- and off-target regions by OFF-PEAK

CNV detection by OFF-PEAK

Results

Scoring of off-target reads in WES data

Figure 1.

Implementation of OFF-PEAK

Figure 2.

Figure 3.

Validation of the tool on the ICR96 dataset

Figure 4.

Validation on WES from individuals with rare diseases

Figure 5.

Detailed performance results

Clinical classification of the CNVs detected and further analyses of IRD data

Discussion

Data and code availability

Acknowledgments

Declaration of interests

Footnotes

Web resources

Supplemental information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases