Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2020 Jan 3;15(1):e0227427. doi: 10.1371/journal.pone.0227427

Sequencing artifacts derived from a library preparation method using enzymatic fragmentation

Norio Tanaka 1,#, Akihisa Takahara 1,2,#, Taichi Hagio 1,2, Rika Nishiko 1, Junko Kanayama 1, Osamu Gotoh 1, Seiichi Mori 1,*
Editor: Ruslan Kalendar3
PMCID: PMC6941819  PMID: 31899787

Abstract

DNA fragmentation is a fundamental step during library preparation in hybridization capture-based, short-read sequencing. Ultra-sonication has been used thus far to prepare DNA of an appropriate size, but this method is associated with a considerable loss of DNA sample. More recently, studies have employed library preparation methods that rely on enzymatic fragmentation with DNA endonucleases to minimize DNA loss, particularly in nano-quantity samples. Yet, despite their wide use, the effect of enzymatic fragmentation on the resultant sequences has not been carefully assessed. Here, we used pairwise comparisons of somatic variants of the same tumor DNA samples prepared using ultrasonic and enzymatic fragmentation methods. Our analysis revealed a substantially larger number of recurrent artifactual SNVs/indels in endonuclease-treated libraries as compared with those created through ultrasonication. These artifacts were marked by palindromic structure in the genomic context, positional bias in sequenced reads, and multi-nucleotide substitutions. Taking advantage of these distinctive features, we developed a filtering algorithm to distinguish genuine somatic mutations from artifactual noise with high specificity and sensitivity. Noise cancelling recovered the composition of the mutational signatures in the tumor samples. Thus, we provide an informatics algorithm as a solution to the sequencing errors produced as a consequence of endonuclease-mediated fragmentation, highlighted for the first time in this study.

Introduction

Next-generation sequencing (NGS) technologies have facilitated the delivery of precision medical care to patients with cancer. Short-read sequencing technology has been widely exploited for this purpose, encompassing amplicon- or hybridization capture-based library preparations [1]. This diagnostic strategy relies on accurate sequencing and interpretation to provide patients with the right clinical decision [1, 2]. Genome sequencing using hybridization capture comprises multiple steps, including tissue processing, tissue storage, DNA isolation, DNA fragmentation, probe hybridization, library amplification, sequencing, and informatics analysis [25]. Sequencing errors can be introduced at any of these steps, and nucleotides can be further modified through oxidation during tissue processing, tissue storage, DNA isolation, and DNA fragmentation [24, 6]. Nucleotide incorporation errors can in turn create polymerase reaction biases, affect precise library amplification, and generate sequencing noise [24]. Although substantial efforts have been made to minimize such sequencing noise experimentally, stochastic errors remain persistent [24].

Ultra-sonication has long been a standard method for DNA fragmentation during hybridization capture-based, short-read sequencing. Ultrasonication creates even cuts in the DNA across the entire genome, thereby providing a simple means of controlling fragment size in a non-biased manner [2, 3]. However, the physical scattering of DNA solution during the process often leads to a loss of DNA sample, which can be critical when the sample amount is limited to nano- or picogram quantities, such as found with biopsied tissue fragments. Several commercial library preparation kits are available, including the HyperPlus (KAPA Biosystems), SureSelect QXT (Agilent Technologies), Fragmentase (New England Biolabs), and Nextera Tagmentation (Illumina) kits, each of which uses endonucleases or transposases for DNA fragmentation. Although these kits minimize DNA loss, it remains largely unknown what degree of sequencing errors are caused by the enzymatic fragmentation process.

In the current study, we identified numerous artifactual SNVs/indels among libraries constructed using the HyperPlus kit for DNA fragmentation. These sequencing errors—characterized by variants located at the center of palindromic structures and near the 5’ or 3’ ends of the read, with multi-nucleotide substitutions—were deemed to have been introduced by the endonuclease treatment step and the following fill-in process for end repair, but not as a result of the whole sequencing process per se. Taking advantage of these noise properties, we developed an algorithm to efficiently distinguish sequencing errors from genuine mutations. This algorithm could be used in future studies to improve datasets that rely on enzymatic fragmentation using the same or a similar enzyme during library preparation.

Results

Distinct features of somatic SNVs/indels derived from different library preparation kits

In our sequencing facility, the SureSelect kit (Agilent, Santa Clara, California, United States) is the default protocol for library preparation for exome analysis and requires 200 ng of DNA as the starting amount. However, in some cases, the starting amount is less than 200 ng; this typically occurs when samples are extracted from small tissue fragments. For such samples, we use the HyperPlus kit (KAPA Biosystems, Cape Town, South Africa) for library preparation, which requires a minimum of only 20 ng of DNA. Typically, there tends to be sufficient amount of matched normal DNA for the standard processing, and consequently—although not ideal—it is often the case that somatic mutation calling occurs for tumor and matched normal samples prepared using different DNA fragmentation methods.

Using exome sequencing of tumor samples—with the exception of hypermutators—we will typically detect several tens or hundreds of somatic SNVs/indels per sample. However, in our experience, we noted that some tumor samples exhibited an extraordinarily large number of SNVs/indels, exceeding a few thousand, regardless of tissue origin, histological type, or method of tissue preservation. We also noted that these tumor samples were prepared using the HyperPlus kit and that the paired normal DNA samples were prepared with the SureSelect kit. Further inspection of data pertaining to 31 tumors (16 gastric, 13 lung and 2 rectal cancers) prepared using the HyperPlus kit revealed a higher median number of SNVs (median: 2,308, range: 1,119–3,996) and indels (median: 89, range: 33–437) as compared with data from tumor samples prepared using the SureSelect kit.

This obvious discrepancy prompted us to perform pairwise comparisons of sequenced reads from the same tumor DNA libraries prepared with the SureSelect kit versus those prepared with the HyperPlus kit (Fig 1). Six tumor tissues preserved as fresh-frozen samples were used for this analysis. For somatic calls, normal DNA samples were prepared using the SureSelect kit. In our comparisons of the exome sequencing process, we noted one major difference between the two library preparation kits: the SureSelect kit uses ultra-sonication for DNA fragmentation whereas the HyperPlus kit relies on endonuclease treatment. Quality assessment of the sequencing data from the six tumor samples generated by both methods showed no difference in the percentage of reads with Q30 values (the Phred score assigns a Q score of 30). However, the percentage of on-target reads, which is dependent on the library preparation method, differed between the SureSelect and HyperPlus kits (Table 1). Therefore, we obtained and analyzed somatic SNVs and indels from two combinations: tumor DNA–SureSelect with normal DNA–SureSelect (“Somatic SNV/indels with SureSelect”) and tumor DNA–HyperPlus with normal DNA–SureSelect (“Somatic SNVs/indels with HyperPlus”) (Fig 1).

Fig 1. Experimental procedure to prepare sequencing libraries using Agilent SureSelect or KAPA HyperPlus kits, and the analytical pipeline to call somatic SNVs and indels.

Fig 1

A major difference between the two kits is the use of ultra-sonication versus enzymatic treatment with an endonuclease for DNA fragmentation. The same tumor DNA samples were used for library preparation, and the SureSelect kit was used for all of the paired normal DNA in the somatic mutation detection.

Table 1. Quality assessment of sequencing.

Sample Study Cohort Number of Mapped Reads On-Target Reads (%) Reads ≥ Q30 (%)
HyperPlus SureSelect HyperPlus SureSelect HyperPlus SureSelect
A001 Lung Cancer 477,599,892 437,724,129 67.04% 74.85% 94.43 95.46
A004 512,922,172 436,510,390 55.35% 73.67% 93.68 95.42
A005 520,804,987 455,109,465 56.65% 74.93% 93.13 94.91
B012 496,356,981 433,005,359 57.48% 74.58% 92.39 95.39
C742 Rectal Cancer 198,466,488 359,482,010 61.26% 71.63% 93.99 95.93
C772 230,911,546 362,714,354 58.19% 71.67% 94.33 95.94

The features of the somatic SNV/indel call results derived from the SureSelect and HyperPlus treatments are shown in Fig 2. Despite starting with the same sample of DNA, the HyperPlus libraries resulted in 2.3- to 9.9-times more SNV/indel detections than the SureSelect libraries (Fig 2A). Importantly, most of SNVs/indels derived from the SureSelect treatment were nested within those from the HyperPlus libraries, but most of the SNVs/indels from the HyperPlus treatment were not also common to the SureSelect-treated libraries (Fig 3A). Given that the numbers of SNVs/indels from the SureSelect libraries for the six tumor samples were comparable with that described in previous literature [7, 8], we concluded that the HyperPlus libraries generated a substantial number of somatic SNVs/indels as non-biological sequencing artifacts.

Fig 2. Differences in the sequenced results using the HyperPlus kit versus the SureSelect kit.

Fig 2

For each sample, the same DNA was sequenced with the two different kits: HyperPlus (magenta) and SureSelect (green) kits. (A) Number of SNVs/indels. Bar plots show the number of SNVs and indels. (B) Frequency of SNVs per length of palindrome. The frequency of SNVs from all samples (n = 6) is shown as bar plots per odd length of palindrome, in which a somatic SNV was detected at the center. The number on the x-axis indicates the total length of the palindrome. Note that a length “zero” indicates a sequence that lacks a palindrome. (C) Mutational signature heatmap. The heatmap represents the proportions of the 30 COSMIC mutational signatures [9] computed from trinucleotide frequencies of nucleotide substitutions in each sample. The number above the heatmap indicates each of the 30 signatures computed according to the definitions in the COSMIC database. Sample IDs are shown on the left.

Fig 3. Features of SNV/indel noise caused by the HyperPlus kit.

Fig 3

(A) Venn diagram of somatic SNVs/indels detected in the HyperPlus- and SureSelect-treated samples. Magenta and green lines indicate HyperPlus and SureSelect treatments, respectively. The regions [a] (red) and [b] (blue) indicate somatic SNVs/indels detected in the HyperPlus-treated but not the SureSelect-treated samples (red), and those shared by both HyperPlus- and SureSelect-treated samples (blue), respectively. Note: features were extracted from BAM files from libraries prepared with the HyperPlus kit for tumor samples. (B) Histogram depicting recurrence in somatic SNV/indel detection across 28-sample libraries prepared with the HyperPlus kit. X- and y-axes indicate the number of recurrent detections of identical SNVs/indels and the number of SNV/indels, respectively. Red and blue colors indicate somatic SNVs/indels in [a] and [b] in (A), respectively. (C) Histogram for the distribution of SNV/indel position in the read. X- and y-axes indicate the positions of SNVs/indels within 5-bases and the number of reads, respectively. Left panel. Position of somatic SNVs/indels detected in [a] (red). Right panel. Positions of somatic SNVs/indels detected in [b] (blue). The number of wildtype nucleotide reads that mapped to the same genomic coordinate as the detected SNV/indel is indicated in grey. (D) Histogram for the ratio of the soft-clipped reads. X- and y-axes indicate the ratio of soft-clipped reads at 0.05 intervals and the number of SNV/indels, respectively.

Closer inspection of the data uncovered that many of these somatic SNVs were coincidently located at the center of palindromic sequences, herein designated as “SNV-centered palindromes” (SCPs). HyperPlus libraries also more frequently generated longer SCPs, whereas no SCP over 15 bases in length was detected among the SureSelect libraries (Fig 2B).

COSMIC mutational signature analyses [10] were performed to assess the overall pattern of the SNV artifacts generated by the HyperPlus kit; the scores are depicted as heatmaps in Fig 2C. Consistent with previous reports [11, 12], the SureSelect libraries produced tumor-type–associated signature profiles, with higher scores for signatures 1, 2, and 4 for lung cancers (A001, A004, A005 and B012) and signatures 1 and 6 for rectal cancers (C742 and C772). However, the HyperPlus libraries showed constant peaks associated with signatures 3, 4, and 22, even across tissues of origin, indicating that the HyperPlus treatment generated a specific set of nucleotide substitutions in the genome as the “HyperPlus signature” (Fig 2C). From these observations, we concluded that the HyperPlus treatment method led to non-biological sequencing artifacts, with biased nucleotide substitutions at characteristic palindromic parts of the genome.

Attributes of sequencing artifacts by HyperPlus

To obtain a more detailed characterization of these sequencing artifacts, we divided the somatic SNV/indel calls from the HyperPlus libraries into two categories of variants: HyperPlus-specific SNVs/indels (category [a]), and commonly detected SNVs/indels, which were found with both libraries (category [b]). This categorization was based on the premise that most of the SNVs/indels in category [a] were likely to be noise generated by the HyperPlus method of preparation, and that genuine somatic SNVs/indels would predominantly be found in category [b] (Fig 3A). We noted three distinctions in the SNVs/indels between the two categories: 1) Most SNVs/indels in category [a] were at least once detected across the pooled data from the HyperPlus libraries, but this was not the case for those in category [b] (Fig 3B); 2) SNVs/indels in category [a] were frequently located 10- to 15-bases away from the 5’ or 3’ edge of the read (defined as “positional bias”), whereas SNVs/indels in category [b] were more uniformly distributed (Fig 3C); 3) Reads with SNVs/indels from category [a] were more substantially soft-clipped than those from category [b] (50.8% and 5.0% on average), which implies multi-nucleotide substitutions at the 5’ or 3’ end of the read (Fig 3D).

Designing a filtering algorithm to remove sequencing artifacts derived from HyperPlus

Despite these shortcomings, enzymatic fragmentation for library preparation is often unavoidable, particularly when only nanograms of DNA sample is available. Taking advantage of the salient properties of the sequencing noise generated by the HyperPlus method, we sought to develop a filtering algorithm to remove these artifacts from somatic SNV/indel call results to optimize the sequencing data (Fig 4A). The algorithm comprised two filtering steps: First, we excluded recurrently detected SNVs/indels across the pooled HyperPlus data, unless the SNVs/indels were already registered in the COSMIC database. Second, we developed and utilized a predictive model to remove SNVs/indels that showed a positional bias in a read and/or those on frequently soft-clipped reads. Positional bias was quantified using the Kolmogorov-Smirnov (KS) test to compare variant and wildtype alleles. The extent of soft clipping was measured using the ratio of soft-clipped reads per total reads with SNVs/indels. Overall, the predictive model was based on logistic regression to classify the SNV/indel as noise or signal (Fig 4A).

Fig 4. Filtering process for the removal of somatic SNV/indel noise generated by treatment with the HyperPlus kit.

Fig 4

(A) Algorithm for the filtration of somatic SNV/indel noise in libraries prepared with the HyperPlus kit. (B) Relationship between soft clipping of a read and an SNV/indel position in the read. 2-D scatter plots present the relation between the ratio of soft-clipped reads and the p-value, as determined using the Kolmogorov-Smirnov (KS) test for positional bias of the SNV/indel in the read. Circles denote SNVs/indels detected using the HyperPlus kit only (red; category [a] in Fig 3A) and SNVs/indels detected using both the HyperPlus and SureSelect kits (blue; category [b] in Fig 3A). The black line indicates the threshold by logistic regression to distinguish SNVs/indels in category [a] from those in category [b] according to the generalized linear model.

2-D scatter plots show the relationship between the KS p-value for positional bias and the ratio of soft-clipped reads (Fig 4B). Whereas category [a] SNVs/indels (mostly sequencing artifacts) were characterized by lower KS p-values and/or a higher ratio of soft-clipped reads, SNVs/indels in category [b] (mostly genuine SNVs/indels) had higher KS p-values and a lower ratio of soft-clipped reads. A threshold was then estimated to distinguish SNVs/indels between the two categories using a generalized linear model with the logit link function. Using receiver operating characteristic (ROC) curve analysis for the six-sample training data, the final model was established and shown to be capable of distinguishing SNVs/indels between the two categories with a specificity of 0.914 and a sensitivity of 0.979 (Fig 4B).

Noise reduction in the training data

We next applied our noise filtering algorithm to the six-sample training data to assess how filtering affects the data derived using the HyperPlus and SureSelect treatments (Fig 5). The total numbers of SNVs/indels in category [a] (likely noises from HyperPlus treatment) and category [b] (likely genuine mutations) were 11,731 and 2,984, respectively. Of these, recurrently detected SNVs/indels across the in-house pooled data prepared with the HyperPlus kit reached 10,928 and 16, of which 389 (3.6%) and 11 (68.8%) were registered in the COSMIC database (ver. 82). Because these 389 and 11 SNVs/indels were considered probably genuine, they were returned to the filtering process. This left 1,192 and 2,979 SNVs/indels in categories [a] and [b], respectively.

Fig 5. Noise reduction in the six-sample training data.

Fig 5

We show changes in the features before and after noise filtration for each sample. (A) Venn diagram of somatic SNVs/indels in the HyperPlus- and SureSelect-treated samples after noise filtering. Magenta and green lines indicate HyperPlus and SureSelect treatments. Region [a] (red) indicates somatic SNVs/indels detected in the HyperPlus- but not the SureSelect-treated samples, whereas region [b] (blue) indicates those shared by both treated samples. The features shown in (A) and (B) were extracted from BAM files from the HyperPlus libraries. (B) Dot plots for changes in the SNV/indel number for each sample after noise reduction. (C) Frequency of SNVs per length of palindrome before and after filtering. Categories [a] and [b] are shown on the left and right panels. Faded and full shading indicate before and after filtration. The frequency of SNVs from all six samples is shown as bar plots per odd length of palindrome, in which a somatic SNV was detected at the center (SCP: SNV-centered palindrome). The number on the x-axis indicates the total length of the palindrome. A length “zero” indicates a sequence that lacks a palindrome. (D) Number of filtered palindromes in a histogram. The number of SCPs was counted per odd-number-length palindrome and shown. Gray and black colors indicate one time and recurrently detected palindromes across the samples, respectively. The reference allele sequence was used to count the SCPs. (E) Comparisons of several features in the sequenced results between the HyperPlus and SureSelect treatments after filtering the HyperPlus artifacts. Note that the noise filter was applied to the HyperPlus data but not to the SureSelect data. Left panel: Bar plots for number of SNVs/indels. Middle panel: Frequency of SNVs per length of palindrome. The frequency of SNVs from all samples (n = 6) is shown as bar plots per odd length of palindrome, in which a somatic SNV was detected at the center. The number on the x-axis indicates the total length of the palindrome. Note that a length “zero” indicates a sequence that lacks a palindrome. Right panel: Mutational signature heatmap. The heatmap represents the proportions of 30 COSMIC mutational signatures [9] computed from trinucleotide frequencies of nucleotide substitutions in each sample. The number above the heatmap indicates each signature computed according to the definitions in the COSMIC database. Sample IDs are shown on the left.

We then proceeded to the next step of the logistic regression based on positional bias and soft clipping. The predictive model classified 1,090 and 62 SNVs/indels as HyperPlus noise in categories [a] and [b], respectively. As anticipated, after filtering, most of the SNVs/indels in category [a] (99.1%; 11,628/11,731) were removed, but far fewer SNVs/indels were removed for category [b] (2.2%; 67/2,984) (Fig 5A and 5B). The resultant number of SNVs/indels in the HyperPlus data after filtering was similar to that in the unfiltered SureSelect data (Fig 5E, left panel). Filtering efficiently removed SNVs with SCPs longer than 13 nucleotides from category [a], whereas most of the SNVs in category [b] remained in the group (Fig 5C). Among 11,695 filtered SNVs, 3,407 SNVs were located at the center of odd-length palindromes (length ≥ 5 bases) and 66.6% of such SCPs were recurrently observed across the samples (Fig 5D). An inspection of the substrings of the palindromes revealed substantial diversity in the length and nucleotide sequence among the samples (371–655 [median 568] different palindromes per sample; Fig 5D and S1 Table). Furthermore, consistent with the presence of positional bias of the artifactual SNVs (Fig 3C), we found that, in 90.4% of SCPs, the entire palindrome sequence was nested within 30 bases from the edge of the read (S1 Fig).

Consequently, the frequency of SNVs per length of SCP among the HyperPlus data after filtering was normalized to that of the SureSelect data (Fig 5E middle panel). Similarly, filtering rendered the mutational signature profiles of the six tumors mostly indistinguishable between the HyperPlus and SureSelect treatments (Fig 5E right panel). These observations confirmed the validity of the filtering algorithm in the six-tumor training samples.

Noise reduction in test data

We next assessed the effects of the filtering algorithm on the remaining samples not used to develop the predictive model for filtering (Fig 6). For this, we used 39 tumor data derived from three independent genomic cohorts: a gastric cancer cohort 1 (n = 3), a lung cancer cohort (n = 9), and a gastric cancer cohort 2 (n = 27). Among the 39 samples, 25, 9, and 5 samples were sequenced with the KAPA HyperPlus, KAPA Hyper, and Agilent SureSelect library preparation kits, respectively. There were nine formalin-fixed paraffin-embedded (FFPE) and 30 fresh-frozen tumor samples. We show the number of SNVs and indels, and the pattern of the mutational signatures before and after filtering the data (Fig 6).

Fig 6. Noise reduction in the test data.

Fig 6

Changes in the features of the somatic variants detected before and after noise filtration for each sample. Libraries were prepared with KAPA HyperPlus (enzymatic fragmentation), KAPA Hyper (ultrasonic fragmentation), or Agilent SureSelect (ultrasonic fragmentation) kits. Note that the only difference between the HyperPlus and Hyper kits is that the Hyper kit uses ultra-sonication in the DNA fragmentation step, whereas endonuclease treatment is used in the HyperPlus kit. (A) Dot plots show changes in the numbers of SNVs (upper panels) and indels (lower panels) for each sample, with libraries prepared with HyperPlus (magenta), Hyper (orange), or SureSelect (green) kits. (B) Heatmap presentations of mutational signatures before and after noise filtration for each sample. Libraries were prepared with HyperPlus (magenta), Hyper (orange), or SureSelect (green) kits. The heatmap represents the proportions of the 30 COSMIC mutational signatures [9] computed from trinucleotide frequencies of nucleotide substitutions in each sample. The number above the heatmap indicates each of the 30 signatures computed according to the definitions in the COSMIC database. The study cohorts and tissue preservation types are indicated on the left.

The experimental procedure for the Hyper kit is similar to that for the HyperPlus kit, except that the Hyper kit uses ultra-sonication for DNA fragmentation instead of endonuclease treatment, similar to the SureSelect kit. Noteworthy, there was no significant difference in the number of SNVs/indels or the pattern of the mutational signature between the SureSelect-treated and Hyper-treated samples before filtering, suggesting that the Hyper kit per se does not produce the sequencing errors recorded for the HyperPlus kit.

Filtering substantially reduced the number of SNVs/indels in the HyperPlus data but had little effect on the Hyper and SureSelect data. The median (range) proportions of the remaining SNVs/indels were 10.8% (0.01%–46.9%), 85.2% (47.6%–98.8%), and 94.3% (86.5%–98.6%) for the HyperPlus, Hyper, and SureSelect datasets, respectively (Fig 6A). In the mutational signatures (Fig 6B), filtering removed cancer type-independent peaks for signatures 3, 4, and 22, with a more uniform distribution of signatures in the HyperPlus data. Subsequently, the noise reduction rendered more signals for signatures 1 and 6 for both gastric cancer cohorts (#1 and #2) and signatures 1, 4, 7, 13, 20, 22, and 24 for the lung cancer cohort (Fig 6B). On the other hand, the filtering algorithm did not change the profiles of the Hyper and SureSelect data (Fig 6B). These observations demonstrate that the artifacts introduced by sample preparation with the HyperPlus kit were removed selectively and efficiently by the noise reduction algorithm.

Reduced but persistent artifactual SNVs/indels by somatic mutation calling with normal–HyperPlus/ tumor–HyperPlus libraries

For a more controlled analysis, we sought to examine the error production rates with the HyperPlus and SureSelect kits using the same DNA fragmentation method in paired normal–tumor samples; e.g., HH combination (normal–HyperPlus versus tumor–HyperPlus) and SS combination (normal–SureSelect versus tumor–SureSelect). Normal and tumor samples from two rectal cancer cases (C742 and C772) were analyzed. We found a substantially reduced number of somatic SNVs/indels for the HH combination (190 and 168 for C742 and C772 tumors), which was almost similar to that found for the SS combination (194 and 179 for C742 and C772 tumors). Whereas the SS and HH combinations detected common and specific SNVs/indels, it is important to note that the HyperPlus-associated sequencing errors and the production of SNV-centered palindromes were persistent in the HH combination, as detected by the filtering algorithm (Table 2). These findings clearly indicate an experimental difficulty in being able to completely cancel the sequencing noise produced by the HyperPlus treatment, even after using the same fragmentation method in paired normal and tumor samples, and suggest the necessity of using informatics to filter the noise.

Table 2. Somatic mutation calling with the same DNA fragmentation method in pairs of normal and tumor DNA.

HH and SS Common HH Only SS Only
C742 C772 C742 C772 C742 C772
No. of Detected SNVs/Indels 165 140 25 28 29 39
No. of SNVs/indels Classified as HyperPlus Noise 3 5 17 21 6 2
No. of SNVs with SCP Length ≥15 0 0 2 6 0 0

Abbreviations:

HH: Normal DNA treated with HyperPlus kit/ Tumor DNA treated with HyperPlus kit

SS: Normal DNA treated with SureSelect kit/ Tumor DNA treated with SureSelect kit

SCP: SNV-Centered Palindrome

Discussion

The advent of NGS has meant that DNA analysis can be achieved in an efficient and highly sensitive high-throughput manner, offering a means to generate large amounts of data, decipher the subtle yet potentially informative distinctions between samples, and help to facilitate an understanding of genetic disease. In hybridization capture-based short-read sequencing, DNA fragmentation is a necessary step in the preparation of nucleic acids, as the quality of the sequencing is contingent on both the randomness of the DNA fragmentation as well as the overlap of the resultant fragments. Furthermore, because fragment size tends to differ across NGS platforms and sequencing runs, efficient control of DNA fragment size is imperative.

Ultra-sonication is one such method that can control DNA fragment size by evenly cleaving DNA throughout the entire genome and, as such, has remained a gold standard in sequencing. However, studies have reported that ultra-sonication produces sequencing noise in the form of oxidative nucleotide modifications, such as guanine to 8-oxo guanine (8-oxo-G) and cytosine deamination [2, 6, 13]. Nebulization is another commonly used mechanical method of DNA shearing. In this method, compressed nitrogen or air is forced into the DNA through a small hole, generating random sheared fragments with both overhangs and blunt ends.

In addition to these mechanical modes of fragmentation, several kits have been developed recently using enzymatic treatment to shear the DNA; albeit, it remains largely unknown whether sequencing errors occur with these alternative modes of cutting. One previous report showed that Fragmentase (New England Biolabs) causes more artifactual indels than sonication or nebulization; although, the number of indels generated by Fragmentase appeared to be within the two-fold range of that produced by the physical methods [14].

We consider the sequencing noise in the HyperPlus-treated samples to be derived as a consequence of endonuclease treatment. There are three major reasons for this proposition. First, we note positional biases in the mutations, with errors frequently located 10- to 15-bases from the 5’ or 3’ end of the read. This implies that the positions are associated with the cutting sites of the HyperPlus endonuclease. Second, the Hyper kit, manufactured by the same company as the HyperPlus kit, uses ultra-sonication for DNA fragmentation instead of endonuclease treatment, and did not produce the same amount of noise as that generated by the HyperPlus kit. Third, artifactual SNVs were often observed at the centers of palindromic sequences, suggestive of another bias in sequence recognition by the endonuclease(s) in the fragmentation step.

Previous studies have highlighted biases in the cleavage sites targeted by “non-specific” endonucleases, such as DNase I [1518]. The HyperPlus endonuclease—the type and composition have not been disclosed (KAPA Biosystems)—seemingly has preferential recognition sites for genomic DNA, and these include palindromic sequences. Importantly, the SCPs were not only substantially diverse in length and sequence but also 66.6% of SCPs recurrently appeared across a range of samples. In addition, in almost all (90.4%) of the SCPs, the entire palindromic sequence was nested within 30 bases from the edge of the read. Based on these properties, the HyperPlus endonuclease is considered to be an endonuclease(s), which prefer DNA sequences with diverse palindromic structure (over 1,000 palindromes with different lengths and sequences) without any specificity. Since a restriction enzyme is defined as an endonuclease with specific recognition site [19], we speculate that the HyperPlus endonuclease is not a mixture of restriction enzymes. Nevertheless, limited information prevented us from further inferring the exact enzyme(s) responsible for the sequencing noise measured in our study. Other endonucleases for DNA fragmentation, such as Fragmentase [14], may also generate sequencing noise that could be misinterpreted as genuine mutations. Fragmentase is a mix of two enzymes: one randomly creates nicks in the dsDNA while the other one cuts the strand opposite to the nicks. It is possible that the noise created by Fragmentase could be similarly ameliorated from the data through a specific algorithm, like the one employed in this study.

Given that endonucleases themselves are incapable of incorporating nucleotides into the DNA or causing mutations [19], we speculate that mutations arise after enzymatic fragmentation during the “fill-in process” orchestrated by the DNA polymerase for end repair (“End repair & A-tailing enzyme” prior to adaptor ligation in the HyperPlus kit). Ultra-sonication randomly cleaves DNA molecules at different genomic positions and, therefore, in the subsequent fill-in process, nucleotides are incorporated at different genomic positions in different DNA molecules. Even if an erroneous nucleotide is incorporated into the cleaved sites, the resultant artifact would not be recognized as a mutation, because it would not consistently appear at the same position on different molecules. However, because the HyperPlus endonuclease preferentially cleaves specific sites on the DNA, when an erroneous nucleotide is incorporated, the resultant artifact could be mistakenly recognized as a mutation because it appears repeatedly at the same position on different molecules. For instance, hairpin structures made in palindromes may result in nucleotide mis-incorporation into the center of a palindromic sequence, which would ordinarily be detectable as a mutation, albeit incorrectly. Moreover, multi-nucleotide substitutions near the end of the read—another feature of the artifactual noise—can arise as more than one mis-incorporation during the fill-in process. By filtering the data using our algorithm, these positional biases and other artifacts are identified and excluded, thereby minimizing the number of non-genuine mutations. For instance, the algorithm designed in this study will identify and exclude mutation-based sequencing artifacts within the center of palindromic sequences, as well as multi-nucleotide substitutions near the ends of the read.

We found a substantial number of somatic SNVs/indels in the paired analysis of the six tumor samples using the SureSelect treatment for normal samples and the HyperPlus treatment for tumor samples (SH). We considered that such noise could be avoided by using the same DNA fragmentation method for paired samples (i.e., HH combination), and tested this using samples from two rectal cancer cases. Even though we confirmed a substantial reduction in the number of SNVs/indels using just one fragmentation method, upon careful examination, we detected the persistence of HyperPlus noise among the resultant SNVs/indels from the HH combination; this noise was frequently classified by the algorithm in other pairwise comparisons and characterized by palindromic structure. This finding reinforces our proposal of the risk that persistent errors may be confused with genuine mutations due to their recurrent appearance in a cohort. In such situations, the algorithm developed in this study can be used to distinguish true mutations from sequencing errors. The current study hence provides the technical basis to remove sequencing noise derived from HyperPlus endonuclease treatment.

Materials and methods

Starting amount of DNA

In our sequencing facility, the default protocol for library preparation in exome sequencing is the use of the SureSelect kit (Agilent Technologies). In cases where there is less than 200 ng of DNA, we use the HyperPlus (KAPA Biosystems) kit. Thus, for the purposes of this comparative study, the starting amounts of DNA were 40 and 200 ng for preparation with the HyperPlus and SureSelect kits, respectively.

DNA fragmentation by ultra-sonication

DNA shearing by ultra-sonication was performed with the E220 Focused-ultra-sonicator (Covaris) for 360 s at 4°C according to the manufacturer’s recommendations. After shearing, the median peak in fragment length was 177 bp (range, 160–185 bp), as measured using the 2200 TapeStation (Agilent Technologies).

DNA fragmentation using HyperPlus endonuclease

DNA was incubated with the HyperPlus “Frag Enzyme” (KAPA Biosystems) at 37°C for 30 min, according to the manufacturer’s recommendations.

Library preparation

After enzymatic fragmentation (HyperPlus) or ultrasonic shearing (SureSelect), we performed end repair, phosphorylation, and the ligation of barcoded adaptors according to each of the manufacturer’s protocols. DNA samples were then captured by hybrid capture using the SureSelect Human All Exon V5 kit (Agilent Technologies). The captured libraries were amplified with the addition of index sequences, and were multiplexed before sequencing.

Sequencing

Libraries were sequenced using the HiSeq2500 (Illumina), according to the manufacturer’s recommendations, with a median depth of coverage of 260 (124–271) per tumor with the HyperPlus kit, 294 (257–334) per tumor with the SureSelect kit, and 172 (148–225) per normal tissue sample with the SureSelect kit.

Bioinformatics tools for somatic SNV/indel calls

Sequenced reads were aligned with BWA (Burrows-Wheeler Aligner; ver. 0.7.12) to the human genome reference (hg19) [20]. GATK (GenomeAnalysisTK; ver 3.4–46) was used to recalibrate the variant quality score and to perform local realignment [21]. Somatic SNVs were called with VarScan (ver. 2.3.7), MuTect (ver.1.1.5), and Karkinos (ver. 3.0.22) [2224]. VarScan (ver. 2.3.7), SomaticIndelDetector (ver.2.3–9), and Karkinos2 (ver.0.1) were used to detect somatic indels [21, 22, 24]. SNVs and indels were considered as genuine only when they were detected by at least two of three callers and used for subsequent analyses. SNVs/indels were annotated with ANNOVAR [25] (2015 Mar 22 released version). COSMIC (Catalogue Of Somatic Mutations In Cancer; v82) database [12] was integrated into the ANNOVAR database and used to identify the SNVs/indels registered in the COSMIC database.

Informatics methods to characterize and filter sequencing artifacts by HyperPlus

We designated a somatic SNV coincidently located at the center of a palindromic sequence as an “SNV-centered palindrome” (SCP). The SNV-centered sequences were extracted from a reference FASTA file (hg19.fa; http://hgdownload.cse.ucsc.edu/goldenPath/hg19/) and were determined to have palindromic structure or not by an in-house script. Reads with and without somatic SNVs/indels were extracted from a BAM file derived from a tumor sample. The position of a variant in a read was subsequently assigned according to the mapped position, and the CIGAR string of the read with an in-house Python (version 3.7.2) script using the Pysam module (https://github.com/pysam-developers/pysam).

COSMIC mutational signatures [9] were computed from trinucleotide frequencies of nucleotide substitutions in each sample. Probabilities for the signatures were downloaded from the COSMIC website (https://cancer.sanger.ac.uk/cosmic/signatures). Each signature contribution to a tumor signature profile was computed by minimizing the difference between the trinucleotide frequencies and the linear sum of probabilities using the Rsolnp (version 1.1.6) library.

Variant recurrence was counted when a variant in a sample had the identical genomic coordinate and altered nucleotides to that in the other samples among the 28 tumors, for which the libraries were prepared with the HyperPlus kit. We utilized this variant recurrence as part of the filtering algorithm for data containing artifacts generated due to preparation with the HyperPlus kit. Because genuine driver mutations were also recurrent in various types of cancer and because most of them were already registered in the COSMIC database, these “COSMIC database-recurrent” variants were returned for further filtering.

To remove SNVs/indels having positional bias in a read, we used p-values from Kolmogorov-Smirnov (KS) comparisons of the position between the variant and wildtype alleles in a read with R (version 3.5.2). The ratio of soft-clipped reads per total reads with SNVs/indels was computed according to information from the BAM file, and this was also used for filtering. A predictive model was developed based on the logistic regression to classify the SNV/indel as noise or signal. Threshold lines were estimated by generalized linear models with the logit link function. The threshold line was then optimized to provide the maximum summation of sensitivity and specificity.

Supporting information

S1 Fig. Location of SNV-centered palindromes (SCPs) on the read.

Each panel indicates the location of SCPs filtered by the noise-canceling algorithm (x-axis) and the number of reads (y-axis) for each odd-numbered-length palindrome. Palindrome length is indicated by the number at the right shoulder of each panel. Palindromic sequences and SNVs are shown as black lines and red points, respectively. Thin and thick dashed lines indicate the end and 30 bases from edge of the read, respectively. Each read is 100 bases in length, as shown above the panel.

(DOCX)

S1 Table. Sequence and number of filtered palindrome sequences.

(XLSX)

Acknowledgments

The authors thank Ken Uchibori, Makoto Nishio, Takashi Akiyoshi, Takeru Wakatsuki, Keisho Chin, Ken Namikawa and Junko Fujisaki for kindly allowing us to re-analyze their data originally generated for other genomic cohort studies. The authors also thank Tetsuo Noda, Kazuma Kiyotani, Yasuo Uemura, Izuma Nakayama and Yu Imamura for helpful discussions. The authors also thank Megumi Nakai, Tomoko Kaneyasu, Sayuri Amino, Yuki Ota, Rie Furuya and Ikumi Haraguchi, for technical assistance; Minako Hoshida for administrative assistance; and Rebecca Jackson for editing a draft of this manuscript.

Data Availability

The two rectal cancer datasets are available on National Bioscience Database Center (NBDC; https://humandbs.biosciencedbc.jp/) with Accession ID hum0140. The other data (lung and two gastric cohorts) are not publicly accessible because our institutional review board prohibits public sharing of the data. However, these data can be available upon request to the JFCR Data Control Office (Data_Control@jfcr.or.jp).

Funding Statement

MEXT | Japan Society for the Promotion of Science (JSPS) - JP15K06861 [S. M. ]; MEXT | Japan Society for the Promotion of Science (JSPS) - JP17K18337 [O. G. ]; MEXT | Japan Society for the Promotion of Science (JSPS) - JP18K07338 [S. M. ]. This work was supported by JSPS KAKENHI Grant Numbers JP17K18337, JP15K06861, and JP18K07338. Data4C’s co. ltd. was neither funding nor funded by any governmental research funding agency. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Whereas Data4C’s co. ltd. provided support in the form of salaries for authors A.T. and T.H., Japanese Foundation for Cancer Research paid to Data4C’s co. ltd. as a business consignment for their assisting analytic parts of this work.

References

  • 1.Taber KAJ, Dickinson BD, Wilson M. The Promise and Challenges of Next-Generation Genome Sequencing for Clinical Care. JAMA Intern. Med. 2014. 174:275–80. 10.1001/jamainternmed.2013.12048 [DOI] [PubMed] [Google Scholar]
  • 2.Ma X, Shao Y, Liqing T, Flasch DA, Mulder HL, Edomonson MN, et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol. 2019. 20:50 10.1186/s13059-019-1659-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Head SR, Komori HK, LaMere AS, Whisenant T, Van Nieuwerburgh F, Salomon DR, et al. Library construction for next-generation sequencing: Overviews and challenges. BioTechniques 2014. 56:61–77. 10.2144/000114133 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Robasky K, Lewis NE, Church GM. The role of replicates for error mitigation in next-generation sequencing. Nat. Rev. Genet. 2014. 15:56–62. 10.1038/nrg3655 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Abnizova I, te Boekhorst R, Yuriy LO. Computational Errors and Biases in Short Read Next Generation Sequencing. J Proteomics Bioinform. 2017. 10(1):1–17. [Google Scholar]
  • 6.Chen G, Moiser S, Gocke CD, Lin MT, Eshleman JR, Cytosine Deamination is a Major Cause of Baseline Noise in Next Generation Sequencing. Mol Diagn Ther. 2014. 18(5): 587–593. 10.1007/s40291-014-0115-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.The Cancer Genome Atlas Network Comprehensive molecular characterization of human colon and rectal cancer. Nature 2012. 487:330–7. 10.1038/nature11252 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.The Cancer Genome Atlas Network Comprehensive molecular profiling of lung adenocarcinoma. Nature 2014. 511:543–50. 10.1038/nature13385 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Alexandrov LB, Nik-zainal S, Wedge DC, Campbell PJ, Stratton MR Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 2013. 3:246–59. 10.1016/j.celrep.2012.12.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Nik-Zainal S, Alexandrov LB, Wedge DC, Van Loo P, Greenman CD, Raine K, et al. Mutational processes molding the genomes of 21 breast cancers. Cell 2012. 149:979–93. 10.1016/j.cell.2012.04.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Alexandrov LB, Nik-zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, et al. Signatures of mutational processes in human cancer. Nature 2013. 500:415–21. 10.1038/nature12477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Tate JG, Bamford S, Jubb HC, Sondka Z, Beare DM, Bindal N, et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res. 2010. 38:164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Costello M, Pugh TJ, Fennel TJ, Stewart C, Lichtenstein L, Meldrim JC, et al. Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res. 2013. 41:67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Knierim E, Lucke B, Schwarz JM, Schuelke M, Seelow D, Systematic comparison of three methods for fragmentation of long-range PCR products for next generation sequencing. PLoS One 2011. 6:e28240 10.1371/journal.pone.0028240 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.He HH, Meyer CA, Hu SS, Chen MW, Zang C, Liu Y, et al. Refined DNase-seq protocol and data analysis reveals intrinsic bias in transcription factor footprint identification. Nat. Methods 2014. 11:73–8. 10.1038/nmeth.2762 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Lazarovici A, Zhou T, Shafer A, Dantas Machado AC, Riley TR, Sandstrom R, et al. Probing DNA shape and methylation state on a genomic scale with DNase I. Proc. Natl. Acad. Sci. U. S. A. 2013. 110:6376–81. 10.1073/pnas.1216822110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Koohy H, Down TA, Hubbard TJ. Chromatin accessibility data sets show bias due to sequence specificity of the DNase I enzyme. PLoS One 2013. 8:e69853 10.1371/journal.pone.0069853 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Meyer CA, Liu XS. Identifying and mitigating bias in next-generation sequencing methods for chromatin biology. Nat. Genet. Rev. 2014. 15:709–721. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Roberts RJ, Halford SE. Type II Restriction Enzymes In Roberts RJ, Linn SM, Lloyd RS, editors, Nucleases, 2nd Ed Cold Spring Harbor Laboratory Press; 1993. pp. 35–88. [Google Scholar]
  • 20.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009. 25:1754–60. 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.DePristo MA, Banks E, Poplin RE, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011. 43:491–8. 10.1038/ng.806 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012. 22:568–76. 10.1101/gr.129684.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 2013. 31:213–9. 10.1038/nbt.2514 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Hiroki Ueda Karkinos Available from https://github.com/genome-rcast/karkinos
  • 25.Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010. 38:164. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Ruslan Kalendar

31 Oct 2019

PONE-D-19-26447

Sequencing Artifacts Derived from a Library Preparation Method using Enzymatic Fragmentation

PLOS ONE

Dear Dr. Mori,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

We would appreciate receiving your revised manuscript by Dec 15 2019 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Ruslan Kalendar, PhD

Academic Editor

PLOS ONE

Journal Requirements:

1. When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following in the Financial Disclosure section:

MEXT | Japan Society for the Promotion of Science (JSPS) - JP15K06861 [S. M. ]

MEXT | Japan Society for the Promotion of Science (JSPS) - JP17K18337 [O. G. ]

MEXT | Japan Society for the Promotion of Science (JSPS) - JP18K07338 [S. M. ]

We note that one or more of the authors are employed by a commercial company: Data4C’s Co. Ltd.

1. Please provide an amended Funding Statement declaring this commercial affiliation, as well as a statement regarding the Role of Funders in your study. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study. You can update author roles in the Author Contributions section of the online submission form.

Please also include the following statement within your amended Funding Statement.

“The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.”

If your commercial affiliation did play a role in your study, please state and explain this role within your updated Funding Statement.

2. Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc.  

Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to  PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests) . If this adherence statement is not accurate and  there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

Please include both an updated Funding Statement and Competing Interests Statement in your cover letter. We will change the online submission form on your behalf.

Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Reviewer #1:

This paper presents an interesting comparison of the effects of preparing DNA fragments using two types of kits - sonication and Restriction enzyme based. The authors have deveoped a way to bring the restriction enzyme fragments into agreement with the sonicated samples. Unfortunately not all of the data set used are publicly available since they are patient data.

The paper does, however, need some additional background information. Some summary of the number of restriction enzymes that act upon palindromic sequences, and a clear explanation of why this leads to the errors that their approach is able to correct. One wonders whether it might be possible to infer which restriction enzymes are actually used in the kits, but this might raise intellectual property issues.

Reviewer #2:

High throughput, short-read DNA sequencing has become an essential tool for the analysis and targeted treatment of human cancers. Samples for sequencing are derived from human tumors (and control samples), and the quantities available for analysis are typically very small. Therefore technologies have been developed to facilitate accurate sequencing from very limited input material. A key step in the preparation of these samples is the fragmentation of the isolated DNA. The standard method for fragmentation has been ultra-sonication. However, ultra-sonication can lead to loss of material. Given the small amount of DNA in the initial samples, this can be a serious problem. Therefore, DNA isolation protocols have recently been introduced using endonucleases as an alternative to ultra-sonication. Unfortunately, however, the sequences obtained from the two preparations (fragmentation by endonucleases versus ultra-sonication) are not identical. Which is correct? And can one develop computational methods to identify and remove the errors?

Addressing these two challenges is the focus of the manuscript by Tanaka et al. The authors sequenced a range of samples from healthy and cancerous tissues using fragmentation either by endonucleases or by ultra-sonication. Using pairwise comparisons of many sequences from both methods, they show that fragmentation by endonucleases introduces errors. Moreover, they show that these errors are typically introduced in regions of DNA close to palindromic sequences.

Building on their understanding of the distinctive features of these errors, the authors developed an algorithm to identify the artifactual noise. They then used this algorithm to cancel this noise and recover correct sequences.

This is an important study that will find use, as more and more scientists and clinicians rely on high throughput sequencing of very small samples. The science is compelling and the manuscript is clear and well written.

This referee does not have any recommendations for revisions except for one small issue. I would like to see more discussion (perhaps a paragraph) about why enzymatic cleavage introduces errors. This finding is at the core of this whole study, and it’s perplexing. I understand the answer is not fully clear. Nonetheless will be good to see a few more sentences speculating about why enzymatic cleavage causes errors.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Reviewer #1: No

Reviewer #2: No

PLoS One. 2020 Jan 3;15(1):e0227427. doi: 10.1371/journal.pone.0227427.r002

Author response to Decision Letter 0


4 Dec 2019

Reviewer #1:

This paper presents an interesting comparison of the effects of preparing DNA fragments using two types of kits - sonication and restriction enzyme based. The authors have developed a way to bring the restriction enzyme fragments into agreement with the sonicated samples. Unfortunately not all of the data set used are publicly available since they are patient data.

The paper does, however, need some additional background information. Some summary of the number of restriction enzymes that act upon palindromic sequences, and a clear explanation of why this leads to the errors that their approach is able to correct. One wonders whether it might be possible to infer which restriction enzymes are actually used in the kits, but this might raise intellectual property issues.

This comment is related to the comment raised by Reviewer #2. In the previous version of the manuscript, we neither showed any detailed information regarding SNV-centered palindromes (SCPs) nor fully discussed the reason(s) why endonuclease treatment introduces artifactual mutations. This was simply because we did not know which endonuclease is used in the HyperPlus kit. However, from the comments raised by the two reviewers, we realized that we could still speculate on the mechanisms by which artifactual mutations were being generated based on the properties of the artifacts—i.e., the palindromic structure, positional bias and multi-nucleotide substitutions—even without knowledge of the enzyme.

In the revised manuscript, we provide information on the frequencies (Figure 5D), locations (on the read; Supplementary Figure 1), and nucleotide sequences of the SCPs (Supplementary Table 1) filtered by the noise-cancelling algorithm. The data showed three characteristic features: first, there was substantial diversity in the lengths and sequences of the SCPs (371-655 [median 568] different palindromes per sample; Figure 5D); second, 66.6% of the SCPs were recurrently observed across the samples (Figure 5D), and; third, in almost all (90.4%) of the SCPs, the entire palindromic sequence was nested within 30 bases from the edge of the read (Supplementary Figure 1). Based on these properties, the HyperPlus endonuclease is considered to be an endonuclease(s), which prefer DNA sequences with diverse palindromic structure (over 1,000 palindromes with different lengths and sequences; Supplementary Table 1) without any specificity. Since a restriction enzyme is defined as an endonuclease with specific recognition site (as a note, over 3,000 type II restriction enzymes, each of which recognizes specific palindrome sequence, have been identified) [1], we speculate that the HyperPlus endonuclease is not a mixture of restriction enzymes. Nevertheless, limited information prevented us from further inferring which specific enzyme is used in the kit.

In response to the second part of the Reviewer’s concern, given that endonucleases themselves are incapable of incorporating nucleotides into the DNA or causing mutations [2], we speculate that mutations likely arise after enzymatic fragmentation during the fill-in process with the DNA polymerase for end repair (“End repair & A-tailing enzyme” prior to adaptor ligation in the HyperPlus kit). Ultra-sonication randomly cleaves DNA molecules at different genomic positions; therefore, in the subsequent fill-in process, nucleotides are incorporated at different genomic positions in different DNA molecules. Even if an erroneous nucleotide is incorporated into the cleaved site, the resultant artifact would not be recognized as a mutation, because it would not repeatedly appear at the same position on different molecules. However, when nucleotide mis-incorporation occurs in a site preferentially cleaved by the HyperPlus endonuclease, the resultant artifact could be mistakenly recognized as a mutation because it appears repeatedly at the same position on different molecules. For instance, hairpin structures made in palindromes may result in nucleotide mis-incorporation into the center of a palindrome, which would ordinarily be detectable as a mutation; albeit, incorrectly. Moreover, multi-nucleotide substitution near the end of a read—another feature of the artifactual noise we see—can arise as more than one mis-incorporation during the fill-in process. By filtering the data using our algorithm, these positional biases and other artifacts are identified and excluded, thereby minimizing the number of non-genuine mutations. For instance, the algorithm designed in this study will identify and exclude mutation-based sequencing artifacts within the center of palindromic sequences, as well as multi-nucleotide substitutions near the ends of the read.

As mentioned above, the revised manuscript includes information on the frequencies (Figure 5D), locations (on the read; Supplementary Figure 1), and nucleotide sequences of the SCPs (Supplementary Table 1) filtered by the noise-cancelling algorithm. We accordingly amended our description in the section “Noise Reduction in the Training Data” in the Results, and include a paragraph in the Discussion to elaborate on these speculations. We hope that these changes address the concern raised by Reviewer 1.

Reviewer #2:

High throughput, short-read DNA sequencing has become an essential tool for the analysis and targeted treatment of human cancers. Samples for sequencing are derived from human tumors (and control samples), and the quantities available for analysis are typically very small. Therefore technologies have been developed to facilitate accurate sequencing from very limited input material. A key step in the preparation of these samples is the fragmentation of the isolated DNA. The standard method for fragmentation has been ultra-sonication. However, ultra-sonication can lead to loss of material. Given the small amount of DNA in the initial samples, this can be a serious problem. Therefore, DNA isolation protocols have recently been introduced using endonucleases as an alternative to ultra-sonication. Unfortunately, however, the sequences obtained from the two preparations (fragmentation by endonucleases versus ultra-sonication) are not identical. Which is correct? And can one develop computational methods to identify and remove the errors?

Addressing these two challenges is the focus of the manuscript by Tanaka et al. The authors sequenced a range of samples from healthy and cancerous tissues using fragmentation either by endonucleases or by ultra-sonication. Using pairwise comparisons of many sequences from both methods, they show that fragmentation by endonucleases introduces errors. Moreover, they show that these errors are typically introduced in regions of DNA close to palindromic sequences.

Building on their understanding of the distinctive features of these errors, the authors developed an algorithm to identify the artifactual noise. They then used this algorithm to cancel this noise and recover correct sequences.

This is an important study that will find use, as more and more scientists and clinicians rely on high throughput sequencing of very small samples. The science is compelling and the manuscript is clear and well written.

This referee does not have any recommendations for revisions except for one small issue. I would like to see more discussion (perhaps a paragraph) about why enzymatic cleavage introduces errors. This finding is at the core of this whole study, and it’s perplexing. I understand the answer is not fully clear. Nonetheless will be good to see a few more sentences speculating about why enzymatic cleavage causes errors.

This comment is related to the comment raised by Reviewer #1. As the Reviewers have pointed out, endonucleases cannot incorporate nucleotides into DNA [2], and, therefore, endonuclease treatment does not directly cause erroneous nucleotide incorporation. We speculate that mutations likely arise as part of the fill-in process by the DNA polymerase for end repair (“End repair & A-tailing enzyme” prior to adaptor ligation in the HyperPlus kit) after enzymatic fragmentation. Because nucleotide mis-incorporation occurs at sites preferentially cleaved by the HyperPlus endonuclease, the resultant artifacts are recognizable because of their recurrence at the same position on different molecules. We added this comment as a paragraph into the Discussion.

Reference

1. Roberts RJ, Vincze T, Posfai J, and Macelis D. REBASE—enzymes and genes for DNA restriction and modification. Nucleic Acids Res. Volume 35:D269–270 (2007)

2. Roberts RJ, Halford SE. Type II Restriction Enzymes. In Roberts RJ, Linn SM, Lloyd RS, editors, Nucleases, 2nd Ed. Cold Spring Harbor Laboratory Press. pp. 35-88 (1993)

Attachment

Submitted filename: Tanaka_Response_to_Reviewers_191119.docx

Decision Letter 1

Ruslan Kalendar

19 Dec 2019

Sequencing Artifacts Derived from a Library Preparation Method using Enzymatic Fragmentation

PONE-D-19-26447R1

Dear Dr. Mori,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Ruslan Kalendar, PhD

Academic Editor

PLOS ONE

Acceptance letter

Ruslan Kalendar

26 Dec 2019

PONE-D-19-26447R1

Sequencing Artifacts Derived from a Library Preparation Method using Enzymatic Fragmentation

Dear Dr. Mori:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Ruslan Kalendar

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Location of SNV-centered palindromes (SCPs) on the read.

    Each panel indicates the location of SCPs filtered by the noise-canceling algorithm (x-axis) and the number of reads (y-axis) for each odd-numbered-length palindrome. Palindrome length is indicated by the number at the right shoulder of each panel. Palindromic sequences and SNVs are shown as black lines and red points, respectively. Thin and thick dashed lines indicate the end and 30 bases from edge of the read, respectively. Each read is 100 bases in length, as shown above the panel.

    (DOCX)

    S1 Table. Sequence and number of filtered palindrome sequences.

    (XLSX)

    Attachment

    Submitted filename: Tanaka_Response_to_Reviewers_191119.docx

    Data Availability Statement

    The two rectal cancer datasets are available on National Bioscience Database Center (NBDC; https://humandbs.biosciencedbc.jp/) with Accession ID hum0140. The other data (lung and two gastric cohorts) are not publicly accessible because our institutional review board prohibits public sharing of the data. However, these data can be available upon request to the JFCR Data Control Office (Data_Control@jfcr.or.jp).


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES