Abstract
Chen et al. (Reports, 17 February 2017, p. 752) highlight an important problem of sequencing artifacts caused by DNA damage at the time of sample processing. However, their manuscript contains several errors that led the authors to incorrect conclusions. Moreover, the same sequencing artifacts were previously described and mitigated in The Cancer Genome Atlas and other published sequencing projects.
The sequencing artifacts discussed in Chen et al. (1) have been described in publications from The Cancer Genome Atlas (TCGA) and other cancer genome projects (2–4). Accordingly, effective mitigation strategies have long been implemented, including software pipelines and improved library preparation methods, as described in Costello et al. (2). Thus, findings described in TCGA publications (5–9) were not impacted by oxidative damage as suggested by Chen et al. While they do raise general awareness of sequencing artifacts, which include machine errors (2), DNA oxidation (2–4), DNA cross-linking (in clinically used formalin-fixed tissues) (10) as well as others, their paper contains several errors that led to incorrect conclusions. The errors affect the following:
(i) Estimation of oxidative damage levels.
While their reported oxidative damage (8-oxoG) metric, GIVG_T, is essentially equivalent to an earlier reported metric, oxoQ (2–4, 11), their software implementation is limited by using reads aligned only to the forward strand and biased by filtering out low quality bases. OxoQ was designed to be interpreted as a (Phred-like) base quality score. As such, oxoQ can be compared to typical levels of sequencing errors, which for most Illumina sequencing protocols is roughly at the level of Q & 30 (i.e., an error rate of 1/1000 bases). We initially noticed a poor agreement between the oxoQ and GIVG_T metrics for damaged samples (oxoQ < 30, Fig. 1a), while scores were consistent for samples with low levels of DNA damage. Examining the methods of Chen et al., we found that the authors’ code with default parameters applied a high base quality threshold (Q > 30) and removed sites where total coverage exceeds 100x, which discarded the bulk of the data for the most damaged samples (Fig. 1c-f). Adjusting the filtering criteria to include bases with Q > 20 and to allow sites with depth above 100x restored concordance between GIVG_T and oxoQ scores (Fig. 1b).
Figure 1.
(A) Picard oxoQ score vs. GIVG_T for &1,900 TCGA tumor exomes. Lack of agreement between oxoQ and GIVG_T scores is apparent in samples with high DNA damage (oxoQ < 30) highlighted in the gray box. GIVG_T is calculated with Chen et al.’s code (estimate_damage.pl) with default parameters; oxoQ was calculated with Picard CollectSequencingArtifactMetrics (11) output. (B) OxoQ vs. GIVG_T corrected (using base quality Q > 20 and coverage depth ≥ 20, comparable to Picard defaults) showing excellent agreement. (C) Histogram of individual base qualities in a typical TCGA 8-oxoG damaged sample (oxoQ = 24.2) compared to a low-damage sample (oxoQ=48.9). Most of the bases in the damaged sample have Q < 30 and hence are ignored by default GIVG_T. This explains both the lack of agreement between the scores seen in (A) and authors missing the sequence context associated with 8-oxoG damage (D vs E) –– the Q > 30 filtering removed bases with the type of damage that is being quantified. “Lego” plot showing the distribution of errors in different 5′ and 3′ sequence contexts (plotted in reverse complement by convention), with Chen et al.’s (D) and our corrected (E) filtering criteria, showing distortion from the characteristic 8-oxoG context pattern arising from data selection criteria. (F) The fraction of bases that remain after applying Chen et al.’s filters show that most bases are removed in highly damaged samples.
(ii) Claim of no context specificity of 8-oxoG damage
The authors’ conclusion that they “did not observe nucleotide context specificity” of 8-oxoG damage (1) is a consequence of their filtering out most of the supporting data, as described above. Sequence context specificity is a key feature of 8-oxoG damage (12), can impact mutational signature analysis, and can be visualized using “Lego” plots (Fig. 1d-e). These plots show the error rate of each type of base substitution in its 3-base sequence contexts (2). The damaged bases reported by Chen et al.’s script (Fig. 1d) have a severely distorted error profile inconsistent with the previously reported (12) 8-oxoG damage pattern (Fig. 1e, C>A errors in the Lego plot with a clear peak at the sequence context CCG>CAG), whereas the corrected GIVG_T script (and the previous implementation in (2)) results in a distribution of errors (Fig. 1e) consistent with 8-oxoG damage. Thus, 8-oxoG damage does in fact have a sequence context.
(iii) Interpretation of 8-oxoG damage levels
The authors state that “73% of the TCGA sequencing runs showed extensive damage, with a GIVG_T > 2”. A GIVG_T metric of 2 indicates that the G>T error rate in the 8-oxoG mode is twice the error rate of the non–8-oxoG mode (background rate for G>T errors). Even if the 8-oxoG artifacts are twice the context-specific background level (i.e., GIVG_T=2), this corresponds to only a 5–10% increase in the overall base-level error rate (summed over all sequence contexts, Fig. 2a-c), which is less than the inter-sample variability of error rates at a fixed oxoQ. A 5% increase in the base-level error rate results in a minor, if any, increase in false-positive mutation calls (Fig. 2e-f), since calling algorithms are designed to handle typical levels of sequencing error. Only at GIVG_T ≳ 5 (equivalent to oxoQ ≲ 35) do the additional errors from 8-oxoG become comparable to the sum of all other errors and adversely impact variant calling. The vast majority of samples in TCGA exhibit only minor 8-oxoG damage that has minimal impact on mutation calling.
Figure 2.
Comparison of 8-oxoG related error rates and other error modes. Samples (A-C) with corrected GIVG_T damage levels below 5 (equivalent to oxoQ & 35) have fewer 8-oxoG–related base errors (red) compared to other sequencing error modes (blue). (A) Count of tumors with a specific corrected GIVG_T score (majority of TCGA samples have very low damage levels). (B) Percentage of additional base mismatches caused by 8-oxoG in relation to GIVG_T score. (C) 8-oxoG–related error rates (red) and other error modes (blue). Other modes dominate until corrected GIVG_T & 5. (D) Copy of Fig 4E from Chen et al. (1). (E) Our estimated FDR vs. corrected GIVG_T for &1,900 TCGA tumors from MuTect (14) before 8-oxoG filtering (GIVG_T < 7.5). (F, G) Estimated FDR vs. corrected GIVG_T for &1,900 TCGA tumors from MuTect before (F) and after (G) 8-oxoG filtering (as described in (2)). Results demonstrate no inflation in number of mutation calls after applying the 8-oxoG software filter.
(iv) Estimation of the false positive rate (FPR) of mutation calling
The authors define an FPR metric, which suggests that mutation calls for many samples in public databases contain >50% falsely detected somatic variants due to sequencing artifacts (Fig. 2d). However, their metric does not reflect the actual somatic mutations used for analyses in (3–10) and other TCGA publications (e.g., a recent reanalysis of TCGA data (13)), but instead represents candidate variants supported by as few as two non-reference reads (not considered as somatic variants by most mutation callers (14)). Moreover, the mathematical definition of their FPR metric is neither a FPR nor a false discovery rate (FDR), which is highlighted by the fact that it can range between −1 and 1 (and not between 0 and 1). Consequently, Chen et al. incorrectly concluded that the FPR exceeds 0.5 in samples that have a low level of DNA damage with nearly no damage-induced false-positive mutation calls (Fig. 2d-g).
(v) Finally, Chen et al. overlook publications that describe 8-oxoG damage (2–10) and that TCGA and other projects have mitigated the 8-oxoG effects with laboratory protocols and software filtering strategies when sequencing library re-generation was impractical (Fig. 2f-g). Moreover, they claim that “recent submissions to TCGA (November to December 2015) displayed similar G-to-T imbalances” but mistakenly interpreted the time that the data repository was last updated as the time of sequencing data generation (repository updates occurred periodically for unrelated issues). The actual dates of generating the sequencing data should be obtained from the sequencing metadata. In fact, most TCGA samples sequenced after October 2012 had low 8-oxoG damage due to improved library preparation (as described in (2)).
In conclusion, we disagree with the assessment of Chen et al. regarding the quality of published cancer genome projects. Raw sequencing data inherently contains errors; therefore, to avoid misinterpreting the data, it is important that researchers use established procedures and carefully curated datasets in downstream analyses.
Acknowledgements
We are grateful for useful comments from Eric S. Lander, Maura Costello, Niall J. Lennon, and Lee Lichtenstein as well as editing help from Mendy Miller and the Science editorial staff. This work was partially funded by the United States National Institutes of Health grant (1U24CA210999).
References:
- 1.Chen L, Liu P, Evans TC, Ettwiller LM, DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification., Science 355, 752–756 (2017). [DOI] [PubMed] [Google Scholar]
- 2.Costello M, Pugh TJ, Fennell TJ, Stewart C, Lichtenstein L, Meldrim JC, Fostel JL, Friedrich DC, Perrin D, Dionne D, Kim S, Gabriel SB, Lander ES, Fisher S, Getz G, Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation., Nucleic Acids Res. 41, e67 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Pugh TJ, Morozova O, Attiyeh EF, Asgharzadeh S, Wei JS, Auclair D, Carter SL, Cibulskis K, Hanna M, Kiezun A, Kim J, Lawrence MS, Lichenstein L, McKenna A, Pedamallu CS, Ramos AH, Shefler E, Sivachenko A, Sougnez C, Stewart C, Ally A, Birol I, Chiu R, Corbett RD, Hirst M, Jackman SD, Kamoh B, Khodabakshi AH, Krzywinski M, Lo A, Moore RA, Mungall KL, Qian J, Tam A, Thiessen N, Zhao Y, Cole KA, Diamond M, Diskin SJ, Mosse YP, Wood AC, Ji L, Sposto R, Badgett T, London WB, Moyer Y, Gastier-Foster JM, Smith MA, Guidry Auvil JM, Gerhard DS, Hogarty MD, Jones SJ, Lander ES, Gabriel SB, Getz G, Seeger RC, Khan J, Marra MA, Meyerson M, Maris JM, The genetic landscape of high-risk neuroblastoma., Nat. Genet. 45, 279–84 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Crompton BD, Stewart C, Taylor-Weiner A, Alexe G, Kurek KC, Calicchio ML, Kiezun A, Carter SL, Shukla SA, Mehta SS, Thorner AR, de Torres C, Lavarino C, Suñol M, McKenna A, Sivachenko A, Cibulskis K, Lawrence MS, Stojanov P, Rosenberg M, Ambrogio L, Auclair D, Seepo S, Blumenstiel B, DeFelice M, Imaz-Rosshandler I, Schwarz-Cruz Y Celis A, Rivera MN, Rodriguez-Galindo C, Fleming MD, Golub TR, Getz G, Mora J, Stegmaier K, The genomic landscape of pediatric Ewing sarcoma., Cancer Discov. 4, 1326–41 (2014). [DOI] [PubMed] [Google Scholar]
- 5.Integrated genomic characterization of papillary thyroid carcinoma., Cell 159, 676–90 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Comprehensive and Integrative Genomic Characterization of Diffuse Lower Grade Gliomas., NEJM 372, 2481–2498 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Genomic Classification of Cutaneous Melanoma., Cell 161, 1681–1696 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Integrated Genomic Characterization of Oesophageal Carcinoma., Nature 541, 169–175 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Integrated Genomic Characterization of Pancreatic Ductal Adenocarcinoma., Cancer Cell 32, 185–203 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Giannakis M, Mu XJ, Shukla SA, Qian ZR, Cohen O, Nishihara R, Bahl S, Cao Y, Amin-Mansour A, Yamauchi M, Sukawa Y, Stewart C, Rosenberg M, Mima K, Inamura K, Nosho K, Nowak JA, Lawrence MS, Giovannucci EL, Chan AT, Ng K, Meyerhardt JA, Van Allen EM, Getz G, Gabriel SB, Lander ES, Wu CJ, Fuchs CS, Ogino S, Garraway LA, Genomic Correlates of Immune-Cell Infiltrates in Colorectal Carcinoma., Cell Rep. 17, 1206 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. https://broadinstitute.github.io/picard.
- 12.Margolin Y, Shafirovich V,Geacintov NE,DeMotS MS,Dedon PC,. DNA sequence context as a determinant of the quantity and chemistry of guanine oxidation produced by hydroxyl radicals and one-electron oxidants. J. Biol. Chem. 283, 35569–35578 (2008) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ellrott K, Bailey MH, Saksena G, Covington KR, Kandoth C, Stewart C, Hess J, Ma S, Chiotti KE, McLellan M, Sofia HJ, Hutter C, Getz G, Wheeler D, Ding L; MC3 Working Group; Cancer Genome Atlas Research Network. Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. Cell Syst. 6, 271–281 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, Gabriel S, Meyerson M, Lander ES, Getz G. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnology 31, 213–219 (2013) [DOI] [PMC free article] [PubMed] [Google Scholar]


