Skip to main content
Nature Communications logoLink to Nature Communications
. 2020 Sep 21;11:4748. doi: 10.1038/s41467-020-18151-y

Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples

Matthew H Bailey 1,2,3, William U Meyerson 4,5, Lewis Jonathan Dursi 6,7, Liang-Bo Wang 1,2, Guanlan Dong 2, Wen-Wei Liang 1,2, Amila Weerasinghe 1,2, Shantao Li 5, Sean Kelso 2; MC3 Working Group; PCAWG novel somatic mutation calling methods working group, Gordon Saksena 8, Kyle Ellrott 9, Michael C Wendl 1,10,11, David A Wheeler 12,13, Gad Getz 8,14,15,16, Jared T Simpson 6,17, Mark B Gerstein 5,18,19,, Li Ding 1,2,3,20,; PCAWG Consortium
PMCID: PMC7505971  PMID: 32958763

Abstract

The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) curated consensus somatic mutation calls using whole exome sequencing (WES) and whole genome sequencing (WGS), respectively. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2,658 cancers across 38 tumour types, we compare WES and WGS side-by-side from 746 TCGA samples, finding that ~80% of mutations overlap in covered exonic regions. We estimate that low variant allele fraction (VAF < 15%) and clonal heterogeneity contribute up to 68% of private WGS mutations and 71% of private WES mutations. We observe that ~30% of private WGS mutations trace to mutations identified by a single variant caller in WES consensus efforts. WGS captures both ~50% more variation in exonic regions and un-observed mutations in loci with variable GC-content. Together, our analysis highlights technological divergences between two reproducible somatic variant detection efforts.

Subject terms: Comparative genomics, Communication and replication, Cancer genomics, Genetic databases


With the generation of large pan-cancer whole-exome and whole-genome sequencing projects, a question remains about how comparable these datasets are. Here, using The Cancer Genome Atlas samples analysed as part of the Pan-Cancer Analysis of Whole Genomes project, the authors explore the concordance of mutations called by whole exome sequencing and whole genome sequencing techniques.

Introduction

Complementary efforts of The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) have recently produced two of the highest quality and most elaborate and reproducible somatic variant call sets from exome and whole genome-level data in cancer genomics, respectively. The motivation for these efforts stems from the notion that “scientific crowd sourcing” and combining mutation callers can provide very strong results.

These two efforts produced variant calls from 10 different callers, namely Radia1, Varscan2, MuSE3, MuTect4, Pindel5,6, Indelocator7, SomaticSniper8 for WES and MuSE, Broad-Pipeline (anchored by MuTect), Sanger-pipeline, German Cancer Research Center pipeline (DKFZ), and SMuFin9, for WGS. Briefly, the PCAWG Consortium aggregated whole genome sequencing data from 2658 cancers across 38 tumor types generated by the ICGC and TCGA projects. These sequencing data were re-analyzed with standardized, high-accuracy pipelines to align to the human genome (reference build hs37d5) and identify germline variants and somatically acquired mutations10. Of the 885 TCGA samples in ICGC, 746 were included in the latest exome call set produced by both the Multi-Center Mutation Calling in Multiple Cancers (MC3) effort and the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium set. These 746 samples represent a critical benchmark for high-level analysis of similarities and differences between exome and genome somatic variant detection methods.

Reproducibility of mutations identified by both whole exome capture sequencing and whole genome sequencing (WGS) techniques remains an important issue, not only for the scientific use of large, established data sets, but for data designs of future research projects. Previous work analyzing exome capture effects on sequence read quality has shown that GC-content bias is the major source of variation in coverage11. A performance comparison across exome-captured platforms demonstrated that for most technologies, both high and low GC-content result in reduced coverage in read depth12. Belkadi et al. compared mutation calls between WGS and WES, observing that ~3% of coding variants with high quality were only detected in WGS, and WGS also had a more uniform distribution of coverage depth, genotype quality, and minor read ratio13. Furthermore, due to the relatively high error rate per read in next-generation sequencing14, the detectability of mutations with low variant allele fractions (VAFs) is limited by background noise. Despite these studies’ nuanced preference towards WGS, others contend that WES will remain a better choice until costs of WGS fall15. The decision to sequence exomes or whole genomes is further confounded as more recent publications in oncology select either WGS1620 or WES2124. Recognizing the unresolved nature of this issue, Schwarze et al. have called for more comprehensive studies comparing the WES and WGS studies, especially as this issue has important ramifications for the clinic25.

Our analysis provides confidence that mutation calls within the captured exonic regions of these two data sets are largely consistent. We highlight common sample, cohort, and caller-specific challenges in cancer variant detection from the TCGA and ICGC efforts. We show that variants that are most confidently called in one database i.e., called by multiple callers, are very likely to be called in the other. We assess how reproducibility impacts higher-level mutation signature analysis and illustrate the need for caution in assessing performance that can only be identified by the overlap of these two data sets. Finally, we explore the capacity of WGS to detect recurrent non-coding mutations captured by whole exome sequencing.

Results

Data and workflow

We used publicly available data from the MC3 and PCAWG repositories, consisting of ~3.6 M and ~47 M variants, respectively (Fig. 1a). 746 samples were sequenced by both WES and WGS, comprising various aliquots and portions of the same tumor (Supplementary Data 1, Fig. 1b). Effects of these differences are discussed below for preliminary results, but we ultimately used the entire set of 746 samples in the variant overlap analysis, since the effects of tumor partitioning did not play a significant role (Supplementary Fig. 1). By restricting the public data sets to overlapping samples, we reduced the total corpus to ~220 K (6.1%) and ~23 M (49.6%) mutations for exome and whole genome, respectively. It is notable that there is an enrichment of variants in hypermutated samples from COAD, HNSC, LUAD, and STAD in the PCAWG set used in this study (Supplementary Fig. 2). To begin building a comparable set of mutations between these two studies, we further restricted the whole genome data set to exon regions provided by the MC3 analysis working group. This reduced the WGS data set to 1.6% of its original size, within range of total exome material estimations26 (Fig. 1a). The next step involved removing poorly-covered variants potentially caused by technical anomalies by limiting mutations to those captured in coverage files (distributed as.wig files). A reciprocal coverage strategy was used, meaning PCAWG mutations were restricted to covered genomic regions in MC3 and vice versa, thereby maintaining a complementary set of callable genomic regions. Removal of mutations in uncovered regions reduced the remaining PCAWG data set by approximately one-half, from 387,166 to 183,424 mutations. We also identified 4241 MC3 and 2219 PCAWG mutations that were present in the respective MAF but were not marked as covered in the coverage files provided by a single group. This suggests that different tools consider different minimum coverage strategies. These mutations reflect 2.0% and 1.2%, respectively, of the total mutation discrepancy and were removed because some callers had limited capacity to identify mutations in poorly-covered regions (see “Methods” section). Finally, filter flags provided by MC3 were used to assess somatic mutation filtering strategies. At this stage, we performed filter optimization to comprehensively evaluate all possible combinations of MC3 filters (Supplementary Fig. 3a). Ultimately, we decided to only remove OxoG labeled artifacts and duplicated events produced by these filters (see “Methods” section, Supplementary Data 2). Since each stage of this filtering workflow resulted in many alternative decisions and outcomes, we built MAFit, a web-based graphical user interface that allows users to easily customize comparisons of merged mutations (https://mbailey.shinyapps.io/MAFit/). A MC3 filter assessment also shows that many variants with filter flags in MC3 are present in the PCAWG variant call set, suggesting a need for improved filtering strategies (Supplementary Fig. 3b).

Fig. 1. Workflow and sample inclusion statistics.

Fig. 1

a A workflow diagram illustrates the number of mutations present during each step (gradient) of the filtering processes for MC3 (left, blue) and PCAWG (right, red). A brief description of each step of the intersection process is shown in between. b TCGA barcodes and aliquot IDs were used to match somatic sequencing. The exact match of these IDs is shown for various collection aliquots from tissue to plate. c A volcano plot highlights cancer subtype discrepancy between each PCAWG and MC3 with −log10(p-value) on the y-axis and log2(odds ratio) on the x-axis (Fisher’s exact test). The horizontal red bar indicates a significant threshold after multiple testing correction. Positive values indicate an over-representation of a cancer subtype in PCAWG, while negative values indicate an under-representation of a cancer subtype in PCAWG compared to MC3-separated by a vertical red bar. d Sample counts for each cancer type are shown in a bar chart. The colors coordinate with panel c.

TCGA samples comprise a sizable fraction of the PCAWG sample pool (~30%, Supplementary Data 1) Additional WGS sequencing from TCGA allowed for mutation validation27 and insights into non-coding mutations, such as in TET2. However, this selection process could have potentially influenced our basic comparison of exome-sequenced samples and genome-sequenced samples in two fundamental ways. First, vagaries of tumor extraction and tissue storage protocols may have resulted in many different portions of a tumor being stored, introducing the possibility that different subclones of the same tumor could be present. These could have very different genetic makeups. This information was captured in different substrings of the TCGA identification barcode (see “Methods” section). From the 746 TCGA barcodes, we found that 64% (477) could be traced to the same well of a microtiter plate (Fig. 1b). After correcting for cancer type, we modeled both the impact of matching barcode identifiers between MC3 and PCAWG and variant concordance, finding that differing barcodes did not have an appreciable impact. This result was seen for all samples, even when excluding the hypermutator (Fig. 1). Second, each AWG was able to independently select samples for WGS, which, while not affecting mutation calling, does raise potential biases when comparing PCAWG results to TCGA exome cohort data. An enrichment analysis was performed to identify which tumor subtypes may have been preferentially selected for different cancer types. We found that four tumor subtypes were enriched in the PCAWG effort from TCGA samples: infiltrating ductal breast cancer, endometrial serous adenocarcinoma, differentiated liposarcoma, and low grade oligodendroglioma (FDR < 0.05, Fig. 1c, Supplementary Data 3, and see “Methods” section). Final tumor sample counts for each cancer type are shown in Fig. 1d.

Landscape of mutational overlap between WGS and WES calls

Limiting our analysis to coding regions with sufficient coverage yielded a total of 202,459 variants (155,859 matched, 21,627 unique MC3 variants, and 24,973 unique PCAWG mutations), with 76.7% in concordance between MC3 and PCAWG and 10.7% and 12.3% being unique in MC3 and PCAWG, respectively (Fig. 2a). Concordance can be further separated into SNPs and indels, with 79% and 57% overlapping, respectively (Supplementary Fig. 4). Variant overlap was further investigated to reveal its association with mutation caller, sample, and cancer type (Fig. 2a–c). Consensus variant calling showed 91,705 (45.3%) concordant variants were captured in the intersection of Sanger, MuSE, DKFZ, and Broad callers from PCAWG, as well as Varscan, SomaticSniper, Radia, MuTect, and MuSE callers from MC3. Notably, an additional 7.7% were identified by all SNV mutation detection algorithms, except SomaticSniper. The reduced sensitivity of SomaticSniper is related to its algorithmic consideration of tumor contamination in the matched normal (e.g., skin) for liquid tumors8. After optimizing for filtering strategies, we performed a sample level comparison and found that 70% of samples had greater than 80% mutation concordance across the two cohorts. An additional 20% of samples had greater than 80% mutation recoverability in one or the other technique (Fig. 2b). Skin Cutaneous Melanoma performed the best among all cancer types and had the highest variant-matching rates for both MC3 and PCAWG (Fig. 2c). Generally, when considering all MC3 and all PCAWG mutations separately, we observed that PCAWG variant matching rates were generally higher, especially for Kidney Chromophobe (KICH), Brain Lower Grade Glioma (LGG), Ovarian Serous Cystadenocarcinoma (OV), Rectum Adenocarcinoma (READ), and Thyroid Carcinoma (THCA). The differences in OV are likely driven by whole genome amplified library preparation. Generally, the median fractions for matching MC3 variants were lower than those of matching PCAWG variants. This result was unexpected because MC3 provided fewer unique variants overall, suggesting that a large fraction of PCAWG unique variants reside in a few samples. Furthermore, after accounting for hypermutators, we identified a correlation between non-silent mutations per megabase and mean consensus percentages at the cancer level in both PCAWG consensus percentages (Mann–Whitney p-value = 1.97 × 10−3) and MC3 consensus percentages (Mann–Whitney p-value = 6.59 × 10−4, see “Methods” section, Supplementary Fig. 1c, d). Despite strong rank statistics, neither set exhibited strong correlation values for MC3 variants or PCAWG variants, R2 statistics = 0.31 and 0.17 respectively with the majority of cancer types exceeding 80% mean concordance. Thus, one may expect to observe slightly higher variant fidelity in samples with more mutations.

Fig. 2. Landscape of mutations overlap by caller, sample and cancer type.

Fig. 2

a UpSetR41 plot shows the variant calling set intersection by caller. The y-axis indicates set intersection size and the x-axis uses a connected dot plot to indicate which sets are considered. Only the largest 27 intersecting sets are shown. Two insets of the UpSetR plot highlight a classic Euler diagram (left), which indicates the total number of overlapping mutations. A set-size bar chart (right) illustrates the total number of mutations considered from each caller. The concordance set indicates the agreement between WES and WGS. Indel callers are indicated with an asterisk. b A scatter plot shows the amount of concordance by sample by calculating the fraction of matched variants divided by the total number of mutations made by MC3 exome sequencing and PCAWG whole genome sequencing (x and y-axis, respectively) below the total fraction of samples within each quadrant. Each point within the plot is related to tumor portion data collected from the TCGA barcode ID. c As shown above, this box plot separates panel b by cancer types (blue considers all MC3 variants, and red boxes indicate all PCAWG variants). Sample sizes are displayed for each cancer; points indicate samples that extend past 1.5 times the interquartile range; and horizontal bars within each box and whisker indicates median matched mutation fraction.

Variant allele fraction affects call-rates

After achieving a comparable data set and merging MC3 and PCAWG variants, we found that low VAF is the prevailing attribute of unique mutations. VAF is a fundamental factor in somatic variant detection, as well as sub-clonal structure prediction, and is used to predict subclonal tumor growth rates and metastatic potential. To explore the contribution of VAF, we sought to distinguish the contribution of subclonal structure and statistical chance when exploring private mutations in a single call set. We articulate our findings in six broad categories: modeling sequence noise, departure from idealized behavior, sub-clonal modeling, annotation differences, variant-caller effects, and analysis correlations.

Association of variant allele fraction with recoverability

We have observed that variants with low VAF are less likely to be reported in both call-sets. This finding relates to the lower sensitivity of somatic variant callers for variants with low VAF. To illustrate this principle, we estimated the expected overlap rate between MC3 and PCAWG at different VAFs. The sensitivity of MuSE across a range of VAFs and read depths in synthetic data was reported in Fan et al., 20163. We used these reported benchmarking characteristics of MuSE to estimate the expected overlap rate between the MuSE call-sets of MC3 and PCAWG across a range of VAFs (see “Methods” section). These expectations, which involve lower overlap rates at lower VAFs, generally tracked observed data but tended to overestimate observed overlap rates, especially for predicting the recovery fraction of MC3 variants in PCAWG. (Fig. 3a) The discrepancies between expectations and observations may relate to simplifying assumptions that made this modeling possible (see “Methods” section).

Fig. 3. Recoverability simulation and effects of subclones on mutation concordance.

Fig. 3

a Observed recovery rate of PCAWG variants in MC3 (red) and of MC3 variants in PCAWG (blue), alongside sequencing noise simulations calculated from random draws of a binomial model that incorporates the VAF and estimated read depth at each site (light red simulates PCAWG recoverability of MC3 variants, and light blue simulated MC3 recoverability of PCAWG variants). Y-axis is described with legend. X-axis displays VAF of the comparative data set in regard to Y. b A stacked bar chart displays the proportion of matched and unique variants (y-axis) for different VAF bins (x-axis). 180 variants did not provide read count information and were removed from this figure. c Stacked proportional histogram shows the fractions of PCAWG matched mutations (purple) and PCAWG-unique mutations (red). Mutations were restricted to SNVs, and subclonality predictions are indicated as either ‘Clonal’ or ‘Sub-clonal’. Columns 2–4 reflect sub-clonal assignment provided by PCAWG (Note: only a few samples reported five predicted subclones and were not included in this analysis). The number of variants represented for each clonal assignment is shown on the x-axis. d Similar to panel c, a stacked proportional histogram illustrates the proportion of matched and unique variants for MC3 which provide estimates of total number of matched or unique variants called by MC3.

More generally, we observed that VAF had a greater association with variant recovery rates than predicted by the binomial model (Fig. 3b). A random forest regression model trained on five statistics characteristics of VAF distribution per PCAWG sample and another five for that of the corresponding MC3 call-set predicted the fraction of variants per sample unique to PCAWG with 0.85 (0.86—when restricting to variants called by MuSE) Spearman correlation of test-set observations and a 0.68 (0.78) coefficient of determination (R2).

The strong association of VAF with recovery rates by call-set, despite modest explanatory power of the binomial, indicates important departures from idealized behavior. These departures could include explanations such as: PCR amplification violates the assumption of independence of reads, imputed read depths are systematically inflated, or some low-VAF variants represent sequencing artifacts. We conclude that non-ideal effects of VAF predict the majority of sample-level variance in fraction of co-called variants.

Exploring subclonality

One possible explanation for some variants being private to one call-set is that the sequencing aliquots for the two sequencing projects came from subclonally-distinct microregions of the same tumor. To investigate this possibility, we tested whether the MC3 and PCAWG call-sets differed from each other systematically at the subclonal level (Fig. 3c, d). We hypothesized that tumors with a more complex subclonal structure (i.e., greater number of subclones) would have larger systematic differences in the VAF of shared variants between the MC3 and PCAWG call-sets. We found a small but highly significant effect: each additional subclone increased the average absolute difference in VAF of the shared variants between MC3 and PCAWG by 0.003, with a p-value of 1.3 × 10−11 (linear regression); this effect reversed after controlling for tumor purity, indicating that the observed trend does not provide evidence of this interesting concept in re-sequencing (see “Methods” section for details). We do not have evidence that systematic VAF differences between call-sets of the same underlying sample associate with tumor heterogeneity. Real time effects of VAF differences between these two data sets can be observed using the online MAFit tool (Fig. 4).

Fig. 4. Screenshots of online tool MAFit. Here we display screenshots from the MAFit on-line interface.

Fig. 4

Currently there are three main components to the interface: a A side panel shows sliders and radio buttons to filter data set to remain inclusive. In addition, a download button is available that will download the underlying data table. b MAFit rebuilds Fig. 2b in the first tab of the on-line interface. Each alteration to the radio buttons or VAF sliders will result in an updated figure. In addition, if one’s hovers over a point on the scatter plot, a pop-up window will automatically display, providing the user with basic statistics used to calculate that point, i.e., total number of mutations, number of unique and matched mutations. c A table is also presented based on the selection criteria in panel a.

Annotation differs by call-set

Genome annotation is critical for biological interpretation and downstream analysis of sequencing data. In order to avoid issues that arise from annotation differences, we only considered genomic locations in our intersection strategy. In doing so, we observed 2153 annotation differences where MC3 and PCAWG had different genes annotated for the same mutation. After restricting the mutation type to missense mutations and indels, 789 annotations differences remained. Most of these had the same mutation types annotated by both call-sets (690 SNPs, 15 insertions, 50 deletions), but some discrepancies remained. Notably, 413 out of 789 mismatch variants are labeled coding in MC3 but non-coding in PCAWG (Supplementary Data 4). We also observed four mutations that were annotated as cancer gene mutations by MC3, but as non-cancer gene mutations by PCAWG, and another four mutations that were annotated as cancer gene mutations by PCAWG, but as non-cancer gene mutations by MC3. One such example subsumed two mutations on chromosomal location 3p21.1 (genomic locations chr3:52442525 and chr3:52442604) that were annotated as missense mutations of BAP1 by MC3, but as 5’Flank SNPs of PHF7 by PCAWG. While identical pipelines resolve such differences, we stress the potential for misinterpretations when combining these publicly-available datasets.

Effects of software

Another important issue we assess is the degree to which differences in bioinformatics pipelines impact concordance. We extracted calls from MuSE and MuTect, both of which were executed on each dataset, and examined 6 subsets of results: MuSE-only-calls and all calls save MuSE-calls (the complement), MuTect-only-calls and their complement, and MuSE + MuTect calls and their complement. MuSE and Mutect each generate around 95% of the total calls, of which each respective subset shows close to 80% concordance between WES and WGS (Supplementary Fig. 5). These call sets themselves overlap almost completely, with their combination (MuSE + MuTect) giving a marginally higher concordance. Conversely, the data-specific caller combinations (referred to above as the complements) each furnish small call sets which vary considerably between WES and WGS (concordance as low as 15%). Because of the vast difference in the sizes of the MuSE/MuTect and the complementary call sets, there is little difference in the original analysis versus analyses restricted to variant callers common to both platforms. Differences in software pipelines do not appear to be significant confounding factors in concordance here.

Effects on higher-level analysis

We also sought to assess how higher-level analyses might be impacted using mutation signature analysis as a representative. We ran SignatureAnalyzer28 to ascertain signatures between matched WGS and WES samples for each case. A total of 563 of 739 cases (76%) showed the same dominant signature between WES and WGS and the multi-element signature vectors for each case are very highly correlated with one another, the average Pearson coefficient being almost 90%, with a cohort significance of <2 × 10−6 (Fisher’s Test, “Methods” section, Supplementary Fig. 6). These observations suggest that signature analysis is relatively insensitive to data type when concordance is high, as it is here.

Landscape of private WES and WGS mutations

After identifying many possible sources of variation among private variants, we sought to characterize the fraction of variation explained by previously identified factors (Supplementary Fig. 7, see “Methods” section). As displayed, subclonal and low VAF variants make up the largest fractions of explained variants for private MC3 and PCAWG variants. Notably, for private MC3 calls, indels (not called by MuSE or MuTect) are the next highest source of variation explained. GC-content and poor performing cancers such as THCA, KICH, and PRAD make up a smaller portion of the total number of private mutations.

Variants present in only one public call-set

We sought to classify cancer driver mutations uniquely identified by MC3. After removing two outlier samples having excesses of unique mutations (TCGA-CA-6717-01A-11D-1835-10, TCGA-BR-6452-01A-12D-1800-08), we observed 424 mutations in cancer genes28 (median read depth = 97, median alternative allele count = 9) The four most frequently mutated genes were: KMT2C (22-mutations), PIK3CA (12), SPTA1 (9), and NCOR1 (9). Interestingly, the majority of unique PIK3CA mutations not identified by PCAWG were at 2 locations: E542K/G (5), and E545K (4). Whether this phenomenon reflects technical bias of WGS or is a product of subclonality warrants further investigation.

The MC3 effort produced two mutation files: one controlled access somatic mutation file that represents nearly all mutations found by all callers, and a second was modified by the scientific community for public use. There are two critical differences in these files involving the reporting of mutations in exonic regions and mutations reported by a single variant caller. Since we limited our analysis strictly to exonic regions, we observed that 92% of the 9138 PCAWG private mutations found in the MC3 controlled access file were only identified by a single variant caller (Supplementary Fig. 8). As expected, the highest unique variant caller overlap was observed in MuTect and MuSE, two tools that were used by both MC3 and PCAWG. This observation accounts for 30% of PCAWG private variants.

We investigated how many variants unique to the MC3 somatic public access call-set could be found in the PCAWG germline call-set for the same patients. We identified a total of six such variants (each in a different sample), five of which were flagged in the MC3 public call-set with one filter or another. Overall, this indicates that variants that have been incorrectly designated as germline or somatic are an extremely uncommon source of variation between the two projects.

Variants in GC-extreme intervals

Since it is well-known that the efficiency of exome capture is adversely affected by very high or very low GC-content29,30, we sought to test whether GC-content was associated with call rates in MC3 and PCAWG. We used a plug-in for VEP31 to annotate all matched and private SNVs with CADD32 in order to annotate each variant with the percentage of the neighboring 100 bases that are a G or C. First, we assessed how the distribution of read depth across GC-content changes between MC3 and PCAWG (Fig. 5b). PCAWG was found to have a fairly uniform read depth across GC-content bins, while MC3 read depth was concentrated in regions of moderate GC-content (Fig. 5c). The low read depth in MC3 at regions of extreme GC-content was in turn associated with lower variant recovery rates in these regions but did not grossly affect the number of variants recovered by MC3 because regions of extreme GC-content are relatively rare in the genome overall and in exome-capture regions in particular.

Fig. 5. WGS mutations in exonic regions not captured by WES.

Fig. 5

a A sunburst diagram provides a breakdown of variants that are removed during the coverage step of the tool. The innermost circle represents the total number of variants identified upon filtering for exome beds used by MC3. Then, we restrict PCAWG variants to well-covered MC3 regions for each sample. The majority of gencode.v19 annotated and the BROAD target bed file of exonic regions are sufficiently covered by PCAWG in flanking regions: 3’UTRs, 5’UTR, and 5’Flanking. The outermost ring illustrates the number mutations identified by PCAWG that were poorly covered by MC3. b A density plot illustrates the density of percent GC-content from a 100 bp window surrounding a variant. Four variant-sets are displayed: matched, private to MC3, private to PCAWG, and we extend our dataset to include exonic variants not covered by WES but sufficiently covered in WGS (Covered by PCAWG only). c A scatter plot displays mean sequence depth (y-axis) by increasing GC-content bins (x-axis). Points are colored according to variant set (same as panel b). df Total annotated mutations counts from 3 different annotated regions are shown for 5UTR, 3UTR, and missense mutations, respectively. g Expression Z−Scores for 3’UTR using all TCGA-UCEC samples. Cis-RNAseq expression violin plots are displayed for 13 genes. On top of the gene-level distribution violin plot, box and whisker plots display sample expression based on mutation classification (box include 25th quantile to 75th quantiles, and whiskers extend to 1.5 times the interquartile range).

An in-depth analysis of these regions revealed that 76 mutations in known driver genes, identified in the combined TCGA data by Bailey et al. 2018, were missed in GC poor (GC fraction < 0.3) or GC rich (GC > 0.7) regions28. Three such instances revealed VHL mutations in KIRC that were overlooked in GC rich regions of this gene (two of these three recur). In addition, these 3 samples are not reported to carry a VHL mutation in the MC3 public data set. Other such instances include 7 SOX17 mutations, LATS2 (6), and CACNA1A (6). These findings emphasize the advantages of uniform coverage using WGS.

The bases flanking a mutation (tri-nucleotide context) affect mutation rate, which should be approximately equal between MC3 and PCAWG, and also the rate of introduction of sequencing artifacts. Large differences in the call-rates of MC3 and PCAWG and particular nucleotide sequences could indicate a sequencing artifact unique to one or the other call-set, which might arise from different procedures for computationally filtering or biochemically preventing sequencing oxidation products. Therefore, we sought to test whether the trinucleotide context of variants correlated with relative call-rates in MC3 and PCAWG. Before applying the MC3 OxoG filters, we found a huge predominance of CA variants unique to MC3, with the trinucleotide contexts most specific to one database or another being 7-9 times more specific than the least specific trinucleotide contexts. After applying the MC3 OxoG filters, nucleotide contexts differed by less than four-fold in their specificities. The residual differential specificity by trinucleotide context after filtering can either indicate differences in sequencing artifact abundance and filtration by project, or could merely be a consequence of the fact that nucleotide context is also correlated with VAF and the distance from transcription start sites, which may independently affect MC3 and PCAWG relative call-rates.

We extended the nucleotide context and performed mutation spectrum analysis, comparing all MC3 and all PCAWG mutations found after restricting the two data sets to exonic regions as described above (Step 3 of Fig. 1a). We then calculated transition and transversion frequencies in each cancer type. After removing hypermutated samples and OxoG artifacts, we used a chi-squared test to determine the similarities and differences between cancer types in the full exome space compared versus the captured exome space. Strikingly, we did not identify significant differences in mutation spectrum in the majority of cancers. We did observe significant differences (FDR < 0.05) in the mutation spectrum for COAD, KICH, LUAD, and OV (Supplementary Data 5). These observations included strong discrepancies for AG and CG transition differences in KICH and OV, respectively. AT and CA transversions contributed mostly to COAD and LUAD differences (Supplementary Fig. 9). While these differences may reflect sequencing artifacts, such as whole genome amplified DNA in OV or low sample size, we believe the data can still provide more information pertaining to additional cancer genes and oncogenic mechanisms.

Non-Coding/Flanking intersections with low coverage

With the growing use of WGS in many labs, we sought to identify which mutations are gained by extending to this form of data. One major observation from our pipeline highlighted that some variants in exome regions were not well covered by WES (Fig. 1a Step 3). Using this mutation set we investigated the most recurrent members as derived by WGS but not by MC3 in exonic regions as defined by gencode.v19 (Fig. 5a). We observed 697 mutations in cancer genes28 uniquely called by WGS (Supplementary Data 6). We defined flanking mutations as all non-translated mutations near exons, i.e., 5’UTR, 3’UTR, 5’Flanking, and 3’Flanking regions, as they make up the majority of mutations not present in the MC3 public MAF. Recurrent mutation analysis identified the most frequently mutated genes in 5’UTR (Fig. 5d), 3’UTR (Fig. 5e), and missense mutations (Fig. 5f). We found the most frequently mutated 3’UTR in cancer genes was PGR (91 mutations allowing for multiple mutations per sample), followed by ERBB4 (72), EPHA3 (42), CYLD (41), and PTPRD (34). To extend this analysis, we used RNAseq data collected by TCGA to determine mutation type specific cis-expression patterns, which clearly shows correlation of UTR mutations on RNA abundance (Fig. 5g).

Finally, similar to previous studies33,34, we investigated the potential effect of non-coding mutations when determining significantly mutated genes (SMG). Using MuSiC35 with the no-skip-non-coding option, we rescued non-coding mutations annotated by PCAWG and included them in the significantly mutated gene (SMG) analysis. We only performed SMG analysis on cancer types having greater than 35 samples (BRCA-Breast-AdenoCa, HNSC-Head-SCC, KICH-Kidney-ChRCC, LIHC-Liver-HCC, LUAD-Lung-AdenoCA, LUSC-Lung-SCC, SKCM-Skin-Melanoma, STAD-Stomach-AdenoCA, THCA-Thy-AdenoCA, and UCEC-Uterus-AdenoCA). We initially identified potential driver-gene candidates (FUT9, MMP16, SNHG14, and SFTPB, Fig. 6) not previously reported in Pan-Cancer whole genome analysis, but further investigation did not support these candidates with the exception of SFTPB.

Fig. 6. Significantly mutated gene analysis with the inclusion of UTR mutations.

Fig. 6

OncoPrint plots were generated using the R package ComplexHeatmap42 for four cancer types: LUAD (a), LIHC (b), LUSC (c), and SKCM (d). We report all SMGs identified by Bailey et al. 201828, as well as top significantly mutated gene hits from MuSiC that include non-coding mutations. While many non-coding mutations look promising, further investigation yielded little support for driver identification status.

SFTPB (FDR 1.56e−07) in LUAD was recently reported to be significantly mutated using a larger set of these same data34. As reported, this gene is involved in a lineage-defining surfactant protein. While six mutations contributed to its SMG status, only 1 3’UTR mutation was reported for LUAD in the MC3 controlled data set. Furthermore, this single indel was only found by one variant caller (Varscan). We confirmed the impact of SFTPB UTR mutations by performing a genome-wide association analysis of expression differences and found that samples with SFTPB mutations showed lower RNA abundance in PCDHA7, a gene known to be involved in cells’ self-recognition and non-self-discrimination (chi-squared p-value 3.6 × 10-8). While other promising candidates exist, such as FUT9, a fucosyltransferase involved in organ bud progression during embryogenesis and has been implicated in cancer initiation36, we found no additional evidence for supporting its driver status.

Discussion

The research community is increasingly leveraging technology advances to integrate data at larger scales. We performed a comparative evaluation of ~750 samples with joint exome and whole genome sequencing mutation calls provided by two consensus mutation calling efforts, MC3 and PCAWG. This joint data set is encouraging, suggesting that ~80% of the predicted somatic mutations were captured by both efforts. Furthermore, a combined 90% of samples have greater than 80% variant concordance when considering covered exonic mutations from individual cohorts separately. Analysis of this data set also revealed three major contributors to variant discrepancies: (1) low variant allele fraction, (2) variant filtering decisions, and (3) technological limitations. Software differences were not an appreciable confounder.

Distinct advantages and disadvantages accompany somatic mutation calling when utilizing captured WES or WGS. We found that ~70% of the discrepancies between whole genome and whole exome sequencing are influenced by low variant allele fraction. This information holds many implications in identifying subclonal heterogeneity in the tumor of interest. Other discrepant calls originate from the decisions made on how to filter and distribute publicly available mutation calls. Higher-order mutation signature analysis does not appear to be inordinately affected by these differences. We show that reported germline variants were negligible, but nearly 30% of the private PCAWG mutations were not reported by MC3 because only a single variant detection algorithm identified them. We want to emphasize that, while somatic variant detection in cancer is commonplace, there are still many issues to reconcile.

Finally, we found additional mutations only observable in exonic regions using either WES or WGS. For example, WES uniquely identified 424 mutations in cancer genes with median VAF of ~10%. We also highlight ~700 WGS mutations from cancer genes, of which ~10% are attributable to regions of high and low CG-content; thus, highlighting the advantages of more uniform coverage from WGS.

Only about 2% of the genome is protein coding. For the last dozen years, cancer genomics has provided a comprehensive molecular characterization of many different tumor types, thanks in large part to The Cancer Genome Atlas and other publicly funded efforts. The community is just starting to explore how exomics, transcriptomics, proteomics, and methylomics can be woven together across this 2% of the genome. We anticipate a general transition from WES to WGS, but this analysis is meanwhile reassuring that few clerical mutations were overlooked in WES and that WGS is capable of recapitulating previous genomic findings.

Methods

Human research participants

The Cancer Genome Atlas (TCGA) collected both tumor and non-tumor biospecimens from human samples with informed consent under authorization of local institutional review boards (https://cancergenome.nih.gov/abouttcga/policies/informedconsent).

Sample overlap

TCGA barcodes carry metadata that reflect tumor portions and different aliquots. As noted in Fig. 1b, TCGA barcode differ slightly in the comparison between MC3 and WGS aliquots. A brief description of the breakdown of the TCGA barcode is outlined below.

Example: TCGA-DD-0001-01B-01D-A152

  • - TCGA-Project

  • - DD-Tissue source site: the tissue location of tumor that matches clinical metadata.

  • - 0001-Participant code

  • - 01-Sample type: i.e., solid tumor (01), primary blood derived tumor (03), solid tissue normal (11), blood derived normal (10)

  • - B-Vial: the order in a sequence of samples, i.e., A = first in sequence, B = second in sequence

  • - 01-Portion: sequential order of the 100–120 mg of samples

  • - D-Analyte: molecular analyte type for analysis, i.e., D for DNA and W for WGA.

  • - A152-Plate: sequential location of a 96-well plate

A lookup table outlining these fields is located at the GDC: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables. In order to determine the role of aliquot differences in assessing mutation concordance, we re-analyzed the clonality and overall mutation overlap after stratifying for exact barcode differences. We observed that the effect of matching barcodes on match variant frequency has little effect.

Assessing cancer subtype selection preference

Analysis working groups for TCGA projects were primarily subdivided according to cancer types. Scientific experts gathered in consortia from around the country to participate in characterizing many tumors using high throughput data generated on many substrates, e.g., WES, RNAseq, etc. At the conclusion of these projects, groups were asked to hand-select a subset of samples to perform validation sequencing (WGS, the samples used in this analysis). The selection criteria differed from group-to-group and sometimes resulted in an overabundance of one subtype over another. To determine cancer subtype selection bias, we performed an enrichment analysis. Using clinical data we calculated (for each cancer type) the subtype fraction in the WES cancer cohort and measured whether the fraction was similar to WGS set of samples using a Fisher’s exact test.

Defining exonic regions

We used the same definition as Ellrott et al. to reduce whole genome and exome calls to define genomic coordinates27.

Coverage calculations

Fixed-step (step = 1) Wiggle coverage files (.wig) for both MC3 and PCAWG were provided by the Broad Institute. The wig format is a binary readout of sufficient sequencing coverage for genomic data. Here, sufficient coverage is defined as bases with 8 or more reads at a given location. These wig files were processed and reduced to exonic regions using the wig2bed function from BEDOPTS37.

After the preliminary screen of coverage-reduced MAF files, we observed that matching mutations (identified by PCAWG and MC3) were removed from one technology and not the other after the coverage reduction step. To account for this issue, we performed a self-coverage reduction step to that identified 6460 mutations. We describe some properties of those mutations here. The median tumor depth reported by MC3 from these variants is 12 reads (+/− 3 median absolute difference). The median tumor depth reported by PCAWG in this region is 39 reads (+/− 35.6 median absolute difference), suggesting wide variance of tumor read depths that were removed. However, the mode tumor depth of the PCAWG variants was 13, justifying this removal of variants with low read support. Finally, we determined how many of these poorly-covered variants originated from cancer driver genes. We observed 126 mutations from the MC3 file, and 156 cancer mutations were eliminated at this stage in the comparison.

Overlapping mutations

After reducing the variants to be within exome sequencing target region, within same exon definitions, and having enough sequencing depth, the remaining variants from ICGC and PCAWG were stored in a SQLite database to enable fast lookup. We then executed a full join between the two sources of variants by matching the donor ID, sample ID, and the genomic range of each variant. The full join output was further cleaned up to remove duplicated filters due to naming variations and duplicated variants due to DNPs.

  • - Matching IDs

  • - Matching chromosomes

  • - End position greater than or equal to start position

  • - Start position is less than or equal to end position.

Deduplication of variants

After merging the PCAWG and MC3 data, we observed different strategies were taken by MC3 and PCAWG to capture neighboring variants, i.e., complex indels, di-nucleotide (DNP) and tri-nucleotide (TNP) polymorphisms. To address complex indel events (SNVs in indel regions), the MC3 working group absorbed the variants made by SNV callers into the assignment made by Pindel. Conversely, PCAWG merged DNP and TNP events into a single event. These strategies resulted in many duplication events from MC3 and PCAWG: 1731 and 62, respectively. These events encompassed 3457 and 119 events, respectively. To address these differences, we merged PCAWG variants into MC3s complex indel events, and MC3 variants into single DNP or TNP events.

Filtering optimization

After reducing the starting pool of possible mutations from 746 samples to covered exons, we sought to identify the optimal set of MC3 filters that provide the largest number of samples with greater than 80% concordance from the two technologies with the simplest schema. This was performed comprehensively using all possible combinations of filters, often with more than one filter per variant, with the MC3 cohort (131,071 filter combinations). Filter flags include: “common_in_exac”, “gapfiller”, “native_wga_mix”, “nonpreferredpair”, “oxog”, “StrandBias”, and “wga”. We pre-defined the exclusion of variants in MC3 flagged as OxoG along with the inclusion of all PASS variants. The comprehensive filter analysis resulted in two major clusters of variant recoverability (Supplementary Fig. 3). Here, we observed the computational trade-off of identifying more matched variants at the cost of more unique MC3 calls. Below, we highlight five strategies considered for analysis (Supplementary Data 2).

  1. Only consider variants labeled PASS by the MC3 filter column.

  2. Only remove variants labeled OxoG by MC3.

  3. Prioritize G1 (samples in the most recoverable quadrant, MC3 and PCAWG samples with greater than or equal to 80% from both efforts.)

  4. Prioritize total number of matched variants.

  5. Maximize total number of samples in the most recoverable quadrant (Fig. 2b) while maximizing the difference between unique MC3 variants and matched variants thus generating fewer unique calls.

After considering complexity, we chose to move forward with strategy 2 for the entirety of this study due to its simplicity and relative similarity to other filtering schemes. We recognize that by selecting a single filtering strategy, we are limiting the data slightly and likely introducing some false positive variant calls. However, this strategy allowed us to maintain larger sample sizes and to capture ~15,000 more matched variants than the PASS only strategy at the cost of ~3500 unique mutations calls for MC3.

Assessing mutations per megabase and cancer type concordance

Mutations per megabase data were collected from the broader TCGA dataset and reduced following the same methods outlined previously28. Briefly, this systematically removed hypermutators from the dataset. This resulted in a set of 625 samples from the MC3/PCAWG dataset studied here and 8852 TCGA samples. Both Pearson and Mann-Whitney correlations statistics were performed to assess the association of non-silent mutations per megabase and concordance statistics.

Simulation of sequencing noise and recoverability

Fan et al. benchmarked the sensitivity of MuSE at recovering somatic variants across 24 combinations of VAF and read depth3. When simulating the recovery of PCAWG variants in MC3 we assumed that the VAF observed in PCAWG was the true VAF. We matched the observed VAF of each variant to the closest VAF reported in Fan et al.

For our analysis, the best value to use as the read depth when predicting the MC3 recovery rate of PCAWG variants would be the MC3 read depth at the same site and sample as the PCAWG variant. However, it was not practical to obtain MC3 read depths at sites without MC3 variants, so instead we simulated an MC3 read depth for each PCAWG variant by randomly sampling from the read depths of observed MC3 variants from the same sample as the PCAWG sample. We then matched these simulated read depths for each variant to the closest read depths reported in Fan et al.

For the binned VAFs and read depths for each PCAWG variant obtained as above, we pulled the corresponding sensitivities of MuSE from the Fan et al. paper and simulated MC3 variants with probability equal to these sensitivities.

Integrating clonality

Both consortia considered clonality in their comprehensive characterization of the somatic mutations. Locations of these files are provided in the data availability section. Here, we provide a brief summary of the strategies used to compile these resources. First, the PanCancer Atlas working-groups used MC3 mutations to predict subclonal structures using ABSOLUTE38. This tool uses copy number, recurrent karyotype, and mutation data to calculate copy number purity and cluster identification. Furthermore, the PanCancer Atlas working group only made the distinction of clonal and subclonal mutations and did not attempt to further assign sub-clonal mutations to other likely heterogeneous clusters. PCAWG, on the other hand, used a consensus calling approach incorporating 11 different clustering tools. Here, we evaluated cluster-ID which represents those mutations that are clonal (ID = 1), with other clusters representing mutations that are subclonal (ID = 2 through 4). For this analysis, we restricted our data to SNVs to be consistent with calls made by the PanCancer Atlas calls of MC3 mutations.

Fraction of private variation explained

In Supplementary Fig. 7 we provide a breakdown of different sources of variant described in our analysis using publicly available data. For MC3 all private variants were classified as into 3 variant types (Indel, MissensePlus, and Other). Specifically, indels are comprised of: “Frame_Shift_Del”, “In_Frame_Ins”, “Frame_Shift_Ins”, and “In_Frame_Del”. MissensePlus variants are comprised of: “Missense_Mutation”, “Nonsense_Mutation”, “Nonstop_Mutation”, “Splice_Site”. And Other variants are comprised of: “RNA”, “3’UTR”, “5’UTR”, “5’Flank”, “Silent”, “3’Flank”, “Intron”, “Translation_Start_Site”.

On the other hand, PCAWG variants were also categorized into Indels, MissensePlus, and Other. Specifically, indels are comprised of: “Frame_Shift_Del”, “Frame_Shift_Ins”, “De_novo_Start_InFrame”, “Start_Codon_Ins”, “Stop_Codon_Ins”, “In_Frame_Del”, “In_Frame_Ins”, “Stop_Codon_Del”, and “Start_Codon_Del”. MissensePlus variants are comprised of: “Missense_Mutation”, “Nonsense_Mutation”, “Nonstop_Mutation”, “Splice_Site”. And other variants are comprised of: “5’UTR”, “RNA”, “5’Flank”, “Silent”, “3’UTR”, “Intron”, “IGR”, “lincRNA”, “De_novo_Start_OutOfFrame”, and “Start_Codon_SNP”.

In addition to the three variant type categories, six additional sources of variation were added to private variants: Subclonal, VAF5, VAF10, MMcomplement, THCA KICH or PRAD, and GCcontents. As mentioned, subclonal variants are tagged if labeled as identified by the TCGA or ICGC consortia. VAF5 tags all variants with less that 5% VAF. VAF10 tags all variants with VAF greater than or equal to 5% and less that 10%. MMcompelement tags all variants not detected by MuSE or MuTect. And finally, GCcontent was calculated as the fraction of G or C nucleotides in a 100 bp window surround a variant. This was calculated using a CADD plug-in to VEP. The GCcontent flag was assigned to a variant if GC fraction was less and 0.3 or greater than 0.7.

MAFit: online comparison and visualization tool

MAFit (maf Interaction Tool) is a shinyapp39 tool to visualize and extract mutations from the intersection of PCAWG and MC3 call sets. It is an interactive and graphical web-based interface built using R Shiny. The interface consists of three panels: a main scatter plot display, a side box of control widgets, and a mutation table in the bottom pane. Any alteration of a control widget will update the scatter plot and the mutation table, enabling very rapid browsing. There is also a button to download the current data set displayed in the scatter plot.

The main panel displays a scatter plot with marginal histograms of mutations grouped by sample. The axes are percentage of matched mutations versus matched plus call-set-unique mutations. Mouse-hovering on a data point initiates a pop-up window showing specific information for this sample, such as TCGA barcode, number of matched mutations, and numbers of mutations unique to each call-set.

The side panel contains two sets of control widgets which can be used to select data based on different criteria. The top control set consists of two sliders to set the VAF cut-offs for each call-set. Both ends of the slider can be adjusted so that users can obtain a desired interval of the VAF. The bottom control set consists of check-boxes of both variant-level and sample-level MC3 filters. If only variant-level filters are checked, all PCAWG-only mutations will be retained; if at least one sample-level filter is checked, mutations from samples that do not have any checked filters flagged (variant-level or sample-level) will be filtered out. Both variant-level and sample-level filters result in the union of mutations with any checked filter will be shown.

The bottom panel displays a table of mutations based on the selection criteria from the side panel. Columns include information on each mutation, such as TCGA barcode, gene name, VAF, variant class, Human Genome Variant Society (HGVS) nomenclature, etc. Users can sort the table by each column. There are two drop-down selection boxes where users can view mutations from a specific TCGA barcode or Hugo symbol. There is also a search bar, which results in mutations that contain the input in any columns.

Controlled access files

Having worked with both the TCGA and ICGC consortia, we were privy to the controlled access data (all MC3 somatic variant calls and PCAWG germline calls). These data sets allowed for the further interrogation of unique variants called by both groups.

We performed a mutation variant intersection of MC3 controlled access mutations with unique PCAWG variants in the captured exonic regions. These data can be downloaded using necessary credentials from https://gdc.cancer.gov/about-data/publications/mc3-2017.

We intersected the MC3 public MAF with the PCAWG germline call-set, donor-by-donor. Six MC3 somatic variants were identified as germline variants in PCAWG for the same donor. Of these, five were flagged in MC3 as OxoG or non-preferred-pair artifacts, and only one was marked PASS in MC3. This PASS variant had a depth in the matched MC3 normal of well over 100 with no alternate reads. The minimal intersection between the MC3 somatic call-set and the donor-matched PCAWG germline call-set is evidence that germline contamination in MC3 calls is low.

Assessment of impact on mutation signature analysis

We ran SignatureAnalyzer40 on the corpus of WES and WGS samples from our TCGA cases. This tool reports a vector of J = 7 normalized weights corresponding to mutational signatures. Once weights were computed, we used COSMIC signatures as a reference in order to synchronize labels of the fractional weights between WES and WGS data for each case to enable proper comparison. We excluded 5 cases in which signatures were not computed for WGS data. Using the synchronized results, we then assessed both the number of cases for which the tool identified the same dominant signature between WES and WGS data and also evaluated the correlation between WES and WGS vectors for each case using least-squares regression. Statistical significance of each correlation was calculated using a 2-tailed t-test. Statistical power of each correlation was limited by the paucity of signature weights because the underlying t-statistic is proportional to the square root of J – 2. However, because these cases, and therefore their hypothesis tests, are independent, the cohort constitutes multiple tests of the same hypothesis regarding signatures derived from WES and WGS data. Therefore, we combined individual P-values into an overall cohort P-value using Fisher’s log-transform. Namely, the transform of negative 2 times the natural log of the product of the K = 739 individual P-values is, itself, chi-square distributed with 2 K degrees of freedom. Using this transform, we found an overall cohort P-value of <2 × 10−6.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Supplementary information

Supplementary Information (776.7KB, pdf)
41467_2020_18151_MOESM2_ESM.pdf (62KB, pdf)

Description of Additional Supplementary Files

Supplementary Data 1 (93.3KB, xlsx)
Supplementary Data 2 (9.7KB, xlsx)
Supplementary Data 3 (24.9KB, xlsx)
Supplementary Data 4 (125.9KB, xlsx)
Supplementary Data 5 (10.1KB, xlsx)
Supplementary Data 6 (46.3KB, xlsx)
Reporting Summary (170.3KB, pdf)

Acknowledgements

We acknowledge the contributions of the many clinical networks across ICGC and TCGA who provided samples and data to the PCAWG Consortium and the contributions of the Technical Working Group and the Germline Working Group of the PCAWG Consortium for collation, realignment, and harmonization of the variant calls of the cancer genomes used by this study. We thank the patients and their families for their participation in the individual ICGC and TCGA projects.

Author contributions

L.D. and M.H.B. conceived the project. L.D. and M.B.G. supervised the project. M.H.B., W.M., and G.D. drafted the manuscript. M.H.B., L.-B.W., W.M., S.L., G.S., M.C.W., G.D., M.B.G., L.D., D.A.W., J.T.S., G.G., and L.J.D. provided scientific input. G.S. and K.E. provided sample mapping and quality control files. T.C.G.A.R.N. and P.C organized, generated, and distributed raw data. P.C. and N.S.M.C.M. produced original whole-genome variant calls. M.H.B., W.M., G.D., and L.-B.W. produced figures. Analysis was performed by M.H.B., W.M., L.-B.W., G.D., Y.L., W.-W.L., M.C.W., and A.W. All authors approve the submission of this manuscript. We thank S.K. for providing text edits.

Data availability

Somatic and germline variant calls, mutational signatures, subclonal reconstructions, transcript abundance, splice calls and other core data generated by the ICGC/TCGA Pan-cancer Analysis of Whole Genomes Consortium are described in another publication10 and available for download at https://dcc.icgc.org/releases/PCAWG. Additional information on accessing the data, including raw read files, can be found at https://docs.icgc.org/pcawg/data/. In accordance with the data access policies of the ICGC and TCGA projects, most molecular, clinical and specimen data are in an open tier which does not require access approval. To access potentially identification information, such as germline alleles and underlying sequencing data, researchers will need to apply to the TCGA Data Access Committee (DAC) via dbGaP (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login) for access to the TCGA portion of the dataset, and to the ICGC Data Access Compliance Office (DACO; http://icgc.org/daco) for the ICGC portion. In addition, to access somatic single nucleotide variants derived from TCGA donors, researchers will also need to obtain dbGaP authorization. Additional links to data resources used for this project can be found using the following urls: MC3 public MAF file (https://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc), PCAWG Public MAF file (https://www.synapse.org/#!Synapse:syn7364923), bed files used for exome restrictions (MC3) (https://api.gdc.cancer.gov/data/7f0d3ab9-8bef-4e3b-928a-6090caae885b), bed files used for exome restrictions (THE BROAD) (https://api.gdc.cancer.gov/data/b1e303a5-a542-4389-8ddb-1d151218be75), wiggle files MC3 (https://www.synapse.org/#!Synapse:syn21785741), wiggle files PCAWG (https://www.synapse.org/#!Synapse:syn8492850), clonality files MC3 (https://www.synapse.org/#!Synapse:syn7870168), clonality files PCAWG (https://www.synapse.org/#!Synapse:syn8532460), cancer subtypes and histological data (https://www.synapse.org/#!Synapse:syn4983466), MAFit online comparison tool (https://mbailey.shinyapps.io/MAFit/), GitHub Repo (https://github.com/ding-lab/mc3_icgc_variant_pipeline), side-by-side MC3-PCAWG comparison (MAF-like) (https://www.synapse.org/#!Synapse:syn21041380). The remaining data is available within the Article, Supplementary Information or available from the authors upon reasonable request.

Code availability

A public GitHub repository (under an MIT opensource license) at https://github.com/ding-lab/mc3_icgc_variant_pipeline furnishes work-flows, scripts, figures, and computational tools used to assess mutation concordance between maf files. For this project we refrained from accessing alignment files, i.e., BAM files or fastq files. Furthermore, due to the memory footprint of mutation and coverage files we have not included them in the repository. Thus, by removing core data files from the repository the software provided supplies the user with processes and decisions, not a fully automated tool for deployment on additional datasets. In addition to our analysis, the core computational pipelines used by the PCAWG Consortium for alignment, quality control and variant calling are available to the public at https://dockstore.org/search?search=pcawg under the GNU General Public License v3.0, which allows for reuse and distribution.

Competing interests

The authors declare the following competing interests: G.G. receives research funds from IBM and Pharmacyclics. G.G. is an inventor on patent applications related to MuTect, ABSOLUTE, and other tools. The remaining authors declare no competing interests.

Footnotes

Peer review information Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Lists of authors and their affiliations appear at the end of the paper.

Contributor Information

Mark B. Gerstein, Email: mark.gerstein@yale.edu

Li Ding, Email: lding@wustl.edu.

MC3 Working Group:

Rehan Akbani, Pavana Anur, Matthew H. Bailey, Alex Buchanan, Kami Chiotti, Kyle Covington, Allison Creason, Li Ding, Kyle Ellrott, Yu Fan, Steven Foltz, Gad Getz, Walker Hale, David Haussler, Julian M. Hess, Carolyn M. Hutter, Cyriac Kandoth, Katayoon Kasaian, Melpomeni Kasapi, Dave Larson, Ignaty Leshchiner, John Letaw, Singer Ma, Michael D. McLellan, Yifei Men, Gordon B. Mills, Beifang Niu, Myron Peto, Amie Radenbaugh, Sheila M. Reynolds, Gordon Saksena, Heidi Sofia, Chip Stewart, Adam J. Struck, Joshua M. Stuart, Wenyi Wang, John N. Weinstein, David A. Wheeler, Christopher K. Wong, Liu Xi, and Kai Ye

PCAWG novel somatic mutation calling methods working group:

Matthew H. Bailey, Beifang Niu, Matthias Bieg, Paul C. Boutros, Ivo Buchhalter, Adam P. Butler, Ken Chen, Zechen Chong, Li Ding, Oliver Drechsel, Lewis Jonathan Dursi, Roland Eils, Kyle Ellrott, Shadrielle M. G. Espiritu, Yu Fan, Robert S. Fulton, Shengjie Gao, Josep L. l. Gelpi, Mark B. Gerstein, Gad Getz, Santiago Gonzalez, Ivo G. Gut, Faraz Hach, Michael C. Heinold, Julian M. Hess, Jonathan Hinton, Taobo Hu, Vincent Huang, Yi Huang, Barbara Hutter, David R. Jones, Jongsun Jung, Natalie Jäger, Hyung-Lae Kim, Kortine Kleinheinz, Sushant Kumar, Yogesh Kumar, Christopher M. Lalansingh, Ignaty Leshchiner, Ivica Letunic, Dimitri Livitz, Eric Z. Ma, Yosef E. Maruvka, R. Jay Mashl, Michael D. McLellan, Andrew Menzies, Ana Milovanovic, Morten Muhlig Nielsen, Stephan Ossowski, Nagarajan Paramasivam, Jakob Skou Pedersen, Marc D. Perry, Montserrat Puiggròs, Keiran M. Raine, Esther Rheinbay, Romina Royo, S. Cenk Sahinalp, Gordon Saksena, Iman Sarrafi, Matthias Schlesner, Jared T. Simpson, Lucy Stebbings, Chip Stewart, Miranda D. Stobbe, Jon W. Teague, Grace Tiao, David Torrents, Jeremiah A. Wala, Jiayin Wang, Wenyi Wang, Sebastian M. Waszak, Joachim Weischenfeldt, Michael C. Wendl, Johannes Werner, Zhenggang Wu, Hong Xue, Sergei Yakneen, Takafumi N. Yamaguchi, Kai Ye, Venkata D. Yellapantula, Christina K. Yung, and Junjun Zhang

PCAWG Consortium:

Lauri A. Aaltonen, Federico Abascal, Adam Abeshouse, Hiroyuki Aburatani, David J. Adams, Nishant Agrawal, Keun Soo Ahn, Sung-Min Ahn, Hiroshi Aikata, Rehan Akbani, Kadir C. Akdemir, Hikmat Al-Ahmadie, Sultan T. Al-Sedairy, Fatima Al-Shahrour, Malik Alawi, Monique Albert, Kenneth Aldape, Ludmil B. Alexandrov, Adrian Ally, Kathryn Alsop, Eva G. Alvarez, Fernanda Amary, Samirkumar B. Amin, Brice Aminou, Ole Ammerpohl, Matthew J. Anderson, Yeng Ang, Davide Antonello, Pavana Anur, Samuel Aparicio, Elizabeth L. Appelbaum, Yasuhito Arai, Axel Aretz, Koji Arihiro, Shun-ichi Ariizumi, Joshua Armenia, Laurent Arnould, Sylvia Asa, Yassen Assenov, Gurnit Atwal, Sietse Aukema, J. Todd Auman, Miriam R. Aure, Philip Awadalla, Marta Aymerich, Gary D. Bader, Adrian Baez-Ortega, Matthew H. Bailey, Peter J. Bailey, Miruna Balasundaram, Saianand Balu, Pratiti Bandopadhayay, Rosamonde E. Banks, Stefano Barbi, Andrew P. Barbour, Jonathan Barenboim, Jill Barnholtz-Sloan, Hugh Barr, Elisabet Barrera, John Bartlett, Javier Bartolome, Claudio Bassi, Oliver F. Bathe, Daniel Baumhoer, Prashant Bavi, Stephen B. Baylin, Wojciech Bazant, Duncan Beardsmore, Timothy A. Beck, Sam Behjati, Andreas Behren, Beifang Niu, Cindy Bell, Sergi Beltran, Christopher Benz, Andrew Berchuck, Anke K. Bergmann, Erik N. Bergstrom, Benjamin P. Berman, Daniel M. Berney, Stephan H. Bernhart, Rameen Beroukhim, Mario Berrios, Samantha Bersani, Johanna Bertl, Miguel Betancourt, Vinayak Bhandari, Shriram G. Bhosle, Andrew V. Biankin, Matthias Bieg, Darell Bigner, Hans Binder, Ewan Birney, Michael Birrer, Nidhan K. Biswas, Bodil Bjerkehagen, Tom Bodenheimer, Lori Boice, Giada Bonizzato, Johann S. De Bono, Arnoud Boot, Moiz S. Bootwalla, Ake Borg, Arndt Borkhardt, Keith A. Boroevich, Ivan Borozan, Christoph Borst, Marcus Bosenberg, Mattia Bosio, Jacqueline Boultwood, Guillaume Bourque, Paul C. Boutros, G. Steven Bova, David T. Bowen, Reanne Bowlby, David D. L. Bowtell, Sandrine Boyault, Rich Boyce, Jeffrey Boyd, Alvis Brazma, Paul Brennan, Daniel S. Brewer, Arie B. Brinkman, Robert G. Bristow, Russell R. Broaddus, Jane E. Brock, Malcolm Brock, Annegien Broeks, Angela N. Brooks, Denise Brooks, Benedikt Brors, Søren Brunak, Timothy J. C. Bruxner, Alicia L. Bruzos, Alex Buchanan, Ivo Buchhalter, Christiane Buchholz, Susan Bullman, Hazel Burke, Birgit Burkhardt, Kathleen H. Burns, John Busanovich, Carlos D. Bustamante, Adam P. Butler, Atul J. Butte, Niall J. Byrne, Anne-Lise Børresen-Dale, Samantha J. Caesar-Johnson, Andy Cafferkey, Declan Cahill, Claudia Calabrese, Carlos Caldas, Fabien Calvo, Niedzica Camacho, Peter J. Campbell, Elias Campo, Cinzia Cantù, Shaolong Cao, Thomas E. Carey, Joana Carlevaro-Fita, Rebecca Carlsen, Ivana Cataldo, Mario Cazzola, Jonathan Cebon, Robert Cerfolio, Dianne E. Chadwick, Dimple Chakravarty, Don Chalmers, Calvin Wing Yiu Chan, Kin Chan, Michelle Chan-Seng-Yue, Vishal S. Chandan, David K. Chang, Stephen J. Chanock, Lorraine A. Chantrill, Aurélien Chateigner, Nilanjan Chatterjee, Kazuaki Chayama, Hsiao-Wei Chen, Jieming Chen, Ken Chen, Yiwen Chen, Zhaohong Chen, Andrew D. Cherniack, Jeremy Chien, Yoke-Eng Chiew, Suet-Feung Chin, Juok Cho, Sunghoon Cho, Jung Kyoon Choi, Wan Choi, Christine Chomienne, Zechen Chong, Su Pin Choo, Angela Chou, Angelika N. Christ, Elizabeth L. Christie, Eric Chuah, Carrie Cibulskis, Kristian Cibulskis, Sara Cingarlini, Peter Clapham, Alexander Claviez, Sean Cleary, Nicole Cloonan, Marek Cmero, Colin C. Collins, Ashton A. Connor, Susanna L. Cooke, Colin S. Cooper, Leslie Cope, Vincenzo Corbo, Matthew G. Cordes, Stephen M. Cordner, Isidro Cortés-Ciriano, Kyle Covington, Prue A. Cowin, Brian Craft, David Craft, Chad J. Creighton, Yupeng Cun, Erin Curley, Ioana Cutcutache, Karolina Czajka, Bogdan Czerniak, Rebecca A. Dagg, Ludmila Danilova, Maria Vittoria Davi, Natalie R. Davidson, Helen Davies, Ian J. Davis, Brandi N. Davis-Dusenbery, Kevin J. Dawson, Francisco M. De La Vega, Ricardo De Paoli-Iseppi, Timothy Defreitas, Angelo P. Dei Tos, Olivier Delaneau, John A. Demchok, Jonas Demeulemeester, German M. Demidov, Deniz Demircioğlu, Nening M. Dennis, Robert E. Denroche, Stefan C. Dentro, Nikita Desai, Vikram Deshpande, Amit G. Deshwar, Christine Desmedt, Jordi Deu-Pons, Noreen Dhalla, Neesha C. Dhani, Priyanka Dhingra, Rajiv Dhir, Anthony DiBiase, Klev Diamanti, Li Ding, Shuai Ding, Huy Q. Dinh, Luc Dirix, HarshaVardhan Doddapaneni, Nilgun Donmez, Michelle T. Dow, Ronny Drapkin, Oliver Drechsel, Ruben M. Drews, Serge Serge, Tim Dudderidge, Ana Dueso-Barroso, Andrew J. Dunford, Michael Dunn, Lewis Jonathan Dursi, Fraser R. Duthie, Ken Dutton-Regester, Jenna Eagles, Douglas F. Easton, Stuart Edmonds, Paul A. Edwards, Sandra E. Edwards, Rosalind A. Eeles, Anna Ehinger, Juergen Eils, Roland Eils, Adel El-Naggar, Matthew Eldridge, Kyle Ellrott, Serap Erkek, Georgia Escaramis, Shadrielle M. G. Espiritu, Xavier Estivill, Dariush Etemadmoghadam, Jorunn E. Eyfjord, Bishoy M. Faltas, Daiming Fan, Yu Fan, William C. Faquin, Claudiu Farcas, Matteo Fassan, Aquila Fatima, Francesco Favero, Nodirjon Fayzullaev, Ina Felau, Sian Fereday, Martin L. Ferguson, Vincent Ferretti, Lars Feuerbach, Matthew A. Field, J. Lynn Fink, Gaetano Finocchiaro, Cyril Fisher, Matthew W. Fittall, Anna Fitzgerald, Rebecca C. Fitzgerald, Adrienne M. Flanagan, Neil E. Fleshner, Paul Flicek, John A. Foekens, Kwun M. Fong, Nuno A. Fonseca, Christopher S. Foster, Natalie S. Fox, Michael Fraser, Scott Frazer, Milana Frenkel-Morgenstern, William Friedman, Joan Frigola, Catrina C. Fronick, Akihiro Fujimoto, Masashi Fujita, Masashi Fukayama, Lucinda A. Fulton, Robert S. Fulton, Mayuko Furuta, P. Andrew Futreal, Anja Füllgrabe, Stacey B. Gabriel, Steven Gallinger, Carlo Gambacorti-Passerini, Jianjiong Gao, Shengjie Gao, Levi Garraway, Øystein Garred, Erik Garrison, Dale W. Garsed, Nils Gehlenborg, Josep L. l. Gelpi, Joshy George, Daniela S. Gerhard, Clarissa Gerhauser, Jeffrey E. Gershenwald, Mark B. Gerstein, Moritz Gerstung, Gad Getz, Mohammed Ghori, Ronald Ghossein, Nasra H. Giama, Richard A. Gibbs, Anthony J. Gill, Pelvender Gill, Dilip D. Giri, Dominik Glodzik, Vincent J. Gnanapragasam, Maria Elisabeth Goebler, Mary J. Goldman, Carmen Gomez, Santiago Gonzalez, Abel Gonzalez-Perez, Dmitry A. Gordenin, James Gossage, Kunihito Gotoh, Ramaswamy Govindan, Dorthe Grabau, Janet S. Graham, Robert C. Grant, Anthony R. Green, Eric Green, Liliana Greger, Nicola Grehan, Sonia Grimaldi, Sean M. Grimmond, Robert L. Grossman, Adam Grundhoff, Gunes Gundem, Qianyun Guo, Manaswi Gupta, Shailja Gupta, Ivo G. Gut, Marta Gut, Jonathan Göke, Gavin Ha, Andrea Haake, David Haan, Siegfried Haas, Kerstin Haase, James E. Haber, Nina Habermann, Faraz Hach, Syed Haider, Natsuko Hama, Freddie C. Hamdy, Anne Hamilton, Mark P. Hamilton, Leng Han, George B. Hanna, Martin Hansmann, Nicholas J. Haradhvala, Olivier Harismendy, Ivon Harliwong, Arif O. Harmanci, Eoghan Harrington, Takanori Hasegawa, David Haussler, Steve Hawkins, Shinya Hayami, Shuto Hayashi, D. Neil Hayes, Stephen J. Hayes, Nicholas K. Hayward, Steven Hazell, Yao He, Allison P. Heath, Simon C. Heath, David Hedley, Apurva M. Hegde, David I. Heiman, Michael C. Heinold, Zachary Heins, Lawrence E. Heisler, Eva Hellstrom-Lindberg, Mohamed Helmy, Seong Gu Heo, Austin J. Hepperla, José María Heredia-Genestar, Carl Herrmann, Peter Hersey, Julian M. Hess, Holmfridur Hilmarsdottir, Jonathan Hinton, Satoshi Hirano, Nobuyoshi Hiraoka, Katherine A. Hoadley, Asger Hobolth, Ermin Hodzic, Jessica I. Hoell, Steve Hoffmann, Oliver Hofmann, Andrea Holbrook, Aliaksei Z. Holik, Michael A. Hollingsworth, Oliver Holmes, Robert A. Holt, Chen Hong, Eun Pyo Hong, Jongwhi H. Hong, Gerrit K. Hooijer, Henrik Hornshøj, Fumie Hosoda, Yong Hou, Volker Hovestadt, William Howat, Alan P. Hoyle, Ralph H. Hruban, Jianhong Hu, Taobo Hu, Xing Hua, Kuan-lin Huang, Mei Huang, Mi Ni Huang, Vincent Huang, Yi Huang, Wolfgang Huber, Thomas J. Hudson, Michael Hummel, Jillian A. Hung, David Huntsman, Ted R. Hupp, Jason Huse, Matthew R. Huska, Barbara Hutter, Carolyn M. Hutter, Daniel Hübschmann, Christine A. Iacobuzio-Donahue, Charles David Imbusch, Marcin Imielinski, Seiya Imoto, William B. Isaacs, Keren Isaev, Shumpei Ishikawa, Murat Iskar, S. M. Ashiqul Islam, Michael Ittmann, Sinisa Ivkovic, Jose M. G. Izarzugaza, Jocelyne Jacquemier, Valerie Jakrot, Nigel B. Jamieson, Gun Ho Jang, Se Jin Jang, Joy C. Jayaseelan, Reyka Jayasinghe, Stuart R. Jefferys, Karine Jegalian, Jennifer L. Jennings, Seung-Hyup Jeon, Lara Jerman, Yuan Ji, Wei Jiao, Peter A. Johansson, Amber L. Johns, Jeremy Johns, Rory Johnson, Todd A. Johnson, Clemency Jolly, Yann Joly, Jon G. Jonasson, Corbin D. Jones, David R. Jones, David T. W. Jones, Nic Jones, Steven J. M. Jones, Jos Jonkers, Young Seok Ju, Hartmut Juhl, Jongsun Jung, Malene Juul, Randi Istrup Juul, Sissel Juul, Natalie Jäger, Rolf Kabbe, Andre Kahles, Abdullah Kahraman, Vera B. Kaiser, Hojabr Kakavand, Sangeetha Kalimuthu, Christof von Kalle, Koo Jeong Kang, Katalin Karaszi, Beth Karlan, Rosa Karlić, Dennis Karsch, Katayoon Kasaian, Karin S. Kassahn, Hitoshi Katai, Mamoru Kato, Hiroto Katoh, Yoshiiku Kawakami, Jonathan D. Kay, Stephen H. Kazakoff, Marat D. Kazanov, Maria Keays, Electron Kebebew, Richard F. Kefford, Manolis Kellis, James G. Kench, Catherine J. Kennedy, Jules N. A. Kerssemakers, David Khoo, Vincent Khoo, Narong Khuntikeo, Ekta Khurana, Helena Kilpinen, Hark Kyun Kim, Hyung-Lae Kim, Hyung-Yong Kim, Hyunghwan Kim, Jaegil Kim, Jihoon Kim, Jong K. Kim, Youngwook Kim, Tari A. King, Wolfram Klapper, Kortine Kleinheinz, Leszek J. Klimczak, Stian Knappskog, Michael Kneba, Bartha M. Knoppers, Youngil Koh, Jan Komorowski, Daisuke Komura, Mitsuhiro Komura, Gu Kong, Marcel Kool, Jan O. Korbel, Viktoriya Korchina, Andrey Korshunov, Michael Koscher, Roelof Koster, Zsofia Kote-Jarai, Antonios Koures, Milena Kovacevic, Barbara Kremeyer, Helene Kretzmer, Markus Kreuz, Savitri Krishnamurthy, Dieter Kube, Kiran Kumar, Pardeep Kumar, Sushant Kumar, Yogesh Kumar, Ritika Kundra, Kirsten Kübler, Ralf Küppers, Jesper Lagergren, Phillip H. Lai, Peter W. Laird, Sunil R. Lakhani, Christopher M. Lalansingh, Emilie Lalonde, Fabien C. Lamaze, Adam Lambert, Eric Lander, Pablo Landgraf, Luca Landoni, Anita Langerød, Andrés Lanzós, Denis Larsimont, Erik Larsson, Mark Lathrop, Loretta M. S. Lau, Chris Lawerenz, Rita T. Lawlor, Michael S. Lawrence, Alexander J. Lazar, Xuan Le, Darlene Lee, Donghoon Lee, Eunjung Alice Lee, Hee Jin Lee, Jake June-Koo Lee, Jeong-Yeon Lee, Juhee Lee, Ming Ta Michael Lee, Henry Lee-Six, Kjong-Van Lehmann, Hans Lehrach, Dido Lenze, Conrad R. Leonard, Daniel A. Leongamornlert, Ignaty Leshchiner, Louis Letourneau, Ivica Letunic, Douglas A. Levine, Lora Lewis, Tim Ley, Chang Li, Constance H. Li, Haiyan Irene Li, Jun Li, Lin Li, Shantao Li, Siliang Li, Xiaobo Li, Xiaotong Li, Xinyue Li, Yilong Li, Han Liang, Sheng-Ben Liang, Peter Lichter, Pei Lin, Ziao Lin, W. M. Linehan, Ole Christian Lingjærde, Dongbing Liu, Eric Minwei Liu, Fei-Fei Liu, Fenglin Liu, Jia Liu, Xingmin Liu, Julie Livingstone, Dimitri Livitz, Naomi Livni, Lucas Lochovsky, Markus Loeffler, Georgina V. Long, Armando Lopez-Guillermo, Shaoke Lou, David N. Louis, Laurence B. Lovat, Yiling Lu, Yong-Jie Lu, Youyong Lu, Claudio Luchini, Ilinca Lungu, Xuemei Luo, Hayley J. Luxton, Andy G. Lynch, Lisa Lype, Cristina López, Carlos López-Otín, Eric Z. Ma, Yussanne Ma, Gaetan MacGrogan, Shona MacRae, Geoff Macintyre, Tobias Madsen, Kazuhiro Maejima, Andrea Mafficini, Dennis T. Maglinte, Arindam Maitra, Partha P. Majumder, Luca Malcovati, Salem Malikic, Giuseppe Malleo, Graham J. Mann, Luisa Mantovani-Löffler, Kathleen Marchal, Giovanni Marchegiani, Elaine R. Mardis, Adam A. Margolin, Maximillian G. Marin, Florian Markowetz, Julia Markowski, Jeffrey Marks, Tomas Marques-Bonet, Marco A. Marra, Luke Marsden, John W. M. Martens, Sancha Martin, Jose I. Martin-Subero, Iñigo Martincorena, Alexander Martinez-Fundichely, Yosef E. Maruvka, R. Jay Mashl, Charlie E. Massie, Thomas J. Matthew, Lucy Matthews, Erik Mayer, Simon Mayes, Michael Mayo, Faridah Mbabaali, Karen McCune, Ultan McDermott, Patrick D. McGillivray, Michael D. McLellan, John D. McPherson, John R. McPherson, Treasa A. McPherson, Samuel R. Meier, Alice Meng, Shaowu Meng, Andrew Menzies, Neil D. Merrett, Sue Merson, Matthew Meyerson, William U. Meyerson, Piotr A. Mieczkowski, George L. Mihaiescu, Sanja Mijalkovic, Ana Mijalkovic Mijalkovic-Lazic, Tom Mikkelsen, Michele Milella, Linda Mileshkin, Christopher A. Miller, David K. Miller, Jessica K. Miller, Gordon B. Mills, Ana Milovanovic, Sarah Minner, Marco Miotto, Gisela Mir Arnau, Lisa Mirabello, Chris Mitchell, Thomas J. Mitchell, Satoru Miyano, Naoki Miyoshi, Shinichi Mizuno, Fruzsina Molnár-Gábor, Malcolm J. Moore, Richard A. Moore, Sandro Morganella, Quaid D. Morris, Carl Morrison, Lisle E. Mose, Catherine D. Moser, Ferran Muiños, Loris Mularoni, Andrew J. Mungall, Karen Mungall, Elizabeth A. Musgrove, Ville Mustonen, David Mutch, Francesc Muyas, Donna M. Muzny, Alfonso Muñoz, Jerome Myers, Ola Myklebost, Peter Möller, Genta Nagae, Adnan M. Nagrial, Hardeep K. Nahal-Bose, Hitoshi Nakagama, Hidewaki Nakagawa, Hiromi Nakamura, Toru Nakamura, Kaoru Nakano, Tannistha Nandi, Jyoti Nangalia, Mia Nastic, Arcadi Navarro, Fabio C. P. Navarro, David E. Neal, Gerd Nettekoven, Felicity Newell, Steven J. Newhouse, Yulia Newton, Alvin Wei Tian Ng, Anthony Ng, Jonathan Nicholson, David Nicol, Yongzhan Nie, G. Petur Nielsen, Morten Muhlig Nielsen, Serena Nik-Zainal, Michael S. Noble, Katia Nones, Paul A. Northcott, Faiyaz Notta, Brian D. O’Connor, Peter O’Donnell, Maria O’Donovan, Sarah O’Meara, Brian Patrick O’Neill, J. Robert O’Neill, David Ocana, Angelica Ochoa, Layla Oesper, Christopher Ogden, Hideki Ohdan, Kazuhiro Ohi, Lucila Ohno-Machado, Karin A. Oien, Akinyemi I. Ojesina, Hidenori Ojima, Takuji Okusaka, Larsson Omberg, Choon Kiat Ong, Stephan Ossowski, German Ott, B. F. Francis Ouellette, Christine P’ng, Marta Paczkowska, Salvatore Paiella, Chawalit Pairojkul, Marina Pajic, Qiang Pan-Hammarström, Elli Papaemmanuil, Irene Papatheodorou, Nagarajan Paramasivam, Ji Wan Park, Joong-Won Park, Keunchil Park, Kiejung Park, Peter J. Park, Joel S. Parker, Simon L. Parsons, Harvey Pass, Danielle Pasternack, Alessandro Pastore, Ann-Marie Patch, Iris Pauporté, Antonio Pea, John V. Pearson, Chandra Sekhar Pedamallu, Jakob Skou Pedersen, Paolo Pederzoli, Martin Peifer, Nathan A. Pennell, Charles M. Perou, Marc D. Perry, Gloria M. Petersen, Myron Peto, Nicholas Petrelli, Robert Petryszak, Stefan M. Pfister, Mark Phillips, Oriol Pich, Hilda A. Pickett, Todd D. Pihl, Nischalan Pillay, Sarah Pinder, Mark Pinese, Andreia V. Pinho, Esa Pitkänen, Xavier Pivot, Elena Piñeiro-Yáñez, Laura Planko, Christoph Plass, Paz Polak, Tirso Pons, Irinel Popescu, Olga Potapova, Aparna Prasad, Shaun R. Preston, Manuel Prinz, Antonia L. Pritchard, Stephenie D. Prokopec, Elena Provenzano, Xose S. Puente, Sonia Puig, Montserrat Puiggròs, Sergio Pulido-Tamayo, Gulietta M. Pupo, Colin A. Purdie, Michael C. Quinn, Raquel Rabionet, Janet S. Rader, Bernhard Radlwimmer, Petar Radovic, Benjamin Raeder, Keiran M. Raine, Manasa Ramakrishna, Kamna Ramakrishnan, Suresh Ramalingam, Benjamin J. Raphael, W. Kimryn Rathmell, Tobias Rausch, Guido Reifenberger, Jüri Reimand, Jorge Reis-Filho, Victor Reuter, Iker Reyes-Salazar, Matthew A. Reyna, Sheila M. Reynolds, Esther Rheinbay, Yasser Riazalhosseini, Andrea L. Richardson, Julia Richter, Matthew Ringel, Markus Ringnér, Yasushi Rino, Karsten Rippe, Jeffrey Roach, Lewis R. Roberts, Nicola D. Roberts, Steven A. Roberts, A. Gordon Robertson, Alan J. Robertson, Javier Bartolomé Rodriguez, Bernardo Rodriguez-Martin, F. Germán Rodríguez-González, Michael H. A. Roehrl, Marius Rohde, Hirofumi Rokutan, Gilles Romieu, Ilse Rooman, Tom Roques, Daniel Rosebrock, Mara Rosenberg, Philip C. Rosenstiel, Andreas Rosenwald, Edward W. Rowe, Romina Royo, Steven G. Rozen, Yulia Rubanova, Mark A. Rubin, Carlota Rubio-Perez, Vasilisa A. Rudneva, Borislav C. Rusev, Andrea Ruzzenente, Gunnar Rätsch, Radhakrishnan Sabarinathan, Veronica Y. Sabelnykova, Sara Sadeghi, S. Cenk Sahinalp, Natalie Saini, Mihoko Saito-Adachi, Gordon Saksena, Adriana Salcedo, Roberto Salgado, Leonidas Salichos, Richard Sallari, Charles Saller, Roberto Salvia, Michelle Sam, Jaswinder S. Samra, Francisco Sanchez-Vega, Chris Sander, Grant Sanders, Rajiv Sarin, Iman Sarrafi, Aya Sasaki-Oku, Torill Sauer, Guido Sauter, Robyn P. M. Saw, Maria Scardoni, Christopher J. Scarlett, Aldo Scarpa, Ghislaine Scelo, Dirk Schadendorf, Jacqueline E. Schein, Markus B. Schilhabel, Matthias Schlesner, Thorsten Schlomm, Heather K. Schmidt, Sarah-Jane Schramm, Stefan Schreiber, Nikolaus Schultz, Steven E. Schumacher, Roland F. Schwarz, Richard A. Scolyer, David Scott, Ralph Scully, Raja Seethala, Ayellet V. Segre, Iris Selander, Colin A. Semple, Yasin Senbabaoglu, Subhajit Sengupta, Elisabetta Sereni, Stefano Serra, Dennis C. Sgroi, Mark Shackleton, Nimish C. Shah, Sagedeh Shahabi, Catherine A. Shang, Ping Shang, Ofer Shapira, Troy Shelton, Ciyue Shen, Hui Shen, Rebecca Shepherd, Ruian Shi, Yan Shi, Yu-Jia Shiah, Tatsuhiro Shibata, Juliann Shih, Eigo Shimizu, Kiyo Shimizu, Seung Jun Shin, Yuichi Shiraishi, Tal Shmaya, Ilya Shmulevich, Solomon I. Shorser, Charles Short, Raunak Shrestha, Suyash S. Shringarpure, Craig Shriver, Shimin Shuai, Nikos Sidiropoulos, Reiner Siebert, Anieta M. Sieuwerts, Lina Sieverling, Sabina Signoretti, Katarzyna O. Sikora, Michele Simbolo, Ronald Simon, Janae V. Simons, Jared T. Simpson, Peter T. Simpson, Samuel Singer, Nasa Sinnott-Armstrong, Payal Sipahimalani, Tara J. Skelly, Marcel Smid, Jaclyn Smith, Karen Smith-McCune, Nicholas D. Socci, Heidi J. Sofia, Matthew G. Soloway, Lei Song, Anil K. Sood, Sharmila Sothi, Christos Sotiriou, Cameron M. Soulette, Paul N. Span, Paul T. Spellman, Nicola Sperandio, Andrew J. Spillane, Oliver Spiro, Jonathan Spring, Johan Staaf, Peter F. Stadler, Peter Staib, Stefan G. Stark, Lucy Stebbings, Ólafur Andri Stefánsson, Oliver Stegle, Lincoln D. Stein, Alasdair Stenhouse, Chip Stewart, Stephan Stilgenbauer, Miranda D. Stobbe, Michael R. Stratton, Jonathan R. Stretch, Adam J. Struck, Joshua M. Stuart, Henk G. Stunnenberg, Hong Su, Xiaoping Su, Ren X. Sun, Stephanie Sungalee, Hana Susak, Akihiro Suzuki, Fred Sweep, Monika Szczepanowski, Holger Sültmann, Takashi Yugawa, Angela Tam, David Tamborero, Benita Kiat Tee Tan, Donghui Tan, Patrick Tan, Hiroko Tanaka, Hirokazu Taniguchi, Tomas J. Tanskanen, Maxime Tarabichi, Roy Tarnuzzer, Patrick Tarpey, Morgan L. Taschuk, Kenji Tatsuno, Simon Tavaré, Darrin F. Taylor, Amaro Taylor-Weiner, Jon W. Teague, Bin Tean Teh, Varsha Tembe, Javier Temes, Kevin Thai, Sarah P. Thayer, Nina Thiessen, Gilles Thomas, Sarah Thomas, Alan Thompson, Alastair M. Thompson, John F. Thompson, R. Houston Thompson, Heather Thorne, Leigh B. Thorne, Adrian Thorogood, Grace Tiao, Nebojsa Tijanic, Lee E. Timms, Roberto Tirabosco, Marta Tojo, Stefania Tommasi, Christopher W. Toon, Umut H. Toprak, David Torrents, Giampaolo Tortora, Jörg Tost, Yasushi Totoki, David Townend, Nadia Traficante, Isabelle Treilleux, Jean-Rémi Trotta, Lorenz H. P. Trümper, Ming Tsao, Tatsuhiko Tsunoda, Jose M. C. Tubio, Olga Tucker, Richard Turkington, Daniel J. Turner, Andrew Tutt, Masaki Ueno, Naoto T. Ueno, Christopher Umbricht, Husen M. Umer, Timothy J. Underwood, Lara Urban, Tomoko Urushidate, Tetsuo Ushiku, Liis Uusküla-Reimand, Alfonso Valencia, David J. Van Den Berg, Steven Van Laere, Peter Van Loo, Erwin G. Van Meir, Gert G. Van den Eynden, Theodorus Van der Kwast, Naveen Vasudev, Miguel Vazquez, Ravikiran Vedururu, Umadevi Veluvolu, Shankar Vembu, Lieven P. C. Verbeke, Peter Vermeulen, Clare Verrill, Alain Viari, David Vicente, Caterina Vicentini, K. Vijay Raghavan, Juris Viksna, Ricardo E. Vilain, Izar Villasante, Anne Vincent-Salomon, Tapio Visakorpi, Douglas Voet, Paresh Vyas, Ignacio Vázquez-García, Nick M. Waddell, Nicola Waddell, Claes Wadelius, Lina Wadi, Rabea Wagener, Jeremiah A. Wala, Jian Wang, Jiayin Wang, Linghua Wang, Qi Wang, Wenyi Wang, Yumeng Wang, Zhining Wang, Paul M. Waring, Hans-Jörg Warnatz, Jonathan Warrell, Anne Y. Warren, Sebastian M. Waszak, David C. Wedge, Dieter Weichenhan, Paul Weinberger, John N. Weinstein, Joachim Weischenfeldt, Daniel J. Weisenberger, Ian Welch, Michael C. Wendl, Johannes Werner, Justin P. Whalley, David A. Wheeler, Hayley C. Whitaker, Dennis Wigle, Matthew D. Wilkerson, Ashley Williams, James S. Wilmott, Gavin W. Wilson, Julie M. Wilson, Richard K. Wilson, Boris Winterhoff, Jeffrey A. Wintersinger, Maciej Wiznerowicz, Stephan Wolf, Bernice H. Wong, Tina Wong, Winghing Wong, Youngchoon Woo, Scott Wood, Bradly G. Wouters, Adam J. Wright, Derek W. Wright, Mark H. Wright, Chin-Lee Wu, Dai-Ying Wu, Guanming Wu, Jianmin Wu, Kui Wu, Yang Wu, Zhenggang Wu, Liu Xi, Tian Xia, Qian Xiang, Xiao Xiao, Rui Xing, Heng Xiong, Qinying Xu, Yanxun Xu, Hong Xue, Shinichi Yachida, Sergei Yakneen, Rui Yamaguchi, Takafumi N. Yamaguchi, Masakazu Yamamoto, Shogo Yamamoto, Hiroki Yamaue, Fan Yang, Huanming Yang, Jean Y. Yang, Liming Yang, Lixing Yang, Shanlin Yang, Tsun-Po Yang, Yang Yang, Xiaotong Yao, Marie-Laure Yaspo, Lucy Yates, Christina Yau, Chen Ye, Kai Ye, Venkata D. Yellapantula, Christopher J. Yoon, Sung-Soo Yoon, Fouad Yousif, Jun Yu, Kaixian Yu, Willie Yu, Yingyan Yu, Ke Yuan, Yuan Yuan, Denis Yuen, Takashi Yugawa, Christina K. Yung, Olga Zaikova, Jorge Zamora, Marc Zapatka, Jean C. Zenklusen, Thorsten Zenz, Nikolajs Zeps, Cheng-Zhong Zhang, Fan Zhang, Hailei Zhang, Hongwei Zhang, Hongxin Zhang, Jiashan Zhang, Jing Zhang, Junjun Zhang, Xiuqing Zhang, Xuanping Zhang, Yan Zhang, Zemin Zhang, Zhongming Zhao, Liangtao Zheng, Xiuqing Zheng, Wanding Zhou, Yong Zhou, Bin Zhu, Hongtu Zhu, Jingchun Zhu, Shida Zhu, Lihua Zou, Xueqing Zou, Anna deFazio, Nicholas van As, Carolien H. M. van Deurzen, Marc J. van de Vijver, L. van’t Veer, and Christian von Mering

Supplementary information

Supplementary information is available for this paper at 10.1038/s41467-020-18151-y.

References

  • 1.Radenbaugh, A. J. et al. RADIA: RNA and DNA integrated analysis for somatic mutation detection. PLoS ONE9, e111516 (2014). [DOI] [PMC free article] [PubMed]
  • 2.Koboldt DC, et al. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22:568–576. doi: 10.1101/gr.129684.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Fan Y, et al. MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biol. 2016;17:178. doi: 10.1186/s13059-016-1029-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Cibulskis K, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 2013;31:213–219. doi: 10.1038/nbt.2514. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009;25:2865–2871. doi: 10.1093/bioinformatics/btp394. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ye K, et al. Systematic discovery of complex insertions and deletions in human cancers. Nat. Med. 2016;22:97–104. doi: 10.1038/nm.4002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Chapman MA, et al. Initial genome sequencing and analysis of multiple myeloma. Nature. 2011;471:467–472. doi: 10.1038/nature09837. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Larson DE, et al. Somaticsniper: Identification of somatic point mutations in whole genome sequencing data. Bioinformatics. 2012;28:311–317. doi: 10.1093/bioinformatics/btr665. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Moncunill V, et al. Comprehensive characterization of complex structural variations in cancer by directly comparing genome sequence reads. Nat. Biotechnol. 2014;32:1106–1112. doi: 10.1038/nbt.3027. [DOI] [PubMed] [Google Scholar]
  • 10.The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Network. Pan-cancer analysis of whole genomes. Nature578, 82–93 (2019).
  • 11.Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 2014;15:121–132. doi: 10.1038/nrg3642. [DOI] [PubMed] [Google Scholar]
  • 12.Chilamakuri, C. S. R. et al. Performance comparison of four exome capture systems for deep sequencing. BMC Genomics15, 449 (2014). [DOI] [PMC free article] [PubMed]
  • 13.Belkadi A, et al. Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants. Proc. Natl Acad. Sci. USA. 2015;112:5473–5478. doi: 10.1073/pnas.1418631112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Spencer DH, et al. Performance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data. J. Mol. Diagnostics. 2014;16:75–88. doi: 10.1016/j.jmoldx.2013.09.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Alfares A, et al. Whole-genome sequencing offers additional but limited clinical utility compared with reanalysis of whole-exome sequencing. Genet. Med. 2018;20:1328–1333. doi: 10.1038/gim.2018.41. [DOI] [PubMed] [Google Scholar]
  • 16.Phallen J, et al. Early noninvasive detection of response to targeted therapy in non–small cell lung cancer. Cancer Res. 2019;79:1204–1213. doi: 10.1158/0008-5472.CAN-18-1082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Głodzik D, et al. Mutational mechanisms of amplifications revealed by analysis of clustered rearrangements in breast cancers. Ann. Oncol. 2018;29:2223–2231. doi: 10.1093/annonc/mdy404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Ren S, et al. Whole-genome and transcriptome sequencing of prostate cancer identify new genetic alterations driving disease progression [Figure presented] Eur. Urol. 2018;73:322–339. doi: 10.1016/j.eururo.2017.08.027. [DOI] [PubMed] [Google Scholar]
  • 19.Zhang S, et al. Genome evolution analysis of recurrent testicular malignant mesothelioma by whole-genome sequencing. Cell. Physiol. Biochem. 2018;45:163–174. doi: 10.1159/000486355. [DOI] [PubMed] [Google Scholar]
  • 20.Nik-Zainal S, et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature. 2016;534:47–54. doi: 10.1038/nature17676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Mendoza-Alvarez A, et al. Whole-exome sequencing identifies somatic mutations associated with mortality in metastatic clear cell kidney carcinoma. Front. Genet. 2019;10:1–10. doi: 10.3389/fgene.2019.00439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Yan P, et al. Whole exome sequencing of ulcerative colitis-associated colorectal cancer based on novel somatic mutations identified in Chinese patients. Inflamm. Bowel Dis. 2019;25:1293–1301. doi: 10.1093/ibd/izz020. [DOI] [PubMed] [Google Scholar]
  • 23.Mogensen, M. B. et al. Genomic alterations accompanying tumour evolution in colorectal cancer: tracking the differences between primary tumours and synchronous liver metastases by whole-exome sequencing. BMC Cancer18, 752 (2018). [DOI] [PMC free article] [PubMed]
  • 24.Sponziello M, et al. Whole exome sequencing identifies a germline MET mutation in two siblings with hereditary wild-type RET medullary thyroid cancer. Hum. Mutat. 2018;39:371–377. doi: 10.1002/humu.23378. [DOI] [PubMed] [Google Scholar]
  • 25.Schwarze K, Buchanan J, Taylor JC, Wordsworth S. Are whole-exome and whole-genome sequencing approaches cost-effective? A systematic review of the literature. Genet. Med. 2018;20:1122–1130. doi: 10.1038/gim.2017.247. [DOI] [PubMed] [Google Scholar]
  • 26.Choi M, et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc. Natl Acad. Sci. USA. 2009;106:19096–19101. doi: 10.1073/pnas.0910672106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Ellrott K, et al. Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 2018;6:271–281.e7. doi: 10.1016/j.cels.2018.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Bailey MH, et al. Comprehensive characterization of cancer driver genes and mutations. Cell. 2018;173:371–385.e18. doi: 10.1016/j.cell.2018.02.060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Meienberg, J. et al. New insights into the performance of human whole-exome capture platforms. Nucleic Acids Res.43, e76 (2015). [DOI] [PMC free article] [PubMed]
  • 30.Clark MJ, et al. Performance comparison of exome DNA sequencing technologies. Nat. Biotechnol. 2011;29:908–916. doi: 10.1038/nbt.1975. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.McLaren W, et al. Deriving the consequences of genomic variants with the Ensembl API and SNP effect predictor. Bioinformatics. 2010;26:2069–2070. doi: 10.1093/bioinformatics/btq330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Kircher M, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 2014;46:310–315. doi: 10.1038/ng.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Rheinbay, E. et al. Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature578, 102–111 (2019). [DOI] [PMC free article] [PubMed]
  • 34.Imielinski M, Guo G, Meyerson M. Insertions and deletions target lineage-defining genes in human cancers. Cell. 2017;168:460–472.e14. doi: 10.1016/j.cell.2016.12.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Dees ND, et al. MuSiC: Identifying mutational significance in cancer genomes. Genome Res. 2012;22:1589–1598. doi: 10.1101/gr.134635.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Auslander N, et al. An integrated computational and experimental study uncovers FUT 9 as a metabolic driver of colorectal cancer. Mol. Syst. Biol. 2017;13:956. doi: 10.15252/msb.20177739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Neph S, et al. BEDOPS: High-performance genomic feature operations. Bioinformatics. 2012;28:1919–1920. doi: 10.1093/bioinformatics/bts277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Carter SL, et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 2012;30:413–421. doi: 10.1038/nbt.2203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Chang, W., Cheng, J., Allaire, J., Xie, Y. & McPherson, J. shiny: Web Application Framework for R. R Packag (2016).
  • 40.Kasar, S. et al. Whole-genome sequencing reveals activation-induced cytidine deaminase signatures during indolent chronic lymphocytic leukaemia evolution. Nat. Commun. 6, 1–12 (2015). [DOI] [PMC free article] [PubMed]
  • 41.Conway JR, Lex A, Gehlenborg N. UpSetR: An R package for the visualization of intersecting sets and their properties. Bioinformatics. 2017;33:2938–2940. doi: 10.1093/bioinformatics/btx364. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Gu Z, Eils R, Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016 doi: 10.1093/bioinformatics/btw313. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information (776.7KB, pdf)
41467_2020_18151_MOESM2_ESM.pdf (62KB, pdf)

Description of Additional Supplementary Files

Supplementary Data 1 (93.3KB, xlsx)
Supplementary Data 2 (9.7KB, xlsx)
Supplementary Data 3 (24.9KB, xlsx)
Supplementary Data 4 (125.9KB, xlsx)
Supplementary Data 5 (10.1KB, xlsx)
Supplementary Data 6 (46.3KB, xlsx)
Reporting Summary (170.3KB, pdf)

Data Availability Statement

Somatic and germline variant calls, mutational signatures, subclonal reconstructions, transcript abundance, splice calls and other core data generated by the ICGC/TCGA Pan-cancer Analysis of Whole Genomes Consortium are described in another publication10 and available for download at https://dcc.icgc.org/releases/PCAWG. Additional information on accessing the data, including raw read files, can be found at https://docs.icgc.org/pcawg/data/. In accordance with the data access policies of the ICGC and TCGA projects, most molecular, clinical and specimen data are in an open tier which does not require access approval. To access potentially identification information, such as germline alleles and underlying sequencing data, researchers will need to apply to the TCGA Data Access Committee (DAC) via dbGaP (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login) for access to the TCGA portion of the dataset, and to the ICGC Data Access Compliance Office (DACO; http://icgc.org/daco) for the ICGC portion. In addition, to access somatic single nucleotide variants derived from TCGA donors, researchers will also need to obtain dbGaP authorization. Additional links to data resources used for this project can be found using the following urls: MC3 public MAF file (https://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc), PCAWG Public MAF file (https://www.synapse.org/#!Synapse:syn7364923), bed files used for exome restrictions (MC3) (https://api.gdc.cancer.gov/data/7f0d3ab9-8bef-4e3b-928a-6090caae885b), bed files used for exome restrictions (THE BROAD) (https://api.gdc.cancer.gov/data/b1e303a5-a542-4389-8ddb-1d151218be75), wiggle files MC3 (https://www.synapse.org/#!Synapse:syn21785741), wiggle files PCAWG (https://www.synapse.org/#!Synapse:syn8492850), clonality files MC3 (https://www.synapse.org/#!Synapse:syn7870168), clonality files PCAWG (https://www.synapse.org/#!Synapse:syn8532460), cancer subtypes and histological data (https://www.synapse.org/#!Synapse:syn4983466), MAFit online comparison tool (https://mbailey.shinyapps.io/MAFit/), GitHub Repo (https://github.com/ding-lab/mc3_icgc_variant_pipeline), side-by-side MC3-PCAWG comparison (MAF-like) (https://www.synapse.org/#!Synapse:syn21041380). The remaining data is available within the Article, Supplementary Information or available from the authors upon reasonable request.

A public GitHub repository (under an MIT opensource license) at https://github.com/ding-lab/mc3_icgc_variant_pipeline furnishes work-flows, scripts, figures, and computational tools used to assess mutation concordance between maf files. For this project we refrained from accessing alignment files, i.e., BAM files or fastq files. Furthermore, due to the memory footprint of mutation and coverage files we have not included them in the repository. Thus, by removing core data files from the repository the software provided supplies the user with processes and decisions, not a fully automated tool for deployment on additional datasets. In addition to our analysis, the core computational pipelines used by the PCAWG Consortium for alignment, quality control and variant calling are available to the public at https://dockstore.org/search?search=pcawg under the GNU General Public License v3.0, which allows for reuse and distribution.


Articles from Nature Communications are provided here courtesy of Nature Publishing Group

RESOURCES