Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Oct 19;17(10):e1009542. doi: 10.1371/journal.pcbi.1009542

A data-driven approach for constructing mutation categories for mutational signature analysis

Gal Gilad 1, Mark D M Leiserson 2, Roded Sharan 1,*
Editor: Pedro Mendes3
PMCID: PMC8555780  PMID: 34665813

Abstract

Mutational processes shape the genomes of cancer patients and their understanding has important applications in diagnosis and treatment. Current modeling of mutational processes by identifying their characteristic signatures views each base substitution in a limited context of a single flanking base on each side. This context definition gives rise to 96 categories of mutations that have become the standard in the field, even though wider contexts have been shown to be informative in specific cases. Here we propose a data-driven approach for constructing a mutation categorization for mutational signature analysis. Our approach is based on the assumption that tumor cells that are exposed to similar mutational processes, show similar expression levels of DNA damage repair genes that are involved in these processes. We attempt to find a categorization that maximizes the agreement between mutation and gene expression data, and show that it outperforms the standard categorization over multiple quality measures. Moreover, we show that the categorization we identify generalizes to unseen data from different cancer types, suggesting that mutation context patterns extend beyond the immediate flanking bases.

Author summary

Cancer is a group of genetic diseases that occur as a result of an accumulation of somatic mutations in genes that regulate cellular growth and differentiation. These mutations arise from mutagenic processes such as exposure to environmental mutagens and defective DNA damage repair pathways. Each of these processes results in a characteristic pattern of mutations, referred to as a mutational signature. These signatures reveal the mutagenic mechanisms that have influenced the development of a specific tumor, and thus provide new insights into its causes and potential treatments. Originally, a mutational signature has been defined using 96 mutation categories that take into account solely the information from the mutated base and its flanking bases. Here, we aim to challenge this arbitrary categorization, which is widely used in mutational signature analysis. We have developed a novel framework for the construction of mutation categories that is based on the assumption that the activities of DNA damage repair genes are correlated with the mutational processes that are active in a given tumor. We show that using this approach we are able to identify an alternative mutation categorization that outperforms the standard categorization with respect to multiple metrics. This categorization includes categories that account for bases that extend beyond the immediate flanking bases, suggesting that mutational signatures should be studied in broader sequence contexts.


This is a PLOS Computational Biology Methods paper.

Introduction

Mutational signatures are characteristic combinations of mutation types (here referred to as categories) that arise from a specific mutational process. The analysis of the activity (here referred to as exposure) of these signatures in cancer provides insight to the active biological mechanisms that drive it. The inference of these mutational signatures is based on mutation category counts in a set of samples, acquired via whole-genome or whole-exome sequencing. A common way to do this inference is by employing non-negative matrix factorization (NMF) to express the mutation count matrix as the product of an exposure matrix (where each row represents signature exposures in the corresponding sample) and a signature matrix (where each row is a probability vector over mutation categories, representing a signature).

Current research in the field of mutational signatures typically categorizes mutations to 96 categories (referred to here as the standard categorization), taking into account the point mutation and a single flanking base on each side of it (6 base substitution classes and 4 possible flanking bases on each side) [1, 2]. This limited context ensures that mutation counts per category are not too sparse for downstream analysis, but may be too restrictive. Nevertheless, these categories served to discover state-of-the-art mutational signatures [3] as cataloged in the COSMIC database.

Several authors [35] considered also a wider context of two flanking bases on each side of the mutation, and analyzed mutational signatures using the resulting 1536 categories. Similarly, Shirashi et al. [6] modeled mutational signatures using independent “features” rather than categories, which allowed them to effectively consider two flanking base pairs on each side in their analysis. Nevertheless, these variants of the standard categorization were arbitrary (built on the mutation and 1–2 flanking based on each side) and were not derived systematically from data. Moreover, there is evidence in the literature for higher order mutation contexts [7].

Here we propose a data-driven approach for constructing a mutation categorization for mutational signature analysis. Our approach is based on the assumption that tumor cells that are exposed to similar mutational processes, show similar expression levels of genes that are involved in the underlying mutagenic processes. In particular, we focus on DNA damage repair (DDR) genes, as about one third of mutational processes with known etiology are associated with DNA repair deficiency [3]. The hypothesis that mutational signatures can be related to aberrant gene expression or alterations in DNA repair genes is well supported. For example, the deactivation of MUTYH gene in cancer patients is associated with a specific mutational signature [810]. In addition, Kim et al. [11] identify significant correlations between DDR gene expression and the activity of multiple mutational signatures. We develop an algorithm to find a categorization that maximizes the agreement between mutation and DDR gene expression data, and show that it outperforms the standard categorization over multiple quality measures. We build a data-driven categorization (DDC) of mutations that considers up to three bases away from the mutated base. We show that our categorization leads to improved performance in signature discovery and expression correlation across multiple cancer types. We further compare the identified signatures to those cataloged in COSMIC and analyze their etiologies.

Materials and methods

We consider a wide context for mutations, which consists of three flanking bases on each side of the mutated base. Representing a 7-mer mutation sequence by an 8-long character string (consisting of three 5’ flanking bases, the mutated base, the new base, and the three 3’ flanking bases) we aim to group mutations to 96 categories based on their sequence patterns. The identification and evaluation of mutation categorizations is based on the following measures of categorization quality (described in detail below): (1) agreement between mutation data and DNA damage repair (DDR) gene expression data, (2) reconstruction error of mutation count data, and (3) agreement between mutation data and gene expression information of cancer related genes. We attempt to find a categorization that optimizes (1), and then use measures (2)-(3) to evaluate the categorization obtained. Below we describe these measures in detail and how they are used for optimization and evaluation.

Categorization quality measures

In all measures below we assume that we are given a mutation categorization into M categories and observe a normalized count matrix V of order N × M that reports the proportion of times a mutation category i ∈ [1, M] was observed in sample j ∈ [1, N] (Fig 1A shows an unnormalized count matrix V of N samples and M arbitrary mutation categories). The (normalized) mutation count matrix is assumed to factorize VWH, where H is a matrix of mutational signatures (probability vectors over categories) and W represents their sample-specific exposures.

Fig 1. Overview of the computational pipeline.

Fig 1

(A) For each mutation categorization (a set of M mutation categories), a mutation count matrix of N samples by M mutation categories is built through the assignment of 7-mer mutation sequences to their best matching category. (B) For each mutation categorization, its corresponding normalized mutation count matrix is factorized to produce a mutational signature matrix H and an exposure matrix W using NMF. (C) For each categorization, the correlation between the exposure matrix W and the corresponding gene expression data is computed using canonical correlation analysis to determine its fitness. (D) Categorizations are selected from the population for crossover and mutation to produce offspring for the next generation of the genetic algorithm.

To avoid overfitting, we perform out-of-sample evaluation that is applied both to the data sets that are used for learning a categorization and to the data sets that are used for the evaluation of the categorization. Each data set has two types of samples: (i) samples with whole-genome-sequencing (WGS) mutation information; (ii) samples with mutation count data from whole-exome-sequencing (WES) as well as gene expression data. First, the signature matrix is learned by factorizing the WGS mutation count matrix (using the Kullback–Leibler variant of NMF [12]; details on how the factorization order is determined are given below, see Fitness and Selection). We normalize the signature matrix to account for the difference between 7-mer distribution (representing mutation opportunities) of whole-genome and whole-exome sequences (see Mutation opportunity normalization section below). Then, exposures of the unseen WES samples are derived via a non-negative least squares (NNLS) computation [13]. In case the signatures are known in advance (taken from COSMIC [14]), only the final NNLS step is performed. To allow evaluation of these signatures for varying factorization orders, we compute the average prevalence (in terms of sample exposure) of each COSMIC signature in a data set and rank signatures from the most prevalent to the least.

Agreement between mutation and gene expression data

Given an exposure matrix W and an expression matrix E over the same set of samples, W and E represent two views of the sample data and we can compare them using canonical correlation analysis (CCA) [15] (we use the Pyrcca [16] implementation of regularized CCA), which maximizes the correlation between the two views. This correlation serves as a measure of the categorization quality. To avoid overfitting, we perform 10-fold cross validation, i.e., we learn the CCA coefficients using training samples and compute the resulting correlation on the test samples.

Reconstruction error

We compute the reconstruction error by applying NMF to the WGS samples to learn the signature matrix H and then derive the exposure matrix W from the other samples using NNLS. We then measure the approximation error (Kullback–Leibler divergence) of the factorization with respect to the test (normalized) count matrix V. To minimize biases that arise from NMF’s tendency to converge to local minima, we perform 10 runs with randomly initialized W, H factorization matrices and report the smallest error obtained.

Mutation opportunity normalization

Let H be a matrix of mutational signatures inferred from WGS data. To account for the difference between mutation opportunities in whole-genome and whole-exome data, we first compute the counts of all possible 7-mer sequences in the GRCh37 genome and exome (genome and exon positions were downloaded from Ensembl: https://grch37.ensembl.org). Next, we construct two mutation opportunity 96-long count vectors, UWGS and UWES, that correspond to the set of 96 mutation categories of H. Each position i in those vectors is the number of 7-mer sequences in the genome/exome that map to the ith category. Finally, each row vector of H is normalized by multiplying its ith element by UWES,i/UWGS,i and normalizing the resulting vector so that it sums to one.

Comparing our signatures to COSMIC signatures

Given a mutation count matrix for N samples and all 7-mer mutation sequences, and a signature S (probability vector over 96 mutation categories), we would like to compare its cosine similarity to a COSMIC signature. We construct a transformation matrix T of size 96 × 96 from our mutation categories to the standard categories, where each Ti,j is the fraction of 7-mer mutation sequences that belong to the standard category j out of the total number of sequences that are mapped to our category i. Then, we compute the cosine similarity between S′ = ST and any COSMIC signature.

Categorization optimization

We use a genetic algorithm [17] to optimize the mutation categorization. The algorithm maintains a population of possible solutions, where each individual in the population is a categorization. Each category within a categorization represents a pattern of mutations by a string of 8 characters that correspond to the three left flanking bases, the original base that mutated, the resulting base, and the three right flanking bases. Each symbol can be a single base or a combination of bases, except for the identity of the original base, which can be only C or T. Apart from this base, any other category symbol could represent any of the bases or any pair of bases (marked by a dinucleotide) or a wildcard base (marked by N), with the constraint that the mutated base is different from the original base.

Given a 7-mer mutation sequence of bases describing the mutation and its context, we assign this sequence to the category that best matches it based on three criteria (no mismatches are allowed): (i) the size of the category, i.e., the number of 7-mer mutation sequences that perfectly match it; (ii) the distance between the first and last non-N symbol of the category; and (iii) the distance of the farthest non-N symbol from the mutated site. These criteria are applied in order and used to resolve assignment conflicts.

For example, consider the 7-mer mutation sequence TGTCTAAC, which describes a C-to-T mutation and its context—TGT to the left and AAC to the right. Further consider the following mutation categories (using IUPAC code; e.g. S is G/C)): NSTCTANN and NNNCTARC. The 7-mer mutation sequence can potentially be assigned to both categories as there are no mismatches to both. We follow the three criteria to choose the best matching category: (i) both categories are of the same size—they cover 128 sequences; (ii) the distance between the first and last non-N symbol in each category is 5; (iii) the distance between the farthest non-N symbol and the mutated site is shorter in NSTCTANN, and so the sequence is assigned to it.

Categorizations are created and maintained in such a way that there are less than 1% sequence-to-category assignment conflicts (as described below). Each categorization includes a “joker” category (NNNNNNNN), which guarantees that all sequences are assigned to some category. For ease of comparison to the standard categorization, we focus on categorizations with M = 96 categories.

Ancestors

Each category in a categorization is generated with a clearly defined (C or T) base in the fourth position, as part of an interval of n_symbols consecutive non-N symbols and whose starting position is chosen uniformly at random (see examples in Fig 2). Other non-interval symbols are assumed to be wildcards and the interval is constrained to contain the fourth position. We assume a range of 3 to 7 symbols for n_symbols, with the highest concentration on n_symbols = 4—the number of bases in the standard categories, controlled by parameter c: n_symbolsDir((1, c > 1, 1, 1, 1)). For example, if n_symbols = 5 is generated, then an interval starting position is chosen randomly from the list [0, 1, 2, 3] of possible positions (guarantee a clearly defined base in fourth position). Then, if position 2 is generated, then a sequence of 5 symbols is generated randomly, with the constraints that the second symbol (corresponds with the fourth position—the original base) is C or T, and that the third symbol (corresponds with the mutated base) does not intersect with the original base. If TCRGA is generated, then this example results with the category NNTCRGAN.

Fig 2. Top categories and the percent of data set sequences that are mapped to each of them.

Fig 2

In each category, the bases at the fourth and fifth positions represent the mutation. Flat Xs represent wildcards. The top 20 categories with the highest average prevalence (i.e., the number of sequences that are mapped to a category divided by the number of data set sequences) over all 3 data sets are shown. Each bar is scaled from zero to the highest prevalence value in its column (data set).

Fitness and selection

A pre-step for fitness evaluation is the assignment of 7-mer mutation sequences to categories for each categorization, according to the criteria explained above, which results in a mutation catalog of size N samples by M categories (See Fig 1: Sequence to category assignment). In case of an ambiguous assignment, conflicting categories are replaced with new categories in order to resolve the conflict, until a rate of less than 1% ambiguous assignments of sequences to categories is achieved. The remaining conflicts are resolved by randomly assigning the sequence to one of its “best” matching categories. We now describe the calculation of fitness scores for each categorization in the population. Let K* be the optimal number of NMF factorization components, determined by applying our CV2K method [18]. CV2K automatically chooses K* from data, based on a detailed comparison of the performance of different solutions (using different K values) in reconstructing hidden values of the original mutation catalog in a cross validation setting, combined with a parsimony consideration. We focus on the range K_range = [K* − 2, K* + 2] of components. For each KK_range and each test fold, we compute the CCA correlation between DDR gene expression information and sample exposures as explained above. We take the mean correlation over all K values and test folds to be the fitness score of a categorization.

Following fitness assessment, categorizations are ranked from 1 (worst) to the size of the population S (best) and selected for crossover and mutation processes with probability Pselection=rankipowerj=1Srankjpower, without replacement. The power parameter enables us to adjust the strictness of the selection, i.e., the higher the value the more likely top ranked individuals are to be selected. The two categorizations with the highest fitness values proceed as-is to the next generation, to guarantee non-decreasing top fitness value. To perform the optimization on multiple data sets, we apply the above ranking system with respect to each data set, and construct an average ranking [19], i.e., an average over all data set specific rankings. We consider the individual and average rankings equally in the selection of categorizations for crossover. Finally, the best categorization in the population is defined as the top ranked categorization in the overall (average) ranking.

Crossover and mutation

We perform crossover and mutation in the population to produce the remaining offspring for the next generation. To produce a single offspring, we draw a categorization (parent) from the population with probability Pselection (see above) and cross it with probability q with another parent, or mutate it with probability 1 − q to produce an offspring. In case of a crossover, we randomly draw a different categorization (the other parent) from the population. A random percentage, up to 40%, of the categories are taken from the lower-fitness parent and the rest are taken from the higher-fitness parent, to produce the offspring. Note that if there are not enough unique categories in the two parents to create the offspring, we generate additional categories to fill up the categorization. Mutations occur after crossover is completed (or skipped), and are applied to each category with probability m. As category replacements naturally occur across the generations (either to resolve identified conflicts, or to fill up an offspring categorization), the mutation is responsible for more delicate changes—it replaces a symbol in a random category position with a different intersecting symbol, in order to expand/reduce the category with respect to this position (from nucleotide to dinucleotide symbol or the opposite transition, e.g. ‘A’ − > M (‘A/C’), K (‘G/T’) − > ‘G’). Consequently, a mutation from one symbol to a non-intersecting symbol can occur only indirectly, by a series of two or more mutations in a specific position. This approach was chosen to allow delicate changes in a categorization, while minimizing the introduction of new conflicts to it.

Parameters

We initialize the genetic algorithm with 100 categorizations (ancestors), 40 of which are selected according to their fitness, and a population of 40 individuals is maintained across the generations. We ran the algorithm for 500 generations (two days on 40 CPUs—Intel(R) Xeon(R) CPU E5–2699 v3 @ 2.30GHz). The genetic algorithm has four main parameters: crossover rate (q), mutation rate (m), selection strictness (power) and category concentration (c). To tune each of the four parameters, we set the values of the other three parameters and test the performance of the genetic algorithm (i.e., categorization fitness) on a range of values for the parameter in question, in order to find a value that avoids local minima and leads to greater fitness and faster convergence. For crossover rate, we examined the values 0, 0.4, 0.8, 1 and chose q = 0.8. For mutation rate, we tested rates 0.01, 0.05, 0.1, and set the parameter to m = 0.05. For strictness of selection, we examined values 3, 5, 7 and chose power = 5. For category concentration, we examined values 1, 5, 10, 20 and set the parameter to c = 10.

Data

We use the following TCGA [20] data sets that include gene expression and WES mutation data: (i) 976 breast BRCA samples that had been collected from BRCA-US project, available in ICGC [21]; (ii) 374 prostate PRAD-US samples; and (iii) 185 ovarian OV-US samples. In addition, we use WGS mutation data of: (i) 569 BRCA samples of the BRCA-EU project; (ii) 193 PRAD-UK samples; and (iii) 115 OV-AU samples.

For mutation data, we downloaded simple somatic data files from ICGC, extracted all the single base substitutions and used GRCh37 reference genome to collect 7-mer sequences centered around each of the mutations (3 flanking bases on each side and the substitution that occurred). These 7-mer mutation sequences served as initial categories, with 24,513 of these categories represented in the BRCA data, 23,310 in the PRAD data, and 24,150 in the OV data.

For gene expression data, we collected the corresponding data files from ICGC. For each sample, we extracted the value under ‘normalized read count’ for each gene, to create gene expression data sets of 20,502 genes, for each data set. We consider two types of gene subsets: (i) DDR genes, and (ii) Cancer Gene Census (CGC) genes. For (i), we use expression data for 190 genes from the updated version of [22] that can be found here: www.mdanderson.org/documents/Labs/Wood-Laboratory/human-dna-repair-genes.html. For (ii), we use expression data of 680 COSMIC Cancer Gene Census genes that do not appear in the DDR gene set.

We further split the largest BRCA-US data set to 726 training samples, that are used to learn our categorization with the genetic algorithm, and 250 test samples, that are used for evaluation. Similarly, we split the PRAD-US data set equally to 187 training samples and 187 test samples. We reserve the OV-US samples solely for testing.

As a reference set of signatures we used COSMIC signatures that were found to be active in the investigated cancer types: 1–3, 5, 6, 8, 13, 17, 18, 20, 26 and 30. These add up to a total of 12 signatures in the BRCA data set, 3 signatures in the PRAD data set, and 3 signatures in the OV data set.

Results

Despite an almost decade-long research into mutational signatures, the categorization of mutations still revolves around 96 categories that specify a mutation and its two flanking bases (one on each side).

Here we aimed to take a data-driven approach toward mutation categorization with the goal of maximizing the correlation between the activities of the processes that create these mutations (a.k.a. mutation signature exposures) and the expression of the genes that are involved in the mutational process (DNA damage repair genes). An overview of the computational pipeline is shown in Fig 1.

We applied our pipeline to identify mutation categories over three data sets: breast cancer (BRCA) samples, prostate cancer (PRAD) samples and ovary cancer (OV) samples. To identify a categorization that generalizes across data sets, we performed multi-objective optimization using the two largest sample sets—BRCA and PRAD—for training. We benchmarked our categorization against the standard categorization, which is based on the mutation that occurred and a single flanking base on each side, referred to below as the standard categorization, and also against a random categorization which assigns 7-mer sequences of mutations (incl. the mutation that occurred and 3 flanking bases on each side) uniformly at random to 96 categories. To evaluate the categorizations, we tested their performance on a BRCA and PRAD test sets (250 and 187 left out samples, respectively) and an independent test set from different cancer type (OV), all evaluated in an out-of-sample fashion (see Materials and methods).

The categories that make the resulting categorization were variable in size (i.e. the number of sequences in the data set that are assigned to them) and greatly deviate from the standard categories (Fig 2). Of the 95 non-joker categories, 75 have multiple flanking bases on either side of the mutation, and 41 of them have high specificity up to 3 bases away from the mutation. 13 of these 41 categories have the same base appearing in all three flanking bases, either by itself, or as part of a dinucleotide symbol. There are 48 categories with flanking base specificity in only one side of the mutation. The ‘A/G’ dinucleotide symbol appears 51 times in our categories, while the other five dinucleotide symbols appear 30–37 times. There are 51 categories with C as the original mutated base and 44 with T, but of the most prevalent categories (Fig 2) only one category has T as the original base. This is expected since C mutations are 4–5 times more frequent than T mutations in the WES data sets.

We measured the quality of the resulting matrix factorizations and the agreement with the expression patterns of DDR genes and a set of cancer driver genes from Cancer Gene Census (CGC). Indeed, previous studies identified correlations between several mutational signatures and some cancer drivers and acknowledged that the cause-effect relation between signatures and cancer drivers can be in either direction [23]. We also used a control set that is not expected to be related to mutational signature activities—odorant binding genes (GO:0005549)—which does not share genes with the DDR and CGC sets.

In terms of factorization quality, our categorization performed best across a wide range of number of components (K ∈ [2, max(k_cosmic, K*) + 2], Fig 3). For agreement with expression data, we computed correlations between the inferred exposures of each categorization and expression levels of cancer-related gene sets: DDR genes and CGC genes. In this latter evaluation we also included ‘standard cosmic’ exposures that are inferred using the COSMIC signatures (corresponding to the standard categorization) in the given cancer type (see Materials and methods). Overall, our categorization yielded higher correlation values compared to the others. The results are summarized in Fig 4. Both the DDR gene set and the CGC gene set produced a similar range of correlation levels (0.3–0.5 for non-random categorizations). In contrast, the control set yielded correlation levels of up to 0.15, with no clear advantage for non-random categorizations over the random categorization (S2 Fig). For higher K values, the improvement in performance (correlation to DDR and CGC genes) over the breast cancer data is marginal. This possibly stems from the difference in the signatures’ DDR association between the cancer types. Specifically, BRCA has 12 known associated signatures, 6–8 of which are believed to be associated with DDR, while both OV and PRAD have only 3 known signatures and 2–3 have some association with DDR [3].

Fig 3. Comparative performance evaluation by reconstruction error (log-transformed).

Fig 3

For each K in the range of number of components [2, max(k_cosmic, K*) + 2], we apply NMF to the WGS samples to learn the signature matrix H and then derive the exposure matrix W from the test samples using NNLS. The reported reconstruction error (Kullback–Leibler divergence) is the approximation error of this factorization with respect to the test samples of the (normalized) count matrix V.

Fig 4. Comparative performance evaluation: Correlation to expression of DDR genes (A) and CGC genes (B).

Fig 4

For each K in the range of number of components [2, max(k_cosmic, K*) + 2], we apply NMF to the WGS samples to learn the signature matrix H and then derive the exposure matrix W from the WES samples using NNLS. We learn the CCA coefficients using WES training samples and compute the resulting correlation on the test samples. The reported correlation is the average over 10-fold cross validation. Error bars represent the standard deviation of multiple evaluation runs.

To further examine our categorization and resulting signatures, we (i) computed the similarity between each of the signatures that were extracted using our categorization and all known COSMIC v2 signatures (including signatures believed to be associated with sequencing artifacts according to COSMIC v3), and (ii) tested the agreement between our signatures and the DDR gene expression information. For each of the three data sets, we focused on the signatures predicted with an optimum (see Fitness and Selection) number of components (11 for BRCA, 5 for PRAD and 3 for OV). For (i), we first transformed our signatures to the standard categories (see Materials and methods) and then computed the cosine similarity between the transformed signature and each of the COSMIC signatures. For (ii), we used CCA to compute the correlation between a sample’s signature exposure and its vector of expression of DDR genes. Furthermore, we evaluated the correlation with expression information for subsets of the DDR genes as taken from [22]: Homologous recombination (HR), Base excision repair (BER), Mismatch excision repair (MMR), and Nucleotide excision repair (NER).

In total, 13 of the 19 signatures we discovered were similar to a COSMIC signature (cosine similarity greater than 0.8), spanning 9 distinct COSMIC signatures (1,3,4,5,6,8,12,16,26), the majority of which are known to be active in the studied cancer types. The APOBEC-related COSMIC signatures 2 and 13, that are known to be active in breast cancer, are absent from the signatures extracted using the DDC categorization, which might indicate that this categorization neglects non-DDR mutational processes. Interestingly, a signature similar to COSMIC signature 12 was extracted from both BRCA and OV datasets. The etiology of this signature is unknown and it was previously observed mainly in liver cancer samples. 7 of the 19 signatures were highly correlated (Pearson correlation greater than 0.3) to the expression information. These results are summarized in S1 Table and an example signature (#3) is depicted in Fig 5. This analysis allowed us to compare the suspected activity of each of the 7 signatures with the proposed etiologies of similar COSMIC signatures when such were available. The results are summarized in Table 1. Overall, the gene subsets whose expression was correlated with the exposures of the identified signatures matched the etiologies associated with their COSMIC parallels. For instance, DDC signatures 3 and 6 were correlated with homologous recombination genes and were similar to COSMIC Signature 3, which is associated with HR deficiency. DDC signature 4 was correlated with the expression of base excision repair genes and was similar to COSMIC Signature 1, which is associated with spontaneous deamination of 5-methylcytosine. The correlation to BER genes may be related to their role in the demethylation of 5-mC and the repair of T:G mismatches [24, 25]. DDC signature 5 was most similar to COSMIC Signature 8 and was correlated with the expression of nucleotide excision repair genes. Deficiency of nucleotide excision repair was previously suggested to be associated with COSMIC Signature 8 [26], although the etiology of this signature is still in debate. In two of the cases (signatures 2 and 7, S1 Fig) the COSMIC proposed etiologies did not match the discovered correlations. Finally, DDC signature 1 features a predominance of C-to-T transition, mostly in the G downstream context. While it did not have a clear COSMIC parallel, it was most similar to the aging-related Signatures 1 and 5 (cosine similarity > 0.75), and exhibited a mild correlation to age (Pearson correlation of 0.2). In addition, for each of the 7 DDC signatures, we tested the correlation between the exposures of the corresponding similar COSMIC signature and the expression of the gene subsets. We observed that the most correlated subsets were in agreement with those identified for the corresponding DDC signatures.

Fig 5. An example DDC signature (#3 in Table 1).

Fig 5

The signature is depicted using both the standard categories (top) and the DDC ones (bottom, categories presented using the IUPAC code). This signature is similar to COSMIC Signature 3 (cosine similarity 0.89), and its exposure is correlated with the expression of DDR genes (Pearson correlation 0.37).

Table 1. Signatures that had been discovered using the DDC categorization and were highly correlated (correlation > 0.3) with the expression of DDR genes.

DDC # Similar COSMIC # Proposed etiologies Most correlated subset Most correlated to similar COSMIC
1 NER
2 5,3,8 HR BER BER
3 3 HR HR HR
4 1 5mC deamination BER BER
5 3,8,5 HR NER MMR
6 3,8 HR HR HR
7 1,6 5mC deamination,MMR HR HR

The signatures are ordered according to these correlations (non-decreasing). For each signature, given are its similar COSMIC signatures (ordered, cosine similarity > 0.8), their COSMIC etiologies if available, and highest correlated DDR gene subset (correlation > 0.2). Additionally, for each DDC signature, we report the most correlated DDR subset to the corresponding similar COSMIC signature.

Conclusion

We have devised a framework for data-driven identification of mutation categories. This framework should serve as a pre-step for mutational signature analysis, and should be applied only periodically when the mutation data base is updated. Applying this framework to breast and prostate cancer whole-exome sequencing data, we identified a novel categorization that leads to better reconstruction of mutation data and higher correlations with the expression patterns of related gene sets over multiple cancer types. A majority of the signatures identified using the data-driven categorization were similar to known COSMIC signatures and were associated with DDR gene subsets that agreed well with the proposed etiologies of their COSMIC parallels. However, several signatures did not have a clear COSMIC parallel, including signatures with high correlation to DDR gene expression. This may indicate that the data-driven categorization, which takes into account broader mutation contexts, uncovers new mutational signatures. Potentially, these signatures could provide insight into new mutational processes, or a more accurate depiction of known processes.

Our categorization process is based on the assumption that the activities of DDR genes are correlated with the mutational processes that are active in a given patient. This assumption does not take into account the changes in gene expression over time since the first mutations occurred. It also downweights non-DDR mutational processes such as processes with environmental causes. Nevertheless, the generalization of the identified categorization to different cancer types may indicate that the approach we take is able to uncover real signals and that mutational processes should be studied with broader mutation contexts than those used to date.

The framework can be easily adapted to find data-driven categorization based upon different assumptions and motivations. For example, the genetic algorithm fitness function could be modified to account for correlation with other gene sets and factorization reconstruction error, or to prefer high mutational signature entropy to improve interpretability.

Our work focused on 96 categories—the number of standard categories—in order to eliminate the number of categories in a categorization as a possible explanation for differences to previous work. However, as this number is arbitrary, it would be interesting to investigate (i) the possibility of decreasing the number of categories while maintaining comparable performance, and (ii) whether significant improvements in performance could be obtained by increasing the number of categories. To test these ideas, we suggest running our method with a flexible number of categories per categorization. For example, a simple adjustment to our method would be to introduce category addition and deletion mutations in each generation.

Supporting information

S1 Fig. An example of a non-flat DDC signature (#7 in Table 1).

The signature is depicted using both the standard categories (top) and the DDC ones (bottom, categories presented using the IUPAC code). This signature is similar to COSMIC Signatures 1 and 6 (cosine similarity 0.88 and 0.84 respectively), and its exposure is correlated with the expression of DDR genes (Pearson correlation 0.30).

(PDF)

S2 Fig. Comparative performance evaluation: Correlation to expression of odorant binding genes (used as a control set that is not expected to be related to mutational signature activities).

For each K in the range of number of components [2, max(k_cosmic, K*) + 2], we apply NMF to the WGS samples to learn the signature matrix H and then derive the exposure matrix W from the WES samples using NNLS. We learn the CCA coefficients using WES training samples and compute the resulting correlation on the test samples. The reported correlation is the average over 10-fold cross validation. Error bars represent the standard deviation of multiple evaluation runs.

(PDF)

S3 Fig. Comparative performance evaluation by reconstruction error (log-transformed).

For each K in the range of number of components [2, max(k_cosmic, K*) + 2], we apply NMF to the WGS samples to learn the signature matrix H and then derive the exposure matrix W from the test samples using NNLS. The reconstruction error (Kullback–Leibler divergence) is the approximation error of this factorization with respect to the test samples of the (normalized) count matrix V. NMF is applied 10 times and the reported reconstruction error is the average over these runs. Error bars represent the standard deviation.

(PDF)

S1 Table. All 19 signatures that had been discovered using the DDC categorization on the BRCA, PRAD and OV datasets.

The signatures are ordered according to their correlation with the expression of DDR genes. For each signature, given are also the dataset that it was derived from, its similar COSMIC signatures (ordered, cosine similarity > 0.8) and the similarity to the most similar COSMIC signature. In case there is no COSMIC signature with cosine similarity > 0.8 to the DDC signature, we report the COSMIC signature with the greatest similarity to it.

(JPG)

Acknowledgments

We thank Itay Sason for his critical reading of the manuscript.

Data Availability

Code and links to data can be found here: https://github.com/GalGilad/DDC.

Funding Statement

This research was supported by a grant from the United States - Israel Binational Science Foundation (BSF), Jerusalem, Israel (RS and ML). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Nik-Zainal S, Alexandrov L, Wedge D, Van Loo P, Greenman C, Raine K, et al. Mutational Processes Molding the Genomes of 21 Breast Cancers. Cell. 2012;149(5):979–993. doi: 10.1016/j.cell.2012.04.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SAJR, Behjati S, Biankin AV, et al. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415–421. doi: 10.1038/nature12477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Alexandrov LB, Kim J, Haradhvala NJ, Huang MN, Tian Ng AW, Wu Y, et al. The repertoire of mutational signatures in human cancer. Nature. 2020;578(7793):94–101. doi: 10.1038/s41586-020-1943-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Haradhvala NJ, Kim J, Maruvka YE, Polak P, Rosebrock D, Livitz D, et al. Distinct mutational signatures characterize concurrent loss of polymerase proofreading and mismatch repair. Nature Communications. 2018;9(1):1746. doi: 10.1038/s41467-018-04002-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Fang H, Barbour JA, Poulos RC, Katainen R, Aaltonen LA, Wong JWH. Mutational processes of distinct POLE exonuclease domain mutants drive an enrichment of a specific TP53 mutation in colorectal cancer. PLOS Genetics. 2020;16(2):1–20. doi: 10.1371/journal.pgen.1008572 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Shiraishi Y, Tremmel G, Miyano S, Stephens M. A Simple Model-Based Approach to Inferring and Visualizing Cancer Mutation Signatures. PLOS Genetics. 2015;11(12):1–21. doi: 10.1371/journal.pgen.1005657 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Fredriksson NJ, Elliott K, Filges S, Van den Eynden J, Ståhlberg A, Larsson E. Recurrent promoter mutations in melanoma are defined by an extended context-specific mutational signature. PLOS Genetics. 2017;13(5):1–16. doi: 10.1371/journal.pgen.1006773 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Knijnenburg TA, Wang L, Zimmermann MT, Chambwe N, Gao GF, Cherniack AD, et al. Genomic and Molecular Landscape of DNA Damage Repair Deficiency across The Cancer Genome Atlas. Cell Reports. 2018;23(1):239–254.e6. doi: 10.1016/j.celrep.2018.03.076 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Chae YK, Anker JF, Carneiro BA, Chandra S, Kaplan J, Kalyan A, et al. Genomic landscape of DNA repair genes in cancer. Oncotarget. 2016;7(17):23312–23321. doi: 10.18632/oncotarget.8196 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Ma J, Setton J, Lee NY, Riaz N, Powell SN. The therapeutic significance of mutational signatures from DNA repair deficiency in cancer. Nature Communications. 2018;9(1):3292. doi: 10.1038/s41467-018-05228-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Kim YA, Wojtowicz D, Sarto Basso R, Sason I, Robinson W, Hochbaum DS, et al. Network-based approaches elucidate differences within APOBEC and clock-like signatures in breast cancer. Genome Medicine. 2020;12(1):52. doi: 10.1186/s13073-020-00745-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lee DD, Seung HS. Algorithms for Non-Negative Matrix Factorization. In: Proceedings of the 13th International Conference on Neural Information Processing Systems. NIPS’00. Cambridge, MA, USA: MIT Press; 2000. p. 535–541.
  • 13.Lawson CL, Hanson RJ. Solving Least Squares Problems. Society for Industrial and Applied Mathematics; 1995. Available from: https://epubs.siam.org/doi/abs/10.1137/1.9781611971217.
  • 14. Tate JG, Bamford S, Jubb HC, Sondka Z, Beare DM, Bindal N, et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Research. 2018;47(D1):D941–D947. doi: 10.1093/nar/gky1015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Hotelling H. Relations between two sets of variables. Biometrika. 1936;28(3-4):321–377. doi: 10.1093/biomet/28.3-4.321 [DOI] [Google Scholar]
  • 16. Bilenko NY, Gallant JL. Pyrcca: Regularized Kernel Canonical Correlation Analysis in Python and Its Applications to Neuroimaging. Frontiers in Neuroinformatics. 2016;10:49. doi: 10.3389/fninf.2016.00049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Holland JH, et al. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press; 1992. [Google Scholar]
  • 18. Gilad G, Sason I, Sharan R. An automated approach for determining the number of components in non-negative matrix factorization with application to mutational signature learning. Machine Learning: Science and Technology. 2020. [Google Scholar]
  • 19. Bentley PJ, Wakefield JP. Finding Acceptable Solutions in the Pareto-Optimal Range using Multiobjective Genetic Algorithms. In: Chawdhry PK, Roy R, Pant RK, editors. Soft Computing in Engineering Design and Manufacturing. London: Springer London; 1998. p. 231–240. [Google Scholar]
  • 20. Network CGAR, Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nature genetics. 2013;45(10):1113–1120. doi: 10.1038/ng.2764 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Consortium ICG. International network of cancer genome projects. Nature. 2010;464(7291):993–998. doi: 10.1038/nature08987 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Wood RD, Mitchell M, Sgouros J, Lindahl T. Human DNA Repair Genes. Science. 2001;291(5507):1284–1289. doi: 10.1126/science.1056154 [DOI] [PubMed] [Google Scholar]
  • 23. Poulos RC, Wong YT, Ryan R, Pang H, Wong JWH. Analysis of 7,815 cancer exomes reveals associations between mutational processes and somatic driver mutations. PLOS Genetics. 2018;14(11):1–20. doi: 10.1371/journal.pgen.1007779 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Wu X, Zhang Y. TET-mediated active DNA demethylation: mechanism, function and beyond. Nature Reviews Genetics. 2017;18(9):517–534. doi: 10.1038/nrg.2017.33 [DOI] [PubMed] [Google Scholar]
  • 25. Bayraktar G, Kreutz MR. The Role of Activity-Dependent DNA Demethylation in the Adult Brain and in Neurological Disorders. Frontiers in molecular neuroscience. 2018;11:169–169. doi: 10.3389/fnmol.2018.00169 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Jager M, Blokzijl F, Kuijk E, Bertl J, Vougioukalaki M, Janssen R, et al. Deficiency of nucleotide excision repair is associated with mutational signature observed in cancer. Genome research. 2019;29(7):1067–1077. doi: 10.1101/gr.246223.118 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009542.r001

Decision Letter 0

Pedro Mendes, Sushmita Roy

26 Jul 2021

Dear Gilad,

Thank you very much for submitting your manuscript "A data-driven approach for constructing mutation categories for mutational signature analysis" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Pedro Mendes, PhD

Associate Editor

PLOS Computational Biology

Sushmita Roy

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors present newly developed methodology for mutational signature extraction that takes into account flexibly expanded nucleotide context and matched DDR gene activity. The rationale of searching for patterns that reflect similar DDR processes is reasonable and one would indeed expect the resulting signatures to have better interpretability. The generation of these signatures employs a genetic algorithm which has been adapted for the question but whose components are not entirely motivated. Overall, while the idea of combining mutation and DDR expression information for signature discovery is very interesting, the manuscript lacks transparency in the uncovered signatures, their overall interpretability and applicability.

I have the following major concerns:

• The method employs a genetic algorithm that is based on crossover and mutation. It is unclear to me what relevance/interpretation the crossover would have in this context, given that we are not talking about sexual reproduction in the context of tumour cell proliferation. I understand the “parents” in this case are abstracted to refer to different mutation categories, but why would there be a crossover in the process of cell division?

• Why 96 categories? This is not explained, and I cannot think of why it may be beneficial to replicate the 96-category standard given that the categories are being changed.

• It is somewhat unclear what is going on when exposures and expression are being correlated. Are there any correlation cut-offs applied to guide signature selection? These correlations should also be plotted/reported, as they would enhance the interpretability of the procedure.

• For the calculation of the reconstruction error, 10 runs with randomly initialised W,H are performed and the smallest error obtained is reported. It may be fairer to report the mean and confidence interval instead – and this could be updated in Fig 3.

• Figure 4: DDC improvement is very marginal in BRCA - any particular interpretation as to why this is the case?

• Table 1: it would be useful to discuss further the fact that more often than not the most correlated subset and proposed aetiology do not match. I acknowledge part of this is covered in the discussion already (BER and deamination connection), but it may also be worth considering that 4/7 top signatures are essentially flat signatures. Would there be a tendency for flat signatures to naturally display the highest correlations? Is Pearson’s correlation the best measure of correlation, particularly when some contributions will be so small?

• In addition to this, it may be useful to explore further, or at least display, some of the non-flat signatures - possibly a signature correlated with an MMR-deficiency signature like SBS6?

• All signatures derived in the analysed cohorts should be reported, along with the corresponding DDR correlations.

• Do the signatures uncovered make sense in the context of the biology of the three diseases studied, and that of other reports from the literature? Are there any new biological insights that this analysis brings?

• The authors should expand on the motivation, interpretability and applicability of their newly uncovered signatures. Can these signatures improve DDR deficiency prediction? Other benefits and future directions should be specified.

Reviewer #2: The authors present a new method that uses a genetic algorithm to identify fine-scale latent signatures of mutation processes in cancer genomes. Much of the recent development in mutation signature analysis has stagnated on minor tweaks to the non-negative matrix factorization and focused primarily on signatures defined by trinucleotide mutation context, so it is exciting to see new algorithm developments in this area and consideration for broader sequence contexts. I also thought the use of gene expression data from DNA damage+repair genes was a clever idea for validating the mutation signatures detected with the DDC method. However, I am concerned that the implementation of this method is prohibitively computationally costly, and the authors did not provide sufficient evidence to demonstrate that this method will lead to new insights into mutation processes and cancer biology.

MAJOR COMMENTS

The authors state that their algorithm took 2 days to run on a 40-core server, and the description of the tuning procedure indicates that they ran a total of 4+3+3+4=14 iterations to explore different values of the model parameters, which suggests the tuning process took 4 weeks of compute time. I appreciate that this is a new approach to mutation signature analysis and optimization is not necessarily a priority, but I’m not convinced that the novelty of the findings justifies this massive computational burden. Given the length of the paper and focus on describing the algorithm, I reviewed this primarily as a method/software paper, and as such, I would expect the software to have a runtime at least on the same order of magnitude as existing methods (The authors state that they identified a categorization that generalizes across datasets [line 320], so at the very least I would expect these data to be made publicly available so other users can treat it as a “pre-trained” model). On the other hand, if this is meant to be more of a standalone research paper, the biological insights lack novelty/depth. For example, Table 1 indicates DDC signature 1 is the only signature without a similar COSMIC signature or proposed etiology, but is most correlated to NER genes and mildly correlated with age. How do we know this is not just an artifact of fitting to the data or a spurious correlation? Is there any evidence in prior literature that NER genes or age-related processes are enriched for C>T transitions in the G downstream context (line 407)?

It would also be nice to include:

- A supplementary figure comparing performance across the different parameter values evaluated (similar to Figs 3 & 4 but comparing e.g., q = [0, 0.4, 0.8, 1])

- scripts in the Github repository so users can compare performance of DDC on their data against standard approaches

Line 303-307: The authors state that they selected COSMIC signatures only found to be active in the investigated cancer types. It would be good to include some supplementary results comparing the DDC signatures against all COSMIC signatures to affirm that the matching signatures in Table 1 are not just reflecting a confirmation bias. Moreover, the COSMIC signatures believed to be associated with sequencing artifacts should absolutely be considered in this analysis (https://cancer.sanger.ac.uk/signatures/sbs/).

The authors state that their choice to collapse longer sequence contexts into the 96 most informative categories was solely for the sake of comparing their findings directly with the standard 96-subtype approaches, not by any biological motivation. It would be nice to see a more thorough analysis of how many categories are actually needed--what is the minimal number of categories required to match the reconstruction error of standard approaches? How much is reconstruction error reduced by increasing the number of categories? Do the identified categories remain fairly stable?

I had some difficulty getting the DDC software to run. It would help to include a conda/pip environment config file to ensure users are always using tested versions of the required Python libraries (using a fresh install of Miniconda, I was able to run the software after installing the scikit-learn, matplotlib, pandas, and pyrcca packages). I also got several numpy runtime warnings while running DDC, but it was not obvious if this was informational or an error that needed to be debugged (e.g., “python3.9/site-packages/numpy/lib/function_base.py:380: RuntimeWarning: Mean of empty slice.”). To run the software with the included example data, I had to create a new directory called “categorization_related_data”, move the example files to this directory, and rename them to remove underscores in the file names, because the “--datatype” parameter treats underscores as a delimiter. The code is cleanly written and easy to follow, but additional testing and documentation would go a long way towards making this software functional.

MINOR COMMENTS

Line 35-39--this sentence should include some references to the initial implementations of NMF-based mutation signature analysis using 96 categories (Nik-Zainal et al. 2012 [https://www.sciencedirect.com/science/article/pii/S0092867412005284] and Alexandrov et al. 2013 [https://www.nature.com/articles/nature12477])

Line 75, Line 168, Fig. 1--the terminology of “8-long sequence” is very confusing. Functionally, the authors are describing how they encode a 7-mer mutation subtype as an 8-character string (e.g., TGT[C>T]AAC becomes TGTCTAAC), but calling the latter a “sequence” implies they are looking at an actual 8bp DNA sequence motif, which is misleading. It would be clearer to describe the mutation categories as representing a possible 7-mer mutation subtype and us standard IUPAC codes (http://www.bioinformatics.org/sms/iupac.html) to represent “wildcards” present in the categories, e.g., NNTC(A/G)GAN would be better represented as NNT[C>R]GAN, where R represents an A or a G. This way, all categories are consistently written in a way that indicates the unmutated flanking bases and the specific mutation at the 4th position. (I’ll also note that the authors already use the “7-mer” terminology on the README of their Github repository, and the python code is already using IUPAC notation for ambiguous bases, but the --kmer parameter in the program is inconsistent with this).

Fig. 5--the DDC signature panel would be easier to interpret if the signature barplots indicated which bars correspond to each of the “Top Categories” in the logo plots, e.g., with a number 1-5 above the corresponding bar or color code for each category.

Reviewer #3: The paper presents a data-driven approach to extract mutation categories. Such mutational categories can be used to analyze mutational signatures in cancer going beyond the standard and arbitrary choice of the "canonical" mutational signatures.

While the paper does propose a novel methodology to extract mutation categories, the paper's main contributions are the analysis and the results. The methodology is sound, and the results are interesting. The fact that the resulting categorizations are variable in size and greatly deviate from the standard categories is extremely interesting, and may inspire other approaches to develop data-driven signatures.

The main assumption, that is, tumor cells that are expose to similar mutational processes show similar expression levels of genes that are involved in the underlying mutagenic process may be fairly strong and not always satisfied. However, I think it is a reasonable assumption, and maybe even required, to develop a data-driven approach.

I do not have any major request for edit. As a minor, at the beginning of Section 2 it would be good to mention, as in line 185, that the focus on 96 categories is to compare with the standard categorization. Moreover, do the author have any thoughts on what would be the best choice of the number of categories? This could be added to Conclusions.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No: Missing code for comparing against COSMIC signatures; missing data for the DDC categories & resulting signatures

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009542.r003

Decision Letter 1

Pedro Mendes, Sushmita Roy

6 Oct 2021

Dear Gilad,

We are pleased to inform you that your manuscript 'A data-driven approach for constructing mutation categories for mutational signature analysis' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Pedro Mendes, PhD

Associate Editor

PLOS Computational Biology

Sushmita Roy

Deputy Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have addressed my concerns in a satisfactory manner. I have no further comments.

Reviewer #2: The authors have sufficiently addressed all of my previous comments.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009542.r004

Acceptance letter

Pedro Mendes, Sushmita Roy

15 Oct 2021

PCOMPBIOL-D-21-00767R1

A data-driven approach for constructing mutation categories for mutational signature analysis

Dear Dr Gilad,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Anita Estes

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. An example of a non-flat DDC signature (#7 in Table 1).

    The signature is depicted using both the standard categories (top) and the DDC ones (bottom, categories presented using the IUPAC code). This signature is similar to COSMIC Signatures 1 and 6 (cosine similarity 0.88 and 0.84 respectively), and its exposure is correlated with the expression of DDR genes (Pearson correlation 0.30).

    (PDF)

    S2 Fig. Comparative performance evaluation: Correlation to expression of odorant binding genes (used as a control set that is not expected to be related to mutational signature activities).

    For each K in the range of number of components [2, max(k_cosmic, K*) + 2], we apply NMF to the WGS samples to learn the signature matrix H and then derive the exposure matrix W from the WES samples using NNLS. We learn the CCA coefficients using WES training samples and compute the resulting correlation on the test samples. The reported correlation is the average over 10-fold cross validation. Error bars represent the standard deviation of multiple evaluation runs.

    (PDF)

    S3 Fig. Comparative performance evaluation by reconstruction error (log-transformed).

    For each K in the range of number of components [2, max(k_cosmic, K*) + 2], we apply NMF to the WGS samples to learn the signature matrix H and then derive the exposure matrix W from the test samples using NNLS. The reconstruction error (Kullback–Leibler divergence) is the approximation error of this factorization with respect to the test samples of the (normalized) count matrix V. NMF is applied 10 times and the reported reconstruction error is the average over these runs. Error bars represent the standard deviation.

    (PDF)

    S1 Table. All 19 signatures that had been discovered using the DDC categorization on the BRCA, PRAD and OV datasets.

    The signatures are ordered according to their correlation with the expression of DDR genes. For each signature, given are also the dataset that it was derived from, its similar COSMIC signatures (ordered, cosine similarity > 0.8) and the similarity to the most similar COSMIC signature. In case there is no COSMIC signature with cosine similarity > 0.8 to the DDC signature, we report the COSMIC signature with the greatest similarity to it.

    (JPG)

    Attachment

    Submitted filename: Response to reviewers.docx

    Data Availability Statement

    Code and links to data can be found here: https://github.com/GalGilad/DDC.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES