Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Nov 1.
Published in final edited form as: Cancer Epidemiol Biomarkers Prev. 2024 May 1;33(5):721–730. doi: 10.1158/1055-9965.EPI-23-0728

Diffsig: Associating Risk Factors With Mutational Signatures

Ji-Eun Park 1, Markia A Smith 2, Sarah C Van Alsten 3, Andrea Walens 3, Di Wu 1,4, Katherine A Hoadley 5,6, Melissa A Troester 2,3,5, Michael I Love 1,6,*
PMCID: PMC11062813  NIHMSID: NIHMS1973567  PMID: 38426904

Abstract

Background:

Somatic mutational signatures elucidate molecular vulnerabilities to therapy and therefore detecting signatures and classifying tumors with respect to signatures has clinical value. However, identifying the etiology of the mutational signatures remains a statistical challenge, with both small sample sizes and high variability in classification algorithms posing barriers. As a result, few signatures have been strongly linked to particular risk factors.

Methods:

Here we develop a statistical model, Diffsig, for estimating the association of one or more continuous or categorical risk factors with DNA mutational signatures. Diffsig takes into account the uncertainty associated with assigning signatures to samples as well as multiple risk factors’ simultaneous effect on observed DNA mutations.

Results:

We applied Diffsig to breast cancer data to assess relationships between five established breast-relevant mutational signatures and etiologic variables, confirming known mechanisms of cancer development. In simulation, our model was capable of accurately estimating expected associations in a variety of contexts.

Conclusions:

Diffsig allows researchers to quantify and perform inference on the associations of risk factors with mutational signatures.

Impact:

We expect Diffsig to provide more robust associations of risk factors with signatures to lead to better understanding of the tumor development process and improved models of tumorigenesis.

Keywords: METHODOLOGY AND MODELING/METHODOLOGY AND MODELING, GENETICS OF CANCER RISK & OUTCOME/GENETICS OF CANCER RISK & OUTCOME, BIOSTATISTICS/BIOSTATISTICS, COMPUTATIONAL METHODS/Software, COMPUTATIONAL METHODS/Sequence analysis

Introduction

Somatic mutations are changes in the DNA sequence that occur throughout the life of an organism due to mutagenic processes such as exposure to ultraviolet light, inhaled smoke, or impaired DNA repair pathways. Each mutagenic process is believed to leave a unique pattern in the DNA, the collective action of which creates “mutational signatures”1 (Friedberg 2010). Mutational signatures were initially introduced by Nik-Zainal et al. (2012) where breast cancer-specific signatures were discovered based on whole-genome sequenced (WGS) data using non-negative matrix factorization (NMF)2. A following large-scale study by Alexandrov et al. (2013, 2020) on 30 cancer types led to the construction of a compendium of detected signatures that were deposited at the Catalogue Of Somatic Mutation In Cancer (COSMIC)3 4 5. Single base substitution (SBS) signatures from COSMIC are composed of 96 mutational contexts where each context represents a mutated base (C>A, C>G, C>T, T>A, T>C, and T>G) along with its immediate 5’ and 3’ bases (A, C, G, T). Each signature has different compositions of the mutational contexts, e.g. SBS2 is 98% composed of T[C>T]N contexts while SBS3 has a more uniform distribution of mutational contexts. Such mutational signatures are increasingly used to predict response to therapy, but also have an important role in understanding cancer development as they allow us to infer the mutational processes that occur throughout life and which sometimes lead to carcinogenesis5.

Numerous methods that allow for finding de novo signatures have been introduced. These methods detect mutational signatures based on various algorithms including Bayesian NMF6 7, other variations of NMF8 9 10 11 12 13, expectation-maximization14 , or mixed-membership models15 16. These methods can detect novel signatures including ones that resemble the composition of mutational contexts in COSMIC signatures. Along with the discovered latent signatures, the estimated “contribution” is generated for each sample, which estimates how much each signature constitutes a sample’s count of somatic mutations.

Although many methods were successful in finding de novo signatures, identifying the etiology is not trivial. In previous studies, etiologies of the signatures were determined based on previous knowledge of somatic and germline mutations or cell lines2 17. Some studies also used statistical tests to detect the association between a risk factor and a signature, for example, Wilcoxon rank-sum tests comparing samples across binary variables, e.g. gender, and tobacco/alcohol usage18. Such comparison tests are usually performed directly on the number of mutation counts or the estimated contributions. While such statistical methods presented a way to aid the process of discovering the associated mutagenic processes, a recent study by Yang et al. argued that such comparison tests may struggle with low power and efficiency due to the tests ignoring the variance of the estimated contributions16. Especially, when the total number of observed mutations is low, which may result from low sequencing coverage, low tumor purity, and/or low mutational burden, the accuracy of the estimated contributions may vary across samples and across signatures. To consider this, Yang et al. developed HiLDA, which utilizes a unified hierarchical latent Dirichlet allocation model within a Bayesian framework to identify de novo signatures and test the difference in contributions between two groups. This approach increases power as it directly models the latent contribution of signatures within each sample, though it only allows for testing associations between signatures and covariates in two group settings. Furthermore, it focuses on de novo signatures instead of leveraging pre-established signatures.

We propose Diffsig, a Bayesian Dirichlet-Multinomial hierarchical model to estimate and test the association of risk factors and mutational signatures. While traditionally, one would consider risk factors that relate to a donor’s exposures, e.g. sunlight and tobacco usage, here we use the term “risk factor” generally to refer to any measured per-sample covariate that we wish to associate with mutational signatures. As with HiLDA, our proposed model is able to account for the heterogeneous uncertainty inherent with cancer patient cohorts where samples may have different numbers of observed somatic mutations due to varying sequence coverage and mutation burden. A methodological difference compared to HiLDA is that Diffsig can assess associations for continuous as well as categorical risk factors, and multiple risk factors can have associations estimated simultaneously, which allows for unbiased estimation in the case of which allows for unbiased estimation of each risk factor while accounting for other covariates. Unlike many other methods, we focus on finding the association between risk factors and the latent contribution of mutational signatures to each sample’s set of somatic mutations, hence we omitted the process of detecting de novo signatures and instead rely on any set of predefined mutation signatures that have reproducibly been relevant for breast cancer. Our model was evaluated with simulated datasets based on liver and breast cancer-related COSMIC signatures found from previous studies and with a breast cancer dataset generated by The Cancer Genome Atlas (TCGA) Research Network5. We find our model can accurately capture biologically meaningful associations between risk factors and mutational signatures.

Materials and Methods

Bayesian Dirichlet-Multinomial Hierarchical Model

The goal of Diffsig is to estimate the associations between one or more risk factors and a predefined set of mutational signatures. In order to estimate the associations while accounting for sampling variance on heterogeneous samples, we use a Bayesian hierarchical Dirichlet-Multinomial model. As in the HiLDA model, we make use of a Dirichlet-Multinomial distribution; however, we do not include de novo signature detection as part of Diffsig, but instead only use pre-defined sets of signatures, which should be specified by the analyst based on literature or unsupervised analysis of the dataset. Further methodological differences between HiLDA and Diffsig are that 1) Diffsig can model associations with continuous risk factors in addition to categorical ones, 2) Diffsig can model associations from multiple risk factors on somatic mutations simultaneously, which can reduce bias in the case that correlated risk factors have distinct processes affecting distribution of somatic mutations. While a Multinomial distribution can be used to model count data, the Dirichlet-Multinomial distribution, allows for potential differences in the latent contribution of mutational signatures for samples with the same risk factors.

The Diffsig model is built based on two main assumptions: 1) a sample’s mutational counts are composed of a specified set of mutational signatures, and 2) risk factors and mutational signatures have an association that can be approximated with a model that is linear in coefficients on the normalized exponential (or softmax, see Supplementary Methods) scale. The hierarchical model is based on these assumptions (Figure 1) and our goal is to accurately estimate the association between risk factors and mutational signatures.

Figure 1. Schematic of Diffsig model.

Figure 1

(A) Conceptual diagram of Diffsig, for detecting associations between risk factors and mutational signatures (with these target parameters defined as β). Each sample’s observed somatic mutations are composed of the association (β) and the sample’s risk factor traits (X). Associations between samples and signatures are defined as α. Even for samples with the same risk factor values, there exists variance among their signature distribution, here the center of a distribution per sample is diagrammed. The signature contribution is then multiplied by known transition probabilities for a set of reference signatures (C). Finally the resulting mutation probability π underlies the observed mutation counts (Y). (B) Diffsig estimates the associations β (yellow dots: point estimates, colored bars: 80% credible intervals, black lines: 95% credible intervals) for the left example. (C) β‘s effect on ϕ across a range of values of X: β can be interpreted by the strength and direction in which the central tendency of ϕ moves across the space between signatures with differing values of X. From the ternary plots, we can see that as the X increases, the contribution of signature 1 and 2 decreases and that of signature 3 increases. Signature 1 contribution decreases more drastically than signature 2, indicating a negative association of Signature 1 and positive values of X.

For our model, three types of data are required: observed risk factors X, mutation counts YN×96, and the pre-defined set of mutational signatures C96×K. The mutational signatures matrix contains columns representing the conditional probabilities of observing each type of mutational transition conditioning on signature, such that these probabilities add up to 1 for each signature MSk,k1,,K where K is the number of signatures of interest. The observed risk factor data XN,M should be complete for all N samples and M risk factors (M2) including an intercept. In addition to the observed risk factors, we add an intercept that allows us to capture the baseline presence of mutational signatures (X,0=1). Risk factors can be in any form: binary, continuous, or categorical. For preprocessing, continuous variables were centered and scaled to have unit variance, and categorical variables with p levels were encoded as (p- 1) dummy variables.

With the given datasets, the association between risk factors and mutational signatures is modeled with β, where each element is assumed to be distributed as a Normal with mean 0 and standard deviation of σβ. The linear combination of risk factors X and the association coefficient matrix β then yields an association between samples and mutational signatures αn,k where n1,,N and k1,,K.

β~N(0,σβ)α=Xβαn,k(,),βm,K,m0,1,,M

αN×K therefore provides the relatedness of the samples and signatures, but does not reside in a directly interpretable space. Hence, αn, is mapped with a normalized exponential function, i.e. softmax, for each row n, such that the values are compositional within each sample. This choice is motivated by the observation that each somatic DNA mutation can be attributed to some process affecting DNA, and that multiple processes act over time to produce the observed spectra in a sample. A latent contribution for each sample ϕn, is modeled using a Dirichlet distribution on the result of the softmax function applied to rows of α, with a precision parameter τ that controls the concentration of the Dirichlet distribution around a vector of contributions αn,. The matrix of latent contributions ϕK×N incorporates potential differences in the latent contribution for samples with the same risk factors, that is, conditional on X, samples may nevertheless have different values of ϕ,n.The latent composition of the samples for each signature is then projected to the 96-dimensional mutational contexts by taking the product of ϕ and the fixed signature matrix C, giving the mutation probabilities π96×N. Finally, we assume the observed counts Y96×N were sampled from a Multinomial distribution for each sample with probability of π,n.

ϕ|α,τ~Dirichlet(softmax(αT)τ)π= C×ϕTY|π~Multinomial(π)

Additional details on model fitting are provided in Supplementary Methods, and a diagram of the relationship of parameters is provided in Supplementary Figure 1. All analysis for Diffsig was performed in R v4.1.0 with a Bayesian hierarchical model specified in the Stan probabilistic programming language and the posterior inference was performed using the rstan package v2.21.3 with 4 cores and 4 Markov chains in parallel with 1 node19.

Simulations

Simulated datasets were generated with number of donors (N) of 50, 100, 150, 200, and 1,000. In order to show that the model works with any type of risk factor, we generated in total 300 simulations: 20 simulations for three different types of risk factors (binary, categorical, and continuous)). In order to evaluate if the model works in various contexts, we tested on mutational signatures known to be present in i) breast cancer (COSMIC SBS1, 2, 3, 5, 13) and ii) liver cancer (SBS 1, 4, 5, 12, and 29) found from previous studies4. Signatures were selected based on prevalence and highest contribution found by Alexandrov et al. from the respective ICGA and TCGA datasets collected in the PCAWG consortium4. Risk factors (X) in continuous and binary format were generated, sampling from either a normal or binomial distribution, respectively. To evaluate associations with categorical risk factors, a 3-level categorical variable was generated, sampling from a Multinomial distribution, which then was modeled with an intercept and two dummy variables for the non-reference levels. We set the over-dispersion hyperparameter τ=100, sampled associations β from a uniform distribution spanning [−2, 2], and centered the β values for the intercept and two non-reference level coefficients. The variance of the true associations was 4/3. The number of total mutations per donor, J, was sampled from a negative binomial distribution with mean 51 and dispersion parameter 1. These numbers were generated using the fitdistr function in the R package MASS on the preprocessed whole-exome sequenced TCGA breast cancer dataset (e.g., within 10–500, excluding hyper-mutated samples) used in our following sections20. Although such distribution was able to nearly match the distribution of the dataset, there were still samples assigned with a very low number of mutations (i.e. less than 10) that are filtered during preprocessing. In order to have all N simulated samples for analysis, values that were lower than 10 were replaced with a number sampled from a Poisson distribution with mean 17. The observed 96 dimension mutation counts, Y, were then sampled from the model with fixed C, X, τ and sampled values for β and J. In total 20 simulations were generated, where for each iteration a new set of β, X, J were simulated.

To validate that our model would produce accurate estimates on all types of risk factors, we tested on binary, categorical, and continuous risk factors. In addition, analogous to linear or logistic regression, excluding a risk factor that is associated with the presence of one or more signatures may induce bias in estimates which can reduce the power of the test and/or lead to false positives. To empirically verify this possibility of bias from excluded covariates, we generated true values of (α,β,ϕ,π), assuming that there were two continuous risk factors associated with the signatures, where the risk factors were positively correlated, as they were sampled from a multivariate normal distribution (β~N(0,Σ2×2), where σ12=σ21=0.7). Two scenarios were tested: i) including the two continuous risk factors in X and ii) only one of the two risk factors in X.

After fitting the model using rstan, the convergence within and between 4 Markov chain estimates were examined with a standard MCMC diagnostic measure, the Gelman-Rubin statistic (R^) value, for each parameter19. The model chains were considered to be well converged when R^ values were below 1.05. Posterior means of the parameters β^ were compared to the true parameters β, and coverage and width of 80% credible intervals were assessed.

Testing for False Positives in Null Scenario

In addition to testing the sensitivity, we evaluated the specificity of Diffsig with our simulated data. We simulated data where the true association coefficients with signatures was set to zero. Note, the true values for the intercept were kept the same as the original simulation. Then, we ran Diffsig to assess the false positive rate: 80% credible intervals not including 0. To further assess the accuracy, we calculated the RMSE and average error per simulation of the estimates.

TCGA Breast Cancer

TCGA mutation calls were retrieved from NCI GDC website in a MC3 file in hg19 coordinates and were then subset to 1,201 breast tumor samples to evaluate our model on real data. We selected single nucleotide variants (SNVs). We omitted germline chromosomes and adjacent normals. We then used the R package VariantAnnotation and Maftools to generate the mutation count matrix from the MAF file9 21. Hypermutated samples with mutation counts ranging from 563 to 5037 (n=13) were excluded from the dataset leaving only samples with mutation counts in the range of approximately 10–500 mutations. Hypermutated samples were removed when COSMIC signatures were extracted using SigProfiler by Alexandrov et al. (2020) from the PCAWG Network to avoid ‘signature bleeding’ effect4. Since COSMIC signatures were used as the reference, we removed hypermutated samples to accommodate how the signatures were generated. In addition, hypermutation in general is known to be associated with a specific gene (e.g. POLE mutations), thereby samples exhibiting this characteristic were not of interest to associate risk factors with the other mutational signatures. Breast cancer samples are commonly assigned to one of the five molecular intrinsic subtypes (PAM50) which are originally identified by clustering 50-gene expressions: luminal A (LumA), luminal B (LumB), HER2-enriched, basal-like, and normal-like22. In our analysis, we removed normal-like subtypes and only kept the remaining four.

Before applying Diffsig on the TCGA breast cancer dataset, the data was evaluated to determine whether varying sequence coverage and mutational burden induce heterogeneous uncertainty on estimated contributions for mutational signatures. This was done by selecting one of the HER2-enriched samples with sample ID TCGA-3C-AALI-01, which has a total of 428 somatic mutation counts. This mutation count was downsampled using binomial sampling with a success probability of 0.1. We generated 100 different down-sampled mutation counts and estimated contributions on breast cancer COSMIC signatures using the non-negative least squares (NNLS) model from the MutationalPatterns R package12. Variance of the estimated contributions was assessed to see whether heterogeneous uncertainty exists.

We then applied Diffsig on TCGA samples to evaluate if the model would recover well-characterized associations. Along with the mutation counts, we obtained two risk factors for evaluation. One risk factor is the homologous recombination deficiency (HRD) scores, a metric that combines 3 different scores: HRD-LOH, LST (large-scale state transitions), and NtAI (number of telomeric allelic imbalances) scores, where HRD score > 42 is defined as HR deficiency23 24. Another risk factor we considered was the PAM50 molecular subtype for 915 TCGA breast samples (basal-like n=161, HER2-enriched n=74, luminal A n=487, luminal B n=193)25. Along with the categorical variable coding for molecular subtypes, we tested on a continuous subtype variable derived from the correlation of the samples to the PAM50 subtype centroids to see how the results compared to analysis using the categorical subtypes (available as Supplementary Data)26.

Risk factor variables were preprocessed based on the variable type as mentioned previously. However, for ease of comparing continuous to binary covariates, we did not scale the continuous HER2-enriched subtype variable as it already had similar variance to the binary version. The mutation count matrix was subsetted to the 907 samples that have HRD scores and PAM50 subtype information available. Similar to the simulated datasets, R^ values of each parameter were examined to determine MCMC convergence.

Model Evaluation

For simulations, we considered the model to accurately capture the true “βs” when i) the 80% credible intervals covered the true β at their expected rate and ii) the root mean squared error (RMSE) of the estimated β relative to the true β was small. In addition, we measured the average value per simulation of |β^jβj| where j=10 or 15, averaging over the beta vector to show the extent of differences from the estimated to the true value per simulation.

For real data, the true β are not known, so instead, we evaluated the model qualitatively. The model was tested on two risk factors with previous knowledge to be associated with certain signatures. First, since a higher HRD score indicates homologous recombination deficiency, we tested the HRD score and COSMIC signatures, expecting a positive association between HRD score and SBS3 which is known to be associated with HRD. Second, Pitt et al. (2018) observed that tumors categorized as basal-like subtypes have significantly fewer APOBEC mutations compared to HER2-enriched subtypes27. Therefore, we inspected whether the estimated associations show that HER2-enriched tumors had higher association with APOBEC signatures in COSMIC (SBS2,13) compared to basal-like tumors.

Comparison to Linear Regression on Simulated Data

With simulated data, we have seen that Diffsig is capable of accurately estimating the associations between mutational signatures and different types of risk factors, both univariate and multivariate. It is natural to consider using a simple linear regression on estimated contributions to test such associations. For comparison, we regressed the observed transition counts on the COSMIC signatures, and subsequently regressed the estimated contributions on one or more risk factors. In order to compare coefficients from the Dirichlet-Multinomial model with linear regression coefficients, which are on different scales, we used the following three measures: i) Pearson correlation, ii) cosine similarity, and iii) sign accuracy (how well the direction or sign of the estimates match that of the simulated true association).

Comparison with HiLDA

Unlike Diffsig, HiLDA estimates de novo signatures, which have a different format from the COSMIC signatures used in our analysis. Rather than summarizing in 96-dimension vectors, HiLDA estimates a signature with 3 sets of probabilities: probabilities of bases (A,C,G,T) in the flanking bases, and probabilities of mutations (C>A,C>G,C>T,T>A,T>C,T>G) (Supplementary Figure 2). Therefore, HiLDA signatures were qualitatively compared to COSMIC signatures by comparing the flanking bases and mutations of HiLDA signatures to those of COSMIC signatures. In addition, HiLDA includes two different tests: the global test and the local test. The global test assesses whether there is a difference in the mean contribution between two groups with a Bayes factor across any of the HiLDA signatures, and the local test shows whether there is a difference in the mean contribution between two groups for each HiLDA signature. Note the mean contribution, or what is referred to in the HiLDA method as exposure, is the relative contribution for each binary risk factor of one group. In our benchmark, we compared with local test results in order to compare the associations between the risk factor and each signature.

Data Availability

The results published here are in whole or part based upon data generated by The Cancer Genome Atlas Research Network: https://www.cancer.gov/tcga. The MC3 MAF file used to generate the somatic mutation count data in this paper can be found at the GDC Application Programming Interface (API): http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc. All TCGA risk factors used in the analysis can be found in the supplementary data which includes patient barcode, TCGA barcode, HER2 continuous score, PAM50 subtype, and HRD scores.

Diffsig is publicly available as an R package (https://github.com/jennprk/diffsig).

Results

Simulation

Simulation datasets were generated to validate that our model can accurately capture the associations between risk factors and signatures. Simulations showed compatible distributions of the number of mutations compared to the distribution of TCGA breast cancer samples. With maximum-likelihood fitting with negative binomial distributions (`fitdistr` function) in the R package MASS, TCGA samples had mean 51.74 (sd=1.27) and size parameter (1/dispersion) 1.84 (sd=0.08) while the simulations have on average mean of 53.56 (average sd=3.57) and average size parameter of 1.69 (average sd=0.19) (Supplementary Figure 3)20. In simulating data from the model, we found that point estimates tended to have low error, and credible intervals (CIs) obtained their nominal coverage, across a range of risk factor types and sample sizes. We have seen that the mean coverage of the 80% CIs converges to 1 as the sample size increases for all three types of risk factors (R^<1.05, Supplementary Figure 4). As the sample size increased, the mean width of the CIs and the average RMSE of the estimated β s decreased as expected with a properly calibrated model (Figure 2A,B). Note, the average width of the 80% CIs becomes lower than 0.8 as the sample size increases, which indicates that Diffsig was behaving conservatively. As the sample size increases, the RMSE was near or below 0.1 and average absolute error (or mean of |β^β|) per simulation dropped to less than 0.1 on average (Supplementary Figure 4B). Sample size increase led to an increase in computation time where 50 samples required less than 5 minutes, 200 samples required around 25 minutes, and 1,000 samples required 5–7 hours per simulation iteration (Figure 2C). In addition to the RMSE, the average error of the estimated β to the true β was in most cases less than 0.01 regardless of the sample size (Supplementary Figure 4B,F).

Figure 2. Simulation results across a range of risk factor types and sample sizes.

Figure 2

A model including one risk factor (binary, categorical, or continuous) shows changes in (A) 80% credible interval coverage, (B) average root mean squared error (RMSE), (C) computation time (mins) as the number of samples increased. A model including all or only one of the risk factors shows changes in (D) 80% credible interval coverage, (E) average RMSE, (F) computation time (mins) per iteration as the number of samples increased.

Results showed that including both risk factors that influenced the presence of signatures had higher interval coverage (Supplementary Figure 4E) and significantly lower average RMSE, compared to a model that only included one of the risk factors. This showed that including multiple risk factors that are believed to be jointly associated with the signatures can lead to more accurate estimation of the associations compared to modeling with only one risk factor. Moreover, as the number of samples increased, the average width of the 80% credible interval and average RMSE decreased (Figure 2D, E). Within the same sample size, the average widths of simulations with two covariates are lower to ones with only one covariate. Computation time was on average 14 to 20 minutes with two risk factors while it took on average 14 to 46 minutes with only one risk factor, showing the computation time was even lower when including both risk factors compared to only including one risk factor (Figure 2F).

In addition, we tested Diffsig on simulations with another set of mutational signatures associated with liver cancer4. Results were similar to what was observed with breast cancer signatures; again, including all potential risk factors was less biased than only including one of them (Supplementary Figure 4, 5).

Comparison to Linear Regression on Simulated Data

With the same simulated data, we compared Diffsig to simple linear regression. Based on the three measures, Pearson correlation, cosine similarity, sign accuracy, we observed that linear regression generally assigned correct signs in estimating the true βs, and the correlation between estimated and simulated betas was generally high (Figure 3). However, Diffsig had consistently superior results in all three criteria – mostly close to 1 with nearly zero variance.

Figure 3. Diffsig and simple linear regression compared on simulated datasets based on three measures.

Figure 3

(A) Pearson correlation, (B) cosine similarity, and (C) how well the methods estimated the correct direction (or sign) of the true beta (i.e., sign accuracy).

False Positive Evaluation

We observed that as the number of samples increased – particularly over 200 samples – RMSE was near and below 0.1 and the average absolute error became lower than 0.1 on average (Supplementary Figure 6). The coverage of the 80% credible interval was in most cases above 80%, indicating good control of type I error (false positives).

TCGA Breast Cancer

Diffsig has been developed to avoid the inherent sampling variability, instead allowing association tests with various types of risk factors using a hierarchical Bayesian model. Methods using point estimates for estimated contributions to binary dependent variables may underestimate the sampling variability. To empirically demonstrate such a drawback, we performed a naive analysis using NNLS on one of the basal-like TCGA breast cancer samples with five breast cancer-related signatures (SBS1, 2, 3, 5, and 13). It was observed that the relative contributions (signature X’s contribution/sum of all signatures’ contribution) varied across different down-sampled datasets (Figure 4A). For SBS13, the relative contribution ranged from nearly 0.2 to over 0.65 for certain down-sampled datasets.

Figure 4. Sampling variability in NNLS estimates and Diffsig applied to HRD scores samples.

Figure 4

(A) Relative contribution from NNLS on downsampled experiments of a basal sample from TCGA-BRCA dataset with breast cancer-related COSMIC signatures (red dots represent relative contributions generated with original mutation counts) (B) Homologous Recombination Deficiency (HRD) score distribution in breast cancer subtypes (C) Diffsig results including median (yellow dots), 80% and 95% credible intervals (blue and black lines) for the intercept (grey, top) and HRD score (blue, bottom).

Moreover, Diffsig was able to detect the associations across various risk factors which would be expected from previous literature on somatic mutations and breast cancer. From the preliminary exploratory analysis of the TCGA breast cancer data, tumors with basal-like subtypes have a higher average HRD score than the other subtypes (Figure 4B). The basal-like subtype has previously been shown to have a high association with HRD mutation signature SBS34 27. With this knowledge, we tested the association of HRD score with the five mutational signatures previously reported as common in breast cancer with Diffsig (Figure 4C). The estimated association β for the HRD score was much higher in SBS3 than the other 4 signatures. Also, the high estimated β from the intercept of SBS5 indicated that this signature was prevalent in most cancer patients4. All of the R^ values for the estimates were below 1.05 (1.00) and the total computation time was around 12 hours for 931 samples.

To compare our performance with HiLDA, we tested both Diffsig and HiLDA on a binary risk factor (subtype). After selecting only the HER2-enriched and basal-like samples from TCGA, both HiLDA and Diffsig were run on a binary indicator of HER2-enriched samples. Although HiLDA does not produce 96-dimensional mutational signatures, we were able to qualitatively align the HiLDA signatures 1, 2, 3, 4, and 5 with the COSMIC signatures 1, 2, 3, 5, and 13 based on their dominant features (Figure 5A, Supplementary Figure 5).

Figure 5. Diffsig and HiLDA results on a subset of HER2-enriched and Basal-like subtypes from TCGA-BRCA.

Figure 5

(A) Each HiLDA signature was connected to a COSMIC signature, e.g. HiLDA1 has similar composition of SBS1 (B) HiLDA results with error bars representing the difference in mean exposures (in percentage) for each HiLDA signature (C) Diffsig results including median estimates (yellow dots), 80% credible intervals (grey/blue lines), and 95% credible intervals (black lines) for estimated associations between binary HER2-enriched subtype indicator and COSMIC breast cancer signatures (SBS1,2,3,5,13) (D) Diffsig results for estimated associations between continuous HER2-enriched subtype correlation and COSMIC breast cancer signatures.

Once we ordered the HiLDA signatures to match the order of the corresponding COSMIC signatures, we observed that the estimated difference in mean exposures (or contribution) from HiLDA aligned with the ranks and directions of the estimated associations from Diffsig (Figure 5B, C). Note that the scales of the x-axis differ as the models are inherently different in that Diffsig measures associations while HiLDA measures differences in contributions. HER2-enriched samples, when compared to basal-like samples, had a higher association with SBS2 and SBS13 than with other signatures. This agrees with our previous knowledge that HER2-enriched samples tend to have higher APOBEC mutagenesis compared to basal-like samples27. Both results showed convergence (R^<1.02), while the computation time was faster with Diffsig compared to HiLDA (Diffsig 6.65 mins, HiLDA 78 mins with 242 samples).

Aside from considering subtypes as a categorical variable, we considered continuous measures of subtypes using the correlation to subtype centroids26. Estimated associations when considering the continuous correlation to HER2-enriched subtype were almost identical to the estimated associations for binary HER2-enriched subtype status (Figure 5C,D). Estimates using the binary definition of subtype had reduced computation cost (26 mins with continuous, 242 samples). In addition, the same process was tested on another set of breast cancer-related COSMIC signatures (SBS2, 3, 5, 8, 13) to assess whether the selected set of signatures matter (Supplementary Figure 7). Results show similar results as to having high associations in HRD score with SBS3 and HER2-enriched samples with SBS2 and 13.

Discussion

A key task in cancer genomics is to identify the etiology of discovered mutational signatures by relating them to risk factor data. Diffsig allows researchers to quantify and perform inference on the associations of diverse risk factors with reference mutational signatures, leading to better understanding of the tumor development process and improved models of tumorigenesis. From simulation, we observed that when the sample size was sufficient, i.e. more than 200 samples given the simulation parameters, our model was able to correctly capture the estimates within the 80% credible intervals and with sufficiently low RMSEs and average errors. Real data results from TCGA support that our model is capable of estimating expected associations in a variety of contexts. Note that such results with high power require sufficient sample sizes in each subgroup if testing on discrete covariates.

Diffsig is also capable of estimating the associations in real data applications, as shown with the TCGA breast cancer dataset. These results aligned with what has been seen in previous studies of breast cancer related to HRD and APOBEC activity. Homologous recombination deficiency in general are assessed based on various indexes, e.g. mutational signatures, DNA-based measures of genomic instability28. In previous breast cancer studies, it was shown that patients with high contribution of COSMIC signature 3 (SBS3) as well as those with high HRD scores were considered homologous recombination deficient which may indicate high sensitivity to certain treatments, e.g. PARP inhibitor5 29 30 . Diffsig showed consistent associations between these two measures. In addition, a study by Roberts et al. was performed on 14 cancer types including breast cancer with whole-genome and whole-exome sequencing datasets including TCGA31. This study showed that HER2-enriched samples have significantly higher contributions of APOBEC-related mutations compared to other PAM50 subtypes. The contrast was most clearly seen between HER2-enriched and Basal-like subtypes, also corresponding to what we observed in our analysis.

Compared to previous methods which have emphasized mutational signature discovery, Diffsig is focused on evaluating established signatures, and adds key features not available in previous methods, including 1) the ability to model associations from continuous as well as categorical risk factors, and 2) the ability to estimate associations from multiple risk factors simultaneously, which can be important for accurate estimation with correlated risk factors. Diffsig does not involve a de novo signature identification step, but instead leverages the well-established mutational signatures from existing catalogs, e.g. COSMIC. To the extent that current catalogs contain the relevant signatures for the samples under examination, leveraging these catalogs facilitates the comparison of associations across studies. However, for data sets with a sufficiently large number of samples, NMF can be trained on the dataset instead of using existing reference signatures. Although we have limited our scope here to single-basepair substitutions (SBS), Diffsig is capable of taking other categories of mutations such as double-based substitutions (DBS) or short insertions and deletions (indels).

Diffsig is operationally well-aligned with current understanding of mutational signatures. Specifically, Diffsig uses a Dirichlet-Multinomial distribution for the observed counts following HiLDA, which reflects our belief that the mutations in a sample are a composite of multiple underlying mutational signatures. The Dirichlet layer of the model allows for overdispersion of the sample-signature contributions, such that two samples with the same risk factor values may have different latent contributions ɸ. One layer of our Diffsig model maps from the sample-signature associations α = Xβ to the sample-signature contributions ɸ, which sum to one for a given sample, across the signatures. These sample-signature contributions, ɸ, behave as other compositional data types, where an increase in the contribution of one signature necessitates decreases in the other signatures. Therefore, we recommend a compositional interpretation of estimated βs within each risk factor, while per-signature elements of β are not comparable across risk factors. In other words, our model offers inference regarding whether one signature is more or less associated with a risk factor compared to others, while controlling for influences of other covariates or risk factors.

Diffsig has potential to be generalized in various ways. Diffsig encompasses a statistical model rooted in the Dirichlet-Multinomial hierarchical framework. It can thereby be generalized across various applications aimed at assessing associations between potentially multiple risk factors and multiple sample categories (signatures), each of which emit distinct patterns of observations (transitions). To the extent that distinct mutational alterations (amplifications, deletions, mutations affecting coding sequence) affecting genes could be clustered by co-occurrence, these could also be used as a reference set for risk factor association with Diffsig. Here we focused on using trinucleotide-based signatures as we referenced COSMIC mutational signatures. We note that this could easily be extended to signatures with different flanking length (e.g. two flanking basepairs) or size (number of contexts – 96 contexts in COSMIC).

Supplementary Material

1
2
3
4
5
6
7
8
9

Acknowledgments

The authors would like to acknowledge the following individuals for helpful comments and suggestions on the work: Zhi Yang and Jean Fan. This work was supported by an NIH NIEHS grant P30-ES010126 to M.A. Troester, and an NIH NCI grant R01-CA253450 to K.A. Hoadley and M.A. Troester.

Footnotes

Conflicts of Interest: The authors declare no potential conflicts of interest.

References

  • 1.Friedberg EC. A comprehensive catalogue of somatic mutations in cancer genomes [Internet]. DNA Repair. 2010. page 468–9. Available from: 10.1016/j.dnarep.2010.01.013 [DOI] [PubMed] [Google Scholar]
  • 2.Nik-Zainal S, Alexandrov LB, Wedge DC, Van Loo P, Greenman CD, Raine K, et al. Mutational processes molding the genomes of 21 breast cancers. Cell. 2012;149:979–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Alexandrov LB, Kim J, Haradhvala NJ, Huang MN, Tian Ng AW, Wu Y, et al. The repertoire of mutational signatures in human cancer. Nature. 2020;578:94–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SAJR, Behjati S, Biankin AV, et al. Signatures of mutational processes in human cancer. Nature. 2013;500:415–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Forbes SA, Beare D, Boutselakis H, Bamford S, Bindal N, Tate J, et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 2017;45:D777–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Rosales RA, Drummond RD, Valieris R, Dias-Neto E, da Silva IT. signeR: an empirical Bayesian approach to mutational signature discovery. Bioinformatics. 2017;33:8–16. [DOI] [PubMed] [Google Scholar]
  • 7.Fantini D, Vidimar V, Yu Y, Condello S, Meeks JJ. MutSignatures: an R package for extraction and analysis of cancer mutational signatures. Sci Rep. 2020;10:18217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gehring JS, Fischer B, Lawrence M, Huber W. SomaticSignatures: inferring mutational signatures from single-nucleotide variants. Bioinformatics. 2015;31:3673–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Mayakonda A, Lin D-C, Assenov Y, Plass C, Koeffler HP. Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Res. 2018;28:1747–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Carlson J, Li JZ, Zöllner S. Helmsman: fast and efficient mutation signature analysis for massive sequencing datasets. BMC Genomics. 2018;19:845. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Lal A, Liu K, Tibshirani R, Sidow A, Ramazzotti D. De novo mutational signature discovery in tumor genomes using SparseSignatures. PLoS Comput Biol. 2021;17:e1009119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Blokzijl F, Janssen R, van Boxtel R, Cuppen E. MutationalPatterns: comprehensive genome-wide analysis of mutational processes. Genome Med. 2018;10:33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Cartolano M, Abedpour N, Achter V, Yang T-P, Ackermann S, Fischer M, et al. CaMuS: simultaneous fitting and de novo imputation of cancer mutational signature. Sci Rep. 2020;10:19316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Fischer A, Illingworth CJR, Campbell PJ, Mustonen V. EMu: probabilistic inference of mutational processes and their localization in the cancer genome. Genome Biol. 2013;14:R39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Shiraishi Y, Tremmel G, Miyano S, Stephens M. A Simple Model-Based Approach to Inferring and Visualizing Cancer Mutation Signatures. PLoS Genet. 2015;11:e1005657. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Yang Z, Pandey P, Shibata D, Conti DV, Marjoram P, Siegmund KD. HiLDA: a statistical approach to investigate differences in mutational signatures. PeerJ. 2019;7:e7557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Petljak M, Alexandrov LB, Brammeld JS, Price S, Wedge DC, Grossmann S, et al. Characterizing Mutational Signatures in Human Cancer Cell Lines Reveals Episodic APOBEC Mutagenesis. Cell. 2019;176:1282–94.e20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Letouzé E, Shinde J, Renault V, Couchy G, Blanc J-F, Tubacher E, et al. Mutational signatures reveal the dynamic interplay of risk factors and cellular processes during liver tumorigenesis. Nat Commun. 2017;8:1315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Carpenter Gelman, Hoffman Lee. Stan: A probabilistic programming language. J Stat Econ Meth [Internet]. Available from: https://www.osti.gov/biblio/1430202 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Venables WN, Ripley BD. Modern Applied Statistics with S. Springer Science & Business Media; 2013. [Google Scholar]
  • 21.Obenchain V, Lawrence M, Carey V, Gogarten S, Shannon P, Morgan M. VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. Bioinformatics. 2014;30:2076–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Perou CM, Sørlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, et al. Molecular portraits of human breast tumours. Nature. 2000;406:747–52. [DOI] [PubMed] [Google Scholar]
  • 23.Knijnenburg TA, Wang L, Zimmermann MT, Chambwe N, Gao GF, Cherniack AD, et al. Genomic and Molecular Landscape of DNA Damage Repair Deficiency across The Cancer Genome Atlas. Cell Rep. 2018;23:239–54.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Walens A, Van Alsten SC, Olsson LT, Smith MA, Lockhart A, Gao X, et al. RNA-Based Classification of Homologous Recombination Deficiency in Racially Diverse Patients with Breast Cancer. Cancer Epidemiol Biomarkers Prev. 2022;31:2136–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ciriello G, Gatza ML, Beck AH, Wilkerson MD, Rhie SK, Pastore A, et al. Comprehensive Molecular Portraits of Invasive Lobular Breast Cancer. Cell. 2015;163:506–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Parker JS, Mullins M, Cheang MCU, Leung S, Voduc D, Vickery T, et al. Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes. J Clin Oncol. 2023;41:4192–9. [DOI] [PubMed] [Google Scholar]
  • 27.Pitt JJ, Riester M, Zheng Y, Yoshimatsu TF, Sanni A, Oluwasola O, et al. Characterization of Nigerian breast cancer reveals prevalent homologous recombination deficiency and aggressive molecular features. Nat Commun. 2018;9:4181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Takaya H, Nakai H, Takamatsu S, Mandai M, Matsumura N. Homologous recombination deficiency status-based classification of high-grade serous ovarian carcinoma. Sci Rep. 2020;10:2757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Davies H, Glodzik D, Morganella S, Yates LR, Staaf J, Zou X, et al. HRDetect is a predictor of BRCA1 and BRCA2 deficiency based on mutational signatures. Nat Med. 2017;23:517–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Polak P, Kim J, Braunstein LZ, Karlic R, Haradhavala NJ, Tiao G, et al. A mutational signature reveals alterations underlying deficient homologous recombination repair in breast cancer. Nat Genet. 2017;49:1476–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Roberts SA, Lawrence MS, Klimczak LJ, Grimm SA, Fargo D, Stojanov P, et al. An APOBEC cytidine deaminase mutagenesis pattern is widespread in human cancers. Nat Genet. 2013;45:970–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1
2
3
4
5
6
7
8
9

Data Availability Statement

The results published here are in whole or part based upon data generated by The Cancer Genome Atlas Research Network: https://www.cancer.gov/tcga. The MC3 MAF file used to generate the somatic mutation count data in this paper can be found at the GDC Application Programming Interface (API): http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc. All TCGA risk factors used in the analysis can be found in the supplementary data which includes patient barcode, TCGA barcode, HER2 continuous score, PAM50 subtype, and HRD scores.

Diffsig is publicly available as an R package (https://github.com/jennprk/diffsig).

RESOURCES