Abstract
Numerous studies over the past generation have identified germline variants that increase specific cancer risks. Simultaneously, a revolution in sequencing technology has permitted high-throughput annotations of somatic genomes characterizing individual tumors. However, examining the relationship between germline variants and somatic alteration patterns is hugely challenged by the large numbers of variants in a typical tumor, the rarity of most individual variants, and the heterogeneity of tumor somatic fingerprints. In this article we propose statistical methodology that frames the investigation of germline-somatic relationships in an interpretable manner. The method uses meta-features embodying biological contexts of individual somatic alterations to implicitly group rare mutations. Our team has used this technique previously through a multi-level regression model to diagnose with high accuracy tumor site of origin. Herein, we further leverage topic models from computational linguistics to achieve interpretable lower-dimensional embeddings of the meta-features. We demonstrate how the method can identify distinctive somatic profiles linked to specific germline variants or environmental risk factors. We illustrate the method using The Cancer Genome Atlas whole-exome sequencing data to characterize somatic tumor fingerprints in breast cancer patients with germline BRCA1/2 mutations and in head and neck cancer patients exposed to human papillomavirus.
Keywords: Somatic mutations, germline mutations, germline-somatic associations, multilevel regression modeling, meta-features, topic models
Introduction
Decades of cancer susceptibility studies have identified numerous germline variants that increase cancer risk, including rare high-risk variants in cancer susceptibility genes (Rahman, 2014) and common single nucleotide polymorphisms (SNPs) with smaller effects (Erichsen and Chanock, 2004). These variants have been used to develop and refine approaches and tools for cancer risk stratification and prevention, such as genetic testing (American Society of Clinical Oncology 2003) and risk prediction models (Chen et al., 2004; Lee et al., 2019; Mavaddat et al., 2019, 2019). In parallel to research on germline variants, many studies have investigated tumor etiology and progression through the analysis of somatic variants (Watson et al., 2013). These processes can be endogenous, e.g. due to germline variants in DNA repair genes, or exogenous, e.g. due to ultraviolet (UV) exposure. Various patterns based on the nucleotide context categories of single base substitutions (SBS), called mutational signatures, have been extracted using methods such as non-negative matrix factorization (Alexandrov et al., 2020, 2013), and have provided insights into cancer etiology that can help predict prognosis and inform treatment (Brady et al., 2022). The mutational signature framework has also been used to characterize patterns of copy number alterations, which have been shown to play a key role in cancer development and progression (Steele et al., 2022).
Advances in sequencing technology have broadened the range of detectable variants and have led to the increasing availability of large-scale germline and somatic data. While analyses of germline and somatic variants have largely been conducted independently of each other, projects such as The Cancer Genome Alas (TCGA) have sequenced both germline and somatic genomes, making it possible to investigate the influence of germline variants on somatic features. Studies integrating germline and somatic data have identified associations between germline variants and somatic mutation burden (Namba et al., 2023; Zhu et al., 2016), somatic events in cancer driver genes (Barfield et al., 2022; Carter et al., 2017; Chen et al., 2019; Puzone and Pfeffer, 2017; Zhang et al., 2018), and somatic SBS mutational signatures (Chen et al., 2019; Liu et al., 2022; Middlebrooks et al., 2016; Vali-Pour et al., 2022; Wang et al., 2019). One study has also found associations between germline expression quantitative trait loci and somatic variants in cancer driver genes (Liu et al., 2023). Germline-somatic studies that have examined somatic features beyond overall mutation burden have mainly focused on a limited set of cancer driver genes or predefined SBS signatures. These features capture only a small part of the vast somatic landscape, but there are likely broader somatic patterns driven by germline variation that have yet to be investigated. In this paper, we describe a flexible, supervised topic model approach for discovering novel somatic patterns linked with germline variants or other risk factors.
Our approach integrates multiple types/modalities of somatic variants (e.g. SBS and copy number alterations) and provides uncertainty quantification. To our knowledge, previous approaches for exploring germline-somatic interactions have only focused on predefined somatic features (e.g. correlating known cancer driver genes and established somatic mutational signatures with known germline risk factors). Our approach allows for the discovery of novel somatic signatures associated with germline risk factors by searching the space of genome-wide somatic mutations in a supervised way. While many signature discovery methods exist (Afsari et al., 2021; Alexandrov et al., 2020, 2013; Fischer et al., 2013; Gehring et al., 2015; Lee et al., 2022; Rosales et al., 2016; Steele et al., 2022), they are mostly unsupervised methods that are applied to a single modality of data at a time and they rarely include uncertainty quantification. We are aware of only one supervised method, a tree-based approach for SBS signatures (Afsari et al., 2021), and one multi-modality method, an unsupervised topic model (Funnell et al., 2019), neither of which quantifies the uncertainty associated with the discovered signatures.
Our approach is based on a framework previously developed by our team that uses context-based meta-features to aggregate rare somatic variants in a biologically interpretable way (Chakraborty et al., 2020, 2021b). We applied this framework to the problem of classifying tumor site of origin by fitting a multi-level regularized regression model embedding the meta-features of somatic variants (Chakraborty et al., 2021a). The model achieved high classification accuracy using somatic whole-exome sequencing (WES) data as well as whole-genome sequencing (WGS) data. Recently, we enhanced the framework by developing supervised topic models that induce interpretable dimension reduction of the meta-features, producing tumor site-specific topics analogous to mutational signatures (Chakraborty et al., 2022). In contrast to the most common mutational signature extraction approaches (Alexandrov et al., 2020; Steele et al., 2022), which are unsupervised, our supervised approach extracts patterns that are associated with a known risk factor. Our topic models permit full Bayesian inference – i.e, point estimation and uncertainty quantification -- while achieving similar or better performance compared to our original regression approach. In this paper, we extend the topic model approach to allow for multiple modalities of somatic data (e.g. somatic mutations, copy number changes, etc.) and demonstrate how it can be adapted to identify the distinctive somatic profiles that are caused by specific germline mutations or other cancer risk factors, such as environmental exposures. While our previous work investigated somatic patterns associated with different cancer types, focusing on problems in cancer diagnosis such as determining the primary site in metastatic cancers of uncertain primary origin (Chakraborty et al., 2022), this paper investigates somatic patterns associated with risk factors, with the goal of better understanding germline-somatic interactions and other etiological mechanisms. Using somatic WES data from TCGA, we illustrate our approach with examples characterizing the broad somatic features of tumors in breast cancer patients with germline mutations in susceptibility genes BRCA1 and BRCA2 and in head and neck cancer patients with human papillomavirus (HPV), an environmental risk factor. We also apply the approach to investigate patterns in lung cancer cases carrying a risk allele in CHRNA3, a nicotine receptor gene associated with lung cancer, and melanoma cases carrying a risk allele in MC1R, a pigmentation regulating gene associated with melanoma.
Methods
Bayesian hidden genome topic model: a summary.
Our approach employs the topical hidden genome model, originally formulated to distinguish the cancer sites of tumors based on their somatic profiles (Chakraborty et al., 2022). Below we summarize our methodology leveraging a schematic diagram (Figure 1). We note that our approach expands on Chakraborty et al. (2022) through the inclusion of multiple types of genomic modalities. A more detailed and technical overview of the model, including explanation of the inclusion of multiple modalities, is provided in the Supplementary Materials.
Figure 1:

Schematic plot for the topical hidden genome model for germline/environmental exposure risk prediction with two meta-feature sources: CNprofile (for copy number alteration) and SBS categories (for point mutations). The thin green lines visualize connections between individual variants (e.g., KRAS G12D T.A C>T) and its meta-feature category (e.g., SBS T.A C>T). The meta-feature regression produces a mutation count matrix cataloging total mutation counts in each meta-feature category across all tumors for every meta-feature source. The resulting mutation count matrices are then used for extraction of meta-feature topics. Each paired solid red arrow represents a topic model step with the back tip of the arrow denoting the meta-feature mutation count matrix and the paired arrowheads denoting the resulting topic attribution and topic composition matrices. The topic modeling is done in a supervised manner, with the cancer risk factor status indicator (outcome variable) of the tumor informing construction of the topics. The blue solid arrow represents a logistic regression connection with the arrowhead pointing to the outcome variable, and the arrow back-tip pointing to a predictor matrix (before normalization by total mutation burden in a tumor). A zigzag symbol in the back-tip implies a “noisy” version of the predictor matrix being used.
Our model utilizes WES somatic mutation data cataloged by alterations of L > 1 types/modalities. In our examples we consider L = 2 modalities, viz., somatic point mutations and segment level copy number alterations. We label the somatic point mutations by their resulting variants and copy number alterations by a combination of chromosome position, log ratio gain or loss, and heterozygosity. The training dataset comprises somatic mutation fingerprints of N tumors whose risk factor status (e.g., presence/absence of germline BRCA1/2 mutation) is known. We focus on binary 0/1 risk factor status; however, our methodology can be directly extended to handle multinomial risk factors leveraging the multi-class topical hidden genome method of Chakraborty et al. (2022). To aid grouping for dimensionality reduction of the sparse and high-dimensional individual somatic alterations including all rare variants our approach employs mutation contexts embodied in meta-features – higher level features of the mutations themselves – that are available for all individual mutations. In this study we focus on two separate meta-features for each data modality, viz. the gene label itself and SBS categories for point mutations, and cytoband and segment-level total allele-specific CNclass labels (Steele et al., 2022) for copy number alterations (see Data Sources and Description).
The topical hidden genome model unifies two key steps – (a) a regression model linking the individual variants with the meta-features and (b) supervised topic modeling for the meta-feature categories – within a Bayesian multilevel logistic regression model for predicting binary cancer risk status on the basis of tumor somatic mutation fingerprints. The meta-feature regression step (a), visualized by thin green curves in Figure 1, decomposes individual variant effects as meta-feature effects plus residual effects with the latter being substantial (non-zero) for only a handful of highly discriminative variants (determined via a mutual information-based feature screening approach). This step groups all individual variants sharing mutation contexts and enumerates total counts of mutations per meta-feature category -- e.g., each of the 96 SBS categories -- in each tumor. The consequent mutation counts across all meta-feature categories (along columns) and all tumors (along rows) are then stored as mutation count matrices tabulated separately for each meta-feature source. In step (b) these mutation count matrices are factorized to obtain topics for meta-feature categories separately for each meta-feature source. Here, a meta-feature topic refers to a discrete probability distribution/vector over all the categories of a specific meta-feature source (e.g,. all the 96 SBS categories). Similar to word topics in text mining, which are latent lower dimensional embeddings of ‘similar’ words with common underlying semantics derived from text corpus, the meta-feature topics represent embeddings of meta-feature categories with an appropriate sense of ‘similarity’. For each meta-feature source the mutation count matrix in step (b) is decomposed into two matrix factors (visualized through red arrows in Figure 1), viz, the topic composition matrix cataloging the discrete probability distributions defining the various topics along its rows and the topic attribution matrix carrying the expected mutation burden in each tumor attributable to these latent topics along each row. This topic decomposition is done in a supervised manner, with a “noisy” version of the topic attribution matrix – a function of the topic composition, topic attribution, and the original count matrix for meta-feature mutations – being used as derived predictor in the (Bayesian lasso regularized) logistic regression for cancer risk prediction. The entire model is formulated as a multilevel Bayesian model (see Supplementary Materials) permitting coherent unification of the two steps described. Implementation of the model is aided by Markov chain Monte Carlo (MCMC) sampling which facilities principled point estimation and uncertainty quantification. (See Supplementary Figure 3 for visual exemplification of MCMC convergence and mixing.)
Note that the topic model introduces non-identified parameters into the model, making inference directly from the MCMC output challenging. To this end, we use a filtering and clustering based MCMC postprocessing step resembling the Bayesian finite mixture model postprocessing approach (Frühwirth-Schnatter, 2011, 2006; Malsiner-Walli et al., 2016) and consensus NMF approach (Kotliar et al., 2019) and detailed in Chakraborty et al. (2022) for post-hoc reidentification of the topic model parameters from the MCMC output. See Supplementary Materials and Chakraborty et al. (2022) for more detailed information.
After MCMC sampling and subsequent postprocessing, inferences on various model parameters, including the latent meta-feature topic composition and attribution parameters, and logistic regression odds ratio parameters (supervised effects) for the topics and residual variant effects, are obtained using the (postprocessed) posterior draws. Point estimators of the parameters are obtained through posterior means and medians, and interval estimates quantifying uncertainty in estimation are obtained via highest posterior density credible intervals.
Benchmarking SBS topics with COSMIC signatures and copy number heterozygosity CNprofile topics.
We note that our approach produces supervised (based on the cancer-risk factor status) meta-feature topics, and depending on the meta-features used, these topics can extend beyond SBS categories as traditionally considered in mutational signature analysis (Alexandrov et al., 2013) and the recently proposed copy number loss of heterozygosity-based CNclass/CNprofile signatures (Steele et al., 2022). Nonetheless, the existing literature on these signatures does provide a rigorous benchmark for our results. To benchmark the SBS topics, we focused on the COSMIC repository (version 3.4 – October 2023) which catalogs 86 different somatic SBS mutational signatures obtained from unsupervised non-negative matrix factorization analyses of TCGA tumors and subsequently shown to be strongly related to environmental risk factors such as smoking and UV radiation exposure or endogenous characteristics such as homologous recombination deficiency (including pathogenic germline mutations in BRCA1/2). To this list of 86 SBS signatures, we added two additional mixture signatures obtained by averaging COSMIC signatures 2 and 13 (both having AID/APOBEC as the proposed etiology) and signatures 54 and 92 (both having tobacco smoking as the proposed etiology). We computed the total variation distance (Levin and Peres, 2017) between the estimated SBS topics obtained from our model and each of the 88 COSMIC signatures/signature mixtures. Subsequently, we identified the “closest” COSMIC signature/signature mixture to each estimated SBS topic for comparison.
To benchmark the copy number heterozygosity-based CNprofile topics we considered 24 CNprofile copy number signatures published in the (Steele et al., 2022) study and obtained their total variation distances with each estimated topic. Subsequently for each estimated topic we identified the closest Steele et al. (2022) signature for comparison.
Data Sources and Description
We performed four analyses using TCGA (The Cancer Genome Atlas) data: a BRCA1/2 analysis focusing on breast cancer, an HPV analysis focusing on head and neck cancer, a CHRNA3 (nicotine receptor gene) analysis focusing on lung cancer tumors, and an MC1R (skin pigmentation gene) analysis focusing on melanoma.
TCGA germline mutation data.
Pathogenic and likely pathogenic germline variants in BRCA1/2 were obtained from data released by TCGA’s PanCanAtlas working group (Huang et al., 2018), which performed variant calling and classification for TCGA germline WES data. Pathogenic and likely pathogenic variants in cancer susceptibility genes were annotated using the CharGer pipeline. Among TCGA participants with breast cancer, there were 38 cases who had at least one pathogenic or likely pathogenic mutation in BRCA1/2 and 918 non-carriers who were treated as controls.
For CHRNA3 and MC1R, we identified risk SNPs for lung adenocarcinoma and skin melanoma using the NHGRI-EBI GWAS Catalog (Sollis et al., 2022) and focused on two SNPs that were available in the TCGA SNP array data: CHRNA3 SNP rs8040868 (Byun et al., 2018; McKay et al., 2017; Xu et al., 2020) and MC1R SNP rs1805007 (Chahal et al., 2016; Galván-Femenıa et al., 2018; Landi et al., 2020; Sarin et al., 2020) respectively. Estimated odds ratios from for rs8040868 lung cancer GWAS ranged from 1.29–1.33 (Byun et al., 2018; McKay et al., 2017) and estimated odds ratios for rs1805007 from cutaneous melanoma GWAS ranged from 1.46–1.63 (Chahal et al., 2016; Landi et al., 2020; Sarin et al., 2020). There were 136 cases (with at least one CHRNA3 rs8040868 risk allele) and 452 controls in the CHRNA3 lung cancer analysis, and 74 cases (with at least one MC1R rs1805007 risk allele) and 362 controls in the MC1R melanoma analysis.
TCGA clinical data.
For the HPV analysis we focused on the TCGA head and neck tumors and mapped their HPV status utilizing the supplementary information provided in Lawrence et al. (2015). There were 33 HPV-positive patients in the TCGA cohort whose tumor samples were used as cases and tumor samples from the remaining 222 patients were treated as controls.
TCGA somatic point mutation data.
We utilized the TCGA whole exome somatic point mutation datasets curated in the GDC Data Portal (https://portal.gdc.cancer.gov/), focusing our attention on breast adenocarcinoma, head and neck carcinoma, lung adenocarcinoma, and skin cutaneous melanoma tumors for the BRCA1/2, HPV, CHRNA3, and MC1R analyses respectively (subsetting on tumors each with >15 copy number probes).
TCGA somatic copy number alteration segmentation data.
The TCGA segmentation data files were parsed to identify copy number alterations at individual chromosome positions. We removed segmentation data rows with low (< 15) numbers of probes. Subsequently for each tumor and for each chromosome location cataloged in the segmentation data file we defined binary 0 – 1 indicators of a copy number gain if the corresponding log ratio was ≤ −0.2 and a copy number loss if the corresponding log ratio was ≥ 0.2 thus producing a pair of binary 0 – 1 indicators of a copy number gain or loss at that location. The resulting copy number alterations were used to find cytoband-specific counts of alterations (total gains and total losses) per tumor.
TCGA allele-level copy number loss of heterozygosity (LOH) data.
We obtained the allele-level copy number heterozygosity CNclass categories for the TCGA tumors from the Supplementary Table 1 of (Steele et al., 2022). For each TCGA tumor, the total number of alterations for each of these 48 CNclass meta-feature categories were directly computed from this supplementary table (focusing on the TCGA ASCAT profiles sheet).
Meta-features.
We considered two separate sources of meta-features for each of the somatic point mutations and copy number alterations considered in this study. For somatic point mutations, we considered gene labels – based on a curated list of ~600 cancer genes (Chakravarty et al., 2017) – and single base substitution (SBS) – one of 96 possible categories of nucleotide changes (Alexandrov et al., 2013) – as two separate meta-feature sources. For copy number alterations we considered 782 cytobands (based on log ratio gain/loss), and 48 allele-level heterozygosity based CNclass/CNprofile categories (Steele et al., 2022) as two separate meta-feature sources.
Benchmark: COSMIC repository of SBS signatures and Steele et al. LOH signatures:
The SBS topics and CNprofile topics considered in our study have important analogies with published SBS mutational signatures (Alexandrov et al., 2013; Alexandrov et al., 2020) and copy number signatures (Steele et al., 2022). To benchmark our results we compared the topics estimated in our study with the SBS 30 mutational signatures publicly reposited by the COSMIC study (version 3.4 – October 2023) and 24 CNprofile copy number signatures published in the Steele et al. study. For each estimated SBS and CNprofile topic we obtained the corresponding “closest” signature – as measured through total variation distance – and noted the corresponding total variation distance.
Results
Below we focus primarily on the results from the BRCA1/2 and HPV analyses. The results from the CHRNA3 and MC1R analyses, which did not yield any somatic patterns with strong associations, are provided in Table 1 and Supplementary Figures 2 and 3. For each model, we evaluated cross-validation classification performance using the areas under the precision-recall curve (AUPRC) and receiver operator characteristic curve (AUROC). We also examined the estimated effect sizes of the identified topics and compared the topics to previously established SBS mutational signatures (Alexandrov et al., 2020, 2013) and copy number signatures (Steele et al., 2022).
Table 1:
AUROCs and AUPRCs for the 4 sets of analyses performed. In each analysis, two types of classifications are conducted with the full (imbalanced) data and random down sample (ensemble of random subsets of controls to ensure a 1:1 case:control ratio) data. In each classification, an overall AUROC, case and control-specific AUPRC, and average (over case and control-specific) AUPRC are computed. The AUROC and AUPRC values obtained across random replications are averaged. The numbers in parentheses denote the corresponding null baseline AUPRC values associated with a classifier that randomly classifies tumors into case and control groups; the analogous baseline value for all AUROC is 0.5.
| Analysis | Subset | AUROC | AUPRC:Case | AUPRC:Control | AUPRC:Average |
|---|---|---|---|---|---|
|
| |||||
| BRCA1/2 | full data | 0.83 | 0.17 (0.04) | 0.99 (0.96) | 0.58 (0.50) |
| random down sample | 0.85 | 0.80 (0.50) | 0.86 (0.50) | 0.83 (0.50) | |
| HPV | full data | 0.96 | 0.86 (0.13) | 0.99 (0.87) | 0.93 (0.50) |
| random down sample | 0.96 | 0.96 (0.50) | 0.96 (0.50) | 0.96 (0.50) | |
| CHRNA3 | full data | 0.48 | 0.29 (0.30) | 0.68 (0.70) | 0.49 (0.50) |
| random down sample | 0.47 | 0.49 (0.50) | 0.48 (0.50) | 0.48 (0.50) | |
| MC1R | full data | 0.57 | 0.25 (0.17) | 0.86 (0.83) | 0.55 (0.50) |
| random down sample | 0.56 | 0.57 (0.50) | 0.55 (0.50) | 0.56 (0.50) | |
Somatic mutation patterns are associated with BRCA1/2 germline mutations and HPV status
BRCA1/2 Analysis.
For this analysis we focused on N = 956 breast cancer tumors in the TCGA data, each with at least 15 copy number alterations. There were 38 tumors with a BRCA1/2 germline mutation (cases) and the remaining 918 tumors were treated as controls. For robust assessment of the overall association between the collective somatic multi-modal-omics alterations with BRCA1/2 mutational status, we focused on the classification performance of the model under 5-fold cross-validation (10 independent replications) as measured by the AUPRC and AUROC. These measures were computed in each cross-validation replicate and subsequently averaged. The results are displayed in Table 1 The overall AUROC was 0.83, which is slightly higher than the AUROC of 0.76 observed in a previous TCGA study that predicted germline BRCA1/2 mutations based on homologous repair deficiency scores (Min et al., 2020). The AUPRC for BRCA1/2-positive cases was 0.17; for reference, the AUPRC of a null model that classifies patients at random would be 0.04 given the case:control ratio in the analysis. To better understand the association/classification performance under a balanced case:control setup, we considered a followup analysis utilizing balanced random subsets of the training dataset. Specifically, we considered 100 random subsets of the 918 control tumors each of size 38, and in each subset we obtained the AUPRC and AUROC from 5-fold cross-validated fits of the proposed topical hidden genome model. The resulting AUPRCs and AUROCs were averaged across random subsets. As shown in Table 1 in the subset analysis on average we obtained AUROC of 0.85 and AUPRC of 0.83. We note that AUROCs are theoretically identical to AUPRC in balanced comparisons; however small numerical differences are encountered in practice due to smooth approximations of the precision recall curve. Together, these results demonstrate a strong classification performance, implying a strong somatic-germline association in BRCA1/2. We followed up with a complete data (i.e., not cross-validation partitioned) analysis, and focused on the estimated topics and their effects as measured through log odds ratio for BRCA1/2 for 1 standard deviation changes in the predictors (Figure 2).
Figure 2:

Somatic mutation patterns obtained from the BRCA1/2 analysis on breast tumors. Panel-A: BRCA1/2 germline mutation log odds ratios (displayed through posterior median and 80% quantile-based posterior credible interval) associated with somatic mutations from various cytoband copy number topics, gene point mutation topics, SBS point mutation topics, and individual variant residual effects displayed as a forest plot. Panel-B: The estimated SBS topic 6 obtained from our model. The posterior-mean-estimated categorical probability distribution over the 96 SBS categories defining topic 6 is displayed as vertical orange bars and the associated (element-wise) 80% posterior quantile-based credible intervals are displayed as whiskers. The COSMIC signature closest (with respect to TV distance) to the estimated topic 6, viz., signature 3, is displayed through blue vertical bars. Panel-C: The estimated CNprofiles copy number Topic 13 and Topic 4 obtained from our model. The posterior-mean-estimated categorical probability distribution over the 48 CNclass categories defining these topics are displayed as vertical orange bars and the associated (element-wise) 80% posterior quantile-based credible intervals are displayed as whiskers. The Steele et al. (2022) signatures closest (with respect to TV distance) to these topics viz., signatures CN13 and CN20, are displayed through blue vertical bars.
HPV analysis.
For the HPV analysis, we focused on N = 255 head and neck cancer tumors in the TCGA data. There were 33 HPV-positive patients (cases) and the remaining 222 patients were treated as controls. The results are shown in Table 1 The overall AUROC was 0.96 and the AUPRC was 0.86 for HPV-positive cases; for reference, the AUPRC of a null model that classifies patients at random would be 0.13 given the case:control ratio in the analysis. We also performed an ensemble random down-sample analysis with balanced cases and controls, obtaining an AUROC of 0.96 on average across 100 replicates. We performed a followup analysis wth the complete data (i.e., not cross-validation partitioned), and focused on the estimated topics and their effects as measured through log odds ratio for HPV for 1 standard deviation changes in the predictors (Figure 3).
Figure 3:

Somatic mutation patterns obtained from the HPV analysis on head and neck tumors. Panel-A: HPV exposure log odds ratios (displayed through posterior median and 80% quantile-based posterior credible interval) associated with somatic mutations from various cytoband copy number topics, CNprofile copy number topics, gene point mutation topics, SBS point mutation topics, and individual variant residual effects displayed as a forest plot. Panel-B: The estimated SBS topics 13 and 4 obtained from our model. The posterior-mean-estimated categorical probability distributions over the 96 SBS categories defining these two topics are displayed as vertical orange bars and the associated (element-wise) 80% posterior quantile-based credible intervals are displayed as whiskers. The COSMIC v3.4 signatures closest (with respect to TV distance) to estimated topics 13 and 4, viz., a 50–50 mixture of signatures 2 and 13 (AID/APOBEC), and 50–50 mixtures of signatures 4 and 92 (tobacco smoking) are displayed through blue vertical bars. Panel-C: Copy number log ratio (CNLR) cytoband topics 6 and 11 estimated from our model. The posterior-mean-estimated categorical probability distributions over the cytoband categories defining these two topics (grouped by chromosome arms and copy number gains/losses) are displayed as vertical bars and the associated (element-wise) 80% posterior quantile-based credible intervals are displayed as grayed regions.
CHRNA3 and MC1R analyses.
For the CHRNA3 analysis we considered N=452 lung adenocarcinoma tumors of which 136 tumors with at least one CHRNA3 rs8040868 risk allele were considered as cases, and remaining 316 tumors were treated as controls. For the MC1R analysis we focused on N=436 skin melanoma tumors of which 74 were cases (with at least one MC1R rs1805007 risk allele) and 362 were controls. For robust assessment of overall somatic-germline association, in each analysis we performed analogous cross-validation based AUROC and AUPRC evaluation using the full data and an ensemble of 100 random subsets (each with a 1:1 case:control ratio) as performed in the BRCA1/2 and HPV analyses. These metrics are displayed in Table 1 which shows poor AUROCs and AUPRCs for CHRNA3 and MC1R analyses, implying poor somatic-germline associations for the SNPs of interest. We performed analogous complete data analysis as performed for BRCA1/2 and HPV. However, none of the estimated effects appeared to be strong (Supplementary Figures 2 and 3), as expected from the poor overall classification performance, and hence we did not pursue a more in-depth investigation of the estimated topics in these two analyses.
SBS patterns identified in BRCA1/2 and HPV resemble well-known somatic mutational signatures
BRCA1/2 Analysis.
We focused on inferring the most influential somatic alteration patterns, described via individual somatic alterations – point mutations and copy number alterations – as well as gene and SBS (point mutation meta-features) and cytoband (copy number meta-features) topics associated with BRCA1/2 germline mutations. For this, we computed the log odds ratio at 1 standard deviation unit of each explanatory variable (individual mutation or meta-feature topic) and summarized the associated (marginal) posterior distributions via posterior medians and 80% quantile-based posterior credible intervals as point and interval estimates. These estimates are displayed on Figure 2A as forest plots. The figure shows that after taking into account all other somatic multi-modal-omics alterations, one estimated SBS topic, viz. Topic 6, demonstrated high germline BRCA1/2 specificity as measured through these log-odds ratios. Upon further inspection, it followed that this topic closely resembles (with “closeness” being quantified by the total variation distance) SBS mutational signature 3 from the COSMIC repository. Signature 3, which indicates defective homologous recombination-based DNA damage repair, has been linked with BRCA1/2 germline mutations (Alexandrov et al., 2013; Nik-Zainal et al., 2016).
HPV analysis.
In the HPV analysis, the estimated Topic 4 had a strong negative association with HPV status and was similar to the tobacco smoking SBS signatures (we considered a 50–50 mixture of COSMIC signatures 4 and 92, both having tobacco smoking as the proposed etiology). Since tobacco use is a major risk factor for head and neck cancer (Alexandrov et al., 2020; Lawrence et al., 2015), the tobacco smoking signature is expected to be prevalent in non-HPV head and neck cancers. Estimated Topic 13 was positively associated with HPV status and was similar to a 50–50 mix of COSMIC SBS mutational signatures 2 and 13, the APOBEC signatures, which are closely related to HPV (Henderson et al., 2014). We emphasize however that our estimation of these topics has been effectuated via our supervised modeling approach, unlike the COSMIC signatures which emerged from an unsupervised strategy.
Copy number CNprofile patterns estimated show similarity with Steele et al signatures
In the BRCA1/2 analysis, we also identified two prominent copy number CNprofile topics, viz. Topic 13 and Topic 4 with positive specificities for BRCA1/2. These two topics reflect high prevalence of LOH 1 and het 2 (Topic 13) and het 3–4 (Topic 4) alterations as defined in Steele et al. (2022), and have similarities with signatures CN13 and CN20 as identified in Steele et al. (2022). In the HPV analysis, two copy number log ratio cytoband topics were identified with strong positive effects toward HPV, but no HPV-specific CNprofile topics were identified.
Discussion
Detailed characterization of subgroups of tumors is an important contemporary challenge that has the potential to guide the biological understanding of the differential ways in which cancers evolve. This task is challenged by the vastness of the human genome and the large numbers of somatic alterations that accrue in individual tumors, the majority of which are extremely rare across tumors. In this work we have mapped out a practical computational strategy that is thoroughly grounded in classical statistical theory, but also merges this theory with the great contemporary interest in multidimensional mutational and copy number signatures. We have shown that the method is computationally feasible using modern computational Bayesian methods. Further, in our examination of somatic profiles specific to breast cancers characterized by BRCA1/2 germline mutations and to head and neck cancers characterized by HPV we have shown that the method, even in a study with very small sample sizes (e.g. 38 BRCA1/2 carriers), can produce point mutation and copy number patterns (topics) that are closely aligned with well-known signatures, giving us confidence in the promise of the methodology. In particular, the BRCA1/2 analysis successfully reproduced COSMIC signature 3, the homologous recombination deficiency signature, which is known to be challenging to distinguish due to its flat mutational spectrum (Alexandrov et al., 2020; Pancotti et al., 2023; Rosenthal et al., 2016). We emphasize that our topic estimation uses a supervised modeling approach that includes coherent uncertainty quantification, unlike the benchmark COSMIC signatures and copy number signatures that emerged from an unsupervised point estimation strategy (Alexandrov et al., 2020, 2013; Steele et al., 2022). Moreover, while the benchmark point mutation and copy number signatures are obtained from separate single data modality analyses, our approach allows for unified joint inference of topics from multiple data modalities simultaneously. Integrating multiple data modalities into a single model can potentially improve statistical efficiency and reduce confounding. Funnell et al. (2019) previously proposed a topic model integrating somatic point mutations and structural variation from WGS, but their approach is unsupervised and does not incorporate risk factor information that could potentially improve the identification of signatures associated with rare risk factors.
Besides BRCA1/2, we investigated two other germline risk factors, the rs8040868 SNP and the rs1805007 SNP, which have been linked to lung cancer and melanoma respectively, but these analyses did not yield any strong germline-somatic associations. Since these SNPs have relatively modest effect sizes, it is possible that larger sample sizes or use of combinations of SNPs instead of individual SNPs is necessary for detecting associations.
Topics for future research include extending the model to allow for continuous risk factors and investigating interactions between risk factors and other covariates. Our model currently handles categorical risk factors, but many risk factors are continuous, including polygenic risk scores that have more predictive power than the SNPs we examined in our analyses. In our analyses, we assumed that the molecular effects of the risk factors of interest are not modulated by demographic variables such as race/ethnicity, which is consistent with the current literature on mutational signatures. However, potential interactions between risk factors and other covariates could be investigated further, especially in more diverse cohorts, by including the other covariates as additional supervising variables. In general, examination of individual germline variants using our methodology will necessarily be challenged by small sample sizes since highly penetrant germline mutations are invariably rare and common germline mutations have low penetrance due to negative selection. An important future challenge is to investigate if rare germline variants can be meaningfully aggregated, analogous to our focus in this model on the aggregation of rare somatic variants, in order to permit broader analyses with greater statistical power. Of course, the strategy we presented also has applicability in investigating important environmental cancer risk factors for which sample size restrictions are not a concern, such as smoking, exposure to UV or other radiation, infections, etc., and for which the identification of characteristic mutational signatures can contribute to our understanding of biological mechanisms.
Supplementary Material
Supplementary Figure 1. Somatic mutation patterns obtained from the CHRNA3 analysis on lung adenocarcinoma tumors. CHRNA3 germline mutation log odds ratios (displayed through posterior median and 80% quantile-based posterior credible interval) associated with somatic mutations from various cytoband copy number topics, gene point mutation topics, SBS point mutation topics, and individual variant residual effects displayed as a forest plot.
Supplementary Figure 2. Somatic mutation patterns obtained from the MC1R analysis on skin melanoma tumors. CHRNA3 germline mutation log odds ratios (displayed through posterior median and 80% quantile-based posterior credible interval) associated with somatic mutations from various cytoband copy number topics, gene point mutation topics, SBS point mutation topics, and individual variant residual effects displayed as a forest plot.
Supplementary Figure 3. MCMC trace plots to demonstrate convergence. Iteration wise draws (post burn-in and thinning) for the first few elements of the identified topic model parameter function HW, and computed iteration-wise log-likelihoods (for the logistic and logistic + topic model layers in the multilevel model) from the 10 independent MCMC chains used in the HPV analysis are displayed as examples. All traces look reasonably stabilized, demonstrating convergence of the MCMC chains.
Acknowledgements
The research was supported by the National Cancer Institute, awards CA270428, CA251339 and CA008748.
Footnotes
Software Availability
Software implementing the model (specifically, a generalized version of the model handling multinomial outcomes), data analysis R scripts, associated documentations, and processed data used in our study are included as a supplementary zip file.
Data Availability
The data that support the findings of this study are collected from several openly available databases – see Data Sources and Description for detailed references. Processed data utilized in our analyses are available in the supplementary material of this article.
References
- Afsari B, Kuo A, Zhang Y, Li L, Lahouel K, Danilova L, Favorov A, Rosenquist TA, Grollman AP, Kinzler KW, Cope L, Vogelstein B, Tomasetti C, 2021. Supervised mutational signatures for obesity and other tissue-specific etiological factors in cancer. eLife 10, e61082. 10.7554/eLife.61082 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alexandrov LB, Kim J, Haradhvala NJ, Huang MN, Tian Ng AW, Wu Y, Boot A, Covington KR, Gordenin DA, Bergstrom EN, Islam SA, 2020. The repertoire of mutational signatures in human cancer. Nature 578, 94–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, Børresen-Dale AL, Boyault S, 2013. Signatures of mutational processes in human cancer. Nature 500, 415–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barfield R, Qu C, Steinfelder RS, Zeng C, Harrison TA, Brezina S, Buchanan DD, Campbell PT, Casey G, Gallinger S, Giannakis M, Gruber SB, Gsur A, Hsu L, Huyghe JR, Moreno V, Newcomb PA, Ogino S, Phipps AI, Slattery ML, Thibodeau SN, Trinh QM, Toland AE, Hudson TJ, Sun W, Zaidi SH, Peters U, 2022. Association between germline variants and somatic mutations in colorectal cancer. Sci Rep 12, 10207. 10.1038/s41598-022-14408-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brady SW, Gout AM, Zhang J, 2022. Therapeutic and prognostic insights from the analysis of cancer mutational signatures. Trends in Genetics 38, 194–208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Byun J, Schwartz AG, Lusk C, Wenzlaff AS, Andrade M, Mandal D, Gaba C, Yang P, You M, Kupert EY, 2018. Genome-wide association study of familial lung cancer. Carcinogenesis 39, 1135–1140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carter H, Marty R, Hofree M, Gross AM, Jensen J, Fisch KM, Wu X, DeBoever C, Nostrand EL, Song Y, Wheeler E, 2017. Interaction landscape of inherited polymorphisms with somatic events in cancer. Cancer discovery 7, 410–423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chahal HS, Lin Y, Ransohoff KJ, Hinds DA, Wu W, Dai H-J, Qureshi AA, Li W-Q, Kraft P, Tang JY, 2016. Genome-wide association study identifies novel susceptibility loci for cutaneous squamous cell carcinoma. Nature Communications 7, 12048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chakraborty S, Begg CB, Shen R, 2020. Using the “Hidden” genome to improve classification of cancer types. Biometrics. 10.1111/biom.13367 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chakraborty S, Ecker BL, Seier K, Aveson VG, Balachandran VP, Drebin JA, D’Angelica MI, Kingham TP, Sigel CS, Soares KC, Vakiani E, Wei AC, Chandwani R, Gonen M, Shen R, Jarnagin WR, 2021a. Genome-Derived Classification Signature for Ampullary Adenocarcinoma to Improve Clinical Cancer Care. Clin Cancer Res. 10.1158/1078-0432.CCR-21-1906 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chakraborty S, Guan Z, Begg CB, Shen R, 2022. Topical Hidden Genome: Discovering Latent Cancer Mutational Topics using a Bayesian Multilevel Context-learning Approach. [DOI] [PMC free article] [PubMed]
- Chakraborty S, Martin A, Guan Z, Begg CB, Shen R, 2021b. Mining mutation contexts across the cancer genome to map tumor site of origin. Nat Commun 12, 3051. 10.1038/s41467-021-23094-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chakravarty D, Gao J, Phillips S, Kundra R, Zhang H, Wang J, Rudolph JE, Yaeger R, Soumerai T, Nissan MH, Others, 2017. OncoKB: a precision oncology knowledge base. JCO precision oncology 1, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen S, Wang W, Broman KW, Katki HA, Parmigiani G, 2004. BayesMendel: an R environment for Mendelian risk prediction. Statistical applications in genetics and molecular biology 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen Z, Wen W, Beeghly-Fadiel A, Shu X, Díez-Obrero V, Long J, Bao J, Wang J, Liu Q, Cai Q, Moreno V, Zheng W, Guo X, 2019. Identifying Putative Susceptibility Genes and Evaluating Their Associations with Somatic Mutations in Human Cancers. Am J Hum Genet 105, 477–492. 10.1016/j.ajhg.2019.07.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Erichsen HC, Chanock SJ, 2004. SNPs in cancer research and treatment. British journal of cancer 90, 747–751. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fischer A, Illingworth CJ, Campbell PJ, Mustonen V, 2013. EMu: probabilistic inference of mutational processes and their localization in the cancer genome. Genome Biology 14, R39. 10.1186/gb-2013-14-4-r39 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frühwirth-Schnatter S, 2011. Dealing with label switching under model uncertainty, in: Mengersen KL, Robert C, Titterington M (Eds.), Mixtures: Estimation and Applications. John Wiley & Sons and Citeseer, pp. 213–239. [Google Scholar]
- Frühwirth-Schnatter S, 2006. Finite mixture and Markov switching models. Springer. [Google Scholar]
- Funnell T, Zhang AW, Grewal D, McKinney S, Bashashati A, Wang YK, Shah SP, 2019. Integrated structural variation and point mutation signatures in cancer genomes using correlated topic models. PLOS Computational Biology 15, e1006799. 10.1371/journal.pcbi.1006799 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galván-Femenıa I, Obón-Santacana M, Piñeyro D, Guindo-Martinez M, Duran X, Carreras A, Pluvinet R, Velasco J, Ramos L, Aussó S, 2018. Multitrait genome association analysis identifies new susceptibility genes for human anthropometric variation in the GCAT cohort. Journal of Medical Genetics 55, 765–778. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gehring JS, Fischer B, Lawrence M, Huber W, 2015. SomaticSignatures: inferring mutational signatures from single-nucleotide variants. Bioinformatics 31, 3673–3675. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang K-L, Mashl RJ, Wu Y, Ritter DI, Wang J, Oh C, Paczkowska M, Reynolds S, Wyczalkowski MA, Oak N, Scott AD, Krassowski M, Cherniack AD, Houlahan KE, Jayasinghe R, Wang L-B, Zhou DC, Liu D, Cao S, Kim YW, Koire A, McMichael JF, Hucthagowder V, Kim T-B, Hahn A, Wang C, McLellan MD, Al-Mulla F, Johnson KJ, Cancer Genome Atlas Research Network, Lichtarge O, Boutros PC, Raphael B, Lazar AJ, Zhang W, Wendl MC, Govindan R, Jain S, Wheeler D, Kulkarni S, Dipersio JF, Reimand J, Meric-Bernstam F, Chen K, Shmulevich I, Plon SE, Chen F, Ding L, 2018. Pathogenic Germline Variants in 10,389 Adult Cancers. Cell 173, 355–370.e14. 10.1016/j.cell.2018.03.039 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kotliar D, Veres A, Nagy MA, Tabrizi S, Hodis E, Melton DA, Sabeti PC, 2019. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. eLife 8, e43803. 10.7554/eLife.43803 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Landi MT, Bishop DT, MacGregor S, Machiela MJ, Stratigos AJ, Ghiorzo P, Brossard M, Calista D, Choi J, Fargnoli MC, 2020. Genome-wide association meta-analyses combining multiple risk phenotypes provide insights into the genetic architecture of cutaneous melanoma susceptibility. Nature Genetics 52, 494–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lawrence MS, Sougnez C, Lichtenstein L, Cibulskis K, Lander E, Gabriel SB, Getz G, Ally A, Balasundaram M, Birol I, Bowlby R, Brooks D, Butterfield YSN, Carlsen R, Cheng D, Chu A, Dhalla N, Guin R, Holt RA, Jones SJM, Lee D, Li HI, Marra MA, Mayo M, Moore RA, Mungall AJ, Gordon Robertson A, Schein JE, Sipahimalani P, Tam A, Thiessen N, Wong T, Protopopov A, Santoso N, Lee S, Parfenov M, Zhang Jianhua, Mahadeshwar HS, Tang J, Ren X, Seth S, Haseley P, Zeng D, Yang Lixing, Xu AW, Song X, Pantazi A, Bristow CA, Hadjipanayis A, Seidman J, Chin L, Park PJ, Kucherlapati R, Akbani R, Casasent T, Liu W, Lu Y, Mills G, Motter T, Weinstein J, Diao L, Wang J, Hong Fan Y, Liu J, Wang K, Todd Auman J, Balu S, Bodenheimer T, Buda E, Neil Hayes D, Hoadley KA, Hoyle AP, Jefferys SR, Jones CD, Kimes PK, Liu Yufeng, Marron JS, Meng S, Mieczkowski PA, Mose LE, Parker JS, Perou CM, Prins JF, Roach J, Shi Y, Simons JV, Singh D, Soloway MG, Tan D, Veluvolu U, Walter V, Waring S, Wilkerson MD, Wu J, Zhao N, Cherniack AD, Hammerman PS, Tward AD, Sekhar Pedamallu C, Saksena G, Jung J, Ojesina AI, Carter SL, Zack TI, Schumacher SE, Beroukhim R, Freeman SS, Meyerson M, Cho J, Chin L, Getz G, Noble MS, DiCara D, Zhang H, Heiman DI, Gehlenborg N, Voet D, Lin P, Frazer S, Stojanov P, Liu Yingchun, Zou L, Kim J, Sougnez C, Gabriel SB, Lawrence MS, Muzny D, Doddapaneni H, Kovar C, Reid J, Morton D, Han Y, Hale W, Chao H, Chang K, Drummond JA, Gibbs RA, Kakkar N, Wheeler D, Xi L, Ciriello G, Ladanyi M, Lee W, Ramirez R, Sander C, Shen R, Sinha R, Weinhold N, Taylor BS, Arman Aksoy B, Dresdner G, Gao J, Gross B, Jacobsen A, Reva B, Schultz N, Onur Sumer S, Sun Y, Chan TA, Morris LG, Stuart J, Benz S, Ng S, Benz C, Yau C, Baylin SB, Cope L, Danilova L, Herman JG, Bootwalla M, Maglinte DT, Laird PW, Triche T, Weisenberger DJ, Van Den Berg DJ, Agrawal N, Bishop J, Boutros PC, Bruce JP, Averett Byers L, Califano J, Carey TE, Chen Z, Cheng H, Chiosea SI, Cohen E, Diergaarde B, Marie Egloff A, El-Naggar AK, Ferris RL, Frederick MJ, Grandis JR, Guo Y, Haddad RI, Hammerman PS, Harris T, Neil Hayes D, Hui ABY, Jack Lee J, Lippman SM, Liu F-F, McHugh JB, Myers J, Kwok Shing Ng P, Perez-Ordonez B, Pickering CR, Prystowsky M, Romkes M, Saleh AD, Sartor MA, Seethala R, Seiwert TY, Si H, Tward AD, Van Waes C, Waggott DM, Wiznerowicz M, Yarbrough WG, Zhang Jiexin, Zuo Z, Burnett K, Crain D, Gardner J, Lau K, Mallery D, Morris S, Paulauskis J, Penny R, Shelton C, Shelton T, Sherman M, Yena P, Black AD, Bowen J, Frick J, Gastier-Foster JM, Harper HA, Leraas K, Lichtenberg TM, Ramirez NC, Wise L, Zmuda E, Baboud J, Jensen MA, Kahn AB, Pihl TD, Pot DA, Srinivasan D, Walton JS, Wan Y, Burton RA, Davidsen T, Demchok JA, Eley G, Ferguson ML, Mills Shaw KR, Ozenberger BA, Sheth M, Sofia HJ, Tarnuzzer R, Wang Z, Yang Liming, Claude Zenklusen J, Saller C, Tarvin K, Chen C, Bollag R, Weinberger P, Golusiński W, Golusiński P, Ibbs M, Korski K, Mackiewicz A, Suchorska W, Szybiak B, Wiznerowicz M, Burnett K, The Cancer Genome Atlas Network, Genome sequencing centre: Broad Institute, Genome characterization and data analysis centres: BC Cancer Agency, Harvard Medical School/Brigham & Women’s Hospital/MD Anderson Cancer Center, The University of Texas MD Anderson Cancer Center, University of Kentucky, University of North Carolina at Chapel Hill, Broad Institute, Baylor College of Medicine, Memorial Sloan-Kettering Cancer Center, University of California Santa Cruz/Buck Institute, Johns Hopkins University/Sidney Kimmel Comprehensive Cancer Center, University of Southern California, Disease working group, Biospecimen core resource: International Genomics Consortium, Nationwide Children’s Hospital, Data coordinating centre, Project office, Tissue source sites: Analytical Biological Services, Fred Hutchinson Cancer Research Center, Georgia Regents University, Greater Poland Cancer Centre, International Genomics Consortium, 2015. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature 517, 576–582. 10.1038/nature14129 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee A, Mavaddat N, Wilcox AN, Cunningham AP, Carver T, Hartley S, Villiers C, Izquierdo A, Simard J, Schmidt MK, Walter FM, 2019. BOADICEA: a comprehensive breast cancer risk prediction model incorporating genetic and nongenetic risk factors. Genetics in Medicine 21, 1708–1718. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee D, Wang D, Yang XR, Shi J, Landi MT, Zhu B, 2022. SUITOR: Selecting the number of mutational signatures through cross-validation. PLoS Comput Biol 18, e1009309. 10.1371/journal.pcbi.1009309 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Levin DA, Peres Y, 2017. Markov chains and mixing times. American Mathematical Soc. [Google Scholar]
- Liu Y, Gusev A, Heng YJ, Alexandrov LB, Kraft P, 2022. Somatic mutational profiles and germline polygenic risk scores in human cancer. Genome Medicine 14, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Y, Gusev A, Kraft P, 2023. Germline cancer gene expression quantitative trait loci are associated with local and global tumor mutations. Cancer Research 83, 1191–1202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Malsiner-Walli G, Frühwirth-Schnatter S, Grün B, 2016. Model-based clustering based on sparse finite Gaussian mixtures. Stat Comput 26, 303–324. 10.1007/s11222-014-9500-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mavaddat N, Michailidou K, Dennis J, Lush M, Fachal L, Lee A, Tyrer JP, Chen TH, Wang Q, Bolla MK, Yang X, 2019. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. The American Journal of Human Genetics 104, 21–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKay JD, Hung RJ, Han Y, Zong X, Carreras-Torres R, Christiani DC, Caporaso NE, Johansson M, Xiao X, Li Y, 2017. Large-scale association analysis identifies new lung cancer susceptibility loci and heterogeneity in genetic susceptibility across histological subtypes. Nature Genetics 49, 1126–1132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Middlebrooks CD, Banday AR, Matsuda K, Udquim K-I, Onabajo OO, Paquin A, Figueroa JD, Zhu B, Koutros S, Kubo M, Shuin T, Freedman ND, Kogevinas M, Malats N, Chanock SJ, Garcia-Closas M, Silverman DT, Rothman N, Prokunina-Olsson L, 2016. Association of germline variants in the APOBEC3 region with cancer risk and enrichment with APOBEC-signature mutations in tumors. Nat Genet 48, 1330–1338. 10.1038/ng.3670 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Min A, Kim K, Jeong K, Choi S, Kim Seongyeong, Suh KJ, Lee K-H, Kim Sun, Im S-A, 2020. Homologous repair deficiency score for identifying breast cancers with defective DNA damage response. Sci Rep 10, 12506. 10.1038/s41598-020-68176-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Namba S, Saito Y, Kogure Y, Masuda T, Bondy ML, Gharahkhani P, Gockel I, Heider D, Hillmer A, Jankowski J, MacGregor S, Maj C, Melin B, Ostrom QT, Palles C, Schumacher J, Tomlinson I, Whiteman DC, Okada Y, Kataoka K, 2023. Common Germline Risk Variants Impact Somatic Alterations and Clinical Features across Cancers. Cancer Res 83, 20–27. 10.1158/0008-5472.CAN-22-1492 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nik-Zainal S, Davies H, Staaf J, Ramakrishna M, Glodzik D, Zou X, Martincorena I, Alexandrov LB, Martin S, Wedge DC, Van Loo P, Ju YS, Smid M, Brinkman AB, Morganella S, Aure MR, Lingjærde OC, Langerød A, Ringnér M, Ahn S-M, Boyault S, Brock JE, Broeks A, Butler A, Desmedt C, Dirix L, Dronov S, Fatima A, Foekens JA, Gerstung M, Hooijer GKJ, Jang SJ, Jones DR, Kim H-Y, King TA, Krishnamurthy S, Lee HJ, Lee J-Y, Li Y, McLaren S, Menzies A, Mustonen V, O’Meara S, Pauporté I, Pivot X, Purdie CA, Raine K, Ramakrishnan K, Rodríguez-González FG, Romieu G, Sieuwerts AM, Simpson PT, Shepherd R, Stebbings L, Stefansson OA, Teague J, Tommasi S, Treilleux I, Van den Eynden GG, Vermeulen P, Vincent-Salomon A, Yates L, Caldas C, van’t Veer L, Tutt A, Knappskog S, Tan BKT, Jonkers J, Borg Å, Ueno NT, Sotiriou C, Viari A, Futreal PA, Campbell PJ, Span PN, Van Laere S, Lakhani SR, Eyfjord JE, Thompson AM, Birney E, Stunnenberg HG, van de Vijver MJ, Martens JWM, Børresen-Dale A-L, Richardson AL, Kong G, Thomas G, Stratton MR, 2016. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54. 10.1038/nature17676 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pancotti C, Rollo C, Birolo G, Benevenuta S, Fariselli P, Sanavia T, 2023. Unravelling the instability of mutational signatures extraction via archetypal analysis. Frontiers in Genetics 13, 1049501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Puzone R, Pfeffer U, 2017. SNP variants at the MAP3K1/SETD9 locus 5q11.2 associate with somatic PIK3CA variants in breast cancers. Eur J Hum Genet 25, 384–387. 10.1038/ejhg.2016.179 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rahman N, 2014. Realizing the promise of cancer predisposition genes. Nature 505, 302–308. 10.1038/nature12981 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosales RA, Drummond RD, Valieris R, Dias-Neto E, da Silva IT, 2016. signeR: an empirical Bayesian approach to mutational signature discovery. Bioinformatics 33, 8–16. 10.1093/bioinformatics/btw572 [DOI] [PubMed] [Google Scholar]
- Rosenthal R, McGranahan N, Herrero J, Taylor BS, Swanton C, 2016. DeconstructSigs: Delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biology 17, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sarin KY, Lin Y, Daneshjou R, Ziyatdinov A, Thorleifsson G, Rubin A, Pardo LM, Wu W, Khavari PA, Uitterlinden A, 2020. Genome-wide meta-analysis identifies eight new susceptibility loci for cutaneous squamous cell carcinoma. Nature Communications 11, 820. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sollis E, Mosaku A, Abid A, Buniello A, Cerezo M, Gil L, Groza T, Güneş O, Hall P, Hayhurst J, Ibrahim A, Ji Y, John S, Lewis E, MacArthur JAL, McMahon A, Osumi-Sutherland D, Panoutsopoulou K, Pendlington Z, Ramachandran S, Stefancsik R, Stewart J, Whetzel P, Wilson R, Hindorff L, Cunningham F, Lambert SA, Inouye M, Parkinson H, Harris LW, 2022. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Res 51, D977–D985. 10.1093/nar/gkac1010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Steele CD, Abbasi A, Islam SMA, Bowes AL, Khandekar A, Haase K, Hames-Fathi S, Ajayi D, Verfaillie A, Dhami P, McLatchie A, Lechner M, Light N, Shlien A, Malkin D, Feber A, Proszek P, Lesluyes T, Mertens F, Pillay N, 2022. Signatures of copy number alterations in human cancer. Nature 606, 984–991. 10.1038/s41586-022-04738-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vali-Pour M, Lehner B, Supek F, 2022. The impact of rare germline variants on human somatic mutation processes. Nature Communications 13, 3724. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang S, Pitt JJ, Zheng Y, Yoshimatsu TF, Gao G, Sanni A, Oluwasola O, Ajani M, Fitzgerald D, Odetunde A, Khramtsova G, 2019. Germline variants and somatic mutation signatures of breast cancer across populations of African and European ancestry in the US and Nigeria. International Journal of Cancer 145, 33213333. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Watson IR, Takahashi K, Futreal PA, Chin L, 2013. Emerging patterns of somatic mutations in cancer. Nature reviews Genetics 14, 703–718. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu K, Li B, McGinnis KA, Vickers-Smith R, Dao C, Sun N, Kember RL, Zhou H, Becker WC, Gelernter J, 2020. Genome-wide association study of smoking trajectory and meta-analysis of smoking status in 842,000 individuals. Nature Communications 11, 5302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang X, Wang Y, Tian T, Zhou G, Jin G, 2018. Germline genetic variants were interactively associated with somatic alterations in gastric cancer. Cancer Med 7, 3912–3920. 10.1002/cam4.1612 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu B, Mukherjee A, Machiela MJ, Song L, Hua X, Shi J, Garcia-Closas M, Chanock SJ, Chatterjee N, 2016. An investigation of the association of genetic susceptibility risk with somatic mutation burden in breast cancer. British journal of cancer 115, 752–760. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Figure 1. Somatic mutation patterns obtained from the CHRNA3 analysis on lung adenocarcinoma tumors. CHRNA3 germline mutation log odds ratios (displayed through posterior median and 80% quantile-based posterior credible interval) associated with somatic mutations from various cytoband copy number topics, gene point mutation topics, SBS point mutation topics, and individual variant residual effects displayed as a forest plot.
Supplementary Figure 2. Somatic mutation patterns obtained from the MC1R analysis on skin melanoma tumors. CHRNA3 germline mutation log odds ratios (displayed through posterior median and 80% quantile-based posterior credible interval) associated with somatic mutations from various cytoband copy number topics, gene point mutation topics, SBS point mutation topics, and individual variant residual effects displayed as a forest plot.
Supplementary Figure 3. MCMC trace plots to demonstrate convergence. Iteration wise draws (post burn-in and thinning) for the first few elements of the identified topic model parameter function HW, and computed iteration-wise log-likelihoods (for the logistic and logistic + topic model layers in the multilevel model) from the 10 independent MCMC chains used in the HPV analysis are displayed as examples. All traces look reasonably stabilized, demonstrating convergence of the MCMC chains.
Data Availability Statement
The data that support the findings of this study are collected from several openly available databases – see Data Sources and Description for detailed references. Processed data utilized in our analyses are available in the supplementary material of this article.
