MuSiC: Identifying mutational significance in cancer genomes

Nathan D Dees; Qunyuan Zhang; Cyriac Kandoth; Michael C Wendl; William Schierding; Daniel C Koboldt; Thomas B Mooney; Matthew B Callaway; David Dooling; Elaine R Mardis; Richard K Wilson; Li Ding

doi:10.1101/gr.134635.111

. 2012 Aug;22(8):1589–1598. doi: 10.1101/gr.134635.111

MuSiC: Identifying mutational significance in cancer genomes

Nathan D Dees ^1,⁴, Qunyuan Zhang ^1,⁴, Cyriac Kandoth ¹, Michael C Wendl ^1,², William Schierding ¹, Daniel C Koboldt ¹, Thomas B Mooney ¹, Matthew B Callaway ¹, David Dooling ¹, Elaine R Mardis ^1,^2,³, Richard K Wilson ^1,^2,³, Li Ding ^1,^2,⁵

PMCID: PMC3409272 PMID: 22759861

Abstract

Massively parallel sequencing technology and the associated rapidly decreasing sequencing costs have enabled systemic analyses of somatic mutations in large cohorts of cancer cases. Here we introduce a comprehensive mutational analysis pipeline that uses standardized sequence-based inputs along with multiple types of clinical data to establish correlations among mutation sites, affected genes and pathways, and to ultimately separate the commonly abundant passenger mutations from the truly significant events. In other words, we aim to determine the Mutational Significance in Cancer (MuSiC) for these large data sets. The integration of analytical operations in the MuSiC framework is widely applicable to a broad set of tumor types and offers the benefits of automation as well as standardization. Herein, we describe the computational structure and statistical underpinnings of the MuSiC pipeline and demonstrate its performance using 316 ovarian cancer samples from the TCGA ovarian cancer project. MuSiC correctly confirms many expected results, and identifies several potentially novel avenues for discovery.

The continued advancement of DNA sequencing technologies (Mardis 2011) now allows for the rapid sequencing of large sets of cancer cases (matched tumor and normal samples) for the purposes of mutation discovery. This technological progression has shifted the emphasis in cancer genomics from the analysis of a single patient sample to that of hundreds of patient samples across a broad range of tumor types. Such an expansion of scope facilitates the ascertainment of recurrent mutations within genes and functional pathways. Additionally, the increasing scope permits correlating mutations and pathways with clinical phenotypes where appropriate clinical data exist. The outcome of such correlation can include the identification of prognostic or diagnostic markers or the identification of actionable targets for developing therapeutic options that may inform clinical trials development.

To this end, we present a packaged suite of comprehensive, user-friendly tools designed to determine mutational significance in cancer (MuSiC). The primary goal of MuSiC is to separate the significant events which are likely drivers for disease from the passenger mutations present in mutational discovery sets using a variety of statistical methods. This package provides unique practical advantages over existing software and requires a few basic input elements: mapped reads in BAM format, predicted or validated single nucleotide variants (SNVs) and indels in mutation annotation format (MAF), a set of regions of interest (typically the boundaries of coding exons), and any relevant numeric and/or categorical clinical data. Usage is straightforward. With a single command, a user can (1) apply statistical methods across the cohort to identify significantly mutated genes and (2) identify significantly altered pathways and gene sets, (3) investigate the proximity of amino acid mutations within the same gene, (4) search for gene-based or site-based relationships and correlations between the mutations themselves, (5) correlate mutations to clinical features, and (6) cross-reference the findings with relevant databases, such as Pfam (Finn et al. 2010), COSMIC (Forbes et al. 2008), and OMIM (McKusick 1998). These functions can be accessed individually or run as an automated serial implementation.

To illustrate performance, we tested the MuSiC suite on an exome capture data set consisting of 316 cases of ovarian carcinoma (OV) that were previously described by the TCGA Research Network (The Cancer Genome Atlas Research Network 2011). The original analysis provided statistically supported lists of significantly mutated genes, therapeutically targetable copy number amplifications in several genes (e.g., MECOM, MAPK1, CCNE1, and KRAS), evidence of overlaps between DNA methylation clusters and gene expression subtypes, and confirmation of involvement of the NOTCH and FOXM1 signaling pathways in serous ovarian cancer pathophysiology. Using MuSiC, we complement the previous results with detailed descriptions of the correlations between mutation spectra and clinical data. We found evidence of mutual exclusion between mutations in two major tumor suppressors, TP53 and RB1, and strong statistical support for the correlation of germline BRCA1 point mutations and age at disease onset (Hall et al. 1990; Miki et al. 1994). MuSiC also helps us confirm the importance of mutations in the P53 DNA-binding domain (Sigal and Rotter 2000).

Results

Overview

The development of MuSiC was motivated by the rapidly expanding numbers of mutation data sets from a wide variety of tumor types. It is imperative during post-discovery analysis to separate the significant, or “driver,” mutations from the passenger mutations to more accurately pinpoint the key genes and pathways critical for disease initiation and progression. MuSiC is designed precisely to streamline this process into an easily accessible high-throughput software exercise.

MuSiC currently consists of seven analysis modules and an eighth execution module, “MuSiC Play,” which runs each analysis module sequentially (Fig. 1). MuSiC Play parses the input and output of each of the individual modules and then produces a composite summary of all executed modules. Table 1 lists the type of analysis performed and the types of variants considered by each individual MuSiC module. More detailed descriptions of the specific analysis algorithms performed by each module are given below.

Figure 1. — MuSiC flow diagram. MuSiC modules can either be implemented individually with various required input files or may be implemented in serial via one command where four inputs are used to execute the entire package of tools.

Table 1.

Analyses performed and the variants included for each MuSiC module

graphic file with name 1589tbl1.jpg

Open in a new tab

Significantly mutated gene tests

We use the concept of “significantly mutated genes” (SMG) to describe genes that show a significantly higher mutation rate than the background mutation rate (BMR) when multiple mutational mechanisms (coding indel and single nucleotide substitution, splice site mutation, etc.) are considered. Specialized measurements of the BMR may also be considered; BMRs in MuSiC are optionally calculated across the entire sample set, across particular subgroups of similarly mutated samples, or for each sample individually. For each BMR subgroup considered and for each category of mutational mechanism, the mutation rates are compared to the appropriate BMR, and a single P-value summarizing all considerations is generated for each gene. We refer to this summarization procedure as the significantly mutated gene (SMG) test.

We assessed multiple methods of calculating summarized P-values, including a convolution test (CT), a Fisher's combined P-value test (FCPT), and the likelihood ratio test (LRT), using a partially simulated data set (this data set and the associated test simulations are described in the Supplemental Material). By this approach, we determined that the P-value distribution obtained using the CT method most closely resembled the uniform distribution expected under the null (in this case, the null is such that no gene is truly significantly mutated), while the FCPT and LRT methods produced slightly inflated or deflated P-values, respectively (Supplemental Fig. S1). During the SMG test, a false discovery rate (FDR) also is calculated. We evaluate our SMG test results by establishing a P-value or FDR threshold (threshold typically 0.2 or less for FDR), and then appropriately filtering the test output.

The results of MuSiC's SMG analysis for the ovarian cancer data set were previously reported (The Cancer Genome Atlas Research Network 2011). Briefly, there were 12 genes found to be significantly mutated in the data set. The CT, FCPT, and LRT P-values for these genes as well as the BMR for each mutational mechanism category in the ovarian data set are displayed in Figure 2. BRCA1 and BRCA2 are known ovarian cancer risk genes (King et al. 2003; Pal et al. 2005). In addition to 27 (BRCA1) and 25 (BRCA2) germline nonsense, splice site, and indel mutations, 11 and 10 nonsynonymous somatic mutations were discovered in this data set in BRCA1 and BRCA2, respectively (The Cancer Genome Atlas Research Network 2011).

Figure 2. — Mutation rates and SMGs in the OV data set. (A) The cohort-wide background mutation rates for all seven mutational mechanism categories are plotted for the OV data set. The overall BMR is also plotted, combining all types of mutations. (B) −Log₁₀(P) for the top 12 OV SMGs are plotted for all three SMG tests in order of decreasing convolution test P-value.

Significantly mutated pathway/gene set analysis

To identify known cellular pathways with significant accretions of somatic mutations in ovarian tumors, we integrated the PathScan algorithm (Wendl et al. 2011) as a module of the MuSiC pipeline. PathScan treats pathways as groups of genes defined by databases such as KEGG (Kanehisa and Goto 2000), BioCarta (Nishimura 2001), and Reactome (Joshi-Tope et al. 2005), with the KEGG definitions currently set as the default implementation. PathScan can be configured, however, to assess any grouping of genes, including groupings from nonpathway databases such as Pfam (Finn et al. 2010).

Using PathScan, we analyzed the OV somatic mutation data set in two ways. First, the entire data set was analyzed regardless of the frequency of mutation in specific genes. Secondly, due to the overwhelming abundance of TP53 mutations, we also performed the analysis using identical parameters but excluding TP53 mutations.

The most significant pathways identified in the first analysis were a collection of KEGG cancer pathways including “Thyroid Cancer” (hsa05216), “Bladder Cancer” (hsa05219), “Basal Cell Carcinoma” (hsa05217), “Nonsmall Cell Lung Cancer” (hsa05223), and “Melanoma” (hsa05218). In the midst of those significant cancer pathways sits the “p53 Signaling” pathway (hsa04115) at a P-value of 2.62 × 10⁻¹²⁶. Also, MuSiC found the “Apoptosis” pathway (hsa04210), including not only TP53 mutations but also nine phosphoinositide 3-kinase mutations, to be affected. This latter group of mutations includes two PIK3CA mutations, previously implicated in both breast and ovarian cancers (Levine et al. 2005).

Table 2.

PathScan results for the OV data set

graphic file with name 1589tbl2.jpg

Open in a new tab

In the second analysis where TP53 mutations were excluded, the collection of KEGG cancer pathways was no longer identified as the most significant pathways in the OV data set. Instead, for instance, this analysis identified the environmental information processing class “Receptors and Channels” pathway from the KEGG Brite database (hsa04000) as the most significant (P = 4.36 × 10⁻⁹¹) (Table 2). Limiting the analysis scope to the KEGG Pathway database, a similar pathway was identified from the environmental information processing class, the “Neuroactive Ligand-Receptor Interaction” pathway (hsa04080). This pathway incurred 266 mutations across the OV data set yielding a P-value of 2.5 × 10⁻¹¹. The “Calcium Signaling” pathway (hsa04020, P = 4.9 × 10⁻⁸) also rose to the level of significance from the TP53-excluded KEGG analysis, which is interesting due to the role of calcium signaling in many cellular processes including cell death (Crompton 2000). The results from the OV data set feature 266 mutations throughout this pathway, highlighted by 35 mutations in voltage-dependent calcium-channel genes (CACNA1A-H and CACNA1S), and 25 mutations in RYR1 and RYR2 genes, whose expression has been correlated with tumor grade in breast cancer (Abdul et al. 2008).

Mutation relation test

The mutation relation test (MRT) attempts to reveal correlations and mutual-exclusion relationships among significantly and highly mutated genes in a pairwise fashion. Positive correlations suggest that mutations and their associated pathways putatively function synergistically to promote carcinogenesis, while negative correlations imply that the alteration of a single component or pathway may be sufficient, wholly or in part, for carcinogenesis.

An example heat map of the MRT analysis for all 316 OV samples is shown in Figure 3. For this data set, we found examples of both concurrent mutations and also mutually excluded mutations among the genes represented in the heat map. Co-mutators FAT3 and EMR3 (P = 0.0333) are both members of the “Receptors and Channels” KEGG pathway (hsa04000), identified as significantly mutated in the pathway analysis described above. And there is also mild evidence of the mutual exclusion of RB1 and TP53 mutations (P = 0.0141). Both of these genes are tumor suppressors, and both were found previously to be significantly mutated genes in this data set (The Cancer Genome Atlas Research Network 2011). This result is potentially meaningful if one considers the possibility that RB1 mutations could be driver events that act independently from TP53. It is well known that the rate of TP53 mutations in ovarian cancer is very high (Ahmed et al. 2010), including the OV data set used in this analysis (The Cancer Genome Atlas Research Network 2011). However, for the few cases that do not have a driver mutation in TP53, we speculate, based on our mutual exclusion results, that mutations in RB1 may represent an independent path to ovarian adenocarcinoma. Of course, due to the small numbers of mutations present in RB1 and EMR3 (six and five mutations, respectively), additional data would be required to confirm any hypotheses generated using these results.

Clinical correlation test

The clinical correlation test (CCT) can be used to determine relationships between clinical phenotypes and observed mutations. The input clinical data may be represented in either numeric or categorical (“class”) formats. For example, in the OV data set, we obtained clinical data for 315 of the 316 OV samples; the numeric clinical data consisted of the patients' ages at disease diagnosis and also their survival periods (in days), and the categorical clinical data for the OV data set included information about a sample's race, tumor stage, tumor grade, the outcome of the primary therapy, and, lastly, their vital status. For both data types, the goal of the CCT is to determine whether specific mutations/genes are associated with a particular clinical feature. As these associations can sometimes be biased by covariate clinical features, MuSiC also offers a generalized linear model (GLM) analysis option within the CCT. This tool allows users to define any number of clinical traits as covariates to discovered mutations and, subsequently, to eliminate any possible biases introduced to the phenotype/mutation associations by the covariates' effects.

As a proof of principle, we have assessed the well-established relationship between the presence of a BRCA1 germline variant and a patient's age at disease diagnosis (Hall et al. 1990; Miki et al. 1994). In the OV data set, the CCT revealed that patients with germline BRCA1 variants were significantly correlated with earlier disease diagnosis (P = 2.456 × 10⁻⁵, Wilcoxon rank sum test), whereas patients with somatic BRCA1 mutations exhibited no such correlation (P = 0.308). A boxplot of the ages at diagnosis for the OV samples (Fig. 4) clearly shows that the mean age of those samples with a germline BRCA1 mutation (51.3 yr) is lower than those samples with either wild-type BRCA1 (60.4) or with a somatic BRCA1 mutation (63.1 yr). Thus, the CCT correctly evaluated the relationship between germline variants in BRCA1 and ovarian cancer susceptibility.

Figure 4. — *BRCA1* variant status versus sample age of diagnosis for the OV data set. A boxplot of the age of diagnosis of 315 OV patients grouped by their *BRCA1* mutation status. Germline *BRCA1* variant status is correlated with a lower age of diagnosis via the CCT (P = 2.456 × 10⁻⁵).

Proximity analysis

In certain genes, mutations tend to cluster in close proximity within functional domains. In order to find these dense “clusters” of mutations within a mutation list, we have developed MuSiC's proximity analysis module. This module searches within fixed windows around each mutation, reporting the number of and distances to all neighboring mutations. The size of the fixed windows utilized for searching is user-configurable. In order to determine an appropriate default size for these windows, we have analyzed the distances between all neighboring mutations in version 54 of the COSMIC database (see Methods). Upon finding that over 25% of the nearest-neighbor mutations in COSMIC are within seven (or less) amino acids of each other, we chose to search 7 aa both upstream of and downstream from each OV mutation for dense clusters of variants. TP53, with 302 total nonsilent somatic mutations in the OV data set, dominated the proximity analysis results. The average number of mutations within 7 aa of another TP53 mutation was 4.9, with the densest 14-aa window containing 26 nonsynonymous mutations.

The next-densest group of mutations occurred in DNAH5, where there were four mutations within a space of 3 aa. Several genes have mutations that occur in triplets within a space of 2 aa, including UBR4 and RB1CC1. Both UBR4 and RB1CC1 have relationships with RB1, a gene on the significantly mutated gene list for the OV data set and a gene also found to harbor copy-number alterations in the OV data set (The Cancer Genome Atlas Research Network 2011). UBR4 is a component of the N-end rule pathway that interacts with RB1, and RB1CC1 actually regulates the expression of RB1. RB1 itself has two mutations in close proximity (within 1 aa of each other). These high-density groups of RB1-related mutations, pictured in Figure 5, may support the hypothesis that RB1 could be an additional driver of ovarian cancer.

Figure 5. — Proximity analysis mutation diagrams. These mutation diagrams show recurrent triplet mutations in both *UBR4* and *RB1CC1*, both of which harbor a relationship with tumor suppressor gene *RB1*.

COSMIC/OMIM query

Using the COSMIC/OMIM module of MuSiC, we attempted to find previously reported mutations matching the query set of somatic OV mutations. This type of analysis can provide a measure of recurrence, as these databases generally contain information about the studies from which their contents were derived, and the COSMIC database deals exclusively with the somatic mutations discovered in cancer studies. A summary of COSMIC/OMIM database comparisons for those significantly mutated genes listed above with at least one database match of any type is presented in Supplemental Figure S2. This summary, however, represents only a subset of all of the information made available via the database queries. We found 15 exact matches in genomic position and nucleotide change between the OV data set and COSMIC, including sites in NF1, RB1, and PIK3CA, all considered significantly mutated genes by the MuSiC pipeline. This type of match is only possible when comparing to the COSMIC database, since OMIM entries contain only amino acid coordinates. We identified another exact match in the FOXG1 gene, which encodes a forkhead transcription factor. Not only was the FOXM1 transcription factor network cited as significantly altered in 87% of samples in the previous TCGA study (The Cancer Genome Atlas Research Network 2011), but, additionally, some forkhead transcription factors were previously identified as therapeutic targets (Moumne et al. 2008; Wang et al. 2010b).

In addition to finding the above COSMIC variants which shared positions and identical nucleotide changes with OV mutations, our comparison of OV mutations to the COSMIC and OMIM databases also identified a large set of database mutations that altered the same amino acid in an identical manner as an OV mutation. Of 233 such matches from COSMIC (76 from OMIM), the overwhelming majority of these, 219 (56), were from the gene TP53. There were 229 (76) other “position” matches, defined as mutations which affect an identically positioned amino acid but which do not cause the same residue change as the previously reported mutation. Most of these matches are from genes that have been previously associated with ovarian cancer (The Cancer Genome Atlas Research Network 2011), such as BRAF, BRCA1, KRAS, and again, TP53.

Pfam annotation

The Pfam annotation module of MuSiC groups genes based on the frequency of mutation in specific protein domains. Grouping mutations by their protein domain can serve to group genes according to putative function, since genes that share a domain are more likely to share related functions. We performed Pfam annotation on the 19,356 somatic variants identified in the 316 OV cases. Supplemental Table S1 reports the number of nonsynonymous mutations, synonymous mutations, and also the number of genes harboring mutations in each Pfam domain with at least five somatic events in the OV data set.

In the analysis of the OV data, many of the frequently mutated domains are also the most prevalent domains in the genome, including the seven-transmembrane G protein-coupled receptor domain, the protein kinase domain, and the zinc finger domain, as illustrated in Figure 6A. This genome-wide abundance is not true, however, for the amply mutated P53 domain. Analysis of this Pfam annotation result correctly confirms the significance of the P53 domain mutations in this cohort. Figure 6A also shows that this domain is recurrently mutated but only in a small number of genes, much different from all of the other domains pictured. And lastly, Figure 6B shows that the P53 domain has an unusually high nonsynonymous:synonymous mutation ratio. All of these details are indicative (correctly) of an important mutation hotspot in this cohort.

Figure 6. — Pfam domains affected by OV mutations. (A) A histogram of the most highly mutated domains in the OV data set next to the number of genes affected in each domain. (B) A stacked bar-graph where the value 100% represents the total number of mutations in a particular Pfam domain. Lighter and darker sections of the bars represent which proportions of the total mutations are nonsynonymous and synonymous, respectively.

The Pfam annotation module output may also be modified slightly and fed into the SMG test algorithm in order to produce a mathematical result describing “significantly mutated domains” (rather than significantly mutated genes) and their associated P-values. For a detailed explanation of this option, please see the Supplemental Material, including Supplemental Table S2. The results presented therein reaffirm the significance of the mutations in the P53 domain, as well as in the other frequently mutated domains.

Discussion

There are several software tools available purporting to determine mutational significance. CHASM (Carter et al. 2010), for instance, uses a machine-learning technique to distinguish driver mutations from passengers based on a driver/passenger mutation training set. Mutation Assessor (Reva et al. 2011) provides a prediction of the functional impact of a mutation based on the specificity of multiple sequence alignments and conservation scores. And tools such as CanPredict (Kaminker et al. 2007) are database-driven, deriving multiple metrics for each variant, starting with stored information and models, and then making use of the metrics through a decision-tree analysis. Other tools, such as ANNOVAR (Wang et al. 2010a), provide detailed annotations of genetic alterations, much like MuSiC's Pfam annotation module. Although not currently publicly available, the “Firehose” pipeline does share some features with MuSiC, including an SMG analysis tool and a pathway analysis tool, PARADIGM (Vaske et al. 2010), but Firehose as a whole is heavily focused on orderly sequencing and quality control rather than post-discovery variant analysis.

MuSiC, on the other hand, is the first available set of combined tools that enables a complete, multidimensional statistical evaluation of next-generation-derived cancer data sets. No other publicly available tool suite currently incorporates clinical data along with coverage data and database references into the determination of the most significant mutations among a large mutation list full of passengers. MuSiC merges several methodological aspects of the above-mentioned tools with many novel additional algorithms, providing the capability of large-cohort, data-driven statistical analysis to the entire research community.

Use of MuSiC is straightforward. The simplicity of the input files and tool updating make this package extremely accessible. MuSiC also accommodates both large and small research organizations; although the entire suite of tools is capable of running sequentially on a single processor, the most CPU-intensive modules offer easily parsable options for parallelizing jobs across multiple machines or across a job cluster.

Future support for the MuSiC package will be devoted to the handling of additional file formats, such as the variant call format (VCF), and also to the development of a graphical user interface (GUI). We intend to design new tools aimed at incorporating a wider variety of biological and variant data types, such as copy alteration data, and 3D protein structures from RCSB's Protein Data Bank (Berman et al. 2000), to be used for improved proximity analysis calculations. We also intend to implement recurrence tests across other data entities, such as significantly mutated transcripts and significantly mutated gene families. We also plan to enhance the assessment of the functional impact of protein mutations in MuSiC through the integration of published solutions, such as Mutation Assessor (Reva et al. 2011) and PolyPhen-2 (Adzhubei et al. 2010), as well as through the design of new modules which take advantage of databases that categorize such effects, such as SIFT (Kumar et al. 2009). And lastly, a fuller integration of the results from various MuSiC modules, many of which are currently considered in isolation, will provide an even more comprehensive picture of cancer genomic mechanisms.

Methods

Sample data set used in this study

Alignment mapping files for the ovarian cancer cohort, as well as all mutational data in mutation annotation format (MAF) are available at The Cancer Genome Atlas (TCGA) website, http://cancergenome.nih.gov/, via dbGaP. The dbGaP study accession number is phs000178.v1.p1.