SUMMARY
A major challenge in cancer genomics is to identify genes with functional roles in cancer and uncover their mechanisms of action. We introduce an integrative framework that identifies cancer-relevant genes by pinpointing those whose interaction or other functional sites are enriched in somatic mutations across tumors. We derive analytical calculations that enable us to avoid time-prohibitive permutation-based significance tests, making it computationally feasible to simultaneously consider multiple measures of protein site functionality. Our accompanying software, PertInInt, combines knowledge about sites participating in interactions with DNA, RNA, peptides, ions, or small molecules with domain, evolutionary conservation, and gene-level mutation data. When applied to 10,037 tumor samples, PertInInt uncovers both known and newly predicted cancer genes, while additionally revealing what types of interactions or other functionalities are disrupted. PertInInt’s analysis demonstrates that somatic mutations are frequently enriched in interaction sites and domains and implicates interaction perturbation as a pervasive cancer-driving event.
In Brief
A fast, analytical framework called PertInInt enables efficient integration of multiple measures of protein site functionality—including interaction, domain, and evolutionary conservation—with gene-level mutation data in order to rapidly detect cancer driver genes along with their disrupted functionalities.
Graphical Abstract
INTRODUCTION
Large-scale, concerted oncogenomic consortia, such as the Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC), have sequenced an unprecedented number of tumor genomes from thousands of patients across tens of cancer types (International Cancer Genome Consortium et al., 2010; TCGA Research Network et al., 2013). Computational analyses of these datasets promise a revolution in precision oncology with additional insights into the genetic underpinnings of a staggeringly complex and heterogeneous disease (Chin and Gray, 2008). The recent, successful completion of these efforts, heralded as the “end of the beginning” of cancer genomics, has revealed a critical need for new methods that are able both to detect less frequent cancer-driving mutational events as well as to suggest the mechanistic, molecular impact of these mutations (Bailey et al., 2018; The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium et al., 2020). More broadly, the comprehensive detection of cancer-driving mutational events, coupled with an understanding of their biological mechanism of action, has the potential to expand our knowledge of altered cellular processes in tumors, to reveal actionable, genetic similarities between different cancer types, to inform how evolving, heterogeneous populations of tumor cells may impact therapeutic efficacy, and to further translational research and inform downstream clinical treatments (McGranahan and Swanton, 2017; Vogelstein et al., 2013).
Despite considerable efforts, the crucial first step toward these goals—differentiating the small fraction of somatic mutations with functional roles in cancer (“drivers”) from the preponderance of neutral “passenger” mutations—still poses a substantial computational obstacle (Porta-Pardo et al., 2017). While initial attempts to uncover cancer drivers at the gene level based on frequency of mutation across tumor samples have been fruitful (Dees et al., 2012; Lawrence et al., 2013), such gene-centric, recurrence-based approaches are inherently unable to detect infrequently mutated driver genes and also cannot distinguish among mutations within the same gene that may lead to distinct tumor phenotypes or clinical responses (Torkamani and Schork, 2008). Indeed, different positions within genes can contribute in varying degrees to different molecular functionalities; mutations falling within different gene positions can therefore have unequal impacts. In order to address the critical need to detect and interpret rare mutational driver events, an emerging class of “subgene” level approaches consider somatic mutations affecting genes within the context of information known about specific sites within their encoded proteins (Porta-Pardo et al., 2017). Existing subgene-level methods have derived such protein site functionality information from analyses of evolutionary conservation (Adzhubei et al., 2010; Ng and Henikoff, 2003; Reva et al., 2011), three-dimensional structure (Gao et al., 2017; Kamburov et al., 2015; Niu et al., 2016; Porta-Pardo et al., 2015; Ryslik et al., 2014a; Tokheim et al., 2016a), domains (Munro et al., 2018; Peterson et al., 2017; Porta-Pardo and Godzik, 2014), or post-translational modification (Reimand and Bader, 2013; Zhao et al., 2017). These methods, however, tend to identify cancer genes by considering whether somatic mutations alter just a single type of functionality (e.g., determining whether mutations are enriched only within protein domains [Porta-Pardo and Godzik, 2014] or only within phosphorylation sites [Reimand and Bader, 2013]), whereas somatic mutations within putative driver genes have been found to disrupt a broad range of protein functionalities. On the other hand, machine learning approaches to classify cancer drivers incorporate multiple types of information, but due to their “black box” nature, mechanistic interpretations of their predictions are not possible (Carter et al., 2009; Shihab et al., 2013).
We and others have previously demonstrated that detecting proteins that harbor somatic mutations in their interaction interfaces is a particularly effective approach to pinpoint infrequent driver mutations as well as reason about their molecular impacts and therapeutic sensitivities (Engin et al., 2016; Ghersi and Singh, 2014; Gress et al., 2016; Kamburov et al., 2015; Kar et al., 2009; Nishi et al., 2013; Porta-Pardo et al., 2015; Stehr et al., 2011). Indeed, several cancer driver genes, including TP53 and IDH1, are well known to harbor mutations within their interaction sites (Nishi et al., 2013). While traditionally interaction sites have been identified directly for the small fraction of human genes with actual or modeled co-complex structures, we have recently developed a domain-based approach that accurately detects residues that interact with DNA, RNA, peptides, ions, or small molecules across 63% of human genes (Kobren and Singh, 2019). A robust computational framework that utilizes this vastly expanded knowledge-base about interaction sites and explicitly integrates it with additional lines of evidence regarding subgene functionality would provide a powerful new approach not only to detect but also to interpret a wide range of mutations driving protein dysfunction in cancer.
Here, we introduce a fast, interpretable, and easily extendable framework that enables us to uncover whether somatic mutations within genes are enriched in sites associated with high measures of “functionality” as determined by multiple, possibly correlated, lines of evidence. Our implementation PertInInt (pronounced “pertinent,” perturbed in interactions) incorporates interaction site information, along with evolutionary conservation and domain membership information, as each of these measures informs which sites are important for protein functioning. We derive analytical calculations that obviate the need to perform time-prohibitive permutation-based significance tests, thereby making it feasible to integrate, in a principled manner, these distinct measures of subgene-level functionality. Further, we extend our framework to consider whole-gene mutation rates, as genes that are recurrently mutated across tumors are often found to be causally implicated in cancers (Forbes et al., 2010). While other approaches have combined the output of multiple programs post hoc (e.g., Bailey et al., 2018), PertInInt integrates multiple alternate sources of subgene resolution data with whole-gene mutational frequency within a single unifying framework in order to detect, evaluate, and infer the molecular impact of patterns of somatic mutations within all human genes.
We apply PertInInt to somatic missense mutation data arising from 10,037 tumor samples across 33 cancer types to identify genes with the most enriched mutational patterns. We find that while each source of information—interaction, domain, evolutionary conservation, and whole-gene mutation frequency—is individually predictive of cancer genes, PertInInt uncovers more comprehensive sets of cancer-relevant genes when considering all sources of information together. We demonstrate that PertInInt is able to identify even those cancer genes with relatively low overall mutation rates, and that PertInInt readily outperforms previous methods while revealing whether and what type of interaction potential is perturbed. PertInInt finds that numerous known oncogenes and tumor suppressors have an enrichment of somatic mutations within their interaction interfaces and, in addition, predicts new cancer-relevant genes along with their altered interaction functionalities. Altogether, PertInInt provides a highly effective integrative framework to analyze large-scale cancer somatic mutation data and further our understanding of the molecular mechanisms driving cancers.
RESULTS
Overview of the PertInInt Framework
PertInInt aggregates somatic mutational data observed across tumor samples and identifies for each gene whether certain types of its functional sites are enriched in somatic mutations and/or whether the gene exhibits a high mutation rate across its length. We briefly overview our approach next (see also Figure 1); more details can be found in the STAR Methods section and Figure S1.
Different measures of protein site functionality are modeled using distinct “tracks” where each position within a track has a corresponding 0 to 1 weight that reflects its importance with respect to the functionality being considered (Figure 1A). Though any type of annotated functional region can be incorporated into our framework, here, we consider four specific types of tracks. First, “interaction tracks” model various protein–ligand interaction interfaces, where higher positional weights indicate that those positions are more likely to participate in interactions with a ligand; each interaction track corresponds to the subset of protein positions where we have any knowledge about ligand-binding potential (Figures S1A–S1C). Second, “domain tracks” span the length of the protein and simply identify portions of the protein sequence that correspond to the domain of interest; weights are 1 for amino acid positions within the domain and 0 elsewhere (Figure S1D). Third, the “conservation track” is also the length of the protein sequence, and the weight of each position measures its conservation across vertebrate homologs; higher weights correspond to positions under more evolutionary constraint (Figure S1E). Finally, to determine whether a gene as a whole has more mutations than expected, we extend our framework to incorporate the “natural variation track,” which has a single entry per gene that reflects its background mutation rate, as estimated from the number of variants this gene has across healthy populations (Lek et al., 2016; Przytycki and Singh, 2017) (Figure S1F). Approximately 63% and 90% of human genes have per-site information about interactions or domains respectively, while all genes have per-site conservation values and background gene-level mutation rates. A gene may have numerous interaction and domain tracks (e.g., for different modeled interaction regions and for each of its identified domains) but has only a single conservation and natural variation track.
For each track, we consider the somatic mutations observed across tumor samples that fall within positions of that track and compute a per-track score as the sum of the per-track weights of the positions that each of the mutations fall into (Figure 1B); intuitively, a high score corresponds to the case where a large number of mutations fall within important track positions. To determine whether the score for a track is more than expected by chance, we could shuffle the mutations across the positions of the track, and use the mean and standard deviation computed from these permutations to compute a Z score; however, the mean and standard deviation for each track can be computed analytically. For each protein, we next combine the information from each of its tracks. Because tracks can overlap along the length of the protein sequence, and the somatic mutations that fall in each of them can also overlap, these tracks cannot be treated independently. Instead, for the background model we derive an approach to compute the covariance between tracks analytically and then use this covariance matrix to estimate a combined score (Figure 1C; see STAR Methods for derivations). We find that even when considering just a single track, our analytical formulation leads to >7× speedup over each empirical permutation (Figure S2). In practice, numerous shuffles are necessary to compute the mean and variance for a single track, and empirical calculations to estimate the covariance across all tracks is prohibitively slow, highlighting the power and necessity of our analytical formulation.
The final per-gene score output by PertInInt considers whether somatic mutations across samples are enriched (1) in positions with high ligand-binding potential for an interaction track, (2) within domain positions for a domain track, (3) within conserved sites for the conservation track, and (4) within the gene overall.
PertInInt Effectively Identifies Cancer Driver Genes via Integrating Multiple Sources of Information
We run PertInInt on somatic point mutation data aggregated across 10,037 pan-cancer tumor samples and 33 tumor types from TCGA (TCGA Research Network et al., 2013) (Figure S3 and Table S1). PertInInt’s analytical formulation enables the simultaneous consideration of multiple types of biological data regarding protein functionality. However, to first uncover to what extent each source of information—per-site interaction, domain, and conservation information as well as overall gene mutational frequency—is independently useful for identifying cancer-relevant genes, we run PertInInt on the pan-cancer dataset when restricted to each of these track types in turn. To validate the method in the absence of a complete gold standard, as we consider an increasing number of output genes, we compute how “enriched” this set is in genes from the Cancer Gene Census (CGC), a curated list of genes implicated in cancer (Futreal et al., 2004). In particular, enrichment is computed as the ratio between the fraction of CGC genes in the set of top-scoring genes considered (i.e., the precision) and the fraction of CGC genes in the whole set of genes (i.e., the precision you would expect to achieve if genes are randomly ordered).
We find that utilizing subsets of only interaction, only domain, only conservation, or only natural variation tracks in turn can recapitulate known CGC genes to varying degrees, with interaction tracks identifying the largest number of known driver genes while maintaining perfect precision relative to other track subsets (Figure 2A). Our integrative framework that incorporates all track types outperforms every version of our algorithm that considers only subsets of information; indeed, considering any two sources of biological information outperforms versions of PertInInt that utilize only one source, and considering any three sources of data tends to improve performance even further (Figure 2B). Attempts to combine per-track scores without accounting for between-track covariance (e.g., by summing or averaging track Z scores) not only are incorrect but also perform considerably worse in detecting cancer-relevant genes (Figure S4). Altogether, these results demonstrate the ability of our approach to effectively leverage the distinct contributions of multiple, complementary data sources regarding protein position and whole-gene functionality in order to uncover cancer driver genes. Furthermore, the enrichment of cancer genes among PertInInt’s top predictions remains when considering different gold standards (Figure S5).
We find low overlap between the sets of CGC genes identified when utilizing distinct track types, indicating that mutations within cancer genes tend to target a diverse array of functional elements (Figure 2C). Only a small minority of CGC genes (less than 10%) are identified by all four track types within the top 200 ranked genes. Mutations falling into known tumor suppressor PTEN, for instance, tend to hit evolutionarily conserved protein positions but do not alter known inferred interaction interfaces or domain regions more than expected by chance. In contrast, a small molecule-binding pocket in the IDH2 oncogene is recurrently mutated across cancers, and, thus, it is readily detected using interaction tracks alone but is less significantly ranked when PertInInt is restricted to other functionality data.
Lowly Mutated Genes Harbor Mutations that Preferentially Alter Functional Sites
We next show that PertInInt’s integrative approach can highlight genes with preferentially altered functional sites that may be lowly mutated overall; such “long tail” driver genes are easily missed by traditional frequency-based driver gene detection approaches. When run on the pan-cancer dataset utilizing all track types, PertInInt ranks highly several such infrequently mutated genes (Figure 3A). Of the top 35 genes ranked by PertInInt on the pan-cancer dataset, we find that 20 have a missense mutation rate less than one-twentieth of the maximum observed mutation rate (Figure 3B). These high-scoring long tail genes include novel genes with potential roles in cancer as well as known driver genes that cannot have been identified based solely on their relative mutation frequency (e.g., KMT2D and CIC, Figure 3B). Of the 20 highly ranked infrequently mutated genes, 18 harbor perturbed interaction sites, enabling immediate molecular insights regarding their roles in cancer. For example, among long tail genes that are highly ranked by PertInInt but have not yet been identified as cancer relevant, several have an enrichment of mutations in their DNA or small molecule interaction sites (e.g., MGA and GRIN2D, Figure 3B), in line with previous observations that many cancer driver genes exhibit these types of protein interaction perturbations (Delgado and León, 2006; Jeggo et al., 2016; Raimondi et al., 2017).
Mutations Are Distributed across Interaction Interfaces
For each protein with a significantly perturbed interaction interface, we next sought to determine whether mutations are found within a small number of interaction sites or across several interaction sites. We consider all sites within the protein with nonzero interaction track weights and use the frequency with which somatic mutations occur within each of them to compute a normalized Shannon entropy (Shannon, 1948). Higher entropies correspond to proteins with mutations spread across many interaction sites, whereas low entropies correspond to mutational patterns that can be uncovered by methods that look for mutation “hotspots” (Chang et al., 2016). As expected, PertInInt highly ranks several oncogenes that have previously been detected by hotspot detection algorithms due to their recurrent mutations in critical interaction positions (e.g., IDH1, BRAF, and NRAS). However, there are also many genes with significantly perturbed interaction interfaces where mutations are spread more widely across their interaction sites (Figure 3C). Known cancer genes DICER1, SMARCA4, CREBBP, and KMT2D, for instance, are among the top 35 genes ranked by PertInInt and contain significantly mutated interaction sites (combined score across interaction tracks > 6), each with several interaction sites that together harbor an enriched number of somatic mutations.
Notably, this analysis reveals that the top-ranked genes with significantly perturbed interaction interfaces include both oncogenes and tumor suppressor genes (TSGs), reflecting a dichotomy in the impact of binding interface mutations. Whereas some specific mutations within interaction sites have been linked to oncogenic activity (Stehr et al., 2011), other binding site mutations are known to entirely disrupt critical interactions and overall protein function (Cho et al., 1994). Although we model the interaction sites of similar numbers of oncogenes and TSGs (238 and 246 respectively), we find that among the 50 genes with the highest enrichment of mutations within their interaction sites, the enrichment of oncogenes is 2.36-fold greater than the enrichment of TSGs. Nevertheless, PertInInt uncovers perturbed interaction interfaces in many genes that have been previously identified as drivers due to nonsense, frameshift, or other relatively disruptive mutations typically associated with TSGs (e.g., RUNX1 and FOXO1). Indeed, enriched yet less common interaction-altering missense mutations uncovered by PertInInt may correspond to more subtle knockdown phenotypes or previously underappreciated oncogenic activities of genes traditionally characterized as TSGs.
PertInInt Outperforms Previous Methods in Detecting Cancer Genes
Having demonstrated that PertInInt can identify interaction interfaces enriched in mutations across tumor samples, and that this is highly predictive of cancer genes, we next turn to assessing PertInInt’s performance as compared with previously published methods (Chang et al.,2016; Lawrence et al., 2014; Melloni et al., 2016; Mularoni et al., 2016; Porta-Pardo et al., 2015; Porta-Pardo and Godzik, 2014; Przytycki and Singh, 2017; Ryslik et al., 2013, 2014b; Tamborero et al., 2013; Ye et al., 2010) for detecting cancer driver genes (see STAR Methods). These methods differ substantially in terms of their statistical models and overall goals and, unlike PertInInt, are largely unable to distinguish among the various types of interaction and other functional perturbations affecting the identified genes.
When applied to tumor samples from the pan-cancer dataset, our method has a greater enrichment for CGC genes than the other tested methods that we were able to run (Figure 4A). PertInInt also outperforms these other methods in terms of enrichment of CGC genes among top-ranked genes even after we exclude tumor samples from the six most highly mutated cancer types with 100+ missense mutations per patient on average, demonstrating that PertInInt’s superior pan-cancer performance is not driven by samples from cancer types that contribute large numbers of mutations (Figure S6). PertInInt’s greater enrichment as compared with other methods is observed as well for other lists of driver genes, and, furthermore, neither PertInInt nor any of the other tested methods show any enrichment for sets of genes that have been suggested to be unlikely to play roles in cancer (Figure S7; Table S2). Notably, the genes ranked highly by PertInInt differ substantially from those identified by other approaches (Figure 4B). Specifically, the set of genes identified by PertInInt has a consistently low Jaccard index (JI) with sets of genes ranked by alternate methods (JI < 0.5 across all methods for top 25 genes, JI < 0.25 across all methods for top 150 genes). Moreover, due to our analytical formulation, PertInInt can process the pan-cancer mutational data while considering multiple sources of data about protein functionality in 10 min on a single core of standard desktop; alternate methods each consider a limited set of mutational patterns and range in runtime from minutes to days (Table S3).
We also repeat our analysis on datasets restricted to samples from one cancer type, as many alternate methods that failed to run on the pan-cancer dataset are able to run on these substantially smaller subsets of tumor genomes. We find that in general across individual cancer datasets, PertInInt tends to achieve a higher area under the enrichment curve than other methods, including whole-gene methods, and a version of PertInInt that includes only subgene resolution tracks also outperforms other subgene methods (Figure S8). We note that since individual cancer types obviously have smaller total numbers of somatic mutations as compared with the combined pan-cancer dataset, the Z scores computed by PertInInt when run on individual cancers tend to be smaller than PertInInt’s pan-cancer Z scores; this trend is especially notable for cancer types with fewer samples and/or lower mutation rates. Similarly, fewer tracks can be evaluated for significance in the per-cancer analysis; this has a larger effect on interaction tracks in particular as they tend to involve fewer protein positions. Indeed, we find that the relative proportions of track types with positive Z scores for each cancer type are not notably different from each other, with the exception that the highly mutated cancer types (i.e., colorectal, lung, and stomach adenocarcinomas; skin cutaneous melanoma; and uterine corpus endometrial carcinoma, see Figure S3) each have a greater number of positively scoring domain and interaction tracks relative to other cancer types. Despite these differences when run on different cancer types, PertInInt is able to readily recover “cancer-specific” drivers (e.g., EIF1AX in uveal melanoma), and our analysis also reveals that genes that play dominant roles in certain cancer types may be important for smaller proportions of tumor samples in alternate cancer types as well (e.g., the same gene in thyroid carcinoma, Figure S9). Overall, our results show that PertInInt is a powerful method for evaluating mutational patterns across tumors of the same cancer type as well as across a pan-cancer dataset covering over 10,000 tumor genomes.
Distinct Perturbed Molecular Mechanisms Uncovered across Genes
Having shown that PertInInt is highly effective in identifying cancer genes, we next demonstrate its additional power to pinpoint which specific functional regions and mechanisms are perturbed by analyzing each track separately and determining which have positive Z scores (Table S4). Altogether, we find that 665 CGC genes have at least one subgene functionality track with a Z score ≥ 0.5, representing functional coverage of 93% of all CGC genes (Figure 5). Specifically, we find that DNA, RNA, peptide, ion, and small molecule interaction sites are enriched in mutations in 16%, 5%, 19%, 14%, and 22% of CGC genes, respectively; these numbers go up to 23%, 5%, 27%, and 24% of CGC genes if including those that are more broadly enriched in mutations across, respectively, DNA-binding, RNA-binding, peptide-binding, or metabolite-binding domains (as categorized in Pfam2Go; Mitchell et al., 2015). Up to 77% of CGC genes are enriched in mutations across at least one domain or interaction interface. We note that the perturbed nucleic acid and small molecule binding domains or interaction sites found across 45% of cancer genes would not be readily identified by analyses that focus exclusively on protein–protein interaction alterations (Porta-Pardo et al., 2015).
We now highlight a few of these genes that, though not present in the CGC, were uncovered by PertInInt as having significantly mutated interaction interfaces. For instance, transcription factors MGA and KLF5 harbor mutations within their basic helix-loop-helix and C2H2-ZF domains, respectively, that alter their DNA base-binding positions (Figure 6A), suggesting cancer-specific changes to normal DNA-binding and downstream regulatory activity. Indeed, KLF5’s E419Qmutation has recently been experimentally shown to change wild-type binding preferences and increase the expression of tumor progression genes in vivo (Zhang et al., 2018). Similarly, MGA normally subdues the activity of well-known oncogene MYC; its frequent deletion, truncation, or mutated binding properties across cancers further indicates its role as a tumor suppressor (Schaub et al., 2018). We also find that two RNA-binding genes DIS3 and SF1 exhibit significant mutations in their putative RNA-binding sites, with recurrent mutations in DIS3 altering multiple distinct RNA-contacting positions (Figure 6B). In support of our predictions, DIS3 is recurrently mutated in blood and skin cancers and has been identified as a candidate oncogene in colorectal cancer (de Groen et al., 2014). SF1 is recurrently mutated across cancers in a mutually exclusive fashion (i.e., indicating its analogous functionality) to RBM10, a gene found to drive aberrant splicing events in cancer (Seiler et al., 2018).
PertInInt also newly implicates a number of genes—present neither in the CGC nor on other lists of known cancer genes (Bailey et al., 2018; Kandoth et al., 2013; Lawrence et al., 2014; Vogelstein et al., 2013)—with mutations that appear to alter critical small molecule interaction sites (Figure 6C). The highly conserved kinase GSK3A for instance harbors a significant enrichment of mutations altering its ATP-binding positions. Supporting our prediction, suppression of this gene is associated with impaired growth and induction of apoptosis, and it has recently been proposed as a potential therapeutic target in acute myeloid leukemia (Banerji et al., 2012; McCubrey et al., 2014). We also find that the S-type lectin LGALS4 has an enrichment of mutations altering the β-galactoside sugar-binding positions in its galectin domains; indeed, LGALS4 has been linked to the regulation of the cancer-relevant Wnt signaling pathway and has been experimentally implicated as a tumor suppressor in colorectal cancer cells in vitro (Satelli et al., 2011).
To more broadly characterize PertInInt’s novel cancer gene predictions, we use gene set enrichment analysis (GSEA) (Mootha et al., 2003; Subramanian et al., 2005) with Reactome pathways (Jassal et al., 2020) on the ranked list of non-CGC genes output by PertInInt with combined score ≥ 1 (Table S5). GSEA uncovers 12 enriched Reactome pathways at false discovery rate (FDR) corrected p value (or q-value) < 0.2, including known cancer pathways such as Ras (q-value = 0.102), and signaling pathways mediated by known cancer genes, such as CREB1 (q-value = 0.042), NOTCH3 (q-value = 0.071), and PDGF (q-value = 0.195). In addition, a pathway related to cell adhesion, which is known to be disrupted especially in metastatic tumors, is also enriched (e.g., L1-ankyrin interactions, q-value = 0.078). Intriguingly, two enriched pathways suggest that nervous system-related functionalities may play roles across tumors (q-value < 0.2), an observation that has only very recently been explored (Zeng et al., 2019).
DISCUSSION
In this work, we have introduced a fast, integrative framework to detect cancer driver genes by uncovering whether somatic mutations across tumors are enriched in sites of different types of functionalities. Our method utilizing this framework, PertInInt, integrates knowledge from the largest set of protein–ligand interaction sites to date (Ghersi and Singh, 2014; Kobren and Singh, 2019) with additional biological data regarding subgene functionality and whole gene mutability (Figures 1 and S1). When applied to over 10,000 tumor samples from 33 cancer types, PertInInt reveals a broad range of perturbed functionalities in several known driver genes as well as in relatively rarely mutated genes with predicted tumorigenic roles (Figures 3A, 3B, and 5). Notably, PertInInt finds that mutations within many known driver genes are enriched in protein interaction interfaces (Figure 2A) and more broadly implicates interaction perturbation as a frequent phenomenon in cancer cells (Figure 5).
Analyses of predicted cancer-relevant coding mutations often involve—whenever possible—assessing their putative effect with respect to protein structure (Bailey et al., 2018; Chang et al., 2016; Kamburov et al., 2015; Niknafs et al., 2013; Raimondi et al., 2017). Although using structure directly to identify relevant mutations is rarely scalable in terms of runtime and coverage (Ryslik et al., 2013), PertInInt’s use of structurally predefined regions mediating protein interactions makes large-scale analyses in the context of protein structure feasible. Moreover, since cancer-driving genetic aberrations do not always involve mutation of protein–ligand interaction interfaces, a critical additional feature of PertInInt—that extends its coverage to all human genes—is that it seamlessly incorporates additional lines of evidence regarding protein site functionality. While here we have demonstrated that PertInInt effectively utilizes per-site evolutionary conservation and domain knowledge, we anticipate that encoding more sources of functional information within our framework (e.g., known phosphorylation sites or intrinsically disordered regions) will unearth other driver mutations and alternate mechanisms of action. Incorporating structurally resolved information from protein–protein interaction networks will also be a valuable direction for future work.
Genes that are frequently mutated across their lengths tend not to overlap genes that exhibit nonrandom patterns of mutations across individual protein positions, a pattern that has previously been leveraged to distinguish TSGs from oncogenes (Tokheim et al., 2016b; Vogelstein et al., 2013). By incorporating whole gene mutability information into our existing framework, we are able to uncover and profile a much more comprehensive set of both oncogenes and TSGs (Figures 2B and 3C). Although previous methods have also considered the frequency and spatial patterning of mutations within genes together (Korthauer and Kendziorski, 2015; Lawrence et al., 2013; Tokheim et al., 2016b), we also simultaneously infer specific perturbed molecular mechanisms within uncovered genes. We note that while mutation deleteriousness predictors—developed both in the context of cancer (Carter et al., 2009; Shihab et al., 2013) and otherwise (Adzhubei et al., 2010)—can evaluate the impact of somatic mutations, they tend to integrate multiple sources of protein site functionality information via complex statistical or machine learning approaches, where the contribution of each data source and thus subsequent mechanistic interpretations are obscured. In contrast, by determining mutational enrichments in specific types of functional sites, PertInInt is able not only to identify cancer-relevant genes but also to begin to explicitly reason about the biomolecular impacts of mutations. Indeed, uncovering the mechanisms of action for cancer-driving mutational events has been a major bottleneck in the critical step of translating this knowledge to improve patient care and outcomes (The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium et al., 2020).
Given the success of large-scale cancer genome sequencing consortia projects in expanding our knowledge of basic cancer biology (Bailey et al., 2018; Ding et al., 2018; Hoadley et al., 2018; Sanchez-Vega et al., 2018), coupled with the decreased cost of genome sequencing, it is clear that sequencing tumor genomes will be routine practice in both basic science and clinical settings, thereby rapidly increasing the number of sequenced tumors available for analysis. Importantly, PertInInt’s analytical framework enables it to efficiently process increasing numbers of tumor genomes; further, this speed is accompanied by better identification of cancer-relevant genes when run on larger numbers of tumor samples (Figure S10). Since PertInInt’s underlying analytical framework is general, we anticipate that it will also be effective in other settings. For example, because very few non-coding somatic mutations in cancer tend to be recurrent (Khurana et al., 2016), it may be especially powerful for identifying regulatory regions with an enrichment of mutations within sites associated with different measures of functionality (e.g., binding sites for different proteins).
In the future, one of the most tantalizing prospects of cancer genomics is its potential in transforming clinical practice. While identifying and linking cancer mutations to personalized treatments remains a daunting challenge, PertInInt dramatically accelerates the detection of rare mutational driver events from sequenced tumors while providing important information about their mechanisms of action, a key step in developing and customizing targeted therapeutic regimens.
STAR★METHODS
RESOURCE AVAILABILITY
Lead Contact
Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Mona Singh (mona@cs.princeton.edu).
Materials Availability
This study did not generate new unique reagents or materials.
Data and Code Availability
All original code generated during this study are available at http://github.com/Singh-Lab/PertInInt.
METHOD DETAILS
Protein Site-Based Functional Tracks
Any pre-defined functional region of a protein can be encoded as a track in the PertInInt framework. Currently, we consider three types of per-site functional annotations—interaction, domain, and conservation—the former two of which may yield multiple subgene resolution tracks per protein. Each type of track is described in more detail below.
Interaction Tracks
Interaction tracks correspond to portions of a protein that are inferred to interact with ligands. These tracks arise in two ways.
First, we leverage sequence homology directly to transfer information from co-complex structures to human protein sequences as previously described (Ghersi and Singh, 2014). For proteins with one or more regions whose structure in complex with a ligand could be homology-modeled, we introduce a track for each contiguous homology-matched region. Per-position weights reflect the observed residue-to-ligand proximities, computed as the fraction of atoms in the amino acid residue found within 4.0Å of the ligand.
Second, we utilize the set of “confident” domain–ligand interactions from the InteracDome database (v0.3) (Kobren and Singh, 2019) to identify putative ligand-binding positions. We subset this collection to the 9,142 domain–ligand interactions across 1,850 domains that were characterized by InteracDome using at least five structural instances. Each position within an InteracDome domain is associated with a “binding frequency” between 0 and 1 that corresponds to the fraction of the time residues in this position were found to be in contact with the ligand of interest when analyzing co-complex structures. For each human protein, we identify instances of InteracDome domains using HMMER (v2.3.2 and v3.1b2), and require complete, high-scoring domain instances as previously described (Eddy, 2011; Finn et al., 2014; Kobren and Singh, 2019). Within a protein, there is a separate track for each domain–ligand instance within it; this track consists of the residues comprising the match states of the domain, and the weights of these residues are the binding frequencies for the ligand in the corresponding domain positions.
Finally, we note that some domain interactions are mediated not by individual domain instances but by repeating instances of the same domain family. To capture these interfaces, we also consider additional tracks encoding multiple instances of the same domain family in a protein; these “aggregate” tracks span noncontiguous intervals that correspond to the locations of individual domain instances, with track positions weighted according to the binding frequencies at corresponding domain match states as described above. Interaction domain tracks corresponding to domain families with 40+ instances in the same protein are replaced by their aggregate tracks.
Domain Tracks
For each Pfam-A (v31.0) domain instance within a protein sequence, there is a domain track that specifies which amino acids comprise the domain (Finn et al., 2014). Domain tracks span the length of the protein, and positions within and outside of the domain instance are respectively assigned weights of 1 and 0. We again also encode aggregate domain tracks as before to model functional regions mediated by repetitive domain families.
Conservation Tracks
Each protein has a single conservation track. We obtain the 100-vertebrate cross-species protein multiple sequence alignment from the UCSC Genome Browser (Meyer et al., 2013) and compute per-protein-position conservation-based functionality weights by multiplying the fraction of non-gap residues in the column by the Jensen-Shannon divergence (JSD) between those non-gap residues and a Blosum 62 background amino acid distribution (Capra and Singh, 2007).
Per-track Functional Mutation Scores
Suppose we have a protein sequence of length L spanning positions P = {p1,...,pL}. This protein is associated with multiple “tracks” W, each defined as W⊆P, where each position pi∈W is associated with a real-valued weight wi∈[0, 1] reflecting its functionality with respect to the track. Suppose there are n cancer somatic missense mutations across a cohort of tumor samples that fall in positions included in track W. For each mutation i, let zi∈{z1, …, zn} be the weight in track W of the position where that mutation lies. We further consider the case where each mutation i is associated with a value fi∈(0, 1]; here, each fi is set to the proportion of sequencing reads that contain the mutation (i.e., its subclonal fraction), which has previously been shown to be associated with a mutation’s relevance in cancer (McGranahan et al., 2015). The score of the somatic mutations with respect to track W is then defined as:
(Equation 1) |
Intuitively, this score reflects the extent to which somatic mutations are falling into functionally important positions within a track.
Per-track Analytical Z score Calculations
For a given score SW for a track, we next want to determine if this score is higher than we would expect by chance. One approach would be to repeatedly randomize the mutations within the positions of the track and use the distribution of resulting scores to compute an empirical p value. Here we show that we can determine the significance of these scores analytically, obviating the need for empirical mutation shuffles and dramatically improving runtime (Figure S2). Note that in the absence of any selective pressure, the values z1, …, zn are independent and identically distributed (i.i.d.) random variables. We leverage this observation to directly compute the significance of SW. First, we model all mutation locations zi as being drawn from the same background mutation model λ1, …, λL, where λi is the probability that a mutation affects position i. If every position i within a protein of length L is equally likely to harbor a missense mutation, λi = 1/L. Here, we incorporate codon-specific missense mutation probabilities as well as cancer-specific C/G-mutation biases into our background mutation model (see “Per-site background mutational model” STAR Methods section below). We linearly scale these values with respect to each track W such that . The expected weight of the position in which mutation i lies and its variance with respect to this null distribution are computed as
and
Because the total score of the set of mutations affecting track W (i.e., SW) is a sum of independent random variables (Equation 1), the expectation and variance of SW can also be calculated directly as
(Equation 2) |
and
(Equation 3) |
Finally, to determine the significance of the actual score SW, which indicates the propensity of somatic mutations to fall into highly weighted positions in a track, since the sum of independent random variables tends towards a normal distribution, we compute the mutational enrichment Z score for each track W as
(Equation 4) |
We note that if we (1) restricted each weight within a track to be 0/1 rather than real-valued, (2) restricted mutations to have equal fi values of 1, and (3) restricted the λi to be uniform across the track, we could determine per-track significance analytically using the binomial distribution. Note that with these restrictions, however, we would not be able to incorporate real-valued functionality weights from conservation or interaction tracks, subclonal mutation fractions, or mutational signatures.
Per-site Background Mutational Model
We model the likelihoods of protein positions p1, …, pL to harbor a missense mutation as λ1, …, λL such that
(Equation 5) |
where
and
Between-Track Analytical Covariance Calculation
In our framework, a single protein may be associated with multiple tracks, each representing a distinct aspect of protein functioning. Since tracks can share positions, the track scores with respect to a set of somatic mutations are not independent of each other, and thus we need to determine their covariance.
Suppose we consider two tracks V⊆P and W⊆P, where each position pi∈V is associated with a weight vi and each position pi∈W is associated with a weight wi. Suppose there are m mutations (with associated values ) that involve positions within track V, and n mutations (as before with associated values f1, …, fn) that involve positions within track W. Let y1, y2, …, ym be the weights of the positions that the m mutations in track V fall into, and let z1, z2, …, zn be the weights of the positions that the n mutations in track W fall into. Scores are thus calculated as before for tracks V and W as
and
Let X = V∩W. If the two tracks do not overlap (i.e., X = ∅), then the covariance between SV and SW is 0. Otherwise, note that , where and . Similarly, . Therefore, we can write covariance as
Because the covariance is bilinear, we can now expand this equation as
Finally, because mutations landing in track V outside of the overlap region X have no bearing on SW and vice versa, the first three covariance terms in the equation above will be evaluated as 0, leaving us with
In our framework, we compute covariance conditional on the q mutations observed to fall on positions shared by tracks V and W. Let , with the fi associated with the q mutations in X. With the number of mutations q fixed, we have
Note that the same mutations from tracks and land on the same position in the overlap region and simultaneously impact the SV and SW scores, whereas any other pair of mutations j ≠ k are independent. Hence,
(Equation 6) |
Analytical Formulation Enables Precomputation
Note that the per-track expectation, variance and covariance calculations (Equations 2, 3, and 6) can each be rewritten as or , where C is fixed per track. We therefore precompute the per-track expectations, variances, and cross-track covariances assuming a single mutation of value 1, and scale these precomputed values at runtime by the mutations observed to fall into each track; this allows PertInInt to achieve an additional 16–18× speedup at runtime (Figure S11).
Whole Gene Mutability Tracks
Using the same analytical formulation described above, we can also compute a Z score per gene reflecting whether the gene is more mutated overall than we might expect. We define a natural variation track of length L = 19,460 for each gene, where the entry corresponding to the gene of interest has a functionality weight of 1, and all other entries have weights of 0 (i.e., one-hot gene encodings). We then compute a corresponding background mutability probability distribution λ1,…λL based on how much each gene varies naturally across healthy human populations. Specifically, for each of 2,504 individuals included in the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2012), we first min-rank all protein-coding genes by their variant count, linearly scale these ranks to fall between 0 and 1, then round each normalized rank down to its nearest hundredth, which we refer to as its bin. We compute the expected bin value (across individuals) for each gene, and finally to derive the values of λ1,…, λL linearly scale these per-gene expected bin values such that they sum to 1 across all genes. For each track, we use this background mutation model and the n mutations observed to fall across all 19,460 genes to analytically compute per-gene expectations, variances, and Z scores as before. Note that PertInInt models 23,278 genes—of which 20,356 are on chromosomes 1–22, X or Y—but only 19,460 genes were profiled in the 1000 Genomes Project, and thus only this many genes have associated natural variation tracks. The covariance between the natural variation track and all subgene tracks is set to 0.
Since the whole-gene track W for gene Gj is a one-hot encoding, we can simplify Equation 1 as , Equation 2 as and Equation 3 as .
Because the number of mutations affecting all genes is often substantially larger than the number of mutations to affect any single gene, the whole-gene Z scores can be much larger than for the other tracks. We thus effectively subsample the total number of mutations by a factor s—set to in our implementation—to compute the whole-gene Z scores using the values below before combining them with other subgene Z scores:
Combining Per-track Z scores for Each Protein
We evaluate the significance of the scores for all tracks simultaneously using a multivariate normal distribution. Recall that our per-track somatic mutation functional scores (SW, Equation 1) and their analytically-derived Z scores (Equation 4) computed for random assignments of mutations are normally distributed when the number of mutations (n) is sufficiently large (i.e., by the Central Limit Theorem).
For each track W, we empirically determine this minimum n by randomly assigning up to 500 mutations to the track 1,000 times in accordance with the corresponding background mutation model (i.e., the λi’s) and recomputing SW each time. At each mutation count, we ask whether we can reject the null hypothesis that the mutation functional scores are normally distributed via the Shapiro-Wilk test with p value < 5e-5. We keep track of the minimum number of mutations per track where we could no longer confidently reject the normality assumption. Only scores derived from mutated tracks with the corresponding required minimum mutation count are modeled together in our multivariate Gaussian. We pre-compute this minimum mutation count value for each track (i.e., before evaluating any cancer somatic mutation data).
For each mutated protein, we compute a single combined score using a weighted Z-transform test with correlation correction (Zaykin, 2011) as
(Equation 7) |
where k is the number of tracks with their required minimum mutation count and positive Z scores, Zi corresponds to the Z score associated with track i, ci is a weight indicating the “confidence” of track i, and rij is the correlation between tracks i and j (i.e., rij = cov(Si, Sj)/σiσj ). In order to consider each type of functionality data equally, we assign per-track confidences ci such that the four functional track groups (i.e., interaction, domain, conservation, and natural variation; Figures 1A and S1) each contribute a quarter of the overall confidence. Within the interaction and domain track groups, where there may be 2+ tracks per group, confidence weights are assigned proportionally to , where m is the total number of mutations to fall into positively-weighted positions in the track. Finally, we assign a single score per gene by taking the maximum combined score achieved by any of its corresponding protein isoforms.
QUANTIFICATION AND STATISTICAL ANALYSIS
Cancer Mutation Data Preparation
We downloaded all open-access TCGA somatic exome mutation data and RNA-seq expression data from NCI’s Genomic Data Commons on July 15, 2017 (Fan et al., 2016; Grossman et al., 2016). We convert gene expression values (FPKM) to transcripts per million (TPM) and exclude mutations from genes that were expressed at <0.1 TPM in the corresponding tumor sample. For the 765 samples with missing expression data, we exclude mutations from genes that were expressed at <0.1 TPM on average across other tumor samples of the same tissue type. These steps resulted in a filtered set of 1,141,609 missense, 442,070 silent, and 94,813 nonsense mutations across 18,613 genes from 33 cancer types (Figure S3); note that we consider the unfiltered (by expression) set of 1,473,729 missense, 578,407 silent and 118,921 nonsense mutations across 19,550 genes when running alternate methods and when running PertInInt to compare to alternate methods. We combine COAD and READ cancer types into the COAD-READ group, and GBM and LGG cancer types into the GBMLGG group for per-cancer performance testing (Figures S8 and S9).
Runtime Analysis
PertInInt, as well as all algorithm variants of PertInInt and all alternate methods, are run as sole processes on single CPUs, each with a 2.4–2.7 Ghz processor and 30GB of RAM. Methods are timed using Python’s time package, and the real (i.e., “wall clock”) elapsed time is reported.
Testing Related Driver Detection Methods
We classify alternate cancer driver detection methods based on the mutational patterns they detect; these include whole gene enrichment, de novo linear clustering, enrichment in linear externally defined regions, de novo three-dimensional (3D) clustering, or enrichment in 3D externally defined regions (as in Porta-Pardo et al., 2017). We include methods from each of these five groups that require only mutational and/or structural input from the user and have open-source implementations that run locally on a 64-bit Linux machine using sample input. We test the whole gene methods DiffMut (Przytycki and Singh, 2017), MutSigCV (Lawrence et al., 2014), and OncodriveFML (Mularoni et al., 2016); the linear clustering methods Hotspot (Chang et al., 2016), OncodriveClust (Tamborero et al., 2013), and NMC (Ye et al., 2010); the linear externally defined regions methods ActiveDriver (Reimand and Bader, 2013), eDriver (Porta-Pardo and Godzik, 2014), and LowMACA (Melloni et al., 2016); the 3D clustering methods GraphPAC (Ryslik et al., 2014b), iPAC (Ryslik et al., 2013), and SpacePAC (Ryslik et al., 2014a); and the 3D externally defined regions method eDriver3D (Porta-Pardo et al., 2015). We note that in addition to overall mutation frequency, MutSigCV also considers linear clustering of mutations within genes and the functional impact of mutations based on evolutionary conservation.
All methods including PertInInt are run on the same mutation datasets before our filtering step of limiting to mutations from expressed genes. Additional data files required for individual methods are obtained from their most recent online repositories or otherwise from their original publications. For 3D clustering methods, we select a single structural template for each human protein wherever possible as suggested (i.e., preferring native over bound form, longer length, higher sequence identity, higher resolution, and smaller R-value; Berman et al., 2000). We note that because these 3D clustering methods only run on proteins with corresponding structural information, their results may be biased toward known cancer genes (i.e., 55.3% of cancer genes have structural templates whereas 28.1% of all genes have structural templates). Methods are run with default parameters, except GraphPAC and SpacePAC, where the significance threshold (α) is set to 1.0 to maximize the number of scored genes returned.
For each method, enrichment for known driver genes on increasingly larger sets of predictions is computed; enrichment is computed at each gene rank on pan-cancer results and at every tenth gene on per-cancer results to reduce the impact of minor reorderings of the relatively small number of known driver genes detected across these datasets. Specifically, we calculate enrichment as the fraction of known driver genes in the gene set (i.e., the precision) divided by the fraction of known driver genes in the whole gene set considered; unmutated driver genes with respect to each mutation dataset are excluded entirely. For this evaluation, we consider the CGC set of known driver genes as a gold standard, as well as alternate positive and negative sets of cancer genes in turn.
Note that LowMACA, NMC, and all three 3D clustering methods did not finish running without error within 30 days on the pan-cancer dataset. We were also unable to run and obtain results from NMC on an additional four individual cancer types (UCEC, SKCM, COADREAD, and LUSC), and thus exclude this method from our evaluation.
Supplementary Material
KEY RESOURCES TABLE.
Highlights.
Unified framework identifies cancer genes enriched in mutations in functional sites
Fast analytical calculations obviate the need for time-prohibitive permutation tests
Integration of structurally resolved interaction site data for >60% of human genes
Disrupted molecular functionalities revealed in known or newly predicted cancer genes
ACKNOWLEDGMENTS
Thanks to members of the Singh lab for their insight and comments. Thanks especially to Dario Ghersi, Anton Persikov, and Pawel Przytycki for helpful discussions regarding PDB files and TCGA mutation data. The results published here are in part based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/. This work was funded in part by NIH grant R01-CA208148 (to M.S.) and NSF GRFP grant DGE-1148900 (to S.N.K.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Footnotes
DECLARATION OF INTERESTS
The authors declare no competing interests.
SUPPLEMENTAL INFORMATION
Supplemental Information can be found online at https://doi.org/10.1016/j.cels.2020.06.005.
REFERENCES
- Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, and Sunyaev SR (2010). A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bailey MH, Tokheim C, Porta-Pardo E, Sengupta S, Bertrand D, Weerasinghe A, Colaprico A, Wendl MC, Kim J, Reardon B, et al. (2018). Comprehensive characterization of cancer driver genes and mutations. Cell 173, 371–385.e18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Banerji V, Frumm SM, Ross KN, Li LS, Schinzel AC, Hahn CK, Kakoza RM, Chow KT, Ross L, Alexe G, et al. (2012). The intersection of genetic and chemical genomic screens identifies GSK-3α as a target in human acute myeloid leukemia. J. Clin. Invest 122, 935–947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, and Bourne PE (2000). The Protein Data Bank. Nucleic Acids Res 28, 235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Capra JA, and Singh M (2007). Predicting functionally important residues from sequence conservation. Bioinformatics 23, 1875–1882. [DOI] [PubMed] [Google Scholar]
- Carter H, Chen S, Isik L, Tyekucheva S, Velculescu VE, Kinzler KW, Vogelstein B, and Karchin R (2009). Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer Res 69, 6660–6667. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang MT, Asthana S, Gao SP, Lee BH, Chapman JS, Kandoth C, Gao J, Socci ND, Solit DB, Olshen AB, et al. (2016). Identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity. Nat. Biotechnol 34, 155–163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chin L, and Gray JW (2008). Translating insights from the cancer genome into clinical practice. Nature 452, 553–563. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cho Y, Gorina S, Jeffrey PD, and Pavletich NP (1994). Crystal structure of a p53 tumor suppressor-DNA complex: understanding tumorigenic mutations. Science 265, 346–355. [DOI] [PubMed] [Google Scholar]
- Davoli T, Xu AW, Mengwasser KE, Sack LM, Yoon JC, Park PJ, and Elledge SJ (2013). Cumulative haploinsufficiency and triplosensitivity drive aneuploidy patterns and shape the cancer genome. Cell 155, 948–962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Groen FL, Krijgsman O, Tijssen M, Vriend LE, Ylstra B, Hooijberg E, Meijer GA, Steenbergen RD, and Carvalho B (2014). Gene-dosage dependent overexpression at the 13q amplicon identifies DIS3 as candidate oncogene in colorectal cancer progression. Genes Chromosomes Cancer 53, 339–348. [DOI] [PubMed] [Google Scholar]
- Dees ND, Zhang Q, Kandoth C, Wendl MC, Schierding W, Koboldt DC, Mooney TB, Callaway MB, Dooling D, Mardis ER, et al. (2012). MuSiC: identifying mutational significance in cancer genomes. Genome Res 22, 1589–1598. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Delgado MD, and León J (2006). Gene expression regulation and cancer. Clin. Transl. Oncol 8, 780–787. [DOI] [PubMed] [Google Scholar]
- Ding L, Bailey MH, Porta-Pardo E, Thorsson V, Colaprico A, Bertrand D, Gibbs DL, Weerasinghe A, Huang KL, Tokheim C, et al. (2018). Perspective on oncogenic processes at the end of the beginning of cancer genomics. Cell 173, 305–320.e10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eddy SR (2011). Accelerated profile HMM searches. PLoS Comput. Biol 7, e1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Engin HB, Kreisberg JF, and Carter H (2016). Structure-based analysis reveals cancer missense mutations target protein interaction interfaces. PLoS One 11, e0152929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan Y, Xi L, Hughes DST, Zhang J, Zhang J, Futreal PA, Wheeler DA, and Wang W (2016). MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biol 17, 178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, et al. (2014). Pfam: the protein families database. Nucleic Acids Res 42, D222–D230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Forbes SA, Tang G, Bindal N, Bamford S, Dawson E, Cole C, Kok CY, Jia M, Ewing R, Menzies A, et al. (2010). Cosmic (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer. Nucleic Acids Res 38, D652–D657. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, and Stratton MR (2004). A census of human cancer genes. Nat. Rev. Cancer 4, 177–183. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gao J, Chang MT, Johnsen HC, Gao SP, Sylvester BE, Sumer SO, Zhang H, Solit DB, Taylor BS, Schultz N, and Sander C (2017). 3D clusters of somatic mutations in cancer reveal numerous rare mutations as functional targets. Genome Med 9, 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ghersi D, and Singh M (2014). Interaction-based discovery of functionally important genes in cancers. Nucleic Acids Res 42, e18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gress A, Ramensky V, Büch J, Keller A, and Kalinina OV (2016). StructMAn: annotation of single-nucleotide polymorphisms in the structural context. Nucleic Acids Res 44, W463–W468. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, and Staudt LM (2016). Toward a shared vision for cancer genomic data. N. Engl. J. Med 375, 1109–1112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E, Shen R, Taylor AM, Cherniack AD, Thorsson V, et al. (2018). Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell 173, 291–304.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- International Cancer Genome Consortium, Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, Bernabé RR, Bhan MK, Calvo F, Eerola I, et al. (2010). International network of cancer genome projects. Nature 464, 993–998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jassal B, Matthews L, Viteri G, Gong C, Lorente P, Fabregat A, Sidiropoulos K, Cook J, Gillespie M, Haw R, et al. (2020). The reactome pathway KnowledgeBase. Nucleic Acids Res 48, D498–D503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jeggo PA, Pearl LH, and Carr AM (2016). DNA repair, genome stability and cancer: a historical perspective. Nat. Rev. Cancer 16, 35–42. [DOI] [PubMed] [Google Scholar]
- Kamburov A, Lawrence MS, Polak P, Leshchiner I, Lage K, Golub TR, Lander ES, and Getz G (2015). Comprehensive assessment of cancer missense mutation clustering in protein structures. Proc. Natl. Acad. Sci. U S A 112, E5486–E5495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, Xie M, Zhang Q, McMichael JF, Wyczalkowski MA, et al. (2013). Mutational landscape and significance across 12 major cancer types. Nature 502, 333–339. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kar G, Gursoy A, and Keskin O (2009). Human cancer protein-protein interaction network: a structural perspective. PLoS Comput. Biol. 5, e1000601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khurana E, Fu Y, Chakravarty D, Demichelis F, Rubin MA, and Gerstein M (2016). Role of non-coding sequence variants in cancer. Nat. Rev. Genet 17, 93–108. [DOI] [PubMed] [Google Scholar]
- Kobren SN, and Singh M (2019). Systematic domain-based aggregation of protein structures highlights DNA-, RNA- and other ligand-binding positions. Nucleic Acids Res 47, 582–593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Korthauer KD, and Kendziorski C (2015). MADGiC: a model-based approach for identifying driver genes in cancer. Bioinformatics 31, 1526–1535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lawrence MS, Stojanov P, Mermel CH, Robinson JT, Garraway LA, Golub TR, Meyerson M, Gabriel SB, Lander ES, and Getz G (2014). Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, et al. (2013). Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O’Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, et al. (2016). Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McCubrey JA, Steelman LS, Bertrand FE, Davis NM, Sokolosky M, Abrams SL, Montalto G, D’Assoro AB, Libra M, Nicoletti F, et al. (2014). GSK-3 as potential target for therapeutic intervention in cancer. Oncotarget 5, 2881–2911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McGranahan N, Favero F, de Bruin EC, Birkbak NJ, Szallasi Z, and Swanton C (2015). Clonal status of actionable driver events and the timing of mutational processes in cancer evolution. Sci. Transl. Med 7, 283ra54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McGranahan N, and Swanton C (2017). Clonal heterogeneity and tumor evolution: past, present, and the future. Cell 168, 613–628. [DOI] [PubMed] [Google Scholar]
- Melloni GEM, de Pretis S, Riva L, Pelizzola M, Céol A, Costanza J, Müller H, and Zammataro L (2016). LowMACA: exploiting protein family analysis for the identification of rare driver mutations in cancer. BMC Bioinformatics 17, 80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meyer LR, Zweig AS, Hinrichs AS, Karolchik D, Kuhn RM, Wong M, Sloan CA, Rosenbloom KR, Roe G, Rhead B, et al. (2013). The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res 41, D64–D69. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mitchell A, Chang HY, Daugherty L, Fraser M, Hunter S, Lopez R, McAnulla C, McMenamin C, Nuka G, Pesseat S, et al. (2015). The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res 43, D213–D221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstråle M, Laurila E, et al. (2003). PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet 34, 267–273. [DOI] [PubMed] [Google Scholar]
- Mularoni L, Sabarinathan R, Deu-Pons J, Gonzalez-Perez A, and López-Bigas N (2016). OncodriveFML: a general framework to identify coding and non-coding regions with cancer driver mutations. Genome Biol 17, 128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Munro D, Ghersi D, and Singh M (2018). Two critical positions in zinc finger domains are heavily mutated in three human cancer types. PLoS Comput. Biol 14, e1006290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ng PC, and Henikoff S (2003). SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res 31, 3812–3814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Niknafs N, Kim D, Kim RG, Diekhans M, Ryan M, Stenson PD, Cooper DN, and Karchin R (2013). MuPIT interactive: webserver for mapping variant positions to annotated, interactive 3D structures. Hum. Genet 132, 1235–1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nishi H, Tyagi M, Teng S, Shoemaker BA, Hashimoto K, Alexov E, Wuchty S, and Panchenko AR (2013). Cancer missense mutations alter binding properties of proteins and their interaction networks. PLoS One 8, e66273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Niu B, Scott AD, Sengupta S, Bailey MH, Batra P, Ning J, Wyczalkowski MA, Liang WW, Zhang Q, McLellan MD, et al. (2016). Protein-structure-guided discovery of functional mutations across 19 cancer types. Nat. Genet 48, 827–837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peterson TA, Gauran IIM, Park J, Park D, and Kann MG (2017). Oncodomains: a protein domain-centric framework for analyzing rare variants in tumor samples. PLoS Comput. Biol 13, e1005428. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pletscher-Frankild S, Pallejà A, Tsafou K, Binder JX, and Jensen LJ (2015). DISEASES: text mining and data integration of disease-gene associations. Methods 74, 83–89. [DOI] [PubMed] [Google Scholar]
- Porta-Pardo E, Garcia-Alonso L, Hrabe T, Dopazo J, and Godzik A (2015). A pan-cancer catalogue of cancer driver protein interaction interfaces. PLOS Comput. Biol 11, e1004518. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Porta-Pardo E, and Godzik A (2014). e-driver: a novel method to identify protein regions driving cancer. Bioinformatics 30, 3109–3114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Porta-Pardo E, Kamburov A, Tamborero D, Pons T, Grases D, Valencia A, Lopez-Bigas N, Getz G, and Godzik A (2017). Comparison of algorithms for the detection of cancer drivers at subgene resolution. Nat. Methods 14, 782–788. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Przytycki PF, and Singh M (2017). Differential analysis between somatic mutation and germline variation profiles reveals cancer-related genes. Genome Med 9, 79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raimondi F, Singh G, Betts MJ, Apic G, Vukotic R, Andreone P, Stein L, and Russell RB (2017). Insights into cancer severity from biomolecular interaction mechanisms. Sci. Rep 6, 34490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reimand J, and Bader GD (2013). Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers. Mol. Syst. Biol 9, 637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reva B, Antipin Y, and Sander C (2011). Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res 39, e118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ryslik GA, Cheng Y, Cheung KH, Modis Y, and Zhao H (2013). Utilizing protein structure to identify non-random somatic mutations. BMC Bioinformatics 14, 190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ryslik GA, Cheng Y, Cheung KH, Bjornson RD, Zelterman D, Modis Y, and Zhao H (2014a). A spatial simulation approach to account for protein structure when identifying non-random somatic mutations. BMC Bioinformatics 15, 231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ryslik GA, Cheng Y, Cheung KH, Modis Y, and Zhao H (2014b). A graph theoretic approach to utilizing protein structure to identify non-random somatic mutations. BMC Bioinformatics 15, 86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sanchez-Vega F, Mina M, Armenia J, Chatila WK, Luna A, La KC, Dimitriadoy S, Liu DL, Kantheti HS, Saghafinia S, et al. (2018). Oncogenic signaling pathways in The Cancer Genome Atlas. Cell 173, 321–337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Satelli A, Rao PS, Thirumala S, and Rao US (2011). Galectin-4 functions as a tumor suppressor of human colorectal cancer. Int. J. Cancer 129, 799–809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schaub FX, Dhankani V, Berger AC, Trivedi M, Richardson AB, Shaw R, Zhao W, Zhang X, Ventura A, Liu Y, et al. (2018). Pan-cancer alterations of the MYC oncogene and its proximal network across the Cancer Genome Atlas. Cell Syst 6, 282–300.e2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Seiler M, Peng S, Agrawal AA, Palacino J, Teng T, Zhu P, Smith PG, Cancer Genome Atlas Research Network, Buonamici S, and Yu L (2018). Somatic mutational landscape of splicing factor genes and their functional consequences across 33 cancer types. Cell Rep 23, 282–296.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shannon CE (1948). A mathematical theory of communication. Bell Syst. Tech. J 27, 379–423. [Google Scholar]
- Shihab HA, Gough J, Cooper DN, Day IN, and Gaunt TR (2013). Predicting the functional consequences of cancer-associated amino acid substitutions. Bioinformatics 29, 1504–1510. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Silverbush D, Cristea S, Yanovich-Arad G, Geiger T, Beerenwinkel N, and Sharan R (2019). Simultaneous integration of multi-omics data improves the identification of cancer driver modules. Cell Syst 8, 456–466.e5. [DOI] [PubMed] [Google Scholar]
- Stehr H, Jang SH, Duarte JM, Wierling C, Lehrach H, Lappe M, and Lange BM (2011). The structural impact of cancer-associated missense mutations in oncogenes and tumor suppressors. Mol. Cancer 10, 54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, and Mesirov JP (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U S A 102, 15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tamborero D, Gonzalez-Perez A, and López-Bigas N (2013). OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes. Bioinformatics 29, 2238–2244. [DOI] [PubMed] [Google Scholar]
- TCGA Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, and Stuart JM (2013). The Cancer Genome Atlas pan-cancer analysis project. Nat. Genet 45, 1113–1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, and McVean GA (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium (2020). Pan-cancer analysis of whole genomes. Nature 578, 82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The UniProt Consortium (2019). UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47, D506–D515. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tokheim C, Bhattacharya R, Niknafs N, Gygax DM, Kim R, Ryan M, Masica DL, and Karchin R (2016a). Exome-scale discovery of hotspot mutation regions in human cancer using 3D protein structure. Cancer Res 76, 3719–3731. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tokheim CJ, Papadopoulos N, Kinzler KW, Vogelstein B, and Karchin R (2016b). Evaluating the evaluation of cancer driver genes. Proc. Natl. Acad. Sci. U S A 113, 14330–14335. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Torkamani A, and Schork NJ (2008). Prediction of cancer driver mutations in protein kinases. Cancer Res 68, 1675–1682. [DOI] [PubMed] [Google Scholar]
- Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, and Kinzler KW (2013). Cancer genome landscapes. Science 339, 1546–1558. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ye J, Pavlicek A, Lunney EA, Rejto PA, and Teng CH (2010). Statistical method on nonrandom clustering with application to somatic mutations in cancer. BMC Bioinformatics 11, 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zaykin DV (2011). Optimally weighted Z-test is a powerful method for combining probabilities in meta-analysis. J. Evol. Biol 24, 1836–1841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeng Q, Michael IP, Zhang P, Saghafinia S, Knott G, Jiao W, McCabe BD, Galván JA, Robinson HPC, Zlobec I, et al. (2019). Synaptic proximity enables NMDAR signalling to promote brain metastasis. Nature 573, 526–531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang X, Choi PS, Francis JM, Gao GF, Campbell JD, Ramachandran A, Mitsuishi Y, Ha G, Shih J, Vazquez F, et al. (2018). Somatic superenhancer duplications and hotspot mutations lead to oncogenic activation of the KLF5 transcription factor. Cancer Discov 8, 108–125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao J, Cheng F, and Zhao Z (2017). Tissue-specific signaling networks rewired by major somatic mutations in human cancer revealed by proteome-wide discovery. Cancer Res 77, 2810–2821. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All original code generated during this study are available at http://github.com/Singh-Lab/PertInInt.