Abstract
A major challenge in cancer genomics is uncovering genes with an active role in tumorigenesis from a potentially large pool of mutated genes across patient samples. Here we focus on the interactions that proteins make with nucleic acids, small molecules, ions and peptides, and show that residues within proteins that are involved in these interactions are more frequently affected by mutations observed in large-scale cancer genomic data than are other residues. We leverage this observation to predict genes that play a functionally important role in cancers by introducing a computational pipeline (http://canbind.princeton.edu) for mapping large-scale cancer exome data across patients onto protein structures, and automatically extracting proteins with an enriched number of mutations affecting their nucleic acid, small molecule, ion or peptide binding sites. Using this computational approach, we show that many previously known genes implicated in cancers are enriched in mutations within the binding sites of their encoded proteins. By focusing on functionally relevant portions of proteins—specifically those known to be involved in molecular interactions—our approach is particularly well suited to detect infrequent mutations that may nonetheless be important in cancer, and should aid in expanding our functional understanding of the genomic landscape of cancer.
INTRODUCTION
Understanding how the genetic and epigenetic alterations acquired during tumorigenesis give rise to specific cancer phenotypes represents a major aim of cancer biology, and is an important motivation for profiling human cancers at the genomic level. The Cancer Genome Atlas (TCGA) and the Cancer Genome Project have already generated vast amounts of information (1), and have opened up unprecedented opportunities for studying the functional consequences of the molecular alterations found in human cancers. Because of emerging technologies such as exome sequencing (2,3), characterizing human cancers at the level of proteins in large cohorts of patients has now become feasible, with the prospect of even more data at lower cost in sight (4). Analyzing these data sets represents a promising avenue for furthering our understanding of cancer and for ultimately obtaining better patient stratification, refined prognostic tools and novel therapeutic targets (5).
Although numerous mutations are usually observed in each cancer genome (6), it has been proposed that the majority of them play no role in tumorigenesis (5,7), even when focusing on the protein coding regions of biologically plausible candidate genes (8). The mutational landscape of cancer is described as dotted for the most part by ‘hills’ (infrequently altered genes, some of which are functionally important), with only a few ‘mountains’ (i.e. genes altered in a high percentage of cases) (6).
Therefore, a major challenge in cancer genomics is in distinguishing ‘driver’ genes—with an active role in tumorigenesis—from genes with ‘passenger’ mutations; this is especially difficult in the case of infrequently mutated genes that are nonetheless important in cancers (9). Further, systematic analyses of cancer genomes are necessary due to the high degree of molecular heterogeneity displayed by tumors, even among patients diagnosed with the same cancer type. In fact, cancer heterogeneity goes beyond inter-patient variability, as at least some tumors have been shown to contain distinct clones, with complex and shifting dominance hierarchies (10,11). It has also been argued that mutations that are neutral with respect to the initial tumorigenesis may affect the way a patient will respond to a treatment, or the evolution of the disease in later stages (9).
Here we introduce an approach for uncovering genes that play a functional role in cancer by focusing on the distribution across patient samples of missense mutations in the nucleic acid, small molecule, ion and peptide binding sites of the proteins they encode. Because proteins accomplish most of their functions by interacting with other molecules, the residues that participate in these interactions and comprise their binding sites represent critical functional regions. The motivation behind our method is that if a binding site has an enriched number of mutations across patient samples as compared with the total number of mutations uncovered for the entire protein, it may play a functionally important role in cancer even if the protein itself is not frequently mutated overall.
A small number of previous studies have confirmed the critical functional impact of cancer somatic mutations on binding sites (12–15). Further, a number of tools have been developed to distinguish passenger from driver mutations by screening for genes with high mutation rates (16–18), by training classifiers for this task (19–22) or by applying methods developed for assessing the functional impact of SNPs (23–29). Other approaches exploit patterns of conservation in sequence alignments (30), or take into account the specific tolerance to variation exhibited by functional groups of genes (31). The distinct topological properties of cancer genes in protein interaction networks provide another discriminating feature for ‘true’ cancer genes (32–34), as does the gain or loss of phosphorylation sites (35,36), or aggregated mutation data at the level of individual domains (37). Homology modeling of protein–protein interactions has also been proposed as a way to discover novel cancer-related genes (38). Further, some methods integrate multiple features in a probabilistic framework (39), whereas others exploit pathway information (40–42). Despite a wealth of previous studies, a large-scale comprehensive analysis of the distribution of cancer mutations with respect to protein binding sites is still lacking.
We aim to fill this gap by providing a fully automated pipeline for mapping missense mutations across cancer (and other types of) exomes, and analyzing their distributions with respect to protein–nucleic acid, protein–peptide, protein–small molecule and protein–ion interactions. We show that this pipeline can map a significant fraction of human proteins onto structures with annotated binding information. We next demonstrate that binding site residues are more frequently affected by cancer missense mutations than are other residues, but are less affected by single nucleotide polymorphisms (SNPs) observed across populations. Finally, we show that by focusing on binding sites, our approach identifies many genes already known to be causally implicated in cancers. Our software is available online at http://canbind.princeton.edu as a web server to explore the data deposited in TCGA, and as a standalone package to study newly sequenced genomes.
MATERIALS AND METHODS
Reference sequence data sets
Somatic missense mutations were obtained from TCGA. Eight cancer types with unrestricted data (as of March 2013) were used for this analysis: breast cancer (BRCA, 775 samples), clear cell kidney cancer (KIRC, 219 samples), colon adenocarcinoma (COAD, 84 samples), endometrial cancer (UCEC, 247 samples), glioblastoma multiforme (GBM, 290 samples), lung squamous carcinoma (LUSC, 176 samples), ovarian cancer (OV, 151 samples) and rectal adenocarcinoma (READ, 38 samples).
Chromosomal coordinates of the missense mutations provided by TCGA were converted to protein sequence coordinates by building the transcripts according to the Genome Reference Consortium assembly (GRCh37.p10) and by mapping the mutated codons to the corresponding position in protein sequence space. For each gene, mutation data were mapped to all the isoforms reported in the assembly, and the isoform that allowed the largest number of mutated positions to be mapped was selected. In the case of ties, the first isoform reported in the Genome Reference Consortium assembly was retained. We refer to these protein sequences as ‘reference’ sequences. There are 17 556 reference protein sequences with a missense or synonymous mutation in at least one of the eight cancer data sets. Additionally, SNP data, already mapped to reference transcripts, were obtained from the Single Nucleotide Polymorphism Database (dbSNP) repository (43), release 137.
Assigning binding information to reference sequences
Binding information for structures deposited in the Protein Data Bank (44) was obtained from BioLip (45), a repository of biologically relevant protein–ligand interactions. We considered all the different types of interactions reported in BioLip (DNA/RNA, peptides, metals and small molecules), and extracted the sequences of the protein chains for which binding information was available. To increase coverage, we used the redundant version of BioLip (45).
The protein chain sequences were converted into a BLAST database using the BLAST +2.2.26 suite (46). For each reference sequence under consideration, we ran a BLASTP search with default parameters and an E-value <10−6. All hits with sequence identity >60% and coverage of the matching structure >80% were aligned to the initial reference sequence using Clustal Omega with default parameters (47). Using these pairwise alignments, the binding information contained in BioLip was transferred to the reference sequence only if the sequence identity in the binding residues was >90%.
We note that with the procedure outlined above, the same structure can map to multiple reference sequences, and more than one structure can map to the same reference sequence. To deal with this issue, for each reference sequence with multiple matching structures, we merged the mapped binding sites if they had at least one residue in common. Multiple binding sites within the same sequence with no residue in common were treated independently.
Analyzing the distribution of cancer mutations and SNPs in protein binding sites
For both TCGA and dbSNP sets, we calculated (i) the total number of positions with structural information, N; (ii) the total number of binding residues, K; (iii) the total number of positions with a mutation (or SNP, respectively) in a structural part, n; and (iv) the total number of binding positions hit by a mutation (or SNP, respectively), k. We then calculated the probability of observing a given number of positions affected by a cancer mutation (SNP, respectively), using the hypergeometric distribution, and assuming that affected binding positions have the same probability of being mutated or affected by a SNP as non-binding positions. The probability density curves shown in Figure 2 were obtained from the hypergeometric distribution:
(1) |
by using K, N and n from each data set, and varying the parameter k. The same procedure was followed to generate probability distributions for each type of binding site (Figure 2). P-values for the observed k were computed as for cancer missense mutations and for SNPs.
Assigning binding scores to protein sites
To capture the relative importance of a given residue within a binding site, we devised a score that uses the fraction of all heavy atoms of the residue within 4 Å of any ligand atom as a weight. In case of multiple structures matching the same TCGA sequence, the per-residue score was averaged over all the structures. That is, the binding score b for residue i in a given TCGA sequence was computed as:
(2) |
where N is the number of structures that match the sequence at position i; Aij is the set of heavy atoms of residue i in structure j; and is the subset of atoms of Aij within 4.0 Å of any ligand atom. All binding residues received a score between 0 and 1.
Selecting significantly mutated binding sites
For each human reference sequence with at least one reported mutation in a binding residue, we calculated the score si of each residue by multiplying its binding score bi [see Equation (2)] by the number of samples with a reported missense mutation in that residue. We then summed these si scores over all the mutated residues in the binding site, obtaining the total score per binding site sb. Next, we calculated the total number of mutations m within the sequence that affected residues with structural information, and to calculate empirical P-values, we generated 100 000 replicates, where for each replicate m residues with structural information are uniformly sampled with replacement. Finally, for each replicate, we calculated the binding score (as described above) using the sampled mutational data, and computed empirical P-values by counting the fraction of random samples with a more extreme score than that observed in the real case. In sequences with more than one binding site, we considered each site independently. We note that in the case of identical scores for all binding residues, this procedure would approximate sampling from a binomial distribution.
In Figure 3, we aggregated the mutations for all cancer types, whereas in Figure 4 we considered each cancer type separately. P-values of all binding sites in each cancer type were converted into false discovery rates (FDRs) using the Benjamini–Hochberg procedure (48).
RESULTS
We developed a computational approach (shown in Figure 1) that (i) takes as input mutations uncovered in exomes; (ii) maps them onto reference protein sequences; (iii) determines for each reference sequence if it is possible to transfer structural information from BioLip (45), a semi-manually curated database of protein–ligand interactions; (iv) extracts residues in the reference sequences that comprise binding sites for small molecules, DNA, RNA, peptides or ions, if structural information is available; (v) highlights genes with mutations that fall into a binding site of the corresponding proteins; and (vi) uncovers genes whose proteins are significantly enriched in mutations in at least one of their binding sites.
We applied this procedure to eight fully-available cancer data sets in TCGA, as well as to SNP data available from dbSNP, a catalog of both common and rare variants in nucleotide sequences. The cancer data sets consist of 1980 patient samples from either BRCA, KIRC, COAD, UCEC, GBM, LUSC, OV or READ.
More than 20% of human genes can be mapped to protein structures that have binding information
Given the increasing number of cancer genomes that are being sequenced, it is likely that a cancer mutation will eventually be observed in almost every human gene; thus, we first set out to determine how many human genes can be mapped to protein structures with binding information. We gathered all the Uniprot protein sequences for Homo sapiens (May 2013); this yielded 88 817 protein sequences, which mapped to 20 421 genes. For each Uniprot sequence, we performed a BLAST search against the BioLip database, and found that 4471 human genes (21.9%) are similar to at least one BioLip entry (E-value <10−6, coverage of the structural part >80% and sequence identity >60%).
In the TCGA data set, we found 17 379 genes with at least one missense mutation in at least one of the eight cancer types that we included in the analysis (see ‘Materials and Methods’ section for more details). Of these 17 379 genes, 3943 (22.7%) were similar to a BioLip entry, using the same criteria as above. When requiring ≥90% sequence identity in the binding sites between the human sequence and the protein structure, the number of mappable genes went down to 3656 (21%).
Common polymorphisms and cancer missense mutations show opposite trends in their propensity to affect protein binding sites
As a proof of concept, we first addressed the question of whether cancer missense mutations show a different propensity to affect protein binding sites as compared with SNPs obtained from dbSNP (43). In addition to missense mutations, we also considered synonymous mutations observed in TCGA.
We computed the total number of binding residues where cancer missense mutations, cancer synonymous mutations and population SNPs were observed. For this part of the analysis, only the number of affected positions (and not the frequency with which they were affected) was considered. An excess of cancer missense mutations was found in protein binding sites (, hypergeometric test, Figure 2a, left). In contrast, SNPs tended to avoid protein binding sites (, Figure 2a, middle). We hypothesized that if cancer missense mutations are enriched in binding sites because they play a functional role, we should not find this enrichment when considering the distribution of synonymous mutations found in the same cancer types. As expected, the total number of synonymous mutations in protein binding sites lay within the expected range (Figure 2a, right).
To determine whether the same trend could be observed for each type of binding sites, we repeated the analysis by considering in turn each of (i) nucleic acid binding residues, (ii) peptide binding residues and (iii) small molecule and ion binding residues. The global trends observed with aggregate binding data were recapitulated for each of the three types of binding sites, with cancer missense mutations hitting more than expected by chance nucleic acid binding residues (, Figure 2b), peptide binding residues (, Figure 2c) and small molecule and ion binding residues (, Figure 2d). In contrast, SNPs were less likely than expected to hit nucleic acid binding residues (, Figure 2b), peptide binding residues (, Figure 2c) and small molecule and ion binding residues (, Figure 2d).
We also set out to determine whether the same conclusions held when considering the frequency with which binding positions were affected. Because dbSNP does not directly provide the number of times a SNP was observed, we limited this analysis to cancer missense and synonymous mutations. We computed the mutation frequency in binding and non-binding residues and performed a binomial test, defining the probability of randomly hitting a binding residue as the fraction of residues with structual information that are binding residues. Across our entire data set, the probability of observing at least the observed number of cancer missense mutations in binding residues by chance alone was , whereas the probability of observing at least the observed number of cancer synonymous mutations in binding sites by chance alone was 0.28.
It is important to point out that the type of analysis shown in Figure 2 only reflects global tendencies, and biologically important exceptions for specific genes can be observed. For example, while SNPs tended to fall outside protein binding sites, we found that genes belonging to the major histocompatibility complex had between two and three times more binding positions with SNPs than expected, confirming the well-known allelic diversity of major histocompatibility complex genes (49). No other genes had more than two times the binding positions with SNPs than expected.
Genes that are significantly mutated in protein-binding sites are enriched in well-known cancer genes
Having confirmed that cancer missense mutations tend to occur in protein binding sites more frequently than expected, we set out to identify genes with a statistically significant excess of mutations in binding sites across the TCGA data, aggregating the mutations found in the eight cancer types. For each binding site, we calculated a mutation score, defined as the sum across the positions that comprise it of the number of mutations affecting each position weighted by the binding score of the position. The binding score is calculated as the fraction of all the heavy atoms of the residue in that position in the mapped structure that are in proximity (≤4.0 Å) of the ligand. We then computed an empirical P-value for each binding site using a permutation test (see ‘Materials and Methods’). We note that by using data obtained exclusively from large-scale DNA sequencing, we minimize the risk of study biases that would artificially inflate the mutation count for well-characterized binding sites.
To validate the method in the absence of a gold standard, we computed the enrichment for Cancer Census Genes (CCGs)—a curated list of genes causally implicated in cancer (50)—in gene sets with progressively smaller P-values. The results showed that smaller P-value thresholds yielded increasingly higher CCGs enrichment values (Figure 3). For example, 39 genes were in the top 2.5% of the P-value distribution, and 10 of them were CCGs; because the total number of binding sites with at least one mutation in a binding site was 1379, of which 104 were in CCGs, this is an enrichment factor of >3.40 (, hypergeometric test). As a control, we repeated the analysis on synonymous cancer mutations, obtaining—as expected—no enrichment in CCGs for small P-values (Figure 3).
Per-cancer type analysis
The analysis described above was carried out by considering all the reported somatic mutations in the eight cancer types together. Although the results suggest that genes whose corresponding proteins are significantly affected by mutations in their binding sites are enriched in those with a known role in tumorigenesis, they do not reveal what happens at the level of individual cancer types.
To address this question, we carried out a per-cancer type analysis, using only the mutations reported in each cancer type. Using an FDR < 0.1, we obtained 42 genes whose encoded proteins were significantly mutated within binding sites in at least one cancer type. Figure 4 shows the significant genes, with the total number of mutations in binding residues normalized by the number of samples. As expected, most of the CCGs appeared to be mutated at a higher frequency in multiple cancer types, whereas other genes displayed more cancer-specific patterns.
The contribution of different types of binding sites is also shown in Figure 4. We note that while mutations observed in small molecule or ion binding sites far outnumber the mutations in other binding sites, these binding sites were also the most highly represented in our structural data set. A small molecule or ion binding site was found in >82% of the structural entries, whereas nucleic acid and peptide binding sites were found in 14 and 23% of the entries, respectively, with some proteins having more than one type of binding site.
We performed Gene Ontology (GO) (51) biological process enrichment analysis on this set of 42 genes. Using a Bonferroni corrected P-value threshold of <0.05, the most specific GO-enriched terms were fibroblast growth factor receptor signaling pathway, epidermal growth factor receptor signaling pathway, response to insulin stimulus, neurotrophin TRK receptor signaling pathway and phosphate-containing compound metabolic process; we note that these terms are mostly related to signaling.
The original publications for seven of the eight cancer types considered here reported genes with significantly recurrent mutations (Supplementary Text S1). As expected, there was agreement between our approach and the list of recurrently mutated genes for several well-known players (e.g. AKT1, FBXW7, KRAS, PIK3CA and TP53). However, by focusing on protein-binding sites, our approach was able to uncover genes that were not detected as recurrently mutated, a few of which are discussed further below.
Examples of genes significantly mutated in binding sites
To showcase how our computational pipeline may assist in analyzing cancer proteomes, we briefly highlight a few proteins that were found to have an enriched number of mutations in their binding sites across cancer proteomes.
DICER1
Our approach suggests a role for DICER1 in endometrial cancer (Figures 4 and 5a), where six of the nine observed mutations cluster around two Mg2+ binding sites in the RNase IIIb domain. Mg2+ ions have been shown to play a role in DICER’s ability to bind RNA and cleave it (52), and are found in a negatively charged valley (Figure 5a). The six mutations in endometrial cancer replace the negatively charged amino acids (Asp and Glu) with either positively charged (Lys) or non-charged amino acids (Asn, Gly and Ala), thereby suggesting a potential loss of Mg2+ binding and RNase activity.
Supporting the potential role of DICER in cancer, we note that global microRNA (miRNA) downregulation is frequently observed in human cancers (57). Work by Martello et al. showed that the miRNA family miR103/107 (over-expressed in some breast cancers) can lead to less-differentiated cancer cells and a metastatic phenotype by targeting DICER, a crucial component of miRNA processing (58). More recently, recurrent somatic mutations of DICER1 around metal-binding residues were found in non-epithelial ovarian cancer (59).
AKT3
AKT consists of three isoforms (AKT1, AKT2 and AKT3) encoded by distinct genes but each containing a pleckstrin homology domain (60). AKT3—a gamma serine/threonine kinase in the phosphatidylinositol 3-kinase pathway—has only two mutations in clear cell kidney cancer (Figure 4). However, both these mutations (Gly16Val and Glu17Lys) fall in the binding site for phosphatidylinositol (3,4,5)-trisphosphate, in the pleckstrin homology domain of the kinase (Figure 5b).
It has previously been found that over-expression of AKT3 is a critical factor that correlates with cell proliferation in ovarian cancer (61). The Glu17Lys mutation seen in the binding site of AKT3 in kidney cancer was also found in the pleckstrin homology domain of AKT1 in breast, colorectal and ovarian cancers, and results in the activation of AKT1, followed by downstream signaling and cell transformation (62). The same gain-of-function mutation in AKT1 was also subsequently observed in squamous lung cancer, with a frequency of 0.6% (63).
MBD1
Methyl-CpG-binding domain protein 1 (MBD1) is a transcriptional repressor that functions by binding CpG islands in gene promoters (64). MBD1 has been shown to bind the promoters of known tumor suppressor genes (e.g. p16, VHL and E-cadherin) (64). Our pipeline found MBD1 to be significantly mutated in binding sites in the endometrial cancer data set (Figure 4) because of four somatic mutations affecting a binding site that recognizes methylated DNA (Figure 5c). The mutations replace two Arg residues in positions 18 and 22 with either His or Cys. Interestingly, another mutation (Arg17Cys) that falls just outside the binding site (>4 Å from DNA) but is next to Arg18 has been associated with Rett syndrome in the MeCP2 gene, which has an identical MBD domain (55).
FBXW11
FBXW11 (also known as HOS) is part of the SCF complex, which mediates the proteasome-dependent degradation of phosphorylated substrates (65). We found FBXW11 to be significantly enriched in binding site mutations in colon adenocarcinoma (FDR < 0.1), and of borderline significance in glioblastoma multiforme (FDR = 0.104) (Figure 4). Three of the four mutations observed in the two cancer types occur in proximity to a site that binds beta-catenin (Figure 5d), a well-studied protein involved in the Wnt signaling pathway, and implicated in several malignancies, such as colon cancer, melanoma, medulloblastoma and others (66). In the complex shown in Figure 5d, Arg370 is 2.4 Å away from a phosphorylated Ser in beta-catenin, and is replaced with a Gln in colon adenocarcinoma and a Trp in glioblastoma multiforme. Another mutation in colon adenocarcinoma affects Gly347, which is replaced with a negatively charged amino acid (Asp). We note that mutations in beta-catenin, FBXW11’s substrate, have also been observed in colon cancer, where they affect Ser and Thr residues that are essential for the phosphorylation-dependent degradation of beta-catenin (67).
DISCUSSION
Owing to the growing catalog of molecular events that occur during tumorigenesis, cancer biology has reached a turning point. A high resolution view of the mutational landscape in cancer is now becoming available and—as a consequence—almost every gene has been or will be found to be mutated in at least a few patient samples.
In this article, we have described a novel structural bioinformatics approach that aims to understand the effects of mutations in the broader context of a protein’s molecular interactions, and to assess the potential of mutations to disrupt these interactions. Our pipeline represents a complementary approach to existing methods, as it directly uses structural information in the context of large-scale cancer resequencing data, and is a step toward providing mechanistic interpretations of the effects of mutations. One important aspect of our approach is that it can highlight genes that may be infrequently mutated overall, but for which mutations preferentially occur in binding sites. It has previously been observed that genes that are mutated at low frequency can play important functional roles in cancer, and account (at least in part) for the high degree of clinical heterogeneity observed in many cancer types (10,11).
Given the complexity and diversity of cancer proteomes, it is not surprising that many computational methods to prioritize candidate genes have been developed over the years. Perhaps most similar to the work described here are earlier attempts to combine structural information about protein binding sites with cancer mutation data. Stehr et al. (14) characterized the structural differences between oncogenes and tumor suppressors, and their analysis included protein functional sites. However, their aim was to describe the structural impact of mutations in well-known and frequently mutated genes, and thus they analyzed 24 well-characterized genes. In contrast, our method analyzes thousands of genes and identifies those that may be infrequently mutated overall, but are nonetheless important. A more recent paper by Nishi et al. (15) studied the changes in binding energy caused by mutations in glioblastoma multiforme and further included protein–protein interfaces. Our method does not attempt to estimate the impact of any individual mutation on the binding energy, but simply highlights genes whose mutation patterns statistically deviate from expectation, thereby suggesting a selective process at work. The power of our method lies in its ability to detect genes with only a few mutations concentrated in a small functional part of the protein (i.e. its binding sites). For this reason, we excluded protein–protein interfaces from our analysis, which can contain a substantial fraction of a protein’s residues. For example, in the data set used in (15), on average, 41% of the residues within a protein take part in protein–protein interfaces (see Supplementary Text S2).
One limitation of our approach is that it requires structural information to accurately pinpoint the binding residues. As a consequence, a little over one-fifth of human genes can be studied at this time, although we expect this number to grow over time as our knowledge of protein structures increases. Moreover, we note that an excess of mutations in binding sites does not reveal whether a loss or a gain of function is at play, although it has been proposed that gain of function is more likely to occur in binding sites (14). Further work will be required to predict the functional implications at the cellular level of a mutational event in a binding site.
To conclude, by focusing on protein binding sites, we developed an automated approach that is particularly well suited to capture relatively rare mutations that are likely to perturb protein function. Therefore, our approach and the accompanying software should prove useful to the community as a hypothesis-generating tool, and as a bridge between detailed structural analyses of select genes and broad statistical screenings of cancer genomes.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
American-Italian Cancer Foundation Postdoctoral Fellowship (to D.G.). National Science Foundation [ABI-1062371 to M.S.] (in part); National Institutes of Health [GM076275 to M.S.] (in part); National Institute of Health Center of Excellence [P50 GM071508 to David Botstein], (in part); Forese Family Fund for Innovation. Funding for open access charge: [ABI-1062371 and GM076275].
Conflict of interest statement. None declared.
Supplementary Material
ACKNOWLEDGEMENT
The authors thank the members of the Singh laboratory for useful discussions.
REFERENCES
- 1.Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, Bernabe RR, Bhan MK, Calvo F, Eerola I, Gerhard DS, et al. International network of cancer genome projects. Nature. 2010;464:993–998. doi: 10.1038/nature08987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Barbieri CE, Baca SC, Lawrence MS, Demichelis F, Blattner M, Theurillat JP, White TA, Stojanov P, Van Allen E, Stransky N, et al. Exome sequencing identifies recurrent SPOP, FOXA1 and MED12 mutations in prostate cancer. Nat. Genet. 2012;44:685–689. doi: 10.1038/ng.2279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Liang H, Cheung LW, Li J, Ju Z, Yu S, Stemke-Hale K, Dogruluk T, Lu Y, Liu X, Gu C, et al. Whole-exome sequencing combined with functional genomics reveals novel candidate driver cancer genes in endometrial cancer. Genome Res. 2012;22:2120–2129. doi: 10.1101/gr.137596.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Green ED, Guyer MS. Charting a course for genomic medicine from base pairs to bedside. Nature. 2011;470:204–213. doi: 10.1038/nature09764. [DOI] [PubMed] [Google Scholar]
- 5.Garraway LA, Lander ES. Lessons from the cancer genome. Cell. 2013;153:17–37. doi: 10.1016/j.cell.2013.03.002. [DOI] [PubMed] [Google Scholar]
- 6.Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Jr, Kinzler KW. Cancer genome landscapes. Science. 2013;339:1546–1558. doi: 10.1126/science.1235122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Bozic I, Antal T, Ohtsuki H, Carter H, Kim D, Chen S, Karchin R, Kinzler KW, Vogelstein B, Nowak MA. Accumulation of driver and passenger mutations during tumor progression. Proc. Natl Acad. Sci. USA. 2010;107:18545–18550. doi: 10.1073/pnas.1010978107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, Davies H, Teague J, Butler A, Stevens C, et al. Patterns of somatic mutation in human cancer genomes. Nature. 2007;446:153–158. doi: 10.1038/nature05610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature. 2009;458:719–724. doi: 10.1038/nature07943. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Inda MM, Bonavia R, Mukasa A, Narita Y, Sah DW, Vandenberg S, Brennan C, Johns TG, Bachoo R, Hadwiger P, et al. Tumor heterogeneity is an active process maintained by a mutant EGFR-induced cytokine circuit in glioblastoma. Genes Dev. 2010;24:1731–1745. doi: 10.1101/gad.1890510. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Keats JJ, Chesi M, Egan JB, Garbitt VM, Palmer SE, Braggio E, Van Wier S, Blackburn PR, Baker AS, Dispenzieri A, et al. Clonal competition with alternating dominance in multiple myeloma. Blood. 2012;120:1067–1076. doi: 10.1182/blood-2012-01-405985. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Kar G, Gursoy A, Keskin O. Human cancer protein-protein interaction network: a structural perspective. PLoS Comput. Biol. 2009;5:e1000601. doi: 10.1371/journal.pcbi.1000601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Richardson CJ, Gao Q, Mitsopoulous C, Zvelebil M, Pearl LH, Pearl FM. MoKCa database–mutations of kinases in cancer. Nucleic Acids Res. 2009;37:D824–D831. doi: 10.1093/nar/gkn832. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Stehr H, Jang SH, Duarte JM, Wierling C, Lehrach H, Lappe M, Lange BM. The structural impact of cancer-associated missense mutations in oncogenes and tumor suppressors. Mol. Cancer. 2011;10:54. doi: 10.1186/1476-4598-10-54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Nishi H, Tyagi M, Teng S, Shoemaker BA, Hashimoto K, Alexov E, Wuchty S, Panchenko AR. Cancer missense mutations alter binding properties of proteins and their interaction networks. PLoS One. 2013;8:e66273. doi: 10.1371/journal.pone.0066273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sjoblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, Mandelker D, Leary RJ, Ptak J, Silliman N, et al. The consensus coding sequences of human breast and colorectal cancers. Science. 2006;314:268–274. doi: 10.1126/science.1133427. [DOI] [PubMed] [Google Scholar]
- 17.Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, Sougnez C, Greulich H, Muzny DM, Morgan MB, et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature. 2008;455:1069–1075. doi: 10.1038/nature07423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Youn A, Simon R. Identifying cancer driver genes in tumor genome sequencing studies. Bioinformatics. 2011;27:175–181. doi: 10.1093/bioinformatics/btq630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kaminker JS, Zhang Y, Waugh A, Haverty PM, Peters B, Sebisanovic D, Stinson J, Forrest WF, Bazan JF, Seshagiri S, et al. Distinguishing cancer-associated missense mutations from common polymorphisms. Cancer Res. 2007;67:465–473. doi: 10.1158/0008-5472.CAN-06-1736. [DOI] [PubMed] [Google Scholar]
- 20.Carter H, Chen S, Isik L, Tyekucheva S, Velculescu VE, Kinzler KW, Vogelstein B, Karchin R. Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer Res. 2009;69:6660–6667. doi: 10.1158/0008-5472.CAN-09-1133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Capriotti E, Altman RB. A new disease-specific machine learning approach for the prediction of cancer-causing missense variants. Genomics. 2011;98:310–317. doi: 10.1016/j.ygeno.2011.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Shihab HA, Gough J, Cooper DN, Day IN, Gaunt TR. Predicting the functional consequences of cancer-associated amino acid substitutions. Bioinformatics. 2013;29:1504–1510. doi: 10.1093/bioinformatics/btt182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ng PC, Henikoff S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31:3812–3814. doi: 10.1093/nar/gkg509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Bao L, Cui Y. Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information. Bioinformatics. 2005;21:2185–2190. doi: 10.1093/bioinformatics/bti365. [DOI] [PubMed] [Google Scholar]
- 25.Karchin R, Diekhans M, Kelly L, Thomas DJ, Pieper U, Eswar N, Haussler D, Sali A. LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Bioinformatics. 2005;21:2814–2820. doi: 10.1093/bioinformatics/bti442. [DOI] [PubMed] [Google Scholar]
- 26.Yue P, Moult J. Identification and analysis of deleterious human SNPs. J. Mol. Biol. 2006;356:1263–1274. doi: 10.1016/j.jmb.2005.12.025. [DOI] [PubMed] [Google Scholar]
- 27.Wood LD, Parsons DW, Jones S, Lin J, Sjoblom T, Leary RJ, Shen D, Boca SM, Barber T, Ptak J, et al. The genomic landscapes of human breast and colorectal cancers. Science. 2007;318:1108–1113. doi: 10.1126/science.1145720. [DOI] [PubMed] [Google Scholar]
- 28.Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. A method and server for predicting damaging missense mutations. Nat. Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Le Calvez-Kelm F, Lesueur F, Damiola F, Vallee M, Voegele C, Babikyan D, Durand G, Forey N, McKay-Chopin S, Robinot N, et al. Rare, evolutionarily unlikely missense substitutions in CHEK2 contribute to breast cancer susceptibility: results from a breast cancer family registry case-control mutation-screening study. Breast Cancer Res. 2011;13:R6. doi: 10.1186/bcr2810. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Reva B, Antipin Y, Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 2011;39:e118. doi: 10.1093/nar/gkr407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Gonzalez-Perez A, Deu-Pons J, Lopez-Bigas N. Improving the prediction of the functional impact of cancer mutations by baseline tolerance transformation. Genome Med. 2012;4:89. doi: 10.1186/gm390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Jonsson PF, Bates PA. Global topological features of cancer proteins in the human interactome. Bioinformatics. 2006;22:2291–2297. doi: 10.1093/bioinformatics/btl390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Rambaldi D, Giorgi FM, Capuani F, Ciliberto A, Ciccarelli FD. Low duplicability and network fragility of cancer genes. Trends Genet. 2008;24:427–430. doi: 10.1016/j.tig.2008.06.003. [DOI] [PubMed] [Google Scholar]
- 34.Ghersi D, Singh M. Disentangling function from topology to infer the network properties of disease genes. BMC Syst. Biol. 2013;7:5. doi: 10.1186/1752-0509-7-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Radivojac P, Baenziger PH, Kann MG, Mort ME, Hahn MW, Mooney SD. Gain and loss of phosphorylation sites in human cancer. Bioinformatics. 2008;24:i241–i247. doi: 10.1093/bioinformatics/btn267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Reimand J, Bader GD. Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers. Mol. Syst. Biol. 2013;9:637. doi: 10.1038/msb.2012.68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Nehrt NL, Peterson TA, Park D, Kann MG. Domain landscapes of somatic mutations in cancer. BMC Genomics. 2012;13(Suppl. 4):S9. doi: 10.1186/1471-2164-13-S4-S9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Hosur R, Xu J, Bienkowska J, Berger B. iWRAP: an interface threading approach with application to prediction of cancer-related protein-protein interactions. J. Mol. Biol. 2011;405:1295–1310. doi: 10.1016/j.jmb.2010.11.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Cho DY, Przytycka TM. Dissecting cancer heterogeneity with a probabilistic genotype-phenotype model. Nucleic Acids Res. 2013;41:8011–8020. doi: 10.1093/nar/gkt577. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Liu Y, Koyuturk M, Barnholtz-Sloan JS, Chance MR. Gene interaction enrichment and network analysis to identify dysregulated pathways and their interactions in complex diseases. BMC Syst. Biol. 2012;6:65. doi: 10.1186/1752-0509-6-65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Ng S, Collisson EA, Sokolov A, Goldstein T, Gonzalez-Perez A, Lopez-Bigas N, Benz C, Haussler D, Stuart JM. PARADIGM-SHIFT predicts the function of mutations in multiple cancers using pathway impact analysis. Bioinformatics. 2012;28:i640–i646. doi: 10.1093/bioinformatics/bts402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Vandin F, Upfal E, Raphael BJ. De novo discovery of mutated driver pathways in cancer. Genome Res. 2012;22:375–385. doi: 10.1101/gr.120477.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29:308–311. doi: 10.1093/nar/29.1.308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Yang J, Roy A, Zhang Y. BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 2013;41:D1096–D1103. doi: 10.1093/nar/gks966. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Soding J, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 2011;7:539. doi: 10.1038/msb.2011.75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B Methodol. 1995;57:289–300. [Google Scholar]
- 49.Sommer S. The importance of immune gene variability (MHC) in evolutionary ecology and conservation. Front. Zool. 2005;2:16. doi: 10.1186/1742-9994-2-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR. A census of human cancer genes. Nat. Rev. Cancer. 2004;4:177–183. doi: 10.1038/nrc1299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Takeshita D, Zenno S, Lee WC, Nagata K, Saigo K, Tanokura M. Homodimeric structure and double-stranded RNA cleavage activity of the C-terminal RNase III domain of human dicer. J. Mol. Biol. 2007;374:106–120. doi: 10.1016/j.jmb.2007.08.069. [DOI] [PubMed] [Google Scholar]
- 53.Provost P, Dishart D, Doucet J, Frendewey D, Samuelsson B, Radmark O. Ribonuclease activity and RNA binding of recombinant human Dicer. EMBO J. 2002;21:5864–5874. doi: 10.1093/emboj/cdf578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Thomas CC, Deak M, Alessi DR, van Aalten DM. High-resolution structure of the pleckstrin homology domain of protein kinase b/akt bound to phosphatidylinositol (3,4,5)-trisphosphate. Curr. Biol. 2002;12:1256–1262. doi: 10.1016/s0960-9822(02)00972-7. [DOI] [PubMed] [Google Scholar]
- 55.Ohki I, Shimotake N, Fujita N, Jee J, Ikegami T, Nakao M, Shirakawa M. Solution structure of the methyl-CpG binding domain of human MBD1 in complex with methylated DNA. Cell. 2001;105:487–497. doi: 10.1016/s0092-8674(01)00324-5. [DOI] [PubMed] [Google Scholar]
- 56.Wu G, Xu G, Schulman BA, Jeffrey PD, Harper JW, Pavletich NP. Structure of a beta-TrCP1-Skp1-beta-catenin complex: destruction motif binding and lysine specificity of the SCF(beta-TrCP1) ubiquitin ligase. Mol. Cell. 2003;11:1445–1456. doi: 10.1016/s1097-2765(03)00234-x. [DOI] [PubMed] [Google Scholar]
- 57.Kumar MS, Lu J, Mercer KL, Golub TR, Jacks T. Impaired microRNA processing enhances cellular transformation and tumorigenesis. Nat. Genet. 2007;39:673–677. doi: 10.1038/ng2003. [DOI] [PubMed] [Google Scholar]
- 58.Martello G, Rosato A, Ferrari F, Manfrin A, Cordenonsi M, Dupont S, Enzo E, Guzzardo V, Rondina M, Spruce T, et al. A MicroRNA targeting dicer for metastasis control. Cell. 2010;141:1195–1207. doi: 10.1016/j.cell.2010.05.017. [DOI] [PubMed] [Google Scholar]
- 59.Anglesio MS, Wang Y, Yang W, Senz J, Wan A, Heravi-Moussavi A, Salamanca C, Maines-Bandiera S, Huntsman DG, Morin GB. Cancer-associated somatic DICER1 hotspot mutations cause defective miRNA processing and reverse-strand expression bias to predominantly mature 3p strands through loss of 5p strand cleavage. J. Pathol. 2013;229:400–409. doi: 10.1002/path.4135. [DOI] [PubMed] [Google Scholar]
- 60.Santi SA, Lee H. The Akt isoforms are present at distinct subcellular locations. Am. J .Physiol. Cell Physiol. 2010;298:C580–C591. doi: 10.1152/ajpcell.00375.2009. [DOI] [PubMed] [Google Scholar]
- 61.Cristiano BE, Chan JC, Hannan KM, Lundie NA, Marmy-Conus NJ, Campbell IG, Phillips WA, Robbie M, Hannan RD, Pearson RB. A specific role for AKT3 in the genesis of ovarian cancer through modulation of G(2)-M phase transition. Cancer Res. 2006;66:11718–11725. doi: 10.1158/0008-5472.CAN-06-1968. [DOI] [PubMed] [Google Scholar]
- 62.Carpten JD, Faber AL, Horn C, Donoho GP, Briggs SL, Robbins CM, Hostetter G, Boguslawski S, Moses TY, Savage S, et al. A transforming mutation in the pleckstrin homology domain of AKT1 in cancer. Nature. 2007;448:439–444. doi: 10.1038/nature05933. [DOI] [PubMed] [Google Scholar]
- 63.Do H, Salemi R, Murone C, Mitchell PL, Dobrovic A. Rarity of AKT1 and AKT3 E17K mutations in squamous cell carcinoma of lung. Cell Cycle. 2010;9:4411–4412. doi: 10.4161/cc.9.21.13654. [DOI] [PubMed] [Google Scholar]
- 64.Fujita N, Takebayashi S, Okumura K, Kudo S, Chiba T, Saya H, Nakao M. Methylation-mediated transcriptional silencing in euchromatin by methyl-CpG binding protein MBD1 isoforms. Mol. Cell. Biol. 1999;19:6415–6426. doi: 10.1128/mcb.19.9.6415. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Fuchs SY, Chen A, Xiong Y, Pan ZQ, Ronai Z. HOS, a human homolog of Slimb, forms an SCF complex with Skp1 and Cullin1 and targets the phosphorylation-dependent degradation of IkappaB and beta-catenin. Oncogene. 1999;18:2039–2046. doi: 10.1038/sj.onc.1202760. [DOI] [PubMed] [Google Scholar]
- 66.Morin PJ. beta-catenin signaling and cancer. Bioessays. 1999;21:1021–1030. doi: 10.1002/(SICI)1521-1878(199912)22:1<1021::AID-BIES6>3.0.CO;2-P. [DOI] [PubMed] [Google Scholar]
- 67.Polakis P. Wnt signaling and cancer. Genes Dev. 2000;14:1837–1851. [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.