Skip to main content
Molecular Biology and Evolution logoLink to Molecular Biology and Evolution
letter
. 2015 Apr 14;32(8):2181–2185. doi: 10.1093/molbev/msv083

The dJ/dS Ratio Test Reveals Hundreds of Novel Putative Cancer Drivers

Han Chen 1, Ke Xing 1,*, Xionglei He 1,2,3
PMCID: PMC4833070  PMID: 25873590

Abstract

Computational tools with a balanced sensitivity and specificity in identification of candidate cancer drivers are highly desired. In this study, we propose a new statistical test, namely the dJ/dS ratio test, to compute the relative mutation rate of exon/intron junction sites (dJ) to synonymous sites (dS); observation of dJ/dS ratio larger than 1 in cancer indicates positive selection for splicing deregulation, a signature of cancer driver genes. Using this method, we analyzed the data from The Cancer Genome Atlas and identified hundreds of novel putative cancer drivers. Interestingly, these genes are highly enriched in biological processes related to the development and maintenance of multicellularity, paralleling a previous finding that cancer evolves back to be unicellular by knocking down the multicellularity-associated genetic network.

Keywords: positive selection, cancer driver, dJ/dS ratio test


The most common approaches used to identify cancer drivers from cancer genomic data are based on the mutation frequency of a given gene relative to the background of the whole genome (Sjoblom et al. 2006; Ding et al. 2008; TCGA 2008; Dees et al. 2012; Brennan et al. 2013). These approaches are confounded by the striking heterogeneity of mutation rates among genomic regions and between different tumor samples, which is difficult to model, thus often resulting in extensive false positives (Lawrence et al. 2013). An alternative strategy (Wood et al. 2007; Hodis et al. 2012) that in principle can circumvent the problem of mutation rate heterogeneity and has been widely used in evolutionary studies (Li et al. 1985; Nei and Gojobori 1986) is the dN/dS (or Ka/Ks) ratio test, which compares the number of nonsynonymous mutations per nonsynonymous site (dN) and the number of synonymous mutations per synonymous site (dS). Observation of dN/dS > 1 indicates positive selection for functional alterations, a signature of driver genes. Unfortunately, this method lacks sensitivity because there are usually a small fraction of coding sites on which significant function-altering mutations could occur to promote cancer. In a previous study (Chen et al. 2014), we studied truncating substitution mutations that often lead to a complete loss of gene function, and proposed dT/dS ratio test which computes the relative rate of truncating mutations (dT) to synonymous mutations (dS). Using The Cancer Genome Atlas (TCGA) data we identified a large number novel putative tumor suppressors that are subject to positive selection for null mutations during cancer (i.e., dT/dS > 1), well demonstrating the power of focusing on those sites on which significant function-altering mutations could occur. We reasoned that splice sites at the exon/intron junctions are another types of clearly defined sites which, when mutated, would cause strong functional consequences, and previous studies have highlighted aberrant mRNA splicing frequently associated with cancer development (Ghigna et al. 2008; Cancer Genome Atlas Research Network 2014; Oltean and Bates 2014). Motivated by the success of the dT/dS test, in this study we propose dJ/dS ratio test to compare the rate of splice-site mutations at the exon/intron junctions (dJ) and the rate of synonymous mutations (dS). Observation of the dJ/dS ratio significantly larger than 1 indicates positive selection for splice-site mutations, which can be regarded as evidence for cancer drivers. Because the dJ/dS test is a modified version of the classical dN/dS test or our previously proposed dT/dS test, all rationales and associated statistics of the three tests are essentially the same.

Results

A reliable mutation spectrum is critical for estimating the expected exon/intron junction sites (J sites) and synonymous sites (S sites) of a gene. For each cancer type, we analyzed the single-base substitution mutations at 4-fold degenerate sites of protein-coding genes to derive mutation spectrum of that cancer. As some base contexts can heavily affect mutation rate in cancer (Lawrence et al. 2013), in addition to the six regular types of substitution mutations, mutations of TpA→TpT, CpG→TpG, TpC→TpX (X: A, T, and G), and TpCpG→TpTpG (the combination of TpC and CpG) were separately considered. Thus, we obtained for each cancer type its mutation spectrum that comprises the rate U for each of the 12 mutation types (supplementary table S1, Supplementary Material online). The first two bases and the last two bases of an intron are considered as splice sites (or J sites). For a hypothetical gene comprising ten exons encoding 1,000 codons, there are 36 J sites from the 10 − 1 = 9 introns, with 36 × 3 = 108 J mutation possibilities; with regard to the 9,000 mutation possibilities at the coding region (nine different mutation possibilities for each codon), only the synonymous (or S) mutation possibilities are considered. For each mutation possibility, the expected rate is the U of the corresponding mutation type. We then sum up the expected rates (i.e., all U) of all J mutation possibilities and all S mutation possibilities, respectively, and the expected J/S site rate is the former divided by the latter. The dJ/dS ratio is the observed J/S mutation rate divided by the expected J/S site rate, and the corresponding P values are calculated using binomial test.

We first examined the solid tumor BRCA (Kandoth et al. 2013). We computed the dJ/dS ratio and the corresponding P value for genes with at least five mutations at the J and S sites. After controlling for multiple testing (Storey 2002), we obtained four genes, namely TP53, CDH1, POLR2B, and NCOR1, each with a clear signal of positive selection for J mutations (i.e., dJ/dS > 1; supplementary table S2, Supplementary Material online). Interestingly, POLR2B and NCOR1 are not annotated as cancer drivers according to the Cancer Gene Census (CGC) (Forbes et al. 2008), so they are likely to be newly identified cancer genes which, when certain splicing errors occur, promote cancer. To ensure that the J mutations considered here are truly functional, we examined RNA-seq reads of these genes mapped to their exon–exon junctions. Functional J mutations that compromise splicing efficiency are expected to reduce the number of RNA-seq reads mapped to the corresponding exon–exon junctions. Indeed, out of the 30 J mutations observed in the four genes 28 result in a reduced density of reads mapped to the corresponding exon–exon junctions (P < 1E-5, chi-square test; fig. 1a), suggesting that these J mutations are largely functional. It should be pointed out that here a few nonfunctional J mutations will only make the dJ/dS test more conservative. Furthermore, with the purpose of evaluating the specificity of the dJ/dS test, we applied it to single nucleotide polymorphisms (SNPs) of the human population; specifically, we analyzed the SNP data derived from 6,515 sequenced exomes (Fu et al. 2013). There are a total of 14,777 genes each with at least five J and S SNPs examined. We found no significant genes (i.e., dJ/dS > 1) under the statistical cutoff of q < 0.1. The result remained unchanged when we considered only SNPs with the minor allele frequency higher than 0.1%, in which 6,163 genes and 59,348 SNPs were analyzed.

Fig. 1.

Fig. 1.

Expression and functional features of the genes with dJ/dS > 1. (a) J mutations observed in TP53, CDH1, POLR2B, and NCOR1 significantly reduce the density of reads mapped to the corresponding exon–exon junctions in BRCA. P < 1E-5, chi-square test. (b) The 302 novel cancer drivers are enriched in multicellularity-related GO terms. Arrows stand for “is_a,” and FDR is false discovery rate.

With the same approach we analyzed mutation data of 22 tumor types in TCGA. Under the cutoff of q < 0.1 we identified a total of 393 putative cancer drivers, of which 62 are CGC genes (supplementary table S2, Supplementary Material online). Because we considered a total of 18,681 human genes, of which 522 are CGC genes, results of the dJ/dS test showed a 5.6-fold enrichment in CGC genes (P < 1E-15, chi-square test), suggesting its generally good performance in recovering known cancer genes. With the focus on functionally important sites the dJ/dS test is supposed to perform well in the identification of minor cancer drivers that are beyond the detection resolution of the prevailing computational tools. Consistently, there were a total of 302 nonredundant genes that are not included in the CGC but showed significant dJ/dS signals in at least one tumor type. To save space, only the top three novel putative cancer genes of each cancer type are presented in table 1. It is expected that some of these genes are true cancer drivers whereas others are false positives. Due to the lack of a negative benchmark, it is difficult to directly estimate the rate of false positives. A previous study on a variety of experimental data revealed preferential inactivation of multicullularity-related genes during cancer, we can thus assess these candidates by looking at their functions (Chen et al. 2014). Gene Ontology (GO) analysis of the 302 genes showed that they are highly overrepresented in GO terms related to the development and maintenance of multicellularity (fig. 1b), suggesting that there should be a substantial proportion of true cancer drivers. Similarly, the relative RNA-seq reads density was also checked for each J mutation on these genes; out of 1,652 J mutations 1,187 showed a reduced density of reads mapped to the corresponding exon–exon junctions (P < 1E-15, chi-square test; supplementary fig. S1, Supplementary Material online).

Table 1.

Some Novel Putative Cancer Genes Revealed by the dJ/dS Ratio Test.

Cancer Type Gene Synonymous Mutations Splice Mutations Exp_JS Rate dJ/dS P Value q GCG
BLCA MAML3 4 4 0.0062 1.61E+02 9.92E-08 2.82E-05 NO
BLCA C8orf76 3 3 0.0235 4.26E+01 2.30E-04 3.92E-02 NO
BLCA ANKMY1 2 3 0.0340 4.42E+01 3.37E-04 4.80E-02 NO
BRCA POLR2B 2 6 0.0950 3.16E+01 1.03E-05 2.06E-03 NO
BRCA NCOR1 2 5 0.0802 3.12E+01 4.17E-05 6.30E-03 NO
COAD GATSL3 1 5 0.0504 9.92E+01 1.47E-06 1.31E-03 NO
COAD TYRO3 4 6 0.0573 2.62E+01 4.41E-06 2.12E-03 NO
COAD NPHP4 4 4 0.0499 2.00E+01 3.06E-04 9.81E-02 NO
GBM FRYL 1 4 0.0902 4.43E+01 2.19E-04 2.67E-03 NO
GBM KEL 2 3 0.0905 1.66E+01 5.02E-03 5.02E-02 NO
HNSC TVP23C 0 6 0.0490 NA 1.04E-08 2.30E-06 NO
HNSC MAML3 7 5 0.0112 6.38E+01 1.24E-07 2.06E-05 NO
HNSC EPHA2 2 4 0.0435 4.59E+01 4.25E-05 5.85E-03 NO
KICH FRG1 3 6 0.1363 1.47E+01 1.81E-04 2.08E-03 NO
KICH PRSS3 5 3 0.0592 1.01E+01 7.89E-03 6.05E-02 NO
KIRC SMG7 0 12 0.0998 NA 3.11E-13 1.84E-11 NO
KIRC BPIFC 0 13 0.1297 NA 6.05E-13 2.68E-11 NO
KIRC CCDC91 0 13 0.1514 NA 3.52E-12 1.25E-10 NO
KIRP HNRNPM 0 10 0.1148 NA 1.34E-10 1.40E-08 NO
LGG TVP23C 0 7 0.0386 NA 9.83E-11 7.05E-09 NO
LGG MAML3 4 6 0.0117 1.29E+02 4.75E-10 2.55E-08 NO
LGG TPTE 4 10 0.1633 1.53E+01 1.72E-06 5.30E-05 NO
LICH TVP23C 1 4 0.0454 8.80E+01 1.72E-05 9.08E-03 NO
LICH NELFB 1 4 0.1012 3.95E+01 3.30E-04 8.69E-02 NO
LUAD COL11A1 17 26 0.2365 6.47E+00 2.80E-09 5.87E-07 NO
LUAD TVP23C 2 8 0.0636 6.29E+01 6.58E-09 1.30E-06 NO
LUAD RBM10 2 9 0.1065 4.22E+01 3.26E-08 6.07E-06 NO
PRAD BCLAF1 0 6 0.0559 NA 2.21E-08 1.17E-06 NO
PRAD TVP23C 0 5 0.0394 NA 7.80E-08 2.07E-06 NO
PRAD MAML3 6 2 0.0119 2.80E+01 3.69E-03 6.51E-02 NO
SKCM RPS27 0 30 0.1038 NA 1.60E-31 3.05E-29 NO
SKCM MRPS31 0 18 0.0709 NA 5.93E-22 1.10E-19 NO
SKCM NDUFB9 1 15 0.0589 2.55E+02 2.28E-18 4.12E-16 NO
STAD TVP23C 0 48 0.0341 NA 7.26E-72 3.70E-69 NO
STAD CBWD1 1 37 0.1211 3.06E+02 5.91E-35 2.58E-32 NO
STAD AGAP9 0 8 0.0471 NA 1.67E-11 5.67E-09 NO
UCEC SMTNL2 2 11 0.0258 2.13E+02 1.93E-16 1.98E-13 NO
UCEC M6PR 0 7 0.0971 NA 4.25E-08 2.61E-05 NO
UCEC RHCE 1 5 0.0638 7.83E+01 4.43E-06 2.26E-03 NO
UCS TTN 8 3 0.0441 8.50E+00 9.65E-03 3.86E-02 NO

Note.—NA, not applicable.

Discussion

A previous study raised concerns with respect to the application of the dN/dS test or its derivatives to detecting positive selection using, instead of divergences between species, polymorphisms within a population; specifically, false negative would be generated in the case of strong positive selection, because strong positive selection may lead to a rapid fixation of nonsynonymous polymorphisms and thus an underestimation of the nonsynonymous polymorphic level (Kryazhimskiy and Plotkin 2008). However, this potential bias does not affect the dJ/dS test in cancer genomics, because 1) the major concern of this test is not false negatives but false positives, and 2) cancer mutations are identified by comparing tumors to the normal genotype, which includes both polymorphic and fixed mutations. In addition, each tumor sample represents an independent population, which effectively reduces the effect of historical contingency on polymorphism distribution that could be strong in a single natural population.

Compared with MuSiC (Dees et al. 2012), MutSig (Brennan et al. 2013), and MutSigCV (Lawrence et al. 2013), the three major computation tools used for annotating TCGA data, the design of dJ/dS test has unique strength: First, the dS serves as an ideal control for the background mutation rate, eliminating the false positives resulting from the between-genes variation; second, the dJ focuses exclusively on the small fraction of sites that are able to introduce major function-altering mutations to a cancer gene, preventing the signal of positive selection from being diluted and thus ensuring the sensitivity of the method; third, it works even if the observed background mutation rate (i.e., the dS) is zero, with an exact P value being calculated for each gene, which is critical given the sparsity of the mutation data currently available. Thus, equipped with easy statistics and a balanced specificity and sensitivity, the dJ/dS test, together with our previously proposed dT/dS test, would serve as an important complement to the existing computational tools in cancer genomics.

Given the relatively good performance of the dJ/dS test in recovering CGC genes and the GO enrichment in multicellularity-related processes of the non-CGC positive selection signals, we reasoned that a large number of the novel putative cancer drivers might be true signals although substantial false positives are possible (Lachance and Tishkoff 2013). The former scenario would challenge the view that the growth of cancer gene list has reached a plateau (Vogelstein et al. 2013) and support the argument that cancer drivers are pervasive in the human genome (Ostrow et al. 2014). At any rate, these putative cancer drivers deserve the priority of the future experimental validation. It is conceivable that most of newly identified (or to-be-identified) cancer genes are minor drivers, conferring small fitness advantages in primary tumors. Interestingly, because of the clonal interference in which large-effect beneficial mutations suppress small-effect beneficial mutations in an asexual population, major cancer genes would function primarily at early stages of a tumor, whereas minor ones may become particularly effective at late stages of cancer evolution. In this regard, novel cancer drivers are of special importance to the cancer research community.

Materials and Methods

For mutations we downloaded TCGA level 3 MAF files, and in total, 362 ACC, 396 BLCA, 1,768 BRCA, 671 CESC, 272 COAD, 282 GBM, 815 HNSC, 197 KICH, 726 KIRC, 902 KIRP, 1,264 LGG, 1,002 LIHC, 1,099 LUAD, 178 LUSC, 463 OV, 824 PRAD, 116 READ, 1,048 SKCM, 1,160 STAD, 811 THCA, 442 UCEC, and 170 UCS tumor samples are included in dJ/dS computation. In addition, the human genome (hg19) sequence was downloaded from UCSC and gene annotations were from NCBI CCDS. The information of exon–exon junctions was extracted from the TCGA GAF file and the numbers of RNA-seq reads were from TCGA splixn.quantification files.

To find whether J mutations are truly functional, we calculated for each mutation its relative density of RNA-seq reads. Specifically, for a given J site of a gene its normalized read number N is defined as the number of RNA-seq reads mapped to the corresponding junction divided by the total number of RNA-seq reads mapped to all junctions of the gene. The relative read density of a given J mutation is the normalized read number in the tumor sample with the J mutation divided by the average normalized read number in all tumor samples without the J mutation. Finally, a total of 1,652 J mutations, including 30 from BRCA were checked, and J mutations without corresponding RNA-seq data were excluded.

The GO analysis was conducted using BiNGO 2.44 (Maere et al. 2005) and Cytoscape_3.2.0 (Shannon et al. 2003)

Supplementary Material

Supplementary figure S1 and tables S1 and S2 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

Supplementary Data

Acknowledgments

The authors thank members of He lab for comments on the manuscript. This work was supported by the National Basic Research Program of China (no. 2014CB542005), the Marine Fisheries Science and Technology Promotion Project of Guangdong Province (no. A201301C09), and the Science and Technology Planning Project of Guangdong Province (no. 2012A080202006).

References

  1. Brennan CW, Verhaak RG, McKenna A, Campos B, Noushmehr H, Salama SR, Zheng S, Chakravarty D, Sanborn JZ, Berman SH, et al. The somatic genomic landscape of glioblastoma. Cell. 2013;155:462–477. doi: 10.1016/j.cell.2013.09.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Cancer Genome Atlas Research Network. Comprehensive molecular profiling of lung adenocarcinoma. Nature. 2014;511:543–550. doi: 10.1038/nature13385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chen H, Lin F, Xing K, He X. The degenerative evolution from multicellularity to unicellularity during cancer. Nat Commun. 2014 doi: 10.1038/ncomms7367. 6:6367. [DOI] [PubMed] [Google Scholar]
  4. Dees ND, et al. MuSiC: identifying mutational significance in cancer genomes. Genome Res. 2012;22:1589–1598. doi: 10.1101/gr.134635.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Ding L, et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature. 2008;455:1069–1075. doi: 10.1038/nature07423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Forbes SA, et al. Curr Protoc Hum Genet. 2008. The Catalogue of Somatic Mutations in Cancer (COSMIC) Chapter 10:Unit 10.11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Fu W, et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013;493:216–220. doi: 10.1038/nature11690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Ghigna C, Valacca C, Biamonti G. Alternative splicing and tumor progression. Curr Genomics. 2008;9:556–570. doi: 10.2174/138920208786847971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Hodis E, et al. A landscape of driver mutations in melanoma. Cell. 2012;150:251–263. doi: 10.1016/j.cell.2012.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Kandoth C, et al. Mutational landscape and significance across 12 major cancer types. Nature. 2013;502:333–339. doi: 10.1038/nature12634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kryazhimskiy S, Plotkin JB. The population genetics of dN/dS. PLoS Genet. 2008;4:e1000304. doi: 10.1371/journal.pgen.1000304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Lachance J, Tishkoff SA. Population genomics of human adaptation. Annu Rev Ecol Evol Syst. 2013;44:123–143. doi: 10.1146/annurev-ecolsys-110512-135833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Lawrence MS, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499:214–218. doi: 10.1038/nature12213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Li WH, Wu CI, Luo CC. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol Biol Evol. 1985;2:150–174. doi: 10.1093/oxfordjournals.molbev.a040343. [DOI] [PubMed] [Google Scholar]
  15. Maere S, Heymans K, Kuiper M. BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics. 2005;21:3448–3449. doi: 10.1093/bioinformatics/bti551. [DOI] [PubMed] [Google Scholar]
  16. Nei M, Gojobori T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 1986;3:418–426. doi: 10.1093/oxfordjournals.molbev.a040410. [DOI] [PubMed] [Google Scholar]
  17. Oltean S, Bates DO. Hallmarks of alternative splicing in cancer. Oncogene. 2014;33:5311–5318. doi: 10.1038/onc.2013.533. [DOI] [PubMed] [Google Scholar]
  18. Ostrow SL, Barshir R, DeGregori J, Yeger-Lotem E, Hershberg R. Cancer evolution is associated with pervasive positive selection on globally expressed genes. PLoS Genet. 2014;10:e1004239. doi: 10.1371/journal.pgen.1004239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Shannon P, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13:2498–2504. doi: 10.1101/gr.1239303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Sjoblom T, et al. The consensus coding sequences of human breast and colorectal cancers. Science. 2006;314:268–274. doi: 10.1126/science.1133427. [DOI] [PubMed] [Google Scholar]
  21. Storey JD. A direct approach to false discovery rates. J R Stat Soc Series B Stat Methodol. 2002;64:479–498. [Google Scholar]
  22. TCGA. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455:1061–1068. doi: 10.1038/nature07385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Vogelstein B, et al. Cancer genome landscapes. Science. 2013;339:1546–1558. doi: 10.1126/science.1235122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Wood LD, et al. The genomic landscapes of human breast and colorectal cancers. Science. 2007;318:1108–1113. doi: 10.1126/science.1145720. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Molecular Biology and Evolution are provided here courtesy of Oxford University Press

RESOURCES