Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Apr 17.
Published in final edited form as: IEEE Trans Nanobioscience. 2013 May 16;12(3):150–157. doi: 10.1109/TNB.2013.2263391

Adjusting for background mutation frequency biases improves the identification of cancer driver genes

Perry Evans 1, Stefan Avey 2, Yong Kong 3, Michael Krauthammer 4
PMCID: PMC3989533  NIHMSID: NIHMS499906  PMID: 23694700

Abstract

A common goal of tumor sequencing projects is finding genes whose mutations are selected for during tumor development. This is accomplished by choosing genes that have more non-synonymous mutations than expected from an estimated background mutation frequency. While this background frequency is unknown, it can be estimated using both the observed synonymous mutation frequency and the non-synonymous to synonymous mutation ratio. The synonymous mutation frequency can be determined across all genes or in a gene-specific manner. This choice introduces an interesting trade-off. A gene-specific frequency adjusts for an underlying mutation bias, but is difficult to estimate given missing synonymous mutation counts. Using a genome-wide synonymous frequency is more robust, but is less suited for adjusting biases. Studying four evaluation criteria for identifying genes with high non-synonymous mutation burden (reflecting preferential selection of expressed genes, genes with mutations in conserved bases, genes with many protein interactions, and genes that show loss of heterozygosity), we find that the gene-specific synonymous frequency is superior in the gene expression and protein interaction tests. In conclusion, the use of the gene-specific synonymous mutation frequency is well suited for assessing a gene’s non-synonymous mutation burden.

Keywords: cancer, sequencing, melanoma

I. Introduction

Genome-wide and exome-wide cancer sequencing projects focus on cataloguing the mutational landscape of tumor cohorts, with the goal of identifying genes important for cancer. A common approach to finding these genes is to select genes having more non-synonymous mutations than expected by chance. This process is confounded by mutation biases that cause genes to have more or fewer mutations than other genes, while not affecting a gene’s importance to tumors. These biases include genomic sequence context, gene expression, distance from the transcription start site, and the timing of gene replication [1], [2]. The factors mentioned here explain less than 40% of the mutation variation observed across genes [1], indicating that there are still other factors to be discovered.

Given this dilemma, it is impossible to model the background frequency of non-synonymous mutations using the biases themselves. However, synonymous mutations are affected by the same biases that influence non-synonymous mutations, making it possible to use synonymous mutations as a proxy for bias influence. Given the non-synonymous to synonymous mutation ratio, the expected non-synonymous mutations can be estimated by counting the synonymous ones. This approach was used to evaluate the non-synonymous mutation burden significance for 623 genes from 188 lung adenocarcinomas [3]. In that analysis, the synonymous mutation frequency was determined in two ways. First, in a global approach, the frequency was found across all genes, and every gene’s expected non-synonymous mutation count was derived from this constant frequency. Second, in a gene-specific approach, the synonymous frequency was found separately for each gene, allowing each gene’s expected non-synonymous mutation count to be estimated using the gene’s synonymous mutations. The global approach helps estimate frequencies for genes with under-sampled synonymous mutations, but it fails to capture individual gene biases. The gene-specific approach does account for these biases, but becomes problematic when a gene has no synonymous mutations. These two measures of non-synonymous frequencies, global and gene-specific, have never been compared to see which yields better lists of genes with significantly high non-synonymous mutations.

In this paper, we first examine mutation frequency biases resulting from gene expression, replication timing, and genomic context using the largest melanoma cohort to date. We show that unexpressed genes accumulate more mutations than expressed genes, and that late replicating genes have more mutations than early replicating genes. We also demonstrate that the genomic context hotspots that attract most melanoma mutations cause an increase in the number of synonymous mutations expected per gene.

Next, we address the benefits of using a gene-specific estimate of background mutation frequency over using an exome-wide estimate. We use four criteria to evaluate genes deemed to have more non-synonymous mutations than expected by the two frequency estimates. We consider a gene a good cancer driver candidate if it is expressed in melanoma, has mutations at conserved positions, shows loss of heterozygosity (LOH), and has many protein interactions with other genes, indicating importance in multiple cellular pathways. Based on our criteria, we conclude that the gene-specific synonymous mutation frequency yields the best list of candidate driver genes.

II. Results and Discussion

For this study, we used novel somatic mutations found by comparing exome-wide sequencing data from paired melanoma and normal skin or blood samples taken from 99 patients [4]. We searched for novel somatic mutations by ignoring variants catalogued in normal samples and dbSNPv135 [5]. For each sample, we surveyed roughly 22 megabases covering about 15000 genes. Somatic non-synonymous mutations in each sample ranged from zero to nearly two thousand. Somatic synonymous mutations in each sample ranged from zero to just over one thousand.

A. Mutation bias

Here we consider the mutation bias contributors gene expression, replication timing, and UV exposure. We show how these factors contribute to mutation bias by showing their correlation with mutation counts.

1) Gene expression bias

Expressed and unexpressed genes were determined using RNA-Seq data from two normal melanocyte samples. Of the genes with synonymous mutations, 3,281 genes were found to be expressed in normal melanocytes, while 3,224 were not. As seen in Figure 1, expressed genes were found to have less synonymous mutations than unexpressed genes (t-test p-value < 5e-11). This mutation bias is attributed in part to transcription coupled repair, where mutations are corrected as genes are transcribed.

Fig. 1.

Fig. 1

Gene expression mutation bias. Unexpressed genes have significantly more synonymous mutations than expressed genes (t-test p-value < 5e-11). Genes with somatic synonymous mutations were classified into expressed and unexpressed gene sets of roughly 3,000 genes. For each gene, the somatic synonymous mutation frequency was determined by dividing a gene’s total somatic synonymous mutations by the gene’s coding length. Boxplots of mutation frequencies were plotted for expressed and unexpressed genes to show the mutation frequency difference between the gene sets. Outliers were removed from both gene groups. The upper and lower box hinges correspond to the first and third quartiles, the horizontal line within the box corresponds to the median, and the vertical lines span the minimum and maximum expression values.

2) Replication timing bias

During S phase of the cell cycle, genes are replicated at different stages. Replication timing is believed to have an effect on mutation rate [6], with more variation appearing in late replicating genes than early ones. We gathered replication timing data from twelve cell types [7], and found genes that were consistently replicated late or early across all cell types. In total we had 2,081 early replicating genes and 258 late replicating genes with synonymous mutations. We found that late replicating genes had more synonymous mutations than early replicating genes (Wilcoxon test p-value < 3e-16, Figure 2)

Fig. 2.

Fig. 2

Replication timing mutation bias. Genes with consistent late replication across tissues have significantly more synonymous mutations than genes with consistent early replication (Wilcoxon test p-value < 3e-16). Genes with somatic synonymous mutations were classified into sets of 2,081 early replicating genes and 258 late replicating genes. For both of these groups we determined the frequency of somatic synonymous mutations found per gene. As in Figure 1, we show boxplots of mutation frequencies to illustrate the synonymous mutation load difference between early and late replicating genes. Boxplots are formatted as in Figure 1.

3) UV exposure bias

UV mutagenesis causes C>T mutations at cytosines in a dipyrimidine context (CC, TC, CC, or CT), leading to a UV exposure mutation signature in sun-exposed tumors [2], [4]. We recreated this finding using somatic mutations taken from 61 sun-exposed tumors and 26 sun-shielded tumors (acral, mucosal, or uveal). Figure 3 demonstrates that dipyrimidine C>T mutations account for the majority of mutations in sun-exposed melanoma, while sun-shielded melanomas show less of a C>T bias.

Fig. 3.

Fig. 3

UV exposure mutation bias. UV light causes a specific C>T mutation at cytosines in a dipyrimidine context. In sun-exposed melanomas, most of the somatic mutations (SNVs) are caused by UV exposure. Sun-shielded melanomas do not show this UV mutation bias.

The dipyrimidine C>T mutation bias in sun-exposed tumors affects the types of amino acid mutations observed in melanoma. Five amino acids (F, I, K, N, Y) can never be mutated by dipyrimidine C>T mutations. Since dipyrimidine C>T transitions make up the majority of melanoma mutations, these five residues are not often mutated in melanoma, and a gene enriched in these immutable residues will be biased to have a lower non-synonymous mutation frequency than other genes.

As the ratio of non-synonymous to synonymous mutations (NS:SN) is important for calculating the significance of the observed number of non-synonymous mutations for a gene, we consider how UV exposure changes NS:SN for melanomas from sun-exposed body sites. Figure 4 shows that UV induced dipyrimidine C>T mutations decrease NS:SN relative to what is expected from a random dipyrimidine cytosine mutation. This is because dipyrimidine C>T mutations result in a higher frequency of synonymous mutations than do other dipyrimidine cytosine mutations. Figure 5 demonstrates that increasing the fraction of immutable residues in a gene lowers NS:SN for dipyrimidine C>T mutations. The exome-wide distribution of NS:SN resulting from all possible dipyrimidine C>T mutations is shown in Figure 6. The mean NS:SN is 1.67. Table I shows NS:SN for selected genes implicated in melanoma as well as long genes. Ranking genes by NS:SN, and running Gene Set Enrichment Analysis [8], showed that genes with high NS:SN were involved in collagen formation, RNA processing, and extracellular matrix organization (Table II). Genes with low NS:SN were involved in olfactory signaling, G protein activity, and voltage gated channel activity. These differences in gene NS:SN are important to consider as we next discuss using them to estimate expected non-synonymous mutation counts for genes.

Fig. 4.

Fig. 4

Effect of dipyrimidine C>T mutations on a gene’s non-synonymous:synonymous mutation ratio (NS:SN). Dipyrimidine C>T mutations lower a gene’s NS:SN relative to all possible mutations at dipyrimidine cytosines. The solid line shows the expected relation between dipyrimidine C>T mutations and all dipyrimidine cytosine mutations if all mutations resulted in the same NS:SN.

Fig. 5.

Fig. 5

Effect of immutable amino acid composition on the non-synonymous to synonymous mutation ratio (NS:SN) caused by dipyrimidine C>T mutations. Genes with a higher fraction of immutable amino acids (F, I, K, N, Y) have lower NS:SN.

Fig. 6.

Fig. 6

Exome-wide distribution of non-synonymous to synonymous ratios (NS:SN) caused by dipyrimidine C>T mutations. Few genes have amino acid compositions such that an equal number of non-synonymous and synonymous amino acid changes result from dipyrimidine C>T mutations.

TABLE I.

Dipyrimidine C>T non-synonymous to synonymous mutation ratios for selected genes chosen because they have been implicated in cancer in other studies, or are long genes that accumulate many mutations.

Gene NS:SN Feature

GRIN2A 1.35 candidate driver
RAC1 1.64 candidate driver
NRAS 1.88 candidate driver
ERBB4 1.97 candidate driver
PREX2 1.97 candidate diver
PPP6C 2.23 candidate driver
BRAF 2.27 candidate driver
TTN 2.46 long gene
MUC17 3.10 long gene
MUC7 3.38 long gene
TABLE II.

GSEA gene groups with significantly high or low non-synonymous:synonymous ratios.

GSEA Gene Group NS:SN Status

COLLAGEN FORMATION High
RNA SPLICING High
RNA PROCESSING High
EXTRACELLULAR MATRIX ORGANIZATION High
TRANSPORT OF MATURE TRANSCRIPT High
METABOLISM OF NON CODING RNA High
NCAM1 INTERACTIONS High
SPLICEOSOME High
OLFACTORY SIGNALING Low
RHODOPSIN LIKE RECEPTOR ACTIVITY Low
GPCR LIGAND BINDING Low
G ALPHA I SIGNALLING EVENTS Low
G PROTEIN COUPLED RECEPTOR ACTIVITY Low
POTASSIUM CHANNELS Low
GATED CHANNEL ACTIVITY Low
SECOND MESSENGER MEDIATED SIGNALING Low
CYCLIC NUCLEOTIDE MEDIATED SIGNALING Low
ION CHANNEL ACTIVITY Low
AMINE RECEPTOR ACTIVITY Low
CHEMOKINE SIGNALING Low
RNA POL_I PROMOTER OPENING Low
NEUROTRANSMITTER RECEPTOR ACTIVITY Low
CALCIUM SIGNALING Low

B. Using synonymous mutations as a proxy for mutation bias

Having demonstrated that synonymous mutations are influenced by mutation biases, we now survey different methods for incorporating synonymous mutation counts into the gene burden mutation model to address these biases. We limit our analysis to 61 sun-exposed tumors because sun-shielded tumors have few mutations, and might have a different tumor biology.

In the assessment of non-synonymous mutation burden for a gene, a binomial test is used to gauge the significance of a gene’s non-synonymous mutations given the gene’s length, and the expected frequency of non-synonymous mutations for the gene. The expected non-synonymous mutation frequency is found using the non-synonymous to synonymous mutation ratio (NS:SN), and a frequency of synonymous mutations. NS:SN is found by simulating mutations across the genome, and recording NS:SN observed in total, and for individual genes. The expected non-synonymous mutation frequency for a gene is then estimated using the frequency of synonymous mutations accumulated across all genes, or the frequency of synonymous mutations for the gene in question. We address the question of which frequency is better by introducing three alternative methods, and evaluating their performance based on tumor gene expression, loss of heterozygosity (LOH), conservation of amino acids involved in mutations, and connectivity in the human protein interaction network.

We tested three approaches for finding genes with significant non-synonymous mutation burden. The Simple approach acted as a baseline measure, and did not attempt to correct for mutation biases. In this method, all non-synonymous mutations were assumed to occur at the same frequency. We derived this frequency by multiplying the exome-wide non-synonymous to synonymous mutation ratio (NS:SN) by the exome-wide synonymous mutation frequency. For each gene, we found the probability of its non-synonymous mutation count given the expected non-synonymous mutation frequency and its length.

Our second approach is the Context method. This method differs from the Simple method because it attempts to correct for mutation biases caused by gene expression and UV exposure. We divided the genome into three genomic contexts, and split genes into expressed and unexpressed groups, as described in the methods section. For each of the six combinations of genomic context and expression status, we derived a synonymous mutation frequency using mutations across the exome. To determine the significance of a gene’s non-synonymous mutation count, we assessed the probabilities of non-synonymous mutations in each of the three genomic contexts, and combined probabilities using Fisher’s combined probability test. As in the Simple method, probabilities were determined given the expected frequency of non-synonymous mutations and the gene’s length. Unlike the Simple method, expected non-synonymous mutation frequencies and gene length were context specific, and each expected non-synonymous mutation frequency was determined by multiplying the context specific non-synonymous mutation frequency by the gene’s context specific NS:SN.

The Context approach does not account for mutation frequency differences between genes that are not covered by gene expression or genomic context. Our final approach, the Gene method, addresses this problem by using a gene-specific synonymous mutation frequency where appropriate. The Gene method is identical to the Context method, except that the expected non-synonymous mutation frequency is estimated using the gene’s context specific synonymous mutation frequency. When a gene has no synonymous mutations for a genomic context, the global synonymous mutation frequency is used.

In the remainder of this section, we demonstrate that the Gene method performs best, according to four criteria. We find that using this method to determine genes with more non-synonymous mutations than expected produces a gene list that has most genes expressed in tumors, indicating that the mutations in these genes are relevant to tumor development. We also find that genes picked by this method have more protein interactions than genes picked by the other methods.

1) Gene expression evaluation

Genes that are not expressed in tumors are unlikely to have mutations under selection, so gene lists with a higher fraction of expressed genes are more correct. We determined gene expression status based on microarray data from a panel of fifteen melanomas. Figure 7 shows that genes from the Gene method have the highest expression. Genes selected by the Context method have higher expression than those from the Simple method, indicating that correcting for mutation biases due to gene expression and genomic context produces a better gene list. The Gene method outperforms the Context method because it corrects for mutation biases other than genomic context and gene expression.

Fig. 7.

Fig. 7

Expression evaluation. We found candidate cancer driver genes with more non-synonymous mutations than expected using three methods. The resulting gene lists were compared based on gene mean expression from RNA-Seq data taken from multiple tumors. Gene lists with higher expression were considered more correct. Boxplots are formatted as in Figure 1.

2) Loss of heterozygosity evaluation

Our second comparison of candidate driver gene lists uses LOH. When most of a gene’s mutations are in LOH, it is likely that the gene plays a role in cancer. For each candidate driver gene, we determined the fraction of non-synonymous mutations where the mutation is homozygous, and the gene is in an LOH region for the sample. Figure 8 compares LOH mutation percentages for the results obtained using the three methods. The Gene method performs slightly worse than the other two methods when comparing median values, but some of its genes have the highest fraction of mutations in LOH.

Fig. 8.

Fig. 8

LOH evaluation. We found candidate cancer driver genes with more non-synonymous mutations than expected according to each of the three methods. The resulting gene lists were compared based on the fraction of a gene’s non-synonymous mutations that were in LOH across the samples. Gene lists with higher percentages of mutations in LOH were considered more correct. Boxplots are formatted as in Figure 1.

3) Conservation evaluation

For a third comparison, we examined the conservation of the residues at each driver gene’s non-synonymous mutation positions. We assumed that a higher conservation of a residue indicated more importance to the protein, and considered a gene with non-synonymous mutations at mostly conserved residues to be more relevant to driving tumor development than a gene with little conservation at its non-synonymous mutant residues. To evaluate mutant conservation for each gene, we compared the mean non-synonymous mutation phyloP score to the mean phyloP score for the whole gene. Figure 9 shows gene-wise ratios of the mean mutation score to the mean score for gene lists from each method. No method stands out as the best.

Fig. 9.

Fig. 9

Conservation evaluation. We found candidate cancer driver genes with more non-synonymous mutations than expected according to each of the three synonymous mutation frequencies. For each driver gene, we compared the mean of phyloP scores at non-synonymous mutations to the mean phyloP score across the gene using the ratio of the foreground non-synonymous mutation mean to the gene background mean. High ratios indicate that the gene is more likely to be a cancer driver. Boxplots are formatted as in Figure 1.

4) Protein interaction evaluation

For a final test, we gathered protein-protein interactions from the Human Protein Reference Database [9] to assess the connectivity of genes in the human interactome. Genes with many protein interactions were assumed to be important for cancer. Figure 10 shows that genes produced by the Gene method had the highest median connectivity.

Fig. 10.

Fig. 10

Interaction degree evaluation. We found candidate cancer driver genes with more non-synonymous mutations than expected according to each of the three methods. For each gene, we found the degree of the gene to be the number of interacting proteins the gene’s product had. For each gene list, we made boxplots of gene degree. Genes with a high degree are better candidate cancer drivers. Boxplots are formatted as in Figure 1.

III. Conclusion

To accurately find genes important to tumor development, we need robust models of mutation to find genes with more mutations than expected. Developing these models is difficult due to the mutation biases that are both tumor type specific and common across all cancers. Here we demonstrated that mutation bias can be captured by synonymous mutation frequencies, relieving the modeler from accounting for all types of mutation bias. We showed that finding the expected number of non-synonymous mutations using synonymous mutation counts can be accomplished by using a gene-specific frequency, or a global frequency for all genes. We demonstrated that the gene-specific method produces the best gene list by delivering genes with 1) higher expression and 2) more interactions with other genes. We advocate the use of this gene-specific method when each gene has many synonymous mutations, but acknowledge that small sequencing projects will need to fall back on the global method when most genes are without synonymous mutations.

IV. Methods and Data

A. Paired tumor and normal samples

The melanoma tumors were from different patients and were excised to alleviate tumor burden. Specimens were collected with participants’ informed signed consent according to Health Insurance Portability and Accountability Act (HIPAA) regulations with a Human Investigative Committee protocol. The melanomas used for sequencing were from snap-frozen tumors or from short-term cultures [10]. Most of the melanoma cells were collected after 0–4 passages in culture, except for one sample that was collected after 14 passages in culture.

Whole exomes were enriched from genomic DNA by the solution-based SeqCap EZ Exome Library capture method following the manufacturers protocols (Roche/NimbleGen) at the Yale Center for Genome Analysis as follows. DNA was sheared by sonication and adaptors were ligated to the resulting fragments. The adaptor-ligated templates were fractionated by agarose gel electrophoresis and fragments of the desired size were excised. Extracted DNA was amplified by ligation mediated PCR, purified, and hybridized to the SeqCap EZ Exome Library v1.0 using the manufacturers buffer. Hybridized DNA was pulled down with Streptavidin beads and amplified by ligation-mediated PCR. The resulting fragments were purified and subjected to DNA sequencing on the Illumina platform. Captured and non-captured amplified samples were subjected to quantitative PCR to measure the relative fold enrichment of the targeted sequence. Captured libraries were sequenced on the Illumina Genome Analyzer (GA) IIx, and on the Illumina HiSeq 2000, as 75 bp paired-end reads, following the manufacturers protocols. Image analysis and base calling was performed by Illumina pipeline version 1.6 and 1.7 with default parameters.

B. Melanoma somatic mutation calling

We used exome sequence data from 99 matched tumor/germline pairs for the automated calling of somatic mutations. We started by determining novel mutations that occurred in the tumors. For the matched samples, we used sequencing of germline DNA to distinguish between somatic and inherited variants. We used the human reference genome GRCh37/hg19 for mapping Exome-Seq and RNA-Seq data. The RefSeq sequence database downloaded from NCBI on 2011-5-12 was used as our gene model and for determining amino acid substitutions.

The following procedure was used to call melanoma sequence variations: Reads were first trimmed based on their quality scores using the program BTrim [11]. The reads were then mapped against the reference genome using bwa [12]. SAMtools version 0.1.8–11 (r672) [13] was used for PCR duplicate removal and mutation calling. Annotations of mutations were performed with MU2A [14]. Annotation files were checked for adjacent pairs of mutations affecting the same codon. If present, sequencing reads were scanned for occurrence of both mutations on a single allele, and the amino acid change was predicted based on the simultaneous mutations. Mutations were filtered according to the quality criteria: 1) mutant allele frequency of at least 13%; 2) SAM tool mapping score of at least 40; 3) at least one forward and one backward read; 4) a minimum coverage of 4 mutant and 8 total reads at the variant position; and 5) uniform mapping of reads with the variant allele across the mutation locus. Mutations were further filtered based on presence in repositories of common variations (dbSNP135 and 2,577 non-cancer exomes sequenced at Yale). Novel mutations were classified into somatic and inherited variations as follows: we first called the tumor mutation, and then used sequencing data in a matched germline DNA sample to determine the presence or absence of variant reads at the same position. The mutation was called somatic in the absence of variant reads in the germline DNA samples, tolerating one mutant read in normal, and expecting a sufficient variant to total read ratio in tumor and normal as assessed by the Fisher’s exact test (p-value threshold of 0.001).

C. Loss of heterozygosity

LOH regions were determined for each tumor/normal paired samples individually. First, heterozygous genomic positions in the normal sample were identified using the mutant allele frequency. A position whose mutant allele frequency was not significantly different from 0.5 according the the binomial test was considered heterozygous. For heterozygous normal positions, the corresponding tumor mutant allele frequency was tested for homozygosity using the binomial test. This resulted in a set of genomic locations with binary states: heterozygous in normal and tumor, and heterozygous in normal, but homozygous in tumor. The R Bioconductor DNAcopy package [15] was used to split these states into continuous segments representing tumor regions with and without LOH.

LOH regions were further filtered by treating each LOH region as one genomic location, and accumulating the mutant allele frequency for the LOH region. This combined mutant allele frequency was tested using a binomial test to ensure that it was significantly different from the expected 0.5 heterozygous mutant allele frequency.

D. Microarray gene expression

Whole genome gene expression was derived from hybridization to NimbleGen human whole genome expression microarrays. Array analysis was performed on 15 melanomas [16]. Data from the array analysis were used to identify expressed genes in melanomas. Genes with a median expression value of 550 and above were called expressed. These expression data were used to assess the performance of the three gene burden methods presented here.

E. RNA-Seq gene expression

RNA-Seq was performed on two independent cultures of two normal human melanocytes cultures derived from newborn foreskins and adult skin. Total RNA was extracted using Trizol (Invitrogen) followed by DNase digestion and Qiagen RNeasy (Qiagen, Valencia, CA) column purification following the manufacturer’s protocol. The RNA integrity was verified using an Agilent Bioanalyzer 2100 (Agilent, Palo Alto, CA). One microgram of high-quality RNA was processed using an Illumina RNA-Seq sample prep kit following the manufacturer’s instructions (Illumina, San Diego, CA). Final RNA-Seq libraries were sequenced at 75 bp/sequence using an GAIIx Illumina sequencer. Reads were processed with TopHat and SAMtools. Mapping was performed against the reference genome. Reads were counted in bins of 100 base-pairs, and normalized with regard to the median. To calculate the expression value for a particular RefSeq transcript, we determined the transcript exon boundaries, and summed up all bin read values for bins within the boundaries. The transcript length-normalized, and log-transformed value was used as the measure of gene expression. A two component Gaussian mixture model was fit to the data, and a lower bound for expressed genes was chosen as two standard deviations away from the higher distribution mean. The RNA-Seq data were used to identify expressed genes in normal melanocytes for the gene burden analysis.

F. Replication timing

Replication timing data were taken from the ENCODE Repliseq tracks provided by the University of Washington [7]. We used the percent signal files of both replicants from the following cells: Bj, Bg02es, Gm06990, Gm12801, Gm12878, Helas3, Hepg2, Huvec, Imr90, Mcf7, Nhek, Sknsh. For early replication values, we used cell cycle stages G1b and S1. For late replication values, we used S4 and G2. For each gene and cell type, we found the log ratio of the average replication activity for early and late stages. We considered a gene to replicate early if the ratio was positive for all cells and all replicants. Similarly, a late replicating gene had a ratio that was consistently negative across all cells and replicants.

G. Mutation burden analysis

To calculate a list of significantly mutated genes, i.e., genes with more mutations than expected by the background mutation frequency, we modified a recently established protocol [3]. We used the non-synonymous:synonymous (NS:SN) mutation ratio to estimate the non-synonymous background mutation frequency. This estimate is then used to determine whether some observed number of non-synonymous mutations in a gene is above the expected count. We also used insights into melanoma-specific mutation patterns to calculate mutation frequencies based on sequence contexts, and on expression of the gene locus. We measured an increase in mutation frequency when studying unexpressed versus expressed genes, and observed that most mutations occur at cytosines in the dipyrimidine context, as described before [2]. This led us to calculate the non-synonymous background mutation frequencies separately for expressed and unexpressed genes, and separately for the three following sequence contexts: mutating Cs at dipyrimidines, 2) mutating Cs at non-dipyrimidines, and 3) mutating Ts, which stand for, respectively, mutations in cytosines with a flanking pyrimidine, mutations in cytosines without a flanking pyrimidine, and mutations in thymines with no restriction on the flanking bases.

We found expected context specific non-synonymous mutation frequencies using context specific NS:SN ratios and synonymous counts, and performed, for each gene, and for each context, a binomial test for whether the observed non-synonymous mutations in a gene are explained by the expected estimate, receiving three distinct and independent p-values for each context. We then use the Fisher’s combined probability test to generate an overall p-value measuring whether the number of non-synonymous mutations in a gene is more than expected.

We estimated gene-specific NS:SN ratios in each of the three contexts. We proceeded as follows: we first identified all bases in a particular gene that are positioned in the context C under consideration. We then performed an in-silico experiment where we mutated each base and recorded whether the change resulted in a non-synonymous change or not. The resulting ratios between non-synonymous and synonymous changes were weighted according to the observed frequencies for a particular base change. The frequencies for each base change, in each context, were calculated from the frequencies of the observed synonymous and non-synonymous base changes, with the exception of non-synonymous changes in the top 100 mutated genes, which may be enriched for driver mutations. The top 100 genes were determined by dividing the number of observed somatic mutations by the gene length, and ranking of the resulting ratios. We determined an overall NS:SN ratio, across the three contexts, and across all genes, of 1.93 in sun-exposed melanomas.

The final gene burden ranks were matched against similar ranks that were generated by excluding the top 5% of mutated samples, in order to ensure robustness of the results. Only genes that were ranked high in both lists were retained.

Acknowledgments

The authors would like to thank Ruth Halaban and Douglas Brash. This work was supported by the Yale SPORE in Skin Cancer funded by National Cancer Institute grant number 1 P50 CA121974 (principal investigator, Ruth Halaban), the Melanoma Research Alliance (a Team award to Ruth Halaban and MK), National Library of Medicine Training grant 5T15LM007056 (PE).

Contributor Information

Perry Evans, Department of Pathology, Yale University School of Medicine, New Haven, CT, 06511 USA.

Stefan Avey, Department of Pathology, Yale University School of Medicine, New Haven, CT, 06511 USA.

Yong Kong, Department of Molecular Biophysics and Biochemistry, and W.M. Keck Foundation Biotechnology Resource Laboratory, Yale University School of Medicine, New Haven, CT, 06511 USA.

Michael Krauthammer, Department of Pathology, Yale University School of Medicine, New Haven, CT, 06511 USA.

References

  • 1.Hodgkinson A, Chen Y, Eyre-Walker A. The large-scale distribution of somatic mutations in cancer genomes. Human Mutation. 2012;33(1):136–143. doi: 10.1002/humu.21616. [DOI] [PubMed] [Google Scholar]
  • 2.Pleasance E, Cheetham R, Stephens P, McBride D, Humphray S, Greenman C, Varela I, Lin M, Ordóñez G, Bignell G, et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature. 2009;463(7278):191–196. doi: 10.1038/nature08658. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Ding L, Getz G, Wheeler D, Mardis E, McLellan M, Cibulskis K, Sougnez C, Greulich H, Muzny D, Morgan M, et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature. 2008;455(7216):1069–1075. doi: 10.1038/nature07423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Krauthammer M, Kong Y, Ha B, Evans P, Bacchiocchi A, Mc-Cusker J, Cheng E, Davis M, Goh G, Choi M, et al. Exome sequencing identifies recurrent somatic RAC1 mutations in melanoma. Nature Genetics. 2012 doi: 10.1038/ng.2359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Sherry S, Ward M, Kholodov M, Baker J, Phan L, Smigielski E, Sirotkin K. dbsnp: the NCBI database of genetic variation. Nucleic Acids Research. 2001;29(1):308–311. doi: 10.1093/nar/29.1.308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Stamatoyannopoulos J, Adzhubei I, Thurman R, Kryukov G, Mirkin S, Sunyaev S. Human mutation rate associated with DNA replication timing. Nature Genetics. 2009;41(4):393–395. doi: 10.1038/ng.363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Hansen R, Thomas S, Sandstrom R, Canfield T, Thurman R, Weaver M, Dorschner M, Gartler S, Stamatoyannopoulos J. Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. Proceedings of the National Academy of Sciences. 2010;107(1):139–144. doi: 10.1073/pnas.0912402107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Subramanian A, Kuehn H, Gould J, Tamayo P, Mesirov J. Gseap: a desktop application for gene set enrichment analysis. Bioinformatics. 2007;23(23):3251–3253. doi: 10.1093/bioinformatics/btm369. [DOI] [PubMed] [Google Scholar]
  • 9.Prasad TK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al. Human protein reference database2009 update. Nucleic acids research. 2009;37(suppl 1):D767–D772. doi: 10.1093/nar/gkn892. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Halaban R, Zhang W, Bacchiocchi A, Cheng E, Parisi F, Ariyan S, Krauthammer M, McCusker J, Kluger Y, Sznol M. PLX4032, a selective BRAFV600E kinase inhibitor, activates the ERK pathway and enhances cell migration and proliferation of BRAFWT melanoma cells. Pigment cell & melanoma research. 2010;23(2):190–200. doi: 10.1111/j.1755-148X.2010.00685.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Kong Y. Btrim: A fast, lightweight adapter and quality trimming program for next-generation sequencing technologies. Genomics. 2011;98(2):152. doi: 10.1016/j.ygeno.2011.05.009. [DOI] [PubMed] [Google Scholar]
  • 12.Li H, Durbin R. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics. 2009;25(14):1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, et al. The sequence alignment/map format and samtools. Bioinformatics. 2009;25(16):2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Garla V, Kong Y, Szpakowski S, Krauthammer M. MU2A-reconciling the genome and transcriptome to determine the effects of base substitutions. Bioinformatics. 2011;27(3):416–418. doi: 10.1093/bioinformatics/btq658. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Seshan V, Olshen A. Dnacopy: A package for analyzing DNA copy data. 2010. [Google Scholar]
  • 16.Halaban R, Krauthammer M, Pelizzola M, Cheng E, Kovacs D, Sznol M, Ariyan S, Narayan D, Bacchiocchi A, Molinaro A, et al. Integrative analysis of epigenetic modulation in melanoma cell response to decitabine: clinical implications. PLoS One. 2009;4(2):e4563. doi: 10.1371/journal.pone.0004563. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES