Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2008;2008:202–206.

Somatic Mutation Signatures of Cancer

Stephen R Piccolo 1, Lewis J Frey 1,2
PMCID: PMC2655983  PMID: 18999255

Abstract

The advancement of cancer diagnosis, prognosis, and treatment would be hastened via a robust method to identify patterns that indicate a tumor’s state. Prior research has established that sporadic, colorectal-cancer pathogenesis involves a series of genetic mutations that allow benign polyps to develop and eventually progress to malignant tumors in distinguishable patterns. Using a publicly available database of somatic mutations for many cancer types, we identified somatic-mutation signatures. Our results for colorectal cancer are consistent with extant biological models as described in the literature. This approach is potentially useful for identifying previously undiscovered patterns and generating hypotheses related to biological pathways. Such signatures could prove valuable for eventual translation into clinical practice.

Background

Cancer is a serious public health issue impacting millions of people worldwide each year. Robust methods are needed to accelerate the discovery of patterns that represent particular cancer types, sub-populations within those types, and states of cancer progression. Our research goal was to illustrate the potential for identifying genetic mutation signatures that could be used for these purposes. Such signatures could guide the efforts of biology researchers and ultimately lead to better cancer diagnosis, prognosis, and treatment.

Somatic genetic mutations occur sporadically during cell replication or as a result of environmental factors and are not inherited. Germline mutations are passed from one generation to the next and often cause cancer to segregate in families. Environmental factors such as diet and lifestyle also are thought to have a strong impact on risk, although this is controversial.1

Oncogenes encode proteins that control cell proliferation, cell death, or both2 and can contribute to tumorigenesis by promoting inappropriate cell behavior, even when only one allele is mutated.3 Tumor-suppressor genes encode proteins that play an important role in cell death, cell senescence and signaling pathways by inhibiting growth of tumors4 and counteracting mutated oncogenes. Mismatch-repair genes maintain DNA stability by repairing nucleotide bases that are paired incorrectly1.

The World Health Organization estimates that colorectal cancer, the third most common cancer, causes 492,000 deaths annually.5 With its high incidence rate and the relative ease of studying the various phases of tumor progression, colorectal cancer has been the subject of much research.1 Genetic profiles of colorectal tissue exhibit predictable patterns as they progress along a path from adenomatous polyps to carcinoma.6 We draw on this fact to illustrate convergent support for our findings.

Cancer development depends not only on germline or somatic mutations to initiate the process but also on subsequent somatic events to drive progression.4 In 1990, Fearon and Vogelstein presented a seminal model on the genetic basis of colorectal pathogenesis, suggesting that when oncogenes are activated or tumor-suppressor genes inactivated through somatic mutations, benign adenomatous polyps can develop, some of which may progress to malignancy after accumulating at least four or five mutations. They explained that even though these mutations typically occur in a recurring sequence, the order of the mutations is less important than their cumulative effect.6 Mutations may occur over the course of years, slowly altering the tissue’s genomic profile, ultimately resulting in metastasis.4

In a recent study, Sjöblom, et al. identified 69 “candidate cancer genes” for colorectal cancer, a list that included previously identified candidate genes but also many novel ones. The average number of candidate genes that were mutated in each tumor was 9.8 A follow-up study raised this estimate to 15 mutated candidate genes, adding strength to the hypothesis that tumors generally result from many mutational events.9

With the background that colorectal pathogenesis follows predictable patterns of somatic mutations; one would expect these patterns to be detectable across a large number of tissue samples, despite confounding hereditary factors or environmental exposures7. The same methodology used to identify these patterns might also be applied to tissue samples for other cancer types in an attempt to discover corresponding patterns. Such a result would be especially valuable in cancers that are less well defined and tissues in which stages of progression are more difficult to study.6

A recent effort to characterize patterns across the wide spectrum of known genetic diseases revealed that mutations occurring in phenotypes such as cancer are more likely to encode proteins that act as “hubs,” which are expressed in a large number of tissue types and often play critical roles in cellular development and growth.7 This finding suggests that distinguishable mutation patterns should be identifiable across a variety of cancers and was made possible because the authors expanded their view to a wide array of disease types—they identified patterns that would not have been found by studying individual diseases or genes alone.

We took a global view of somatic mutations that occur in many cancers. By applying machine-leaning techniques to a publicly available data set of somatic mutations gleaned from the literature, we show that meaningful patterns do arise and coincide with what we expected for colorectal cancer.

Methods

Researchers at the UK’s Sanger Institute have curated a list of somatic mutations that have been identified through biomedical research and reported in the literature. This resource, the Catalog of Somatic Mutations in Cancer (COSMIC)10, is a publicly available, actively updated database that can be downloaded and queried freely. The July 2007 version contains data for 3,298 genes, 196 cancer types, 230,838 tissue samples, and 515,535 mutation tests on those samples (2.23 tests per sample). A total of 48,911 somatic mutations were identified in those samples.

Using the full COSMIC database, we conducted an analysis with the goal of developing a method to identify signatures of somatic mutations indicative of particular tumor types and phases of cancer progression. We investigated patterns across all tumor types and aimed to identify sets of genes that would provide a robust signature to distinguish between tumor types.

We downloaded the entire COSMIC database (as an Oracle import file) and used SQL queries to associate mutations with genes and cancer types with genes for each tissue sample reported, based on the provided schema.11 After filtering out genes and cancer types for which no mutations had been identified (resulting in 1,212 genes, 136 cancer types, and 404,033 tests), the remaining data were retrieved and stored in a tab-delimited file with each row representing a tissue sample, each column representing a gene, and binary values indicating whether mutations occurred for the respective combinations.

Next we applied a machine-learning technique called feature selection, which has the general goal of reducing the features (genes in this case) in a data set to those most relevant for classifying instances (cancer types in this case). We used ReliefF (implemented in the Orange machine-learning framework12), a feature-selection algorithm that uses the k-nearest neighbor approach13, to estimate how well each gene could be used to separate cancer types within the data for the individual tissue samples. The output of this analysis was a continuous value for each gene, with higher values indicating a better ability to differentiate classes. Using an ad hoc cutoff threshold of features with ReliefF values greater than 0.00 (rounded to two decimal places) prior to clustering, a subset of 46 (3.8%) genes was selected.

Because the ReliefF algorithm is designed to handle multiple classes, incomplete and noisy data, and to account for dependencies between attributes14, it is suited well for this particular data set, which is characterized by sparsity, disparate data-acquisition methods, and known relationships between genes.

A simple approach for selecting a subset of genes would have been to choose those for which the highest mutation rates had occurred. We compared the 46-gene list identified by the ReliefF approach with the list of the same size that would have been selected had we chosen the genes mutated most frequently across all cancers; the ReliefF approach identified 15 (32.6%) genes that would not have been selected with the latter approach.

Each cancer type had mutations in 1 to 33 of the selected genes. Twelve of these genes were mutated in 20 or more cancer types. For each combination of gene and cancer type, we calculated mutation rates (mr):

mr=mutations per gene, cancer typemutations per cancer type

The values of mr (ranging from 0 to 1) were used as metrics for construction mutation rate vectors for each cancer type. Hierarchical clustering of these vectors was then used to create the somatic-mutation signatures.

Hierarchical clustering groups instances in a tree-like structure. We investigated how the distances between mutation rate vectors for each cancer type clustered genes into somatic mutation signatures for the cancer types. Using the Orange framework12 and the basic ReliefF algorithm (which measures distances between instances based on feature values14), similarity between cancer types was measured.

Results

Figure 1 compares the mutation-rate distributions between colorectal adenoma/carcinoma and breast carcinoma for the ten genes with the highest mutation rates. The colorectal profiles show similarity to each other (with some expected differences), while the profile for breast carcinoma is quite different. This finding illustrates that some cancer types differ drastically in their somatic mutation profiles and coincides with the results of recent, genome-wide analyses of somatic alterations, which showed that breast and colorectal cancers differ substantially in their mutation profiles.8,9

Figure 1.

Figure 1

Mutation rates for the ten genes having the highest average mutation rates across breast and colorectal cancers.21–22

The breadth of the COSMIC data allowed us to investigate patterns across a large number of cancer types. An intriguing finding from the cluster analysis was that out of the 136 cancer types, the mutation rate vectors of colorectal adenoma and carcinoma clustered together (see Figure 2). This finding suggests that even though the mutation signatures for these cancer types change as the disease progresses, their somatic-mutation patterns are similar enough to form a cluster that is distinguishable from other cancer types. This cluster appears to be more than an artifact caused by data sparsity. Table 1 shows the number of genes in common between colorectal carcinoma and various other cancer types. Colorectal carcinoma was mutated in 27 genes, of which 9 overlapped with colorectal adenoma; however, colorectal carcinoma overlapped with many more genes in other cancer types. If the clusters were principally a result of data sparsity, one would have expected colorectal carcinoma to cluster more closely with lung, breast, or prostate carcinoma.

Figure 2.

Figure 2

Snapshot of hierarchical clustering output in which the somatic mutation signatures of colorectal adenoma and carcinoma clustered together.12 The colors have no significance other than to highlight clusters.

Table 1.

Number of genes having mutations for various cancer types and the number in common with colorectal carcinoma.

Cancer Type # Mutated Genes In Common
Colorectal carcinoma 27 27
Lung carcinoma 33 22
Breast carcinoma 23 18
Prostate carcinoma 15 13
Colorectal adenoma 10 9
Skin carcinoma 10 8

Additional, potentially interesting cancers that clustered closely were breast carcinoma with ovarian carcinoma, cervix carcinoma with prostate carcinoma, and small-intestine carcinoma with colorectal carcinoma/adenoma. While these findings are suggestive rather than conclusive, they represent an approach to objectively determine cancer types that have similar or complementary patterns of somatic activity and thus may provide a means to generate hypotheses in biological-pathway research and eventually guide clinical decisions.

Discussion

This work demonstrates that somatic mutation rates for a subset of genes identified with machine-learning techniques can be used to illustrate cancer progression for colorectal cancer and relationships between a large number of cancer types.

Figure 3 is a simplified depiction of colorectal-cancer progression from the KEGG PATHWAY database.15 It shows progression from normal tissue to colorectal adenoma to carcinoma as somatic mutations in particular genes accumulate. When APC has been mutated, it may fail to prevent benign polyps from forming. As oncogenes mutate, polyps may begin to form at an increasing rate. Ultimately, a second mutation in APC and/or additional mutations in tumor-suppressor genes such as SMAD4 and TP53 may allow these tumors to expand and turn cancerous.

Figure 3.

Figure 3

Simplified version of colorectal-cancer biological pathway in KEGG database16 for genes with mutation data in COSMIC.

Figure 4 compares the relative mutation rates for colorectal adenoma and carcinoma for the genes in Table 1 for thousands of tissue samples in COSMIC. Each successive gene shows a decrease in the relative proportion of mutations for adenoma compared with carcinoma. This pattern coincides with Fearon and Vogelstein’s findings on how the somatic-mutation profile changes as colorectal adenoma progresses to carcinoma.6

Figure 4.

Figure 4

Relative proportion of mutations in colorectal adenoma and carcinoma tumors for six genes that have data in COSMIC (pattern consistent with Fearon and Vogelstein6).

The extant method for classifying cancer types is based on the localization and appearance of the tissue.16 Clinicians use this information to predict prognosis and determine which therapeutic approach to take; however, patients with histologically similar tissue have different clinical outcomes and respond differently to identical therapies16. Consequently, some patients could be treated with inappropriate therapies for their individual needs.16

Fearon and Vogelstein discovered that a subset of patients with more than the median number of mutations in tumor-suppressor genes had a considerably worse prognosis than the remaining patients.6 They envisioned chemotherapeutic agents that could inactivate mutated oncogenes or “mimic or restore the normal biologic action of suppressor genes.”6 Goyette, et al. experimentally determined that correcting only a single defect can have significant therapeutic effects17; in fact, some drugs have already been developed (and are in clinical use or undergoing clinical trials) that act as inhibitory compounds for the products of oncogenic kinase genes (e.g. EGFR, BRAF) that have gone awry due to somatic mutations.16

Once validated sufficiently, robust somatic mutation signatures could guide such personalized drug targeting. They could also help physicians refine diagnoses of disease severity and shed light on a patient’s prognosis in a personalized way, potentially resulting in decreased mortality.1

Various feature-selection techniques have been applied to genomic data sets18. However, to our knowledge such methods had not been applied previously to a somatic-mutation data set or to such a wide array of cancer types.

Limitations

An important goal of the COSMIC resource is not only to provide a way to look at well-known cancer types and genes that have been tested across many studies but also to increase access to data about less-common cancers and infrequently mutated genes as well as those in which mutations have not yet been found (negative data).10 However, one consequence of this broad coverage in COSMIC is data sparsity in which most samples were tested for only a small subset of mutations. There is also a high likelihood of implicit publication prejudice in which positive results are published more frequently than negative ones.10 Because the data were extracted from the literature, it is also likely an ethnic bias exists in which a disproportionate number of samples were analyzed using tissue from the USA and Europe.10 Furthermore, particular “hot spots” of interest in the genome and common cancers have been reported with much higher frequency than others.10 However, we believe the value of such an aggregate resource will only increase as more data are added, especially from unbiased, genome-wide studies.

In future analysis there is a need to assess the robustness of the results using different thresholds and feature-selection techniques.

Conclusion

Because tissue from sporadic forms of cancer exhibit distinguishable patterns of somatic-mutation activity, identifying signatures for particular cancer types provides opportunities for understanding biological patterns of disease progression, improving prognostic approaches, and developing novel therapeutic methods.

Due to the availability of the COSMIC resource, we were able to illustrate the potential value of looking across many studies to discover patterns that might not be feasible to explore in individual studies.

This work enables the creation of clustered sets of cancer types with similar distinguishing gene patterns. For colorectal cancer, the set of genes are associated with known pathways of proliferation, apoptosis and genetic instability along with the known progression of the cancer. These findings suggest that this method can be used to find biologically relevant patterns of distinguishing genes among clustered sets of cancer types.

Acknowledgments

SP was funded by the National Library of Medicine training grant #1T15-LM007124. The work was additionally funded in part by an Incentive Seed Grant from University of Utah awarded to LF.

References

  • 1.De la Chappelle A. Genetic predisposition to colorectal cancer. Nat Rev Cancer. 2004;4:769–80. doi: 10.1038/nrc1453. [DOI] [PubMed] [Google Scholar]
  • 2.Croce CM. Oncogenes and cancer. N Eng J Med. 2008;358:502–11. doi: 10.1056/NEJMra072367. [DOI] [PubMed] [Google Scholar]
  • 3.Vogelstein B, Kinzler KW. The genetic basis of human cancer. New York: McGraw–Hill; 1998. [Google Scholar]
  • 4.Michor F, Iwasa Y, Nowak MA. Dynamics of cancer progression. Nat Rev Cancer. 2004;4:197–205. doi: 10.1038/nrc1295. [DOI] [PubMed] [Google Scholar]
  • 5.Stewart BW, Kleihues P, editors. World cancer report. Lyon: IARC Press; 2003. [Google Scholar]
  • 6.Fearon ER, Vogelstein B. A genetic model for colorectal tumorigenesis. Cell. 1990;61:759–67. doi: 10.1016/0092-8674(90)90186-i. [DOI] [PubMed] [Google Scholar]
  • 7.Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL. The human disease network. Proc Natl Acad Sci USA. 2007;104(21):8685–8690. doi: 10.1073/pnas.0701361104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Sjöblom T, Jones S, Wood LD, et al. The consensus coding sequences of human breast and colorectal cancers. Science. 2006;314:268–74. doi: 10.1126/science.1133427. [DOI] [PubMed] [Google Scholar]
  • 9.Wood LD, Parson DW, Jones S, et al. The genomic landscapes of human breast and colorectal cancers. Science. 2007;318:1108–1113. doi: 10.1126/science.1145720. [DOI] [PubMed] [Google Scholar]
  • 10.Forbes S, Clements J, Dawson E, et al. COSMIC 2005. Br J Cancer. 2006;94:318–22. doi: 10.1038/sj.bjc.6602928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.COSMIC database schemaAvailable from: ftp://ftp.sanger.ac.uk/pub/CGP/cosmic/oracle_export/cosmic_export_feb_2006.pdf
  • 12.Demsar J, Zupan B, Leban G.Orange: From experimental machine learning to interactive data mining, white paper. Faculty of Computer and Information Science, University of Ljubljana. 2004. Available from: www.ailab.si/orange
  • 13.Bergadano F, de Raedt L, editors. Machine Learning: ECML-94: European Conference on Machine Learning. Springer; 1994. [Google Scholar]
  • 14.Liu H, Motoda H. Computational methods of feature selection. Boca Raton: Chapman & Hall; 2007. [Google Scholar]
  • 15.Kanehisa M, Goto S, Hattori M, et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006;34:D354–7. doi: 10.1093/nar/gkj102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Bleeker FE, Bardelli A. Genomic landscapes of cancers: prospects for targeted therapies. Pharmacogenomics. 2007;8:1629–33. doi: 10.2217/14622416.8.12.1629. [DOI] [PubMed] [Google Scholar]
  • 17.Goyette MC, Cho K, Fasching CL, et al. Progression of colorectal cancer is associated with multiple tumor suppressor gene defects but inhibition of tumorigenicity is accomplished by correction of any single defect via chromosome transfer. Mol Cell Biol. 1992;12:1387–95. doi: 10.1128/mcb.12.3.1387. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23:2507–17. doi: 10.1093/bioinformatics/btm344. [DOI] [PubMed] [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES