Skip to main content
Database: The Journal of Biological Databases and Curation logoLink to Database: The Journal of Biological Databases and Curation
. 2011 May 23;2011:bar018. doi: 10.1093/database/bar018

Data mining using the Catalogue of Somatic Mutations in Cancer BioMart

Rebecca Shepherd 1, Simon A Forbes 1, David Beare 1, S Bamford 1, Charlotte G Cole 1, Sari Ward 1, Nidhi Bindal 1, Prasad Gunasekaran 1, Mingming Jia 1, Chai Yin Kok 1, Kenric Leung 1, Andrew Menzies 1, Adam P Butler 1, Jon W Teague 1, Peter J Campbell 1, Michael R Stratton 1, P Andrew Futreal 1,*
PMCID: PMC3263736  PMID: 21609966

Abstract

Catalogue of Somatic Mutations in Cancer (COSMIC) (http://www.sanger.ac.uk/cosmic) is a publicly available resource providing information on somatic mutations implicated in human cancer. Release v51 (January 2011) includes data from just over 19 000 genes, 161 787 coding mutations and 5573 gene fusions, described in more than 577 000 tumour samples. COSMICMart (COSMIC BioMart) provides a flexible way to mine these data and combine somatic mutations with other biological relevant data sets. This article describes the data available in COSMIC along with examples of how to successfully mine and integrate data sets using COSMICMart.

Database URL: http://www.sanger.ac.uk/genetics/CGP/cosmic/biomart/martview/

Project description

COSMIC is a repository for somatic mutations and associated phenotype/clinical information, combining data from a number of sources. Primarily, data are manually curated from the scientific literature for genes selected from the Cancer Gene Census (http://www.sanger.ac.uk/genetics/CGP/Census/), a listing of genes that are known to be mutated in human cancer. The manual curation initially focused on known cancer genes that had a high proportion of coding point mutations. In recent years, this has been extended to include cancer genes that are also mutated by gene fusion events. With the advent of next-generation sequencing, COSMIC has been adapted to hold complete catalogues of somatic mutations for individual tumour samples.

The COSMIC website has been developed to overview the underlying database in a user-friendly manner. The website can be navigated by gene, cancer sample or tissue/histology type and has a series of graphical and tabular displays to summarize the content of the database.

However, the integration of biological data sets is still a major IT challenge. Although, the data in COSMIC can be viewed and downloaded in a number of useful ways, it is still challenging for users to generate their own custom queries and also to link to related resources. In order to facilitate the integration of the COSMIC data set, we have used the BioMart data mining software. BioMart uses a federated data model to allow integration of biological data from diverse databases (1). There is now a significant number of data resources that have set up their own BioMarts including Ensembl, UniProt and HGNC. The BioMart software also has interfaces that allow users to easily generate custom queries to obtain subsets of data from one or more BioMart data resources. We have successfully set up an instance of a BioMart, COSMICMart, which holds a summary of the somatic mutations and associated phenotype data from the COSMIC database.

Data content

COSMIC version 51 (January 2011) contains a full curation of the scientific literature for 91 cancer genes, mostly point-mutated, as well as 53 curated fusion gene pairs, with over 11 000 papers having been assessed. Weekly literature searches are carried out for each cancer gene and each abstract/article is manually evaluated. Checks are made to ensure a paper has not been curated already, whether the paper is a review or is reporting original data, if there are any inconsistencies in the data, and whether all the required information for full curation is present, e.g. information on the numbers of samples screened and sufficient mutation detail. Papers which pass the initial scan are submitted for detailed manual curation, while the remaining papers are ‘listed’ on the COSMIC website. Information is manually recorded on the actual mutation change, the detailed phenotype of the cancer samples and the original publication/study.

Mutation changes are checked for accuracy and recorded at the genome, transcript (to one reference transcript) and protein level using the mutation syntax standards developed by the Human Genome Variation Society (2). Negative results (sample screened does not have a mutation for a particular gene for a particular publication) are also recorded so prevalence statistics can be estimated for each gene. As part of the curation process, cancer sample tissue/histology descriptions from the original publication are recorded and then re-classified to a COSMIC standard tissue/histology ontology (See classification section of the COSMIC additional information page, http://www.sanger.ac.uk/genetics/CGP/cosmic/add_info/). The standardization of the main data types in COSMIC means the utility of the data is significantly enhanced and the data are easier to browse and incorporate into external data sets. When available, clinical and exposure data and therapeutic responses are also recorded. The full functionality of COSMIC has previously been described (3, 4).

The database coverage has recently been enhanced by the integration of somatic mutations from external data resources. A collaboration with the database curators at International Agency for Research on Cancer (IARC) has allowed the majority of mutations from IARC TP53 database R14 to be integrated into COSMIC (5). Somatic mutations from large-scale systematic cancer screens are curated in COSMIC to include studies on glioblastoma multiforme, breast, colorectal and lung cancer (6–9). Data are also directly curated from the data portals of large-scale projects to include the Cancer Genome Project (CGP) at the Sanger Institute and more recently validated somatic mutations from The Cancer Genome Atlas (TCGA) (6) and International Cancer Genome Consortium (ICGC) (10). COSMIC has been enhanced to accommodate somatic mutations from whole genome and exome screens. This includes all somatic and non-coding mutations, structural rearrangements and gene fusions. Currently, COSMIC has annotated 51 genomes and 332 exomes from a range of cancers to include lung, malignant melanoma, renal, AML, pancreatic and ovarian cancer. The current contents of the database (v51, January 2011) are displayed in Table 1.

Table 1.

Total contents in v51 of the COSMIC database, January 2011 release

Curated data type Curated data count
Experiments 2 946 792
Tumours 577 304
Mutations 167 193
References 11 062
Genes 19 000
Fusions 5573
Structural variants 2729
Whole-cancer genomes 51
Whole-cancer exomes 332

Query examples

COSMICMart allows data to be filtered on six different categories (Figure 1): cancer sample, gene, mutation, site of the tumour, histology and other (e.g. Ensembl Gene ID, Swissprot ID, Entrez Gene ID). The interface has a number of pre-selected filters and attributes; mutated samples are selected by default. Users can change these to suit their requirements. Results are displayed in tabulated form and are exportable in various formats for further analysis.

Figure 1.

Figure 1.

Example of how COSMICMart can be queried. This query searches for all cell lines with missense substitution mutations in the BRAF gene (A). Attributes can be selected (B) to display in the results table (C).

Query #1: ‘Find all missense substitution mutations for BRAF in cell lines, and display sample, mutation, site, and histology information’ (Figure 1, Table 2).

Table 2.

Data sets, filters and attributes selected for query #1

Data sets Filters Attributes
COSMIC51 Mutated sample: yes Sample name
Sample source: cell-line Sample source
Gene name: BRAF Gene name
AA mutation type: substitution—missense Cosmic mutation ID (COSM ID)
CDS mutation syntax
AA mutation syntax
Primary site
Primary histology
Tumour source
Pubmed ID

Missense mutations are the most common variant type in COSMIC; over 90% of mutations in BRAF are missense mutations at the p.V600 position. The results are returned as a tabular summary with links back to the COSMIC website. The sample name field links back to the COSMIC sample overview web page, and mutation ID (COSM ID) to the COSMIC mutation summary page (Figure 2). From the COSMIC mutation summary web page, there are links to the Ensembl contig view so the mutation can be viewed in a genomic context. There are also links to the GMOD’s GBrowse where COSMIC coding and non-coding mutations, gene footprints, structural rearrangements and copy number variants can be viewed (11).

Figure 2.

Figure 2.

The COSMIC sample (A) and mutation (B) summary pages are linked directly from COSMICMart output table.

Query #2: ‘Find all gene fusion mutations involving the FUS gene with a primary site of bone, and display mutation and sample information’ (Table 3).

Table 3.

Data sets, filters and attributes selected for query #2

Data sets Filters Attributes
COSMIC51 Mutated sample: yes Cosmic sample ID
Gene name: FUS Sample name
CDS mutation type: inferred breakpoint, observed mRNA Sample source
Primary site: bone Cosmic fusion mutation ID
CDS mutation syntax
Pubmed ID

Gene fusions have been associated with a number of specific tumour types including prostate and blood tumours. These biomarkers can be useful in diagnosis and as targets for drug therapies. COSMIC has annotations, for an increasing number of gene fusion mutations, which are viewable using COSMICMart. The COSMIC fusion mutation ID links to the gene fusion summary pages, which give a graphical view of different fusion structures observed. Many of the papers describing gene fusions have identified more than one gene fusion product for the same genes in a single sample. Observed mRNAs are the actual expressed products reported in the results. However, to aid display and website navigation, we have inferred the genomic breakpoint from the experimental data.

Query #3: ‘Find variation information in Ensembl for all genes from mutated samples with a primary site of breast, and display COSMIC gene, mutation and sample information along with Ensembl variation information’ (Table 4).

Table 4.

Data sets, filters and attributes selected for query #3

Data sets Filters Attributes
COSMIC51 Mutated sample: yes Cosmic sample ID
Primary site: breast Sample name
Sample source
Cosmic mutation ID (COSM ID)
CDS mutation syntax
AA mutation syntax
Ensembl: Homo sapiens genes Features: Ensembl gene ID
Features: Ensembl transcript ID
Variations: variation source
Variations: source description
Variations: reference ID
Variations: allele

COSMICMart is federated with Ensembl (12), which allows Biomart queries to return and integrate data from both resources. For instance, the linking of the two resources can allow the retrieval of variation data from both resources (somatic mutations from COSMIC and germline polymorphisms from Ensembl) for a particular gene or set of genes or cancer type. There is an increasing awareness of how genomic variation can affect a tumour’s sensitivity or resistance to anti-cancer agents. While this genetic variation can be familial or somatic, an understanding of common genetic variation around known cancer genes can be of much value to investigations searching for loci modifying a tumour’s response to drug therapy (13–15). This query is achieved by first selecting the filters/attributes in the COSMIC BioMart and then clicking the ‘Dataset’ link at the bottom of the left hand margin of the BioMart interface. An additional data set can then be selected from the drop down list, in this instance Ensembl, which allows a federated query between COSMIC and Ensembl. The filters/attributes are then set in the usual way using the Ensembl BioMart to produce an integrated query.

Future directions

COSMIC will continue to curate newly discovered cancer genes and is committed to update existing cancer genes with a data release every 2 months. This will ensure that the scientific community has an up-to-date catalogue of somatic mutations implicated in human cancer. COSMICMart is also automatically updated with each new COSMIC release, which allows the data set to be easily mined and integrated with other resources. COSMIC has been successfully adapted to hold complete catalogues of somatic mutations for individual cancer samples. Currently COSMIC holds genome-wide data on 383 tumour samples and we expect this to increase in the near future.

It is intended to federate COSMICMart with further BioMart-driven data resources in addition to the current link with Ensembl. Linking our data to PRIDE (16), UniProt (17) and InterPro (18) will allow COSMIC somatic mutation data to be linked to protein and peptide annotation, while the addition of the Reactome (19) database will allow the incorporation of pathway data. We also intend to create direct links between COSMIC and the ICGC Data Portal (http://dcc.icgc.org/) so somatic mutation data can be integrated between the two data resources.

Funding

Funding for open access charge: Wellcome Trust (grant reference 077012/Z/05/Z).

Conflict of interest. None declared.

References

  • 1.Haider S, Ballester B, Smedley D, et al. BioMart Central Portal–unified access to biological data. Nucleic Acids Res. 2009;1:37. doi: 10.1093/nar/gkp265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Den Dunnen JT, Antonarakis SE. Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Hum. Mutat. 2000;15:7–12. doi: 10.1002/(SICI)1098-1004(200001)15:1<7::AID-HUMU4>3.0.CO;2-N. [DOI] [PubMed] [Google Scholar]
  • 3.Forbes SA, Tang G, Bindal N, et al. COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer. Nucleic Acids Res. 2010;38:D652–D657. doi: 10.1093/nar/gkp995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Forbes SA, Bhamra G, Bamford S, et al. The Catalogue of Somatic Mutations in Cancer (COSMIC) Curr. Protoc. Hum. Genet. 2008 doi: 10.1002/0471142905.hg1011s57. Chapter 10, 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Petitjean A, Mathe E, Kato S, et al. Impact of mutant p53 functional properties on TP53 mutation patterns and tumor phenotype: lessons from recent developments in the IARC TP53 database. Hum. Mutat. 2007;28:622–629. doi: 10.1002/humu.20495. [DOI] [PubMed] [Google Scholar]
  • 6. Cancer Genome Atlas Research Network. (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature, 455, 1061–1068.
  • 7.Sjöblom T, Jones S, Wood LD, et al. The consensus coding sequences of human breast and colorectal cancers. Science. 2006;314:268–274. doi: 10.1126/science.1133427. [DOI] [PubMed] [Google Scholar]
  • 8.Parsons DW, Jones S, Zhang X, et al. An integrated genomic analysis of human glioblastoma multiforme. Science. 2008;321:1807–1812. doi: 10.1126/science.1164382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ding L, Getz G, Wheeler DA, et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature. 2008;455:1069–1075. doi: 10.1038/nature07423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.International Cancer Genome Consortium. Hudson TJ, Anderson W, et al. International network of cancer genome projects. Nature. 2010;464:993–998. doi: 10.1038/nature08987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Stein LD, Mungall C, Shu S, et al. The generic genome browser: a building block for a model organism system database. Genome Res. 2002;12:1599–1610. doi: 10.1101/gr.403602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Flicek P, Aken BL, Ballester B, et al. Ensembl's 10th year. Nucleic Acids Res. 2010;38:D557–D562. doi: 10.1093/nar/gkp972. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Sharma SV, Haber DA, Settleman J. Cell line-based platforms to evaluate the therapeutic efficacy of candidate anticancer agents. Nat. Rev. Cancer. 2010;10:241–253. doi: 10.1038/nrc2820. [DOI] [PubMed] [Google Scholar]
  • 14.Jänne PA, Gray N, Settleman J. Factors underlying sensitivity of cancers to small-molecule kinase inhibitors. Nat. Rev. Drug Discov. 2009;8:709–723. doi: 10.1038/nrd2871. [DOI] [PubMed] [Google Scholar]
  • 15.McDermott U, Sharma SV, Settleman J. High-throughput lung cancer cell line screening for genotype-correlated sensitivity to an EGFR kinase inhibitor. Methods Enzymol. 2008;438:331–341. doi: 10.1016/S0076-6879(07)38023-3. [DOI] [PubMed] [Google Scholar]
  • 16.Vizcaíno JA, Reisinger F, Côté R, et al. PRIDE and "Database on Demand" as valuable tools for computational proteomics. Methods Mol. Biol. 2011;696:93–105. doi: 10.1007/978-1-60761-987-1_6. [DOI] [PubMed] [Google Scholar]
  • 17. UniProt Consortium. (2010) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res., 38, D142–D148.
  • 18.Hunter S, Apweiler R, Attwood TK, et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 2009;37:D211–D215. doi: 10.1093/nar/gkn785. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Croft D, O’Kelly G, Wu G, et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011;39:D691–D697. doi: 10.1093/nar/gkq1018. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Database: The Journal of Biological Databases and Curation are provided here courtesy of Oxford University Press

RESOURCES