Abstract
The number of published reports using next-generation sequencing (NGS) technology in cancer research is increasing. These technologies generate large amounts of data that need to be appropriately presented and available to other researchers for further use. Our goal was to create a comprehensive database with single nucleotide polymorphisms (SNPs) associated with different types of cancer to integrate them to our bioinformatics tools. We reviewed more than 200 scientific papers and extracted relevant information on mutations detected by NGS technology. The current version of the database contains more than 100.000 mutations in more than 70 types of cancer. However, our review of NGS studies revealed great variation in presentation of NGS data in scientific literature with almost no effort for standardization of the data format. NGS results are published in a variety of forms which hinders the gathering of information. Therefore we suggested a uniform format for presenting the NGS data. This will allow faster database development, easier access and data sharing between the laboratories. The database will be a useful tool to many researchers in the field of cancer research and can be a base for a range of studies such as genome-wide association studies, microRNA target binding, and development of cancer biomarkers research.
Keywords: Cancer, database, next generation sequencing (NGS), single nucleotide polymorphism (SNP)
Next-generation sequencing (NGS), also known as massively parallel sequencing, is rapidly transforming biomedical and biological research from single gene to genome scale1-3. Within only a few years of the advent of NGS technologies, it is now possible to allow researchers to apply whole-exome sequencing (Exome-Seq), whole-genome sequencing (WGS), whole-transcriptome sequencing (RNA-Seq), or a combination of them to investigate individual genomes, especially those related to disease. Next-generation sequencing technologies have demonstrated their power in detecting disease causing or causative genetic variants of human diseases2,4,5, especially in cancer2,6,7. The use of NGS technology in cancer studies and the number of publications is rapidly increasing. Consequently, a massive amount of data is generated. There are already databases available online with an attempt to summarize this data and present it in a comprehensive way. Next Generation Sequencing Catalog (NGS Catalog, http://bioinfo.mc.vanderbilt.edu/NGS/)2, collects scientific literature on NGS studies with hyperlinks to the publications. International Cancer Genome Consortium Data Portal (ICGC, https:// dcc.icgc.org/) provides tools for visualizing, querying and downloading the data released by the consortium's member projects. They systematically and comprehensively characterize somatic mutations in 50 different cancer types and subtypes using high-throughput NGS technologies. Their goal is to rapidly bring these data to the cancer research community in order to accelerate studies on the discovery of cancer causes, to enhance the accuracy of diagnosis and to improve treatments. Catalogue Of Somatic Mutations In Cancer (COSMIC, http://cancer.sanger.ac.uk/cosmic)8 is designed to store and display somatic mutation information and related details and contains information relating to human cancers.
As the number of published articles with various findings and results of NGS technology in cancer is growing, it is important that the data is collected and presented in a suitable way. Our goal was to create an organized and comprehensive database with mutations in different types of cancer, that will be integrated into our previously developed bioinformatics tools, such as the miRNA SNiPer. This enables identification of polymorphisms residing within miRNA genes9. We reviewed the scientific literature and databases on NGS studies in cancer research and collected relevant information on mutations detected by NGS technology. Furthermore, our literature review revealed that scientific papers of different research groups vary greatly and researchers publish their findings in a variety of forms. This makes gathering information from them quite a challenge; therefore a uniform format for presenting the results is needed.
We collected over 200 papers from databases: PubMed, PubMed Central, ScienceDirect, and NGS Catalog using keywords such as »cancer«, »mutations«, »whole genome sequencing«, »exome sequencing«, and »next generation sequencing«. We gathered relevant information on mutations detected by NGS technology: gene name (approved by HUGO Gene Nomenclature Committee), chromosome number, Ensembl gene ID, Entrez gene ID, transcript information, rs number (if available), position of the mutation in the genome, cDNA and protein (amino acid replacement with single letter code), type of cancer, mutation variant, technology used (platform), digital object identifier (DOI), reference and hyperlink to the publication. The data were collected in an Excel table. Based on the systematic review of over 200 papers describing the NGS in cancer research, we proposed a simple, intelligible, easily accessible and ready-to-use database containing fifteen relevant pieces of information on each mutation and hyperlink to the publication.
The use of NGS technology in cancer research is increasing and as a result a growing amount of data is generated. With this in mind we created a comprehensive database of 109.028 mutations associated with more than 70 different types of cancer. There are almost 12.000 SNPs with known reference SNP ID number (rs number) in our database. Those will be integrated into the miRNA SNiPer tool for subsequent functional annotation of miRNA genes. According to our extensive literature review we also developed a template for standardization of NGS data presentation. Each column of the database represents a category of information (e.g., gene name, type of cancer, etc.) and each row represents a single mutation in a particular gene (Figure 1). The database includes diverse types of information on mutations and is designed in order to make the data easily accessible for further use.
Our literature review revealed that most of the research groups that perform NGS studies do not focus much on data standardization, which is an important aspect for the data being available to other researchers. Scientific papers of different research groups vary greatly and researchers publish their findings in a variety of forms, which makes gathering information from them challenging. For example, some publications have attached Excel files with transparently sorted data10,11 which makes final data editing easier. However, in some cases the data is stored in a .txt, .doc or .pdf formats, or tables rotated by 90°. Data presented like this is difficult to manage and each piece of information has to be extracted manually, which is very time-consuming. Some publications are very well written with easily accessible results, while others hinder further use of the results.
NGS data are extensive; therefore a uniform and simple format for presenting the results is needed. Our database could represent a template and an example for other researchers in order to make the increasing amount of data more transparent and easily available. Standardization of the format for NGS data presentation will facilitate further development of our database as well as help other researchers in NGS and cancer studies. The data from the database can be sorted according to different types of information in order to create specialized databases, for example, based on preferred individual cancer types. It can also be analyzed using various bioinformatics tools. It can be a useful tool in a search of new gene hubs and potential biopathways. Our database will be useful to researchers involved in genome-wide association studies, development of cancer biomarkers, prioritization of genomic loci, for further functional studies or genomic overlap analysis, like overlap with QTL, miRNA genes and miRNA binding sites.
Acknowledgments
This work was supported by the Slovenian Research Agency (ARRS) through the Research Program (P4-0220).
Footnotes
Conflict of interests: The authors declare no conflict of interest.
DISCOVERIES is a peer-reviewed, open access, online, multidisciplinary and integrative journal, publishing high impact and innovative manuscripts from all areas related to MEDICINE, BIOLOGY and CHEMISTRY
References
- 1.Challenges of sequencing human genomes. Koboldt Daniel C, Ding Li, Mardis Elaine R, Wilson Richard K. Briefings in bioinformatics. 2010;11(5):484–98. doi: 10.1093/bib/bbq016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.NGS catalog: A database of next generation sequencing studies in humans. Xia Junfeng, Wang Qingguo, Jia Peilin, Wang Bing, Pao William, Zhao Zhongming. Human mutation. 2012;33(6):E2341–55. doi: 10.1002/humu.22096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Sequencing technologies - the next generation. Metzker Michael L. Nature reviews. Genetics. 2010;11(1):31–46. doi: 10.1038/nrg2626. [DOI] [PubMed] [Google Scholar]
- 4.Unlocking Mendelian disease using exome sequencing. Gilissen Christian, Hoischen Alexander, Brunner Han G, Veltman Joris A. Genome biology. 2011;12(9):228. doi: 10.1186/gb-2011-12-9-228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Next-generation human genetics. Shendure Jay. Genome Biology. 2011;12(Suppl 1):I14. doi: 10.1186/gb-2011-12-9-408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Analysis of next-generation genomic data in cancer: accomplishments and challenges. Ding Li, Wendl Michael C, Koboldt Daniel C, Mardis Elaine R. Human molecular genetics. 2010;19(R2):R188–96. doi: 10.1093/hmg/ddq391. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Application of second-generation sequencing to cancer genomics. Robison Keith. Briefings in bioinformatics. 2010;11(5):524–34. doi: 10.1093/bib/bbq013. [DOI] [PubMed] [Google Scholar]
- 8.The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website. Bamford S, Dawson E, Forbes S, Clements J, Pettett R, Dogan A, Flanagan A, Teague J, Futreal P A, Stratton M R, Wooster R. British journal of cancer. 2004;91(2):355–8. doi: 10.1038/sj.bjc.6601894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Genome-wide in silico screening for microRNA genetic variability in livestock species. Jevsinek Skok D, Godnic I, Zorc M, Horvat S, Dovc P, Kovac M, Kunej T. Animal genetics. 2013;44(6):669–77. doi: 10.1111/age.12072. [DOI] [PubMed] [Google Scholar]
- 10.Whole genome sequencing of matched primary and metastatic acral melanomas. Turajlic Samra, Furney Simon J, Lambros Maryou B, Mitsopoulos Costas, Kozarewa Iwanka, Geyer Felipe C, Mackay Alan, Hakas Jarle, Zvelebil Marketa, Lord Christopher J, Ashworth Alan, Thomas Meirion, Stamp Gordon, Larkin James, Reis-Filho Jorge S, Marais Richard. Genome research. 2012;22(2):196–207. doi: 10.1101/gr.125591.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Melanoma genome sequencing reveals frequent PREX2 mutations. Berger Michael F, Hodis Eran, Heffernan Timothy P, Deribe Yonathan Lissanu, Lawrence Michael S, Protopopov Alexei, Ivanova Elena, Watson Ian R, Nickerson Elizabeth, Ghosh Papia, Zhang Hailei, Zeid Rhamy, Ren Xiaojia, Cibulskis Kristian, Sivachenko Andrey Y, Wagle Nikhil, Sucker Antje, Sougnez Carrie, Onofrio Robert, Ambrogio Lauren, Auclair Daniel, Fennell Timothy, Carter Scott L, Drier Yotam, Stojanov Petar, Singer Meredith A, Voet Douglas, Jing Rui, Saksena Gordon, Barretina Jordi, Ramos Alex H, Pugh Trevor J, Stransky Nicolas, Parkin Melissa, Winckler Wendy, Mahan Scott, Ardlie Kristin, Baldwin Jennifer, Wargo Jennifer, Schadendorf Dirk, Meyerson Matthew, Gabriel Stacey B, Golub Todd R, Wagner Stephan N, Lander Eric S, Getz Gad, Chin Lynda, Garraway Levi A. Nature. 2012;485(7399):502–6. doi: 10.1038/nature11071. [DOI] [PMC free article] [PubMed] [Google Scholar]