Abstract
We previously presented YM500, which is an integrated database for miRNA quantification, isomiR identification, arm switching discovery and novel miRNA prediction from 468 human smRNA-seq datasets. Here in this updated YM500v2 database (http://ngs.ym.edu.tw/ym500/), we focus on the cancer miRNome to make the database more disease-orientated. New miRNA-related algorithms developed after YM500 were included in YM500v2, and, more significantly, more than 8000 cancer-related smRNA-seq datasets (including those of primary tumors, paired normal tissues, PBMC, recurrent tumors, and metastatic tumors) were incorporated into YM500v2. Novel miRNAs (miRNAs not included in the miRBase R21) were not only predicted by three independent algorithms but also cleaned by a new in silico filtration strategy and validated by wetlab data such as Cross-Linked ImmunoPrecipitation sequencing (CLIP-seq) to reduce the false-positive rate. A new function ‘Meta-analysis’ is additionally provided for allowing users to identify real-time differentially expressed miRNAs and arm-switching events according to customer-defined sample groups and dozens of clinical criteria tidying up by proficient clinicians. Cancer miRNAs identified hold the potential for both basic research and biotech applications.
INTRODUCTION
MicroRNAs (miRNAs), a small RNA species of ∼22 nt in length, control gene expression and are involved in the regulation of a variety of biological processes including development, differentiation, cell proliferation, metabolism and inflammation, as well as in human diseases (1). It is estimated that more than 60% of human protein-coding genes contain miRNA target sites with their 3′ UTR regions (2). miRNAs can function as oncogenes or tumor suppressors, as do miR-21 and let-7, respectively, and improved knowledge of them has been one of the defining developments in cancer research over the past decade. The abnormal expression of miRNAs and dysregulation of factors that regulate miRNAs contribute to the progression of tumor (3). Owing to the stability of miRNAs within clinical samples, miRNAs have been regarded as potential prognostic indicators in cancers (4) and as biomarkers for cancer classification (5–8). Many miRNA signatures have been proposed for patient prognosis and clinical response and are being investigated in clinical trials (9). In addition, miRNAs have emerged as therapeutic targets for cancer treatments. The function of miRNAs can be efficiently and specifically inhibited by artificial miRNA mimics or antagomirs, supporting their potentials as novel therapy tactics for diseases (9–13). MRX34, a synthetic miR-34a mimic loaded in liposome, is the first miRNA-based therapy and has entered the clinics for cancer therapy in 2013 (14).
In the past few years, because of the increasing use of next generation sequencing (NGS), enormous amounts of small RNA sequencing data have been generated from large-scale cancer projects such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC). These data help researchers to investigate the roles of miRNAs in cancer progression. However, translating such massive amounts of data into information that can be easily interpreted and accessed remains a challenge.
Previously, we developed YM500 (15), a database that includes integrated pipelines for miRNA quantification, isomiR identification, arm switching discovery and novel miRNA prediction from smRNA-seq. YM500 provides researchers with integrated miRNA-related information with various graphical visualization pages from 468 human and 141 mouse smRNA-seq datasets via a user-friendly web interface. No particular biological question or disease was emphasized in YM500. Here we present YM500v2, an updated version of the database. In addition to including more smRNA-seq results (>8000 TCGA cancer-related datasets), we focus on human cancer research in this version. Other new features also include novel miRNA identification and validation: novel miRNAs, i.e. those not included in miRBase R21, were identified from thousands of samples, and a new strategy was developed for filtering out false-positive discoveries. Especially, dozens of CLIP-seq datasets are provided as experimental evidence for in silico predicted novel miRNAs. Regarding isomiRs, two new statistical charts are provided for a specific isomiR. YM500v2 also contains a new function ‘Meta-analysis’, which allows researchers to identify differentially expressed miRNAs and arm-switching events in two customer-defined groups of samples.
DATA COLLECTION AND PRE-PROCESSING
There are 8105 smRNA-seq datasets from TCGA, including those of primary tumors, paired normal tissues, peripheral blood mononuclear cell (PBMC), recurrent tumors and metastatic tumors, incorporated into YM500v2 (Supplemental Table S1). Raw data were downloaded from CGHub (https://cghub.ucsc.edu/) and were pre-processed by in-house scripts. To avoid the disturbance of reads with poor quality, we filtered out reads that do not meet the following criterion: Phred Score >30 in >90% sequence. About 30–55% of reads were excluded in this step, but millions of reads still remained for each dataset. As shown in Supplemental Figure S1, this filtration step dramatically improves data quality, which is important for novel miRNA and isomiR identification. The other pre-processing steps were the same as those followed in our previous study (15). Clinical information of each sample was manually curated by proficient clinicians based on clinical data obtained in TCGA. Each sample was annotated with 35 clinical characteristics.
NOVEL miRNA PREDICTION
In this version, 22 CLIP-seq datasets are incorporated as experimental evidence for the putative novel miRNAs, and a new strategy was developed to reduce the false-positive rate. It has been noted that there are numerous advantages in conducting novel miRNA prediction by performing a single analysis of the pooled data (16–18). After pre-processing, we pooled 8105 and 468 smRNA-seq datasets from TCGA and the Gene Expression Omnibus (human datasets in YM500), respectively, comprising approximately 11.1 billion reads from 193 million unique sequences. When analyzing the sequences, we found that there are 76.7% of sequences with only one read in all the samples and 0.41% of sequences with more than 100 reads in ∼8600 samples (Supplemental Figure S2). This result indicates that most sequences only appear in one sample. We assume that if a sequence does exist in humans, multiple reads from multiple independent experiments will appear to support its existence. Thus, to reduce the false-positive rate, only sequences that met the criterion of being supported by >100 reads in >10 independent datasets were used for novel miRNA prediction according to the pipeline described in our previous study (15). In brief, the pooled dataset was used for novel miRNA prediction by miRDeep2 (19), mireap (20) and miRanalyzer (21). After unifying the prediction results of the three algorithms and filtering out novel miRNAs similar to known transcripts, 3467 putative novel miRNAs were predicted by at least one algorithm and 1408 of those were supported by CLIP-seq data (Supplemental Figure S3). The alignment results of both smRNA-seq and CLIP-seq are provided in the web interface for each putative novel miRNA (Figure 1).
Figure 1.

A representation of example of ‘novel miRNA’, indicating the alignment results of the reads from smRNA-seq (A) and CLIP-seq (B).
META-ANALYSIS
YM500v2 adds a new function, Meta-analysis, which allows researchers to identify differentially expressed miRs and arm-switching events from two user-defined, specific sets of samples. We utilize an R/Bioconductor package, DESeq (22,23) and the algorithm described in our previous study (15) to identify differentially expressed miRs and arm-switching events, respectively. As shown in Supplemental Figure S4, users can select one or multiple datasets and define what sample type they would like to investigate, such as ‘Primary Solid Tumor’ or ‘Solid Tissue Normal’. In addition, we also provide a list of clinical criteria, such as ICD-O-3 histology, tumor stage, distant metastasis and lymph node status, to help researchers to select a subgroup of well-defined cancer samples according to one or multiple clinical parameters. After selecting two groups of samples, users could overview the detailed clinical information of the two groups of selected samples before submitting this job to the server for real-time calculation. The user would then receive a notification email with a Result ID, and could then see a visualization of the results (Figure 2) in the ‘Result and Download’ section when the job is completed.
Figure 2.

The screen shot of the results for ‘Meta-Analysis’.
ISOmiRS
Two statistical charts are added to the ‘IsomiR’ section to help researchers realize the expression profile of an isomiR across distinct tissues. miR-21-5p+CA, an isomiR of miR-21-5p adenylated by PAPD5 reported to lead the degradation of miR-21-5p (24), is used as an example (Figure 3). Figure 3A shows the percentage of samples that expressed this isomiR in various tissues, and Figure 3B indicates the mean expression of a specific isomiR.
Figure 3.

Two new charts for isomiRs. Panel (A) shows the percentage of samples that expressed the isomiR in various tissues, and panel (B) indicates the mean expression of a specific isomiR.
DISCUSSION
The decreasing cost of sequencing technology has led to large amounts of small RNA sequencing data from cancer-related studies. To make the most of these valuable yet massive amounts of data, we have updated our YM500 database to make it focus on the human cancer miRNome. There are more than 8000 cancer-related samples incorporated into YM500v2, and a new function ‘Meta-analysis’, which allows researchers to fully utilize these thousands of cancer samples, is also added. With the updated database, researchers will have the opportunity, for example, to identify a specific set of miRNAs for a distinct question if they could define samples according to a clarified biological or clinical goal. To achieve this, our ‘Meta-analysis’ function allows a user to select two groups of samples using dozens of clinical criteria and then identify differentially expressed miRNAs and arm-switching events for the two groups (Figure 2).
Putative novel miRNAs in YM500v2 were identified from thousands of samples, and more importantly, a new strategy was developed for filtering out false-positive discoveries. Compared with the previous version of YM500, YM500v2 has a lower number of novel miRNAs (from ∼11 000 to ∼3500), but has more novel miRNAs identified by three prediction algorithms (from 90 to 189). Moreover, 40.61% (1408/3467) of the putative novel miRNAs could be supported by CLIP-seq data, and a web interface displays the alignment results of both smRNA-seq and CLIP-seq.
Accumulating evidence suggests that a miRNA locus could produce a series of isomiRs during the miRNA biogenesis process (24–27). The phenomenon of isomiRs contributes largely to the dynamic miRNome and coinstantaneously presents a challenge for miRNA study. A few isomiRs have been proven to be functional regulatory molecules (24,27–31), and more studies about isomiRs should be performed and taken into consideration (32). Several in silico tools have been developed for identifying isomiRs from smRNA-seq data (33–36), and such tools will greatly promote the discovery of isomiRs and their relevant functional roles. YM500v2 provides the existing evidence of isomiRs from enormous smRNA-seq datasets for researchers whenever they discover significant isomiRs in their studies. A representative example of isomiRs is shown in Figure 3.
NGS has become the norm for large-scale cancer research, and cancer-related smRNA-seq data will accumulate rapidly in the future. YM500 will be updated periodically to incorporate new smRNA-seq data. We have developed a pipeline to process new data and to incorporate them into the database semiautomatically. Newly incorporated smRNA-seq data will also be re-annotated for meta-analysis. YM500 will continue to be an informative and valuable database for miRNA studies.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
Acknowledgments
We are grateful to the National Center for High-performance Computing for computer time and facilities and thank the TCGA research network for the availability of data.
Footnotes
The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.
FUNDING
Ministry of Science and Technology (MOST) [101-2320-B-010-059-MY3, 102-2314-B-010-045, 102-2321-B-010-010, 103-2622-B-010-002]; National Health Research Institutes [NHRI-EX102-10254SI]; Veterans General Hospitals University System of Taiwan (VGHUST) Joint Research Program; Tsou's Foundation [VGHUST102-G7-3-2]; UST-UCSD International Center for Excellence in Advanced Bioengineering sponsored by the Taiwan NSC I-RiCE Program [102-2911-I-009-101]; National Yang-Ming University (Ministry of Education, Aim for the Top University Plan). Funding for open access charge: MOST [101-2320-B-010-059-MY3, 102-2314-B-010-045, 102-2321-B-010-010, 103-2622-B-010-002]; National Health Research Institutes [NHRI-EX102-10254SI]; VGHUST Joint Research Program; Tsou's Foundation [VGHUST102-G7-3-2]; UST-UCSD International Center for Excellence in Advanced Bioengineering sponsored by the Taiwan NSC I-RiCE Program [102-2911-I-009-101]; National Yang-Ming University (Ministry of Education, Aim for the Top University Plan).
Conflict of interest statement. None declared.
REFERENCES
- 1.Mendell J.T., Olson E.N. MicroRNAs in stress signaling and human disease. Cell. 2012;148:1172–1187. doi: 10.1016/j.cell.2012.02.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Friedman R.C., Farh K.K., Burge C.B., Bartel D.P. Most mammalian mRNAs are conserved targets of microRNAs. Genome Res. 2009;19:92–105. doi: 10.1101/gr.082701.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Adams B.D., Kasinski A.L., Slack F.J. Aberrant regulation and function of microRNAs in cancer. Curr. Biol. 2014;24:R762–R776. doi: 10.1016/j.cub.2014.06.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lu J., Getz G., Miska E.A., Alvarez-Saavedra E., Lamb J., Peck D., Sweet-Cordero A., Ebert B.L., Mak R.H., Ferrando A.A., et al. MicroRNA expression profiles classify human cancers. Nature. 2005;435:834–838. doi: 10.1038/nature03702. [DOI] [PubMed] [Google Scholar]
- 5.Liu C., Kelnar K., Vlassov A.V., Brown D., Wang J., Tang D.G. Distinct microRNA expression profiles in prostate cancer stem/progenitor cells and tumor-suppressive functions of let-7. Cancer Res. 2012;72:3393–3404. doi: 10.1158/0008-5472.CAN-11-3864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Andorfer C.A., Necela B.M., Thompson E.A., Perez E.A. MicroRNA signatures: clinical biomarkers for the diagnosis and treatment of breast cancer. Trends Mol. Med. 2011;17:313–319. doi: 10.1016/j.molmed.2011.01.006. [DOI] [PubMed] [Google Scholar]
- 7.Farazi T.A., Ten Hoeve J.J., Brown M., Mihailovic A., Horlings H.M., van de Vijver M.J., Tuschl T., Wessels L.F. Identification of distinct miRNA target regulation between breast cancer molecular subtypes using AGO2-PAR-CLIP and patient datasets. Genome Biol. 2014;15:R9. doi: 10.1186/gb-2014-15-1-r9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Garzon R., Volinia S., Liu C.G., Fernandez-Cymering C., Palumbo T., Pichiorri F., Fabbri M., Coombes K., Alder H., Nakamura T., et al. MicroRNA signatures associated with cytogenetics and prognosis in acute myeloid leukemia. Blood. 2008;111:3183–3189. doi: 10.1182/blood-2007-07-098749. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hayes J., Peruzzi P.P., Lawler S. MicroRNAs in cancer: biomarkers, functions and therapy. Trends Mol. Med. 2014;20:460–469. doi: 10.1016/j.molmed.2014.06.005. [DOI] [PubMed] [Google Scholar]
- 10.Yin H., Kanasty R.L., Eltoukhy A.A., Vegas A.J., Dorkin J.R., Anderson D.G. Non-viral vectors for gene-based therapy. Nat. Rev. Genet. 2014;15:541–555. doi: 10.1038/nrg3763. [DOI] [PubMed] [Google Scholar]
- 11.van Rooij E., Kauppinen S. Development of microRNA therapeutics is coming of age. EMBO Mol. Med. 2014;6:851–864. doi: 10.15252/emmm.201100899. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Li Z., Rana T.M. Therapeutic targeting of microRNAs: current status and future challenges. Nat. Rev. Drug Discov. 2014;13:622–638. doi: 10.1038/nrd4359. [DOI] [PubMed] [Google Scholar]
- 13.Bader A.G., Brown D., Winkler M. The promise of microRNA replacement therapy. Cancer Res. 2010;70:7027–7030. doi: 10.1158/0008-5472.CAN-10-2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Bouchie A. First microRNA mimic enters clinic. Nat. Biotechnol. 2013;31:577–577. doi: 10.1038/nbt0713-577. [DOI] [PubMed] [Google Scholar]
- 15.Cheng W.C., Chung I.F., Huang T.S., Chang S.T., Sun H.J., Tsai C.F., Liang M.L., Wong T.T., Wang H.W. YM500: a small RNA sequencing (smRNA-seq) database for microRNA research. Nucleic Acids Res. 2013;41:D285–D294. doi: 10.1093/nar/gks1238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Friedlander M.R., Lizano E., Houben A.J., Bezdan D., Banez-Coronel M., Kudla G., Mateu-Huertas E., Kagerbauer B., Gonzalez J., Chen K.C., et al. Evidence for the biogenesis of more than 1,000 novel human microRNAs. Genome Biol. 2014;15:R57. doi: 10.1186/gb-2014-15-4-r57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Ladewig E., Okamura K., Flynt A.S., Westholm J.O., Lai E.C. Discovery of hundreds of mirtrons in mouse and human small RNA data. Genome Res. 2012;22:1634–1645. doi: 10.1101/gr.133553.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Friedlander M.R., Mackowiak S.D., Li N., Chen W., Rajewsky N. miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades. Nucleic Acids Res. 2012;40:37–52. doi: 10.1093/nar/gkr688. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Friedlander M.R., Chen W., Adamidi C., Maaskola J., Einspanier R., Knespel S., Rajewsky N. Discovering microRNAs from deep sequencing data using miRDeep. Nat. Biotechnol. 2008;26:407–415. doi: 10.1038/nbt1394. [DOI] [PubMed] [Google Scholar]
- 20.Chen X., Li Q., Wang J., Guo X., Jiang X., Ren Z., Weng C., Sun G., Wang X., Liu Y., et al. Identification and characterization of novel amphioxus microRNAs by Solexa sequencing. Genome Biol. 2009;10:R78. doi: 10.1186/gb-2009-10-7-r78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Hackenberg M., Sturm M., Langenberger D., Falcon-Perez J.M., Aransay A.M. miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments. Nucleic Acids Res. 2009;37:W68–W76. doi: 10.1093/nar/gkp347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Anders S., Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Anders S., McCarthy D.J., Chen Y., Okoniewski M., Smyth G.K., Huber W., Robinson M.D. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat. Protoc. 2013;8:1765–1786. doi: 10.1038/nprot.2013.099. [DOI] [PubMed] [Google Scholar]
- 24.Boele J., Persson H., Shin J.W., Ishizu Y., Newie I.S., Sokilde R., Hawkins S.M., Coarfa C., Ikeda K., Takayama K., et al. PAPD5-mediated 3′ adenylation and subsequent degradation of miR-21 is disrupted in proliferative disease. Proc. Natl Acad. Sci. U.S.A. 2014;111:11467–11472. doi: 10.1073/pnas.1317751111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Fukunaga R., Han B.W., Hung J.H., Xu J., Weng Z., Zamore P.D. Dicer partner proteins tune the length of mature miRNAs in flies and mammals. Cell. 2012;151:533–546. doi: 10.1016/j.cell.2012.09.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Wu H., Ye C., Ramirez D., Manjunath N. Alternative processing of primary microRNA transcripts by Drosha generates 5′ end variation of mature microRNA. PLoS One. 2009;4:e7566. doi: 10.1371/journal.pone.0007566. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Ma H., Wu Y., Choi J.G., Wu H. Lower and upper stem-single-stranded RNA junctions together determine the Drosha cleavage site. Proc. Natl Acad. Sci. U.S.A. 2013;110:20687–20692. doi: 10.1073/pnas.1311639110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Hinton A., Hunter S.E., Afrikanova I., Jones G.A., Lopez A.D., Fogel G.B., Hayek A., King C.C. sRNA-seq analysis of human embryonic stem cells and definitive endoderm reveal differentially expressed microRNAs and novel isomiRs with distinct targets. Stem Cells. 2014;32:2360–2372. doi: 10.1002/stem.1739. [DOI] [PubMed] [Google Scholar]
- 29.Tan G.C., Chan E., Molnar A., Sarkar R., Alexieva D., Isa I.M., Robinson S., Zhang S., Ellis P., Langford C.F., et al. 5′ isomiR variation is of functional and evolutionary importance. Nucleic Acids Res. 2014;42:9424–9435. doi: 10.1093/nar/gku656. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Xia J., Zhang W. A meta-analysis revealed insights into the sources, conservation and impact of microRNA 5′-isoforms in four model species. Nucleic Acids Res. 2014;42:1427–1441. doi: 10.1093/nar/gkt967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Guo L., Zhao Y., Yang S., Zhang H., Chen F. A genome-wide screen for non-template nucleotides and isomiR repertoires in miRNAs indicates dynamic and versatile microRNAome. Mol. Biol. Rep. 2014;41:6649–6658. doi: 10.1007/s11033-014-3548-0. [DOI] [PubMed] [Google Scholar]
- 32.Guo L., Chen F. A challenge for miRNA: multiple isomiRs in miRNAomics. Gene. 2014;544:1–7. doi: 10.1016/j.gene.2014.04.039. [DOI] [PubMed] [Google Scholar]
- 33.Humphreys D.T., Suter C.M. miRspring: a compact standalone research tool for analyzing miRNA-seq data. Nucleic Acids Res. 2013;41:e147. doi: 10.1093/nar/gkt485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Giurato G., De Filippo M.R., Rinaldi A., Hashim A., Nassa G., Ravo M., Rizzo F., Tarallo R., Weisz A. iMir: an integrated pipeline for high-throughput analysis of small non-coding RNA data obtained by smallRNA-Seq. BMC Bioinformatics. 2013;14:362. doi: 10.1186/1471-2105-14-362. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.de Oliveira L.F., Christoff A.P., Margis R. isomiRID: a framework to identify microRNA isoforms. Bioinformatics. 2013;29:2521–2523. doi: 10.1093/bioinformatics/btt424. [DOI] [PubMed] [Google Scholar]
- 36.Luo G.Z., Yang W., Ma Y.K., Wang X.J. ISRNA: an integrative online toolkit for short reads from high-throughput sequencing data. Bioinformatics. 2014;30:434–436. doi: 10.1093/bioinformatics/btt678. [DOI] [PubMed] [Google Scholar]
