ABSTRACT
Long non-coding RNAs (lncRNAs) are an emerging class of non-coding RNAs and potent regulatory elements in the living cells. High throughput RNA sequencing analyses have generated a tremendous amount of transcript sequence data. A large proportion of these transcript sequences does not code for proteins and are known as non-coding RNAs. Among them, lncRNAs are a unique class of transcripts longer than 200 nucleotides with diverse biological functions and regulatory mechanisms. Recent emerging studies and next-generation sequencing technologies show a substantial amount of lncRNAs within the plant genome, which are yet to be identified. The computational identification of lncRNAs from these transcripts is a challenging task due to the involvement of a series of filtering steps. We have developed lncRNADetector, a bioinformatics pipeline for the identification of novel lncRNAs, especially from medicinal and aromatic plant (MAP) species. The lncRNADetector has been utilized to analyse and identify more than 88,459 lncRNAs from 21 species of MAPs. To provide a knowledge resource for the plant research community towards elucidating the diversity of biological roles of lncRNAs, the information generated about MAP lncRNAs (post-filtering steps) through lncRNADetector has been stored and organized in MAPslnc database (MAPslnc, https://lncrnapipe.cimap.res.in). The lncRNADetector web server and MAPslnc database have been developed in order to facilitate researchers for accurate identification of lncRNAs from the next-generation sequencing data of different organisms for downstream studies. To the best of our knowledge no such MAPslnc database is available till date.
KEYWORDS: Bioinformatics pipeline, long non-coding RNAs (lncRNAs), medicinal and aromatic plants (MAPs)
Introduction
Recent studies have revealed that RNAs play a regulatory role in diverse biological processes such as development and stress responses in plants [1]. Although it has been hypothesized on the basis of high throughput genomic platforms that approximately 90% of a genome may be transcribed into RNA, only 1–2% of the transcripts code for proteins and the rest are considered to be non-coding [2]. A significant fraction of non-coding RNAs (ncRNAs) consists of different complex families, including housekeeping RNAs and regulatory RNAs. The housekeeping or structural ncRNAs including tRNA, rRNA, snRNA and snoRNA express constitutively in the cell [3]. The regulatory RNAs include small RNAs (miRNAs, siRNAs) and long non-coding RNAs (lncRNAs). Earlier, the lncRNAs were considered to be a source of transcriptional noise due to lack of coding potential and lower expression as compared to protein-coding transcripts [4]. In recent times, the lncRNAs are emerging as potent gene expression regulators [5,6]. Like mRNAs, the lncRNAs are also transcribed by RNA polymerase II and spliced. Additionally, the number of lncRNAs found is much higher than the number of protein-coding RNAs (mRNAs) [7]. The lncRNA Early nodulin40 (ENOD40) gene in Medicago truncatula [8] has been reported to play a critical role in nodule formation. In Arabidopsis thaliana, COOLAIR and COLDAIR lncRNAs participate in the regulation of FLOWERING LOCUS C (FLC) gene, which affects the flowering time [9]. These prominent discoveries of lncRNAs in plants signify the potential interest in the exploration of their biological functions, which in turn may be responsible for regulating important agronomic traits [10]. Besides their regulatory function, the functional role and conserved features of lncRNAs is a matter of current debate and research for the scientific community. It could be possible to work out their functional and conserved (evolutionary) aspects once they have been identified completely. At present, it is a difficult task to identify all the lncRNAs present in an organism. The application of bioinformatics makes the task easier and faster. Reports are available on different computational steps, which are needed for the identification of lncRNAs. Stepwise multi-filtering approach has proved to be effective in discovering thousands of novel lncRNAs in various organisms [11,12]. For this, different tools and softwares have to be considered one by one for the step by step analyses. However, there is an urgent need for an easy-to-use and user-friendly bioinformatics pipeline integrating the series of filtering steps into a single platform, facilitating the discovery of lncRNAs. Towards this end, we have developed lncRNADetector, a bioinformatics pipeline, through combining all the analysis steps into a single platform. It takes the assembled transcript sequences in FASTA format as input and gives the output of every step in separate files. The steps involve utilization of CPC (Coding Potential Calculator), a Support Vector Machine-based tool [13], to discriminate between protein-coding and non-coding transcripts. CPC employs UniRef90 as protein database [14]. Most of the currently available tools utilize only CPC that might categorize novel protein-coding transcripts as non-coding due to unavailability of records in the protein database used by CPC. In lncRNADetector, CPC is considered due to its FASTA and multispecies input format, along with other important steps, which enhance the accuracy of results. Few resources have been reported earlier for analysing lncRNAs. For example, the iSeeRNA is an SVM-based classifier and standalone tool for discriminating protein-coding transcripts and lincRNAs (long intergenic RNAs), a kind of lncRNA present among the exons in a genome [15]; Sebnif is a bioinformatics pipeline used for high-quality single-exonic lincRNAs [16]; and CPC2 is a fast and accurate coding potential calculator based on sequence intrinsic features [17]. However, a need was felt to further increase the accuracy of identifying lncRNAs in the existing tools. In our analysis, all the identified novel lncRNAs were stored and managed in the form of a MAPslnc database. lncRNA databases for various species including plants have been built but none of them focus specifically on medicinal and aromatic plants (MAPs). Earlier the MAPs were considered to be orphan plants but are now in high focus and many of them like, Catharanthus roseus (known for antineoplastic bisindole alkaloids), Papaver somniferum (known for morphinan alkaloids), Artemisia annua (known for the antimalarial sesquiterpene lactone artemisinin), etc. have acquired the status of model plants [18]. The MAPs that we have chosen are of high importance in terms of their medicinal/aromatic value and very less information is available on their lncRNAs. The primary objective of developing the MAPslnc database is to provide a specific repository of lncRNAs from MAPs.
Materials and methods
Pipeline overview
The lncRNADetector is a comprehensive bioinformatics pipeline for the identification of lncRNAs from the assembled transcriptome sequencing (NGS) data. It integrates several filtering steps to identify novel lncRNAs (Fig. 1).
Figure 1.

The schematic representation of the workflow of lncRNADetector
Datasets used for developing lncRNADetector
The datasets of various MAP species were used. The input data for lncRNA identification through lncRNADetector were collected from NCBI web portal https://www.ncbi.nlm.nih.gov using nucleotide database in FASTA format. The selection of datasets is based on the MAP species with sufficient amount of nucleotide sequence of transcript length (>200 bp).
Standard input file formats
To be compatible with both de novo and reference-based assembly softwares, we have set FASTA format as the default input file format for lncRNADetector. The FASTA format allows easy integration of transcriptome data analysis into the lncRNADetector workflow.
lncRNA identification
The source code of the pipeline for identification of lncRNAs written in PERL script [11] was further modified for integrating different standard tools involved in various steps to identify lncRNAs. We have implemented lncRNADetector, a bioinformatics pipeline for the identification of lncRNAs from assembled transcripts. This pipeline is hosted on a web server for proper functioning and storage of analysed data. In the first step, we filtered the assembled data with the transcript length (>200 bp). Next, we analysed the transcripts coding for open reading frames (ORFs). The sequences coding for ORFs lesser than 100 aa were considered further. The filtered sequences were Blast analysed (blastx, 2.2.28+) [19] against patented protein sequences (pataa) database [20], with a default E-value threshold set to 1e-3. The sequences were filtered to get ‘No hits found’ in the blastx-analysed file. The sequences with no hit found results were considered as input for CPC tool [13]. Thus, the coding potential of the left over sequences after blastx were calculated using CPC (0.9-r2) tool, which is an integral part of lncRNADetector. Finally, the sequences marked as non-coding by CPC were re-analysed through blastx with Pfam protein family database [21], which ultimately resulted in the identification of lncRNAs. As the pataa database contains sequences for protein structures not included in the non-redundant protein sequences (nr) database, the identified lncRNAs were further analysed through blastx against the nr database [20], with an E-value threshold set to 1e-5 for enhancing the accuracy of identified lncRNAs [19].
Recently launched species-neutral long non-coding RNA identification tools such as LGC [22] and CPC2 [17] web server were chosen for validation of our identified lncRNAs. Further to validate the results, LGC and CPC2 web servers were also utilized . The lncRNAs identified through lncRNADetector were utilized as input for the LGC and CPC2 web servers and results were compared.
Relational database of MAPs lncRNAs
Many public lncRNA databases on human, vertebrate and plants have already been reported but none of them are especially focused on lncRNAs from MAPs. Therefore, we have developed MAPs long non-coding RNA database (MAPslnc), having all the information of identified lncRNAs from different MAP species. lncRNAs were identified through lncRNADetector including all filtering steps. The resulting data generated by lncRNADetector (in FASTA/text files) gets imported into the MAPslnc database (hosted on SQL server) by mapping relational data fields through .NET script.
Results and discussion
We have implemented lncRNADetector, a bioinformatics pipeline for the identification of lncRNAs from assembled transcripts. It is hosted on a web server for proper functioning and storage of analysed data. It is capable of analysing lncRNAs, which may have important roles in various biological processes and systems in different organisms. The pipeline has been implemented as a web prediction tool available on server interface https://lncrnapipe.cimap.res.in.
Majority of the lncRNAs predicted through lncRNADetector were validated using other tools like, LGC and CPC2, as well. This has corroborated the accuracy of lncRNADetector and shown it to be an efficient and accurate lncRNA prediction tool for MAPs (Table 1).
Table 1.
Comparative validation of lncRNADetector-identified lncRNAs through LGC and CPC2
| S. No. | MAP species | Number of lncRNAs identified through lncRNADetector | lncRNADetector-identified lncRNAs validated through LGC # |
lncRNADetector-identified lncRNAs validated through CPC2 # |
||
|---|---|---|---|---|---|---|
| Total | Percentage | Total | Percentage | |||
| 1 | Gloriosa superba | 12 | 12 | 100 | 12 | 100 |
| 2 | Lippia alba | 42 | 42 | 100 | 42 | 100 |
| 3 | Plantago ovata | 61 | 61 | 100 | 61 | 100 |
| 4 | Aloe vera | 56 | 56 | 100 | 56 | 100 |
| 5 | Mentha spicata | 99 | 99 | 100 | 99 | 100 |
| 6 | M. arvensis | 80 | 80 | 100 | 77 | 96.25 |
| 7 | Acorus calamus | 45 | 45 | 100 | 45 | 100 |
| 8 | Taxus baccata | 106 | 106 | 100 | 106 | 100 |
| 9 | Andrographis paniculata | 74 | 74 | 100 | 74 | 100 |
| 10 | Mentha x piperita | 225 | 225 | 100 | 223 | 99.11 |
| 11 | Plumbago zeylanica | 764 | 758 | 99.21 | 763 | 99.86 |
| 12 | Centella asiatica | 299 | 296 | 98.99 | 299 | 100 |
| 13 | Tinospora cordifolia | 555 | 542 | 97.65 | 548 | 98.73 |
| 14 | Hippophae rhamnoides | 1134 | 1131 | 99.73 | 1134 | 100 |
| 15 | Crocus sativus | 1684 | 1679 | 99.70 | 1679 | 99.70 |
| 16 | Mucuna pruriens | 29 | 29 | 100 | 29 | 100 |
| 17 | Ocimum basilicum | 2306 | 2301 | 99.78 | 2281 | 98.91 |
| 18 | Withania somnifera | 6376 | 6286 | 98.58 | 6366 | 99.84 |
| 19 | Rauvolfia serpentina | 13,598 | 13,487 | 99.18 | 13,586 | 99.91 |
| 20 | Catharanthus roseus | 16,507 | 16,330 | 98.92 | 16,464 | 99.73 |
| 21 | Papaver somniferum | 44,407 | 43,637 | 98.26 | 43,941 | 98.95 |
| Total | 88,459 | 87,276 | 98.66% | 87,885 | 99.35% | |
#Input = Total number of lncRNAs identified in each species through lncRNADetector.
Web server for lncRNADetector
We have developed and deployed a user-friendly web interface for lncRNADetector as a window service using C# language and ASP.NET applications (Fig. 2).
Figure 2.

Architecture of lncRNADetector
For convenience of users, we have established a web portal at https://lncrnapipe.cimap.res.in. Briefly, the user has to submit the nucleotide sequences in FASTA format on the web portal. The web server accepts a set of nucleotide FASTA sequences as input. The user can set the E-value as a parameter for blastx and upload the sequences for the identification of lncRNAs. Depending upon the size of the input data, this web server would take minutes to hours for providing the final output. We have created an lncRNADetector as a window service using .NET, which will execute Perl script of lncRNADetector as the user submits the input. In case of multiple nucleotide sequences submitted by the user, the lncRNADetector will execute on a sequential basis.
After finishing the identification process, results of lncRNA with details will appear in the browser in tabular form which shows the accession number, description, coding potential of the transcript, length of the transcript, sequence and label which can be downloaded in a tab separated file (.txt). Users can further click ‘Download’ to see all the intermediate files generated during the identification process as ZIP format. Also, lncRNADetector web server will assign a unique task ID for each job request submitted. Users can also track the progress of the task submitted and retrieve results by Task ID under 'Result' page in the homepage. In addition to web portal identification, users can also download stand-alone software packages and instal them under the Windows system. For complete installation and its usage, the user has to see the ‘Help’ page of web portal.
MAPslnc: repository of MAPs long non-coding (MAPslnc) RNAs
We have also created a repository of information for lncRNAs identified through lncRNADetector, which is stored in the SQL server database (Table 2). MAPslnc database contains information about sequences and their length as well as the coding potential of identified lncRNAs. Presently, the identified lncRNA data of 21 MAP species is freely available for downloading from an online repository called MAPslnc, which can be downloaded in a tab-separated file that enables downstream analysis and utilization.
Table 2.
Summary of MAP lncRNAs in MAPslnc identified through step-wise filtering
| S. No. | MAP species | Total number of nucleotide sequences in the dataset | Steps 1 & 2 Number of sequences left after Transcript Size Selection + ORF filters |
Step 3 Number of sequences left after blastx against pataa |
Step 4 Number of sequences left after CPC filter |
Step 5 Number of sequences left after blastx against Pfam |
Step 6 Number of sequences left after blastx against nr |
|---|---|---|---|---|---|---|---|
| 1 | G. superba | 82 | 40 | 15 | 15 | 15 | 12 |
| 2 | L. alba | 84 | 73 | 43 | 43 | 43 | 42 |
| 3 | P. ovata | 208 | 143 | 62 | 61 | 61 | 61 |
| 4 | A. vera | 225 | 107 | 61 | 60 | 60 | 56 |
| 5 | M. spicata | 261 | 166 | 104 | 104 | 104 | 99 |
| 6 | M. arvensis | 286 | 196 | 100 | 100 | 100 | 80 |
| 7 | A. calamus | 517 | 153 | 49 | 49 | 49 | 45 |
| 8 | T. baccata | 530 | 282 | 107 | 107 | 107 | 106 |
| 9 | A. paniculata | 867 | 431 | 81 | 81 | 77 | 74 |
| 10 | Mentha x piperita | 1626 | 774 | 250 | 250 | 249 | 225 |
| 11 | P. zeylanica | 1899 | 1283 | 786 | 785 | 773 | 764 |
| 12 | C. asiatica | 4671 | 1544 | 344 | 337 | 334 | 299 |
| 13 | H. rhamnoides | 4991 | 3637 | 1191 | 1188 | 1178 | 1134 |
| 14 | T. cordifolia | 5713 | 1608 | 587 | 587 | 582 | 555 |
| 15 | C. sativus | 7216 | 4345 | 1766 | 1760 | 1745 | 1684 |
| 16 | M. pruriens | 15,818 | 1226 | 47 | 46 | 43 | 29 |
| 17 | O. basilicum | 24,171 | 9453 | 2603 | 2599 | 2501 | 2306 |
| 18 | W. somnifera | 74,567 | 18,589 | 7894 | 7882 | 7508 | 6376 |
| 19 | R. serpentina | 99,622 | 37,462 | 15,146 | 15,126 | 14,827 | 13,598 |
| 20 | C. roseus | 108,110 | 42,684 | 18,056 | 18,003 | 17,700 | 16,507 |
| 21 | P. somniferum | 184,411 | 67,916 | 46,033 | 45,925 | 44,893 | 44,407 |
| Total | 535,875 | 88,459 |
The lncRNAs in the MAPslnc database repository have been obtained through application of following filtering steps:
Step 1 (Transcript size selection): >200 nucleotides.
Step 2 (ORF): ˂100 amino acids.
Step 3 (blastx against patented protein sequences [pataa]): No hit in pataa database (E-value threshold 1e-3)
Step 4 (Coding potential calculation): Classified non-coding by CPC.
Step 5 (blastx against Pfam): No hit in Pfam database (E-value threshold 1e-3).
Step 6 (blastx against nr): No hit in nr database (E-value threshold 1e-5).
The results for lncRNAs obtained after step-wise filtering for all the 21 MAP species have been summarized in Table 2.
Conclusions
Due to next-generation sequencing technology, novel lncRNAs of different organisms can be identified through various bioinformatics pipelines. As lncRNAs in plants play a critical role in many biological regulations, it is of great importance to identify lncRNAs with high accuracy. The proper storage of identified lncRNAs in a public database is also important for their downstream utilization by plant researchers. The non-redundant protein sequences (nr) database is very huge in size (~65 Gb) and similarity search for sequences through blastx against it consumes high computational time. On the other hand, blastx against the pataa database, which is much smaller in size (~900 Mb), gives almost comparative results in significantly less search time. Therefore, the time cost benefit analysis prompted us to incorporate blastx against pataa (but not nr) database in the web portal of lncRNADetector. However, we have created the online repository of identified MAP lncRNAs based on all filtering steps of lncRNADetector, which includes results from all the six filtering steps mentioned in Table 2. The lncRNADetector demonstrates high prediction accuracy due to the incorporation of a series of filtering steps and offers a valuable tool for the identification of lncRNAs in a single window platform. MAPslnc repository will be regularly updated in order to add new lncRNA sequences from other MAP species.
Supplementary Material
Acknowledgements
The authors are grateful to Director, CSIR-CIMAP, for his encouragement and providing necessary laboratory facilities. They also thank Dr Sumit K Bag, Principal Scientist, CSIR-NBRI, Dr Ashutosh Singh, Associate Professor, Shiv Nadar University, and Dr Vikrant Gupta, Senior Principal Scientist, CSIR-CIMAP, for their valuable suggestions. The help provided by Dr V. Sundaresan, Principal Scientist, CSIR-CIMAP, Research Centre, Bengaluru, in the form of original photographs of the target MAPs is duly acknowledged. Er. Manoj Semwal, Principal Scientist, CSIR-CIMAP and Mr Sanjay Singh, Technical Officer, CSIR-CIMAP, are gratefully acknowledged for providing ICT facility for hosting the web server. This work was financially supported by in-house funds provided by CSIR-CIMAP. SG was supported by the Science and Engineering Research Board (SERB), India, in the form of a National-Post Doctoral Fellowship and GS was supported by an ICMR fellowship.
Funding Statement
This work was supported by the CSIR-Central Institute of Medicinal and Aromatic Plants, Lucknow, India [In-house Project].
Disclosure statement
The authors have declared that no competing interests exist.
Supplementary material
Supplemental data for this article can be accessed here.
References
- [1].Szakonyi D, Confraria A, Valerio C, et al. Editorial: Plant RNA Biology. Front Plant Sci. 2019;10:887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Pertea M. The human transcriptome: an unfinished story. Genes (Basel). 2012;3:344–360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Ponting CP, Oliver PL, Reik W.. Evolution and functions of long noncoding RNAs. Cell. 2009;136:629–641. [DOI] [PubMed] [Google Scholar]
- [4].Liu J, Wang H, Chua NH. Long noncoding RNA transcriptome of plants. Plant Biotechnol J. 2015;13:319–328. [DOI] [PubMed] [Google Scholar]
- [5].Mercer TR, Dinger ME, Mattick JS. Long non-coding RNAs: insights into functions. Nat Rev Genet. 2009;10:155–159. [DOI] [PubMed] [Google Scholar]
- [6].Wilusz JE, Sunwoo H, Spector DL. Long noncoding RNAs: functional surprises from the RNA world. Genes Dev. 2009;23:1494–1504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Kornienko AE, Dotter CP, Guenzl PM, et al. Long non-coding RNAs display higher natural expression variation than protein-coding genes in healthy humans. Genome Biol. 2016;17:14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Campalans A, Kondorosi A. Crespi M. Enod40, a short open reading frame–containing mRNA, induces cytoplasmic localization of a nuclear RNA binding protein in Medicago truncatula. Plant Cell. 2004;16:1047–1059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Michaels SD, Amasino RM. FLOWERING LOCUS C encodes a novel MADS domain protein that acts as a repressor of flowering. Plant Cell. 1999;11:949–956. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Sanchita, Trivedi PK, Asif MH. Updates on plant long non-coding RNAs (lncRNAs): the regulatory components. Plant Cell Tiss Organ Cult. 2020;140:259–269. [Google Scholar]
- [11].Zhang W, Han Z, Guo Q, et al. Identification of maize long non-coding RNAs responsive to drought stress. PloS One. 2014;9:e98958. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Mu C, Wang R, Li T, et al. Long non-coding RNAs (lncRNAs) of sea cucumber: large-scale prediction, expression profiling, non-coding network construction, and lncRNA-microRNA-gene interaction analysis of lncRNAs in Apostichopus japonicus and Holothuria glaberrima during LPS challenge and radial organ complex regeneration. Mar Biotechnol (NY). 2016;18:485–499. [DOI] [PubMed] [Google Scholar]
- [13].Kong L, Zhang Y, Ye ZQ, et al. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 2007;35(Web Server issue):W345–W349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Suzek BE, Huang H, McGarvey P, et al. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23:1282–1288. [DOI] [PubMed] [Google Scholar]
- [15].Sun K, Chen X, Jiang P, et al. iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data. BMC Genomics. 2013;14 Suppl 2:S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Sun K, Zhao Y, Wang H, et al. Sebnif: an integrated bioinformatics pipeline for the identification of novel large intergenic noncoding RNAs (lincRNAs)--application in human skeletal muscle cells. PloS One. 2014;9:e84500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Kang YJ, Yang DC, Kong L, et al. CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res. 2017;45(W1):W12–W16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Khanuja SPS, Shukla AK. Medicinal plant metabolomes: converging botany and chemistry into health opportunity. In: Iqbal M, Ahmad A, editors. Current trends in medicinal botany. New Delhi: I.K. International Publishing House Pvt. Ltd.; 2014. p. 346–370. [Google Scholar]
- [19].Altschul SF, Madden TL, Schäffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Sayers EW, Agarwala R, Bolton EE, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2019;47(D1):D23–D28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].El-Gebali S, Mistry J, Bateman A, et al. The Pfam protein families database in 2019. Nucleic Acids Res. 2019;47(D1):D427–D432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Wang G, Yin H, Li B, et al. Characterization and identification of long non-coding RNAs based on feature relationship. Bioinformatics. 2019;35:2949–2956. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
