Skip to main content
RNA Biology logoLink to RNA Biology
. 2021 Mar 18;18(12):2290–2295. doi: 10.1080/15476286.2021.1899673

lncRNADetector: a bioinformatics pipeline for long non-coding RNA identification and MAPslnc: a repository of medicinal and aromatic plant lncRNAs

Bhaskar Shukla a,b,, Sanchita Gupta c, Gaurava Srivastava b,d, Ashok Sharma a,d, Ashutosh K Shukla b,d, Ajit K Shasany b,d
PMCID: PMC8632071  PMID: 33685383

ABSTRACT

Long non-coding RNAs (lncRNAs) are an emerging class of non-coding RNAs and potent regulatory elements in the living cells. High throughput RNA sequencing analyses have generated a tremendous amount of transcript sequence data. A large proportion of these transcript sequences does not code for proteins and are known as non-coding RNAs. Among them, lncRNAs are a unique class of transcripts longer than 200 nucleotides with diverse biological functions and regulatory mechanisms. Recent emerging studies and next-generation sequencing technologies show a substantial amount of lncRNAs within the plant genome, which are yet to be identified. The computational identification of lncRNAs from these transcripts is a challenging task due to the involvement of a series of filtering steps. We have developed lncRNADetector, a bioinformatics pipeline for the identification of novel lncRNAs, especially from medicinal and aromatic plant (MAP) species. The lncRNADetector has been utilized to analyse and identify more than 88,459 lncRNAs from 21 species of MAPs. To provide a knowledge resource for the plant research community towards elucidating the diversity of biological roles of lncRNAs, the information generated about MAP lncRNAs (post-filtering steps) through lncRNADetector has been stored and organized in MAPslnc database (MAPslnc, https://lncrnapipe.cimap.res.in). The lncRNADetector web server and MAPslnc database have been developed in order to facilitate researchers for accurate identification of lncRNAs from the next-generation sequencing data of different organisms for downstream studies. To the best of our knowledge no such MAPslnc database is available till date.

KEYWORDS: Bioinformatics pipeline, long non-coding RNAs (lncRNAs), medicinal and aromatic plants (MAPs)

Introduction

Recent studies have revealed that RNAs play a regulatory role in diverse biological processes such as development and stress responses in plants [1]. Although it has been hypothesized on the basis of high throughput genomic platforms that approximately 90% of a genome may be transcribed into RNA, only 1–2% of the transcripts code for proteins and the rest are considered to be non-coding [2]. A significant fraction of non-coding RNAs (ncRNAs) consists of different complex families, including housekeeping RNAs and regulatory RNAs. The housekeeping or structural ncRNAs including tRNA, rRNA, snRNA and snoRNA express constitutively in the cell [3]. The regulatory RNAs include small RNAs (miRNAs, siRNAs) and long non-coding RNAs (lncRNAs). Earlier, the lncRNAs were considered to be a source of transcriptional noise due to lack of coding potential and lower expression as compared to protein-coding transcripts [4]. In recent times, the lncRNAs are emerging as potent gene expression regulators [5,6]. Like mRNAs, the lncRNAs are also transcribed by RNA polymerase II and spliced. Additionally, the number of lncRNAs found is much higher than the number of protein-coding RNAs (mRNAs) [7]. The lncRNA Early nodulin40 (ENOD40) gene in Medicago truncatula [8] has been reported to play a critical role in nodule formation. In Arabidopsis thaliana, COOLAIR and COLDAIR lncRNAs participate in the regulation of FLOWERING LOCUS C (FLC) gene, which affects the flowering time [9]. These prominent discoveries of lncRNAs in plants signify the potential interest in the exploration of their biological functions, which in turn may be responsible for regulating important agronomic traits [10]. Besides their regulatory function, the functional role and conserved features of lncRNAs is a matter of current debate and research for the scientific community. It could be possible to work out their functional and conserved (evolutionary) aspects once they have been identified completely. At present, it is a difficult task to identify all the lncRNAs present in an organism. The application of bioinformatics makes the task easier and faster. Reports are available on different computational steps, which are needed for the identification of lncRNAs. Stepwise multi-filtering approach has proved to be effective in discovering thousands of novel lncRNAs in various organisms [11,12]. For this, different tools and softwares have to be considered one by one for the step by step analyses. However, there is an urgent need for an easy-to-use and user-friendly bioinformatics pipeline integrating the series of filtering steps into a single platform, facilitating the discovery of lncRNAs. Towards this end, we have developed lncRNADetector, a bioinformatics pipeline, through combining all the analysis steps into a single platform. It takes the assembled transcript sequences in FASTA format as input and gives the output of every step in separate files. The steps involve utilization of CPC (Coding Potential Calculator), a Support Vector Machine-based tool [13], to discriminate between protein-coding and non-coding transcripts. CPC employs UniRef90 as protein database [14]. Most of the currently available tools utilize only CPC that might categorize novel protein-coding transcripts as non-coding due to unavailability of records in the protein database used by CPC. In lncRNADetector, CPC is considered due to its FASTA and multispecies input format, along with other important steps, which enhance the accuracy of results. Few resources have been reported earlier for analysing lncRNAs. For example, the iSeeRNA is an SVM-based classifier and standalone tool for discriminating protein-coding transcripts and lincRNAs (long intergenic RNAs), a kind of lncRNA present among the exons in a genome [15]; Sebnif is a bioinformatics pipeline used for high-quality single-exonic lincRNAs [16]; and CPC2 is a fast and accurate coding potential calculator based on sequence intrinsic features [17]. However, a need was felt to further increase the accuracy of identifying lncRNAs in the existing tools. In our analysis, all the identified novel lncRNAs were stored and managed in the form of a MAPslnc database. lncRNA databases for various species including plants have been built but none of them focus specifically on medicinal and aromatic plants (MAPs). Earlier the MAPs were considered to be orphan plants but are now in high focus and many of them like, Catharanthus roseus (known for antineoplastic bisindole alkaloids), Papaver somniferum (known for morphinan alkaloids), Artemisia annua (known for the antimalarial sesquiterpene lactone artemisinin), etc. have acquired the status of model plants [18]. The MAPs that we have chosen are of high importance in terms of their medicinal/aromatic value and very less information is available on their lncRNAs. The primary objective of developing the MAPslnc database is to provide a specific repository of lncRNAs from MAPs.

Materials and methods

Pipeline overview

The lncRNADetector is a comprehensive bioinformatics pipeline for the identification of lncRNAs from the assembled transcriptome sequencing (NGS) data. It integrates several filtering steps to identify novel lncRNAs (Fig. 1).

Figure 1.

Figure 1.

The schematic representation of the workflow of lncRNADetector

Datasets used for developing lncRNADetector

The datasets of various MAP species were used. The input data for lncRNA identification through lncRNADetector were collected from NCBI web portal https://www.ncbi.nlm.nih.gov using nucleotide database in FASTA format. The selection of datasets is based on the MAP species with sufficient amount of nucleotide sequence of transcript length (>200 bp).

Standard input file formats

To be compatible with both de novo and reference-based assembly softwares, we have set FASTA format as the default input file format for lncRNADetector. The FASTA format allows easy integration of transcriptome data analysis into the lncRNADetector workflow.

lncRNA identification

The source code of the pipeline for identification of lncRNAs written in PERL script [11] was further modified for integrating different standard tools involved in various steps to identify lncRNAs. We have implemented lncRNADetector, a bioinformatics pipeline for the identification of lncRNAs from assembled transcripts. This pipeline is hosted on a web server for proper functioning and storage of analysed data. In the first step, we filtered the assembled data with the transcript length (>200 bp). Next, we analysed the transcripts coding for open reading frames (ORFs). The sequences coding for ORFs lesser than 100 aa were considered further. The filtered sequences were Blast analysed (blastx, 2.2.28+) [19] against patented protein sequences (pataa) database [20], with a default E-value threshold set to 1e-3. The sequences were filtered to get ‘No hits found’ in the blastx-analysed file. The sequences with no hit found results were considered as input for CPC tool [13]. Thus, the coding potential of the left over sequences after blastx were calculated using CPC (0.9-r2) tool, which is an integral part of lncRNADetector. Finally, the sequences marked as non-coding by CPC were re-analysed through blastx with Pfam protein family database [21], which ultimately resulted in the identification of lncRNAs. As the pataa database contains sequences for protein structures not included in the non-redundant protein sequences (nr) database, the identified lncRNAs were further analysed through blastx against the nr database [20], with an E-value threshold set to 1e-5 for enhancing the accuracy of identified lncRNAs [19].

Recently launched species-neutral long non-coding RNA identification tools such as LGC [22] and CPC2 [17] web server were chosen for validation of our identified lncRNAs. Further to validate the results, LGC and CPC2 web servers were also utilized . The lncRNAs identified through lncRNADetector were utilized as input for the LGC and CPC2 web servers and results were compared.

Relational database of MAPs lncRNAs

Many public lncRNA databases on human, vertebrate and plants have already been reported but none of them are especially focused on lncRNAs from MAPs. Therefore, we have developed MAPs long non-coding RNA database (MAPslnc), having all the information of identified lncRNAs from different MAP species. lncRNAs were identified through lncRNADetector including all filtering steps. The resulting data generated by lncRNADetector (in FASTA/text files) gets imported into the MAPslnc database (hosted on SQL server) by mapping relational data fields through .NET script.

Results and discussion

We have implemented lncRNADetector, a bioinformatics pipeline for the identification of lncRNAs from assembled transcripts. It is hosted on a web server for proper functioning and storage of analysed data. It is capable of analysing lncRNAs, which may have important roles in various biological processes and systems in different organisms. The pipeline has been implemented as a web prediction tool available on server interface https://lncrnapipe.cimap.res.in.

Majority of the lncRNAs predicted through lncRNADetector were validated using other tools like, LGC and CPC2, as well. This has corroborated the accuracy of lncRNADetector and shown it to be an efficient and accurate lncRNA prediction tool for MAPs (Table 1).

Table 1.

Comparative validation of lncRNADetector-identified lncRNAs through LGC and CPC2

S. No. MAP species Number of lncRNAs identified through lncRNADetector lncRNADetector-identified lncRNAs validated through LGC #
lncRNADetector-identified lncRNAs validated through
CPC2 #
      Total Percentage Total Percentage
1 Gloriosa superba 12 12 100 12 100
2 Lippia alba 42 42 100 42 100
3 Plantago ovata 61 61 100 61 100
4 Aloe vera 56 56 100 56 100
5 Mentha spicata 99 99 100 99 100
6 M. arvensis 80 80 100 77 96.25
7 Acorus calamus 45 45 100 45 100
8 Taxus baccata 106 106 100 106 100
9 Andrographis paniculata 74 74 100 74 100
10 Mentha x piperita 225 225 100 223 99.11
11 Plumbago zeylanica 764 758 99.21 763 99.86
12 Centella asiatica 299 296 98.99 299 100
13 Tinospora cordifolia 555 542 97.65 548 98.73
14 Hippophae rhamnoides 1134 1131 99.73 1134 100
15 Crocus sativus 1684 1679 99.70 1679 99.70
16 Mucuna pruriens 29 29 100 29 100
17 Ocimum basilicum 2306 2301 99.78 2281 98.91
18 Withania somnifera 6376 6286 98.58 6366 99.84
19 Rauvolfia serpentina 13,598 13,487 99.18 13,586 99.91
20 Catharanthus roseus 16,507 16,330 98.92 16,464 99.73
21 Papaver somniferum 44,407 43,637 98.26 43,941 98.95
  Total 88,459 87,276 98.66% 87,885 99.35%

#Input = Total number of lncRNAs identified in each species through lncRNADetector.

Web server for lncRNADetector

We have developed and deployed a user-friendly web interface for lncRNADetector as a window service using C# language and ASP.NET applications (Fig. 2).

Figure 2.

Figure 2.

Architecture of lncRNADetector

For convenience of users, we have established a web portal at https://lncrnapipe.cimap.res.in. Briefly, the user has to submit the nucleotide sequences in FASTA format on the web portal. The web server accepts a set of nucleotide FASTA sequences as input. The user can set the E-value as a parameter for blastx and upload the sequences for the identification of lncRNAs. Depending upon the size of the input data, this web server would take minutes to hours for providing the final output. We have created an lncRNADetector as a window service using .NET, which will execute Perl script of lncRNADetector as the user submits the input. In case of multiple nucleotide sequences submitted by the user, the lncRNADetector will execute on a sequential basis.

After finishing the identification process, results of lncRNA with details will appear in the browser in tabular form which shows the accession number, description, coding potential of the transcript, length of the transcript, sequence and label which can be downloaded in a tab separated file (.txt). Users can further click ‘Download’ to see all the intermediate files generated during the identification process as ZIP format. Also, lncRNADetector web server will assign a unique task ID for each job request submitted. Users can also track the progress of the task submitted and retrieve results by Task ID under 'Result' page in the homepage. In addition to web portal identification, users can also download stand-alone software packages and instal them under the Windows system. For complete installation and its usage, the user has to see the ‘Help’ page of web portal.

MAPslnc: repository of MAPs long non-coding (MAPslnc) RNAs

We have also created a repository of information for lncRNAs identified through lncRNADetector, which is stored in the SQL server database (Table 2). MAPslnc database contains information about sequences and their length as well as the coding potential of identified lncRNAs. Presently, the identified lncRNA data of 21 MAP species is freely available for downloading from an online repository called MAPslnc, which can be downloaded in a tab-separated file that enables downstream analysis and utilization.

Table 2.

Summary of MAP lncRNAs in MAPslnc identified through step-wise filtering

S. No. MAP species Total number of nucleotide sequences in the dataset Steps 1 & 2
Number of sequences left
after
Transcript Size Selection + ORF filters
Step 3
Number of sequences left
after
blastx against pataa
Step 4
Number of sequences left
after
CPC filter
Step 5
Number of sequences left
after
blastx against Pfam
Step 6
Number of sequences left
after
blastx against nr
1 G. superba 82 40 15 15 15 12
2 L. alba 84 73 43 43 43 42
3 P. ovata 208 143 62 61 61 61
4 A. vera 225 107 61 60 60 56
5 M. spicata 261 166 104 104 104 99
6 M. arvensis 286 196 100 100 100 80
7 A. calamus 517 153 49 49 49 45
8 T. baccata 530 282 107 107 107 106
9 A. paniculata 867 431 81 81 77 74
10 Mentha x piperita 1626 774 250 250 249 225
11 P. zeylanica 1899 1283 786 785 773 764
12 C. asiatica 4671 1544 344 337 334 299
13 H. rhamnoides 4991 3637 1191 1188 1178 1134
14 T. cordifolia 5713 1608 587 587 582 555
15 C. sativus 7216 4345 1766 1760 1745 1684
16 M. pruriens 15,818 1226 47 46 43 29
17 O. basilicum 24,171 9453 2603 2599 2501 2306
18 W. somnifera 74,567 18,589 7894 7882 7508 6376
19 R. serpentina 99,622 37,462 15,146 15,126 14,827 13,598
20 C. roseus 108,110 42,684 18,056 18,003 17,700 16,507
21 P. somniferum 184,411 67,916 46,033 45,925 44,893 44,407
  Total 535,875         88,459

The lncRNAs in the MAPslnc database repository have been obtained through application of following filtering steps:

Step 1 (Transcript size selection): >200 nucleotides.

Step 2 (ORF): ˂100 amino acids.

Step 3 (blastx against patented protein sequences [pataa]): No hit in pataa database (E-value threshold 1e-3)

Step 4 (Coding potential calculation): Classified non-coding by CPC.

Step 5 (blastx against Pfam): No hit in Pfam database (E-value threshold 1e-3).

Step 6 (blastx against nr): No hit in nr database (E-value threshold 1e-5).

The results for lncRNAs obtained after step-wise filtering for all the 21 MAP species have been summarized in Table 2.

Conclusions

Due to next-generation sequencing technology, novel lncRNAs of different organisms can be identified through various bioinformatics pipelines. As lncRNAs in plants play a critical role in many biological regulations, it is of great importance to identify lncRNAs with high accuracy. The proper storage of identified lncRNAs in a public database is also important for their downstream utilization by plant researchers. The non-redundant protein sequences (nr) database is very huge in size (~65 Gb) and similarity search for sequences through blastx against it consumes high computational time. On the other hand, blastx against the pataa database, which is much smaller in size (~900 Mb), gives almost comparative results in significantly less search time. Therefore, the time cost benefit analysis prompted us to incorporate blastx against pataa (but not nr) database in the web portal of lncRNADetector. However, we have created the online repository of identified MAP lncRNAs based on all filtering steps of lncRNADetector, which includes results from all the six filtering steps mentioned in Table 2. The lncRNADetector demonstrates high prediction accuracy due to the incorporation of a series of filtering steps and offers a valuable tool for the identification of lncRNAs in a single window platform. MAPslnc repository will be regularly updated in order to add new lncRNA sequences from other MAP species.

Supplementary Material

Supplemental Material

Acknowledgements

The authors are grateful to Director, CSIR-CIMAP, for his encouragement and providing necessary laboratory facilities. They also thank Dr Sumit K Bag, Principal Scientist, CSIR-NBRI, Dr Ashutosh Singh, Associate Professor, Shiv Nadar University, and Dr Vikrant Gupta, Senior Principal Scientist, CSIR-CIMAP, for their valuable suggestions. The help provided by Dr V. Sundaresan, Principal Scientist, CSIR-CIMAP, Research Centre, Bengaluru, in the form of original photographs of the target MAPs is duly acknowledged. Er. Manoj Semwal, Principal Scientist, CSIR-CIMAP and Mr Sanjay Singh, Technical Officer, CSIR-CIMAP, are gratefully acknowledged for providing ICT facility for hosting the web server. This work was financially supported by in-house funds provided by CSIR-CIMAP. SG was supported by the Science and Engineering Research Board (SERB), India, in the form of a National-Post Doctoral Fellowship and GS was supported by an ICMR fellowship.

Funding Statement

This work was supported by the CSIR-Central Institute of Medicinal and Aromatic Plants, Lucknow, India [In-house Project].

Disclosure statement

The authors have declared that no competing interests exist.

Supplementary material

Supplemental data for this article can be accessed here.

References

  • [1].Szakonyi D, Confraria A, Valerio C, et al. Editorial: Plant RNA Biology. Front Plant Sci. 2019;10:887. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Pertea M. The human transcriptome: an unfinished story. Genes (Basel). 2012;3:344–360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Ponting CP, Oliver PL, Reik W.. Evolution and functions of long noncoding RNAs. Cell. 2009;136:629–641. [DOI] [PubMed] [Google Scholar]
  • [4].Liu J, Wang H, Chua NH. Long noncoding RNA transcriptome of plants. Plant Biotechnol J. 2015;13:319–328. [DOI] [PubMed] [Google Scholar]
  • [5].Mercer TR, Dinger ME, Mattick JS. Long non-coding RNAs: insights into functions. Nat Rev Genet. 2009;10:155–159. [DOI] [PubMed] [Google Scholar]
  • [6].Wilusz JE, Sunwoo H, Spector DL. Long noncoding RNAs: functional surprises from the RNA world. Genes Dev. 2009;23:1494–1504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Kornienko AE, Dotter CP, Guenzl PM, et al. Long non-coding RNAs display higher natural expression variation than protein-coding genes in healthy humans. Genome Biol. 2016;17:14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Campalans A, Kondorosi A. Crespi M. Enod40, a short open reading frame–containing mRNA, induces cytoplasmic localization of a nuclear RNA binding protein in Medicago truncatula. Plant Cell. 2004;16:1047–1059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Michaels SD, Amasino RM. FLOWERING LOCUS C encodes a novel MADS domain protein that acts as a repressor of flowering. Plant Cell. 1999;11:949–956. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Sanchita, Trivedi PK, Asif MH. Updates on plant long non-coding RNAs (lncRNAs): the regulatory components. Plant Cell Tiss Organ Cult. 2020;140:259–269. [Google Scholar]
  • [11].Zhang W, Han Z, Guo Q, et al. Identification of maize long non-coding RNAs responsive to drought stress. PloS One. 2014;9:e98958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Mu C, Wang R, Li T, et al. Long non-coding RNAs (lncRNAs) of sea cucumber: large-scale prediction, expression profiling, non-coding network construction, and lncRNA-microRNA-gene interaction analysis of lncRNAs in Apostichopus japonicus and Holothuria glaberrima during LPS challenge and radial organ complex regeneration. Mar Biotechnol (NY). 2016;18:485–499. [DOI] [PubMed] [Google Scholar]
  • [13].Kong L, Zhang Y, Ye ZQ, et al. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 2007;35(Web Server issue):W345–W349. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Suzek BE, Huang H, McGarvey P, et al. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23:1282–1288. [DOI] [PubMed] [Google Scholar]
  • [15].Sun K, Chen X, Jiang P, et al. iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data. BMC Genomics. 2013;14 Suppl 2:S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Sun K, Zhao Y, Wang H, et al. Sebnif: an integrated bioinformatics pipeline for the identification of novel large intergenic noncoding RNAs (lincRNAs)--application in human skeletal muscle cells. PloS One. 2014;9:e84500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Kang YJ, Yang DC, Kong L, et al. CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res. 2017;45(W1):W12–W16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Khanuja SPS, Shukla AK. Medicinal plant metabolomes: converging botany and chemistry into health opportunity. In: Iqbal M, Ahmad A, editors. Current trends in medicinal botany. New Delhi: I.K. International Publishing House Pvt. Ltd.; 2014. p. 346–370. [Google Scholar]
  • [19].Altschul SF, Madden TL, Schäffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Sayers EW, Agarwala R, Bolton EE, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2019;47(D1):D23–D28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].El-Gebali S, Mistry J, Bateman A, et al. The Pfam protein families database in 2019. Nucleic Acids Res. 2019;47(D1):D427–D432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Wang G, Yin H, Li B, et al. Characterization and identification of long non-coding RNAs based on feature relationship. Bioinformatics. 2019;35:2949–2956. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

Articles from RNA Biology are provided here courtesy of Taylor & Francis

RESOURCES