Skip to main content
RNA Biology logoLink to RNA Biology
. 2020 Mar 19;17(7):966–976. doi: 10.1080/15476286.2020.1738703

Identification of alternatively spliced gene isoforms and novel noncoding RNAs by single-molecule long-read sequencing in Camellia

Zhikang Hu a,b,c, Tao Lyu a,c, Chao Yan a,d, Yupeng Wang a,b, Ning Ye b, Zhengqi Fan c, Xinlei Li c, Jiyuan Li c, Hengfu Yin a,c,
PMCID: PMC7549672  PMID: 32160106

ABSTRACT

Direct single-molecule sequencing of full-length transcripts allows efficient identification of gene isoforms, which is apt to alternative splicing (AS), polyadenylation, and long non-coding RNA analyses. However, the identification of gene isoforms and long non-coding RNAs with novel regulatory functions remains challenging, especially for species without a reference genome. Here, we present a comprehensive analysis of a combined long-read and short-read transcriptome sequencing in Camellia japonica. Through a novel bioinformatic pipeline of reverse-tracing the split-sites, we have uncovered 257,692 AS sites from 61,838 transcripts; and 13,068 AS isoforms have been validated by aligning the short reads. We have identified the tissue-specific AS isoforms along with 6,373 AS events that were found in all tissues. Furthermore, we have analysed the polyadenylation (polyA) patterns of transcripts, and found that the preference for polyA signals was different between the AS and non-AS transcripts. Moreover, we have predicted the phased small interfering RNA (phasiRNA) loci through integrative analyses of transcriptome and small RNA sequencing. We have shown that a newly evolved phasiRNA locus from lipoxygenases generated 12 consecutive 21 bp secondary RNAs, which were responsive to cold and heat stress in Camellia. Our studies of the isoform transcriptome provide insights into gene splicing and functions that may facilitate the mechanistic understanding of plants.

KEYWORDS: Alternative splicing, single-molecule sequencing, phased small interfering RNA, polyadenylation, Camellia, lipoxygenase

Introduction

A complete transcript resource is fundamental to gene discovery and studies of genetic variation, especially in species that lack a reference genome. Alternative splicing (AS) is an evolutionarily critical character of genes in eukaryotes that increases the proteome diversity of cells. AS can also be an important layer of gene regulation in response to environmental changes [1,2]. High-throughput RNA sequencing (RNAseq) of short fragments (with read length usually less than 300 bp) can provide in-depth coverage of low-abundance transcripts, but the assembly of full transcripts based on the bioinformatic algorithms remains challenging, especially for genes that have undergone extensive AS [3]. In recent years, single-molecule sequencing platforms, such as those of Pacific Biosciences (PacBio) and Oxford Nanopore Technologies, have been characterized by long read lengths, high throughput, high accuracy, and the absence of amplification [4,5]. These platforms allow direct sequencing of full-length transcripts, which can support the identification of gene isoforms compared to short-read RNA sequencing, as in this way, there is no need to reconstruct the transcripts variants [6,7].

Recent studies using the single-molecule transcriptome sequencing in several plant species have proven to be an effective means of identifying gene isoforms. For instance, using the Isoform Sequencing (Iso-Seq) developed by PacBio, a study in maize has revealed 111,151 isoforms in six tissues, and tissue-specific isoforms have been identified and investigated [8]. Likewise, comparative analyses of long-read transcriptomes from maize and sorghum have uncovered evolutionarily conserved isoforms and species-specific AS patterns [9]. Other studies of many plant species, such as rice, clover, sugarcane, bamboo, coffee, and others [1015], have been reported, and these provide a comprehensive supplement for understanding the diversity of transcriptomes.

For species without a high-quality reference genome, however, accurately identification of AS isoforms remains difficult. Due to the different modes of AS patterns, the sequence alignment analysis of isoforms may not be efficient for revealing AS sites, especially in regard to sequences with minor changes. In a study of Amborella trichopoda, Liu and colleagues have described a pipeline to identify AS isoforms without using the reference genome; this all-vs.-all approach has resulted in 428 pairs of AS isoforms with a validation rate between 66-76% [16]. Combined analysis using short-read and long-read sequencing of transcriptome can be a robust means of characterizing the structures and expression profiles of AS isoforms in non-model species. One approach is to assemble reads and generate a reference to determine AS sites by re-alignment. For instance, the IDP-denovo tool takes both short and long reads to construct a ‘pseudo-genome’ for AS identification. This method has shown a substantial increase in efficiency for non-model species [17].

Single-molecule transcriptome sequencing produces a novel set of full-length isoforms that is apt to the downstream analyses, including the identification of long non-coding RNAs (lncRNAs), alternative polyadenylation (APA), and fusion transcripts [7,18]. Recently, the lncRNAs have been found to be abundant in plant genomes, and they can play important regulatory functions in diverse biological processes [1921]. Although the number of genomic databases of lncRNAs is increasing [22,23], the understanding of the function of lncRNAs in plants is still limited. Little conservation of lncRNAs at the sequences level is observed in distant-related species, which poses an obstacle to the computational prediction of lncRNA function. Nevertheless, a key aspect of the function of lncRNAs is their association with small regulatory RNAs (e.g. microRNA [miRNA], small interfering RNA [siRNA)]. Mature miRNAs are mainly 20–21 bp RNAs that can direct the expression of their target transcripts at the post-transcriptional level [24]; also miRNA can trigger the synthesis of secondary siRNA including trans-acting small interfering RNAs (tasiRNAs) and phased small-interfering RNAs (phasiRNAs), which further modify the gene expression [2426].

With the development of more competent bioinformatic methods, single-molecule sequencing in non-model species will yield new insights into the regulation of gene expression. The genus Camellia is well known for its cultivars, which can be utilized for making tea, ornaments, and edible oil [27]. Recently, high-quality reference genomes have been released for two genotypes of Camellia sinenisis [28,29]. Here, we have performed Iso-Seq analyses in five Camellia japonica tissues; and designed a novel pipeline for identifying AS isoforms without using a reference genome; finally, we have identified a new locus of phasiRNA that is responsive to temperature stresses.

Materials and methods

Plant materials and treatments

Camellia japonica plants were grown in the greenhouse of Research Institute of Subtropical Forestry (Fuyang, Zhejiang, China). For sample preparation, the plant tissues were collected immediately frozen in liquid nitrogen and stored at −80°C until further use. For temperature treatment, small cuttings of C. japonica (10–15 cm) were kept in a growth chamber under long-day conditions (16-h light/8-h dark) at 24°C and 40% humidity. To perform the low-temperature treatment, a glass freezer was controlled by a temperature sensor (PURUI G6000, Ningbo, China). To perform the high-temperature treatment, an incubator was set to an appropriate temperature prior to the experiment to stabilize the internal temperature. To collect the pericarp and immature seed, a young fruit was sliced in half, and the tissues were removed and collected using a sharp scalpel.

RNA preparation, library construction, sequencing, and IsoSeq data processing

Total RNA was extracted from the tissues of C. japonica using an RNAprep Pure Plant Plus Kit (Transgen, Cat No. DP441, Beijing, China). The concentration and integrity of the total RNA were checked before library construction. A Nanodrop 2000 spectrophotometer (Thermo Fisher, CA, USA) was used to calculate the RNA concentration, and samples with more than 200 ng/uL and optical density (OD) 260/280 above 2.0 were used. Near equal amount of RNAs from different tissue types were mixed according to the concentrations. To construct libraries for IsoSeq analysis, the high-quality mRNA was purified by the Oligo dT beads (Invitrogen, Cat No. 61002) and then reverse-transcribed into cDNA using a SMARTer PCR cDNA Synthesis Kit (Clontech, Cat No. 634926). The cDNA fragments were selected by a BluePippin device (Sage Science, Beverly, MA, USA).

The Iso-Seq protocol was performed on a PacBio sequencer using the RSII platform, as previously described [5,18]. The raw reads data were initially filtered with a read-accuracy less than 0.75 and a read-length less than 50 bp. The reads-of-insert (ROIs) were further divided into full-length and non-full-length based on the presence of 5ʹ and 3ʹ adapters. The described full-length non-chimeric reads were clustered by using Iterative Clustering for Error Correction software to generate the cluster consensus with high-quality isoforms (over 99% accuracy). The non-redundant isoforms were further retrieved using CD-hit [30]. All sequencing data were deposited with National Centre for Biotechnology Information (NCBI) under the BioProject ID: PRJNA564707.

Transcriptome sequencing and data processing

The Illumina HiSeq platform was used to perform RNA sequencing for each sample to generate 2 × 150 bp short reads. The methods of library construction, sequencing, and data processing were as described elsewhere [31]. All clean reads were deposited in the NCBI Short Read Archive mentioned above. The clean reads were mapped to the non-redundant isoforms by bowtie2 v2.1.0 [32]; and the expression level of the transcript was calculated to FPKM by RSEM1.2.15 [33]. The same clean reads of each sample were also used to validate the AS sites predicted by IsoSplitter. To perform a differential analysis, the transcripts with no less than two-fold change and FDR less than 0.001 were identified by software edgeR, as previously described [34].

AS identification and the IsoSplitter pipeline

To identify AS isoforms without the reference genome, we designed a pipeline of sequence alignment, using the isoforms to predict and validate the AS sites (IsoSplitter available at https://github.com/Hengfu-Yin/IsoSplitter). Briefly, IsoSplitter invokes the modified SIM4 program to find split-sites of transcripts [35]; and each split-site is validated and quantified using the high-depth short reads. We have modified the SIM4 ‘–word size’ of core region to 15 (default value is 12) which gives more stringent alignment results for further analyses. The detailed manual is available on the webpage. For this study, the non-redundant isoforms from the Iso-Seq sequencing were used for AS identification as following: ‘IsoSplittingAnchor -i 95 -L 30bp longReadsFile’; to validate the AS sites, the clean reads of Illumina sequencing were used: ‘ShortReadsAligner -q longReadsFile ShortReadsFile Breakpoint_out’. For the quantification of AS isoforms, we further identified the isoform-specific reads through mapping the short-reads across the ‘split-sites’, and the value of ‘average read counts per split per million reads’ (ACM) was obtained to reveal the isoform expression. The script of short-read mapping and quantification for this study is available at https://github.com/Hengfu-Yin/IsoSplitter/scripts/ACM_quantification.py.

Small RNA expression analysis

The 21 bp small RNA sequences were used to design primers for quantitative expression analysis (Supplementary Table S1). The total RNA was prepared and normalized before the reverse-transcription by the Mir-X miRNA First-Strand Synthesis Kit (Clontech, Cat No. 638315, Dalian, China). To perform PCR analysis, a Mir-X miRNA qRT-PCR TB Green Kit (Clontech, Cat No. 638314, Dalian, China) was used according to the user’s manual. The U6 sequences was used as the internal reference, and the miR167 [mature sequence: tgaagctgccagcatgatctg; 35] was also used as a control for gene expression analysis. Three biological replicates were obtained for expression analysis. To predict the targets of the secondary siRNAs, we chose a transcriptome assembly of C. japonica to reduce the complexity [31]; and the psRNATarget server was used with default settings [37]. The gene-specific primers (Supplementary Table S1) of Quantitative Real-Time -PCR (qRT-PCR) analysis for potential target genes were designed by Primer Express 3.0.1 (Applied Biosystems), and carried out using an SYBR Premix Ex Taq (Takara, Dalian, China) kit as described [31].

Bioinformatic and statistical analysis

To predict coding sequences of transcript, TransDecoder (http://transdecoder.sourceforge.net/) was used to identify the open reading frames (ORFs), and the ORFs with more than 100 codons were kept for annotation analysis. The NCBI nucleotide sequences, NCBI non-redundant protein sequences, and Swiss-Prot were used to annotate the derived protein sequences by BLAST 2.2.31 +. To identify lncRNAs, the pipeline of lncRNA prediction (PLEK software version 1.2, available at https://sourceforge.net/projects/plek/files/) was initially used [38] with the maize model (-model maize_ens_linli.model – range maize_ens_linli.range – -minlength 300), and CPC software (cpc-0.9-r2 with default settings) was also used to find lncRNAs [39]. To obtain the final set of lncRNAs, we have combined the predicted results to retrieve the overlapped sequences as lncRNAs. For homologous analysis of lncRNAs, the genome annotation of Populus trichocarpa version 3.0 [40], Vitis vinifera (http://www.genoscope.cns.fr/externe/GenomeBrowser/Vitis/) [41]; and Camellia sinensis [28] were downloaded. And we have used the sequence similarity alignment by BLASTN (version 2.2.2.31+, cut-off E-value: 1e-10) to identify homologous lncRNAs in different species. To determine the polyA signature, the 3ʹ UTR sequences were retrieved and search by SIGNITRUTH to reveal the enriched signal [42].

For Gene Ontology (GO) enrichment analysis of differentially expressed genes, the hypergeometric test was used to calculate the enrichment probability for each GO term in a differentially expressed transcript (DET) set and further corrected by the Benjamini-Hochberg method [43]. To visualize the relationships of the enriched GO terms, the top 30 GO terms from biological processes were grouped using reviGO with the default setting and plotted using the Cytoscape 3.1.1 [44]. To test the probability of polyA signal (PAS) sites between the AS and non-AS groups, one-sided Fisher’s exact test was used to calculate the significance of enrichment of PAS sites. For the prediction of phasiRNA loci, previous small RNA sequencing data were used with UEA SRNA-Workbench version 3.2 [44,45].

Results

Extensive transcript isoforms from the Iso-Seq-based transcriptome in C. japonica

To construct a complete resource for gene discovery in C. japonica, we performed the full-length transcriptome sequencing by the PacBio Iso-Seq technology. A mixed RNA sample from five different tissue types (Fig. 1A) was used for library construction with a preferential size of 1–2 kb, 2–3 kb, 3–6 kb, and 5–10 kb. All libraries were subjected to a PacBio SMRT sequencing platform. In total, 901,752 raw reads (around 10.2 billion bases) were generated; and after filtering, 537,587 subreads representing 9.6 billion bases were obtained, including full-length and non-full-length transcripts (Fig. 1B, C). We found that the size distribution of ROIs was expected with the selection of cDNA size used for library construction (Supplementary Fig. 1).

Figure 1.

Figure 1.

An overview of Iso-Seq transcriptome sequencing in Camellia japonica. (A) the plant tissues used for the library construction. FB, floral bud; YL, young leaf; SK, seed kernel; PE, pericarp; IS, immature seed. The inset figure is a close-up image of fruit tissues. (B) the distribution of sequencing reads with 5ʹ and 3ʹ primers in libraries of different size. (C) the distribution of full-length and non-full-length sequences in libraries of different size.

To retrieve high-quality gene isoforms, the subreads were polished with Illumina sequencing reads from different tissues to correct sequencing errors, and then the redundant sequences with high similarity were filtered [30]. The obtained dataset (111,277 transcripts, in total) was established for further analysis, which included multiple AS isoforms of transcripts. To annotate the transcripts, multiple public databases of gene resources were searched for sequencing similarities (Supplementary Table S2). In total, 108,083 transcripts were annotated in total, and majority of the transcripts were found in the Non-Redundant Protein database (Supplementary Table S2; Supplementary Dataset 1). To accurately quantify the expression levels of transcripts, the short reads from Illumina sequencing platform were generated in the five tissue types with three biological replicates. The average reads per library were about 46 M, and the average mapping rate of total reads was 75.57% (Supplementary Table S3). The expression levels were calculated by alignment short reads to the transcripts, and the distribution of fragments per kilobase million (FPKM) values of each library were calculated (Supplementary Fig. S2; Supplementary Dataset 2).

Identification of gene AS isoforms based of long and short reads without a reference genome

Due to the lack of a high-quality genome in C. japonica, the determination of AS isoforms was not trivial. We implemented a novel pipeline, called IsoSplitter, to identify AS sites based on the sequence alignment of transcripts. We adopted the alignment algorithms of SIM4 to determine the high-similarity regions for initial AS identification. As a tool designed to align cDNA to genomic DNA sequences, SIM4 determines the high-similarity regions (HSPs) with a 12 mers screening followed by the dynamic programming algorithm. This has been shown to have high accuracy and efficiency [35]. To identify potential AS sites of transcripts, we designed a reverse-tracing method through the modified SIM4 program: the HSPs regions were screened for ”split-sites” (sites that were adjacent and supported by another transcript) based on a core region of 15-mers; we then grouped potential gene isoforms, and counted the occurrences of split-sites to reveal the transcript diversity (the details are presented in the Materials and Methods).

We aligned the short-reads to validate the AS sites through screening the junction reads (these are reads partially mapped next to the predicted AS sites and exclusively split at the same location). We showed that the IsoSplitter pipeline was remarkably efficient in identifying AS sites. In total, we determined 61,838 transcripts with at least one AS site from the above-mentioned 111,277 transcripts (accounting for 55.6%; Supplementary Dataset 3); and 257,692 AS sites were identified based on the SIM4 alignments (Fig. 2A, B; Supplementary Dataset 3). To further evaluate the AS sites, we mapped the short-reads from different tissue types to validate the AS sites; and we found that 13,068 transcripts with at least one AS site were validated, with the majority of these transcripts (6,373 transcripts) were commonly found in all tissues (Fig. 2C, D; Supplementary Dataset 4); there were 51,527 AS sites that were supported by the junction reads from all tissues, and 28,889 sites were uncovered in every tissue type (Fig. 2B). These results indicated that the IsoSplitter pipeline is effective to identify gene AS sites without the reference genome information.

Figure 2.

Figure 2.

Identification and analysis of gene alternative splicing isoforms based of long and short reads in Camellia japonica. (A) the number of AS sites that were discovered and validated with Illumina sequencing reads using the IsoSplitter pipeline. (B, A) venn diagram of AS sites validated using Illumina sequencing reads from different tissue types. C, the number of AS transcripts that were discovered and validated with Illumina sequencing reads using the IsoSplitter pipeline. (D, A) venn diagram of AS transcripts with at least one AS site validated using Illumina sequencing reads from different tissue types.

DETs and tissue-specific expression of AS isoforms in distinctive tissue types

We performed the statistical analysis to identify DETs between tissue types (Fig. 3A); in total, we identified 48,487 DETs in the five tissue types (FDR < 0.005; Fold-change > 2; Supplementary Dataset 5). We found that the IS versus PE had the smallest number of DETs comparing to other comparisons of tissue types (Fig. 3A). The expression level of all DETs was used to perform correlation analyses among the samples. We found that all replicates had a high degree of correlations indicating the reproducibility of the gene expression analysis; and IS and PE displayed a particularly high correlation, which was in agreement with the result of smallest DETs between the two (Supplementary Fig. 3). These results indicate that the analysis of long reads transcriptome coupled with short reads is competent for gene expression study.

Figure 3.

Figure 3.

The distribution of DETs and functional enrichment analyses of tissue-specific isoforms in five distinctive tissue types. (A) the distribution of up- and down-regulated DETs between tissue types. (B) the GO enrichment analysis seed kernel specific AS isoforms. The GO terms were summarized using REVIGO (http://revigo.irb.hr/), and the degree of red colour indicates the significance (P-value) of the enrichment as listed in Supplementary Dataset 6. (C) the normalized expression (Z-score) of seed kernel specific AS genes (ACM) that were annotated as lipid biosynthesis genes. (LUP, beta-amyrin synthase; PLD, Phospholipase (D) SQE, Squalene Epoxidase; EH, Epoxide hydrolase; fadD, Long chain acyl-CoA synthetase).

To investigate the gene isoforms that are specifically found in tissues, we obtained the tissue-specific isoform based on IsoSplitter analysis, and performed functional enrichment analysis. The isoforms that were supported by short ‘junction reads’ in a tissue type were used to reveal the tissue-specific AS events. We performed Gene Ontology (GO) enrichment to identify pathways that were related to tissue-specific gene isoforms (Fig. 3B; Supplementary Dataset 6). The most enriched 30 GO terms were analysed to reveal the biological processes with tissue-specific isoforms; we found that many enriched GO terms were consistent with the function of the tissues. For example, the ‘photosynthetic electron transport’ was enriched in leaves; the ‘regulation of embryonic development’ was enriched in seed kernel (Fig. 3B; Supplementary Fig. 4; Supplementary Dataset 6). We further investigated the tissue-specific expression levels of enriched isoforms of seed kernel through the ACM quantification method (See Materials and Methods for details; Supplementary Dataset 7). We found that genes involved in lipid biosynthesis, including beta-amyrin Synthase (LUP) and Squalene Epoxidase (SQE), were highly expressed in seed kernels (Fig. 3C).

Identification and characterization of lncRNAs in C. japonica

The obtained transcriptome was searched for lncRNAs; in total, 20,734 transcripts were identified as lncRNA (Supplementary dataset 8). The majority of lncRNAs were between 1 and 2 kb in length (63%, Fig. 4A). We then compared the lncRNAs from C. japonica to various plant species. We showed that there were a small number of lncRNAs displaying sequence homology to distant-related species: only 318 and 513 lncRNAs were revealed to be homologous to Populus and Vitis species, respectively (Fig. 4B), while a large amount of lncRNAs (17,842, 86.1%) was found to be homologous to a closely related species, Camellia sinensis (Fig. 4B). To identify potential miRNA-harbouring lncRNAs, we aligned the mature sequences of miRNAs from previous studies in Camellia species [36], and showed that 720 lncRNAs were matched to mature miRNAs, indicating that those are potential miRNA-harbouring lncRNAs. We examined the expression of lncRNA in tissue types and found that the average expression levels displayed minor variations (Fig. 4C).

Figure 4.

Figure 4.

Characterization of lncRNA and their associated miRNAs in Camellia. (A) the distribution of length of lncRNAs. (B) the homologous lncRNA in different plant species and potential miRNA-harbouring lncRNAs in Camellia japonica. (C) the expression of lncRNA in different plant tissue types.

Polyadenylation patterns of the C. japonica transcriptome

To investigate the polyadenylation (polyA) site of transcripts, we first translated the transcripts to retrieve the coding sequences, and the 3ʹ UTR sequences were clipped for further analysis. The polyA tails were identified based on a sliding window scanning with 10 nucleotides in length containing at least nine As. The 50 bp sequences upstream of a polyA tail were retrieved, and sequences with the motif ‘AATAAA’ were kept for the identification of PAS. We used an exhaustive counting program to identify potential signatures [42]. We found that ‘AATAAA’ was most frequently discovered (Fig. 5A); and ‘ATAAAA’, ‘AAATAA’, and ‘ATAAAT’ were the abundant ones associated with the ‘AATAAA’ motif (Fig. 5A).

Figure 5.

Figure 5.

The analysis of PAS in Camellia japonica. (A) the distribution of PAS identified in Camellia japonica using transcripts. (B) the PAS sites enriched in AS isoforms. The red colour indicates the significantly enriched PAS sites comparing to the non-AS transcripts.

Furthermore, we analysed the frequency top 15 hexamers in the AS transcripts. We tested the probabilities of occurrence of each hexamer in the group of AS genes and found that the frequency of AATAAA motif was not significantly different from that of the whole gene set (Fig. 5B). We also showed that eight hexamers had higher appearance probabilities in AS isoforms, suggesting that AS transcripts might have different preferences for PAS sites (Fig. 5B).

Identification of a new phasiRNA loci potentially involved in cold and heat stresses of Camellia

The phased secondary small RNA loci were predicted using miRNA and the long-read transcriptome. We found that 182 transcripts were potential phasiRNA loci (Supplementary Table S4). And among these, certain loci including auxin responsive factor, auxin signalling F-box, and zinc-finger domain containing protein had been reported in other plant lineages, suggesting conserved evolutionary origins. We also noticed that 41 transcripts encoding lipoxygenase were also predicted as potential phasiRNAs (Supplementary Table S4); and all of these transcripts contained a region of 252 bp which could potentially generate 12 21-bp siRNAs (Fig. 6A). To further evaluate the loci as a phasiRNA locus, we combined the small RNA datasets from C. japonica and C. azalea to identify the secondary siRNA. We found that only five 21-bp siRNAs (exact matches) were obtained (Fig. 6A).

Figure 6.

Figure 6.

Identification and expression analyses of a phasiRNA locus from lipoxygenases. (A) the 252 bp region was identified with supports from small RNA sequencing data from Camellia japonica and Camellia azalea [31,36]. The numbers were counts of 21-bp siRNA fragments identified of deep sequencing. (B) the expression of each potential siRNA using probes for real-time quantitative PCR analysis in cold and heat treatments. The y-axis indicated the relative expression values. The miR167 was used as a control. (C) a heatmap plot of correlations of expression of potential siRNAs.

It has been shown that lipoxygenases can be induced by thermal stresses. We reasoned that the production of secondary siRNA of this locus might be responsive to stresses. We designed 12 short siRNA probes and performed an expression analysis under low- and high-temperature treatments. We found that the short probes from M1-M4 displayed a consistent induction of expression level upon both cold and heat stresses (Fig. 6B); and probes M7, M10, and M12 were induced under −5 and 42°C treatments (Fig. 6B). We performed a correlation analysis of the expression of the probes, and a high correlation among M1-M5 was observed (Fig. 6C), suggesting a ‘one-hit’ model at the 5ʹ region of the phasiRNA locus. But the high correlations between M6-M7 and M11-M12 were also observed (Fig. 6C), suggesting a complex origin of the secondary siRNA biosynthesis. To further investigate the functions of the secondary siRNA, we predicted the potential targets using transcriptome assembly of C. japonica [31]. We found that, in addition to the lipoxygenase genes, the secondary siRNAs were predicted to target many other genes including protein phosphatase, glycosyl hydrolase, protein kinase, and more (Supplementary Dataset 9). We performed gene expression analysis of some potential targets and showed that RAN GTPase, Xyloglucan endotransglucosylase, and ATPase were differentially expressed in response to heat and cold stresses (Supplementary Fig. 5). These results suggested that the secondary siRNAs might regulate the downstream gene expression in a trans-acting manner.

Discussion

Transcriptome sequencing found a great complexity of AS in plant cells. The roles of AS in gene regulation have been found to be closely related to plant development, growth, and stress resistance [1,47]. The recent development of single-molecule sequencing technologies has provided an efficient way to obtain complete transcripts that can be used for AS, PAS, and lncRNA analyses [7]. Additionally, the use of Iso-Seq in diverse plant species has the potential to be an important tool for studying the genomic basis of adaptations.

We combined long-read and short-read transcriptome sequencing approaches in Camellia japonica, which lacks a reference genome. With the development of novel bioinformatic pipelines, we characterized genome-wide AS patterns, APA, and non-coding RNAs; the integrative analysis of Iso-Seq transcripts and small RNAs developed a new phasiRNA locus that may be involved in the regulation of temperature stresses.

IsoSplitter is a novel pipeline for AS identification using long-reads transcriptome for non-model species

Single-molecule transcriptome sequencing is a useful technology for unravelling AS isoforms, especially for species without reference genomes. Our design for IsoSplitter can efficiently identify AS sites by aligning isoform sequences. In this study, the screening of transcriptome yielded 61,838 transcripts out of 111,277, with at least one AS site (Fig. 2C); the discovery rate of the SIM4-based alignment algorithm was significantly improved. To compare with previous analysis pipeline, we have tried to use the method (based on BLAST 2.2.2.31+ and the cut-off E-value is 1e-15 as described in Liu et al. [15]) to search for homologues sequences, and obtained 906 pairs of AS isoforms (not shown) in C. japonica. Another key feature of IsoSplitter is that if short-read RNAseq data are available, IsoSplitter can map the short-reads to identify the junction reads to validate the predicted AS sites. Using Illumina reads from five tissue types, we validated 13,068 transcripts with at least one AS site (Fig. 2B); based on the short-read analysis, tissue-specific AS isoforms are revealed for further analyses (Fig. 3B, C). The Iso-Seq pipeline is commonly used in combined long-read and short-read sequencing in plants (Wang et al., 2019b; Xu et al., 2015), so this pipeline can be a powerful way to determine AS sites and tissue-specific AS isoforms. This study of Camellia japonica provides a comprehensive example of an integrative analysis of both long-read and short-read transcriptome to uncover AS sites when no reference genome is available.

Differences of APA between AS and non-AS transcripts

Polyadenylation is a key step in mRNA maturation, and it also plays an important role in the regulation of translation. The Iso-Seq pipeline has been shown to be an efficient means of identifying APA isoforms [7,18]. To investigate the PAS sites of transcripts in C. japonica, the 3ʹ-UTR sequences were retrieved based on coding sequences. We showed that ‘AATAAA’ was the most frequent PAS signal, which is consistent with studies in corn and sorghum [9]. Some hexamers, including ‘ATATAT’ and ‘TATATA,’ which are abundant in corn and sorghum, were not among the top 15 hexamers in C. japonica, suggesting a different preference of PAS (Fig. 5A). However, due to the lack of a reference genome, the identification of 3ʹ-UTR and polyA tail might cause bias during the selection of sequences, which could lead to the omission of some high-frequency PAS signals. Our enrichment analysis showed that eight hexamers were significantly selected in the AS isoform group, which indicates that the AS transcripts might have a distinct mechanism of polyA tail processing. This processing has been found to be highly correlated with splicing [48], and this result suggests that genes with AS might produce different 3ʹ-UTR sequences.

A newly evolved phasiRNA locus in Camellia

A diverse range of plant lineages feature phasiRNAs, suggesting that they have a deep evolutionary origin [49,50]. Recent evidence has indicated that not only the 21 bp but also the 24 bp secondary siRNA-producing loci are widely distributed in plants [51]. The phasiRNA-generating loci are found in both protein-coding and non-coding transcripts [25]. Using the isoform sequences and small RNA sequencing data, we predicted 183 transcripts that were potentially phasiRNA loci, including some conserved loci encoding myeloblastosis transcription factor, nucleotide-binding leucine-rich repeat, pentatricopeptide repeats, auxin-related F-box (AFB), and others [Supplementary Table 4; 24]. We also predicted a locus in the transcripts of lipoxygenases, including a region of 252 bp in length, and potentially 12 consecutive 21 bp secondary siRNAs can be produced (Fig. 6A). This locus appears to be a newly evolved phasiRNA transcript in Camellia, as it has not been discovered in other plant lineages. We showed that the secondary siRNAs were expressed at low levels using small RNA sequencing data from various plant tissues [36], and the heat and cold stresses induced the levels of the secondary siRNAs (Fig. 6). Previous studies have shown that the lipoxygenases belong to a large gene family that is involved in several biotic and abiotic responses in C. sinensis [52,53]. The members of lipoxygenase in C. sinensis underwent extensive AS, which led to the truncation of some proteins that might have regulatory functions that are responsive to biotic and abiotic stresses [53]. A potential siRNA-producing region has been found in at least 42 transcripts in the isoform dataset (Supplementary Table 4), partly due to the AS isoforms, which suggests that it may have important regulatory functions for downstream genes.

Supplementary Material

Supplemental Material

Acknowledgments

We are grateful to two anonymous reviewers for critical comments on this work. And we would like to thank Dr. Y-J Liu of Chinese Academy of Forestry for helpful comments on the bioinformatic analyses.

Funding Statement

This work was supported by Nonprofit Research Projects (CAFYBB2017SZ001) of Chinese Academy of Forestry, and National Science Foundation of China (NSFC Grant 31870578).

Abbreviations

Iso-Seq:

Isoform Sequencing

AS:

Alternative Splicing

PolyA:

Polyadenylation

phasiRNA:

phased small interfering RNA

PacBio:

Pacific Biosciences

lncRNAs:

long non-coding RNAs

APA:

alternative polyadenylation

miRNA:

microRNA

siRNA:

small interfering RNA

tasiRNAs:

trans-acting small interfering RNAs

ROI:

Reads of Insert

FPKM:

Fragments Per Kilobase Million

HSPs:

High Similarity Regions

DETs:

Differentially Expressed Transcripts

GO:

Gene Ontology

PAS:

polyA signals

NCBI:

National Center for Biotechnology Information

Author Contribution

HY and JL conceived the project and analysed the data. ZH, TL, and CY collected the samples and performed gene expression experiments. YW and NY were involved in the bioinformatic analysis and software development. XL and ZF carried out data processing and sample preparation. ZH and HY drafted the manuscript and all authors edit the paper.

Data accessibility

All original data associated with this work are deposited in NCBI BioProject PRJNA564707.

Disclosure of potential conflicts of interest

No potential conflict of interest was reported by the authors.

Supplemental material

Supplemental data for this article can be accessed here.

References

  • [1].Jabre I, Reddy ASN, Kalyna M, et al. Does co-transcriptional regulation of alternative splicing mediate plant stress responses? Nucleic Acids Res. 2019;47(6):2716–2726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Rigo R, Bazin J, Crespi M, et al. Alternative splicing in the regulation of plant–microbe interactions. Plant Cell Physiol. 2019;60(9):1906–1916. [DOI] [PubMed] [Google Scholar]
  • [3].Wang Z, Gerstein M, Snyder M.. RNA-Seq: A revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Nakano K, Shiroma A, Shimoji M, et al. Advantages of genome sequencing by long-read sequencer using SMRT technology in medical area. Human Cell. 2017;30(3):149––161.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Rhoads A, Au KF. PacBio sequencing and its applications. Genomics Proteomics Bioinformatics. 2015;13(5):278–289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Ameur A, Kloosterman WP, Hestand MS. Single-molecule sequencing: towards clinical applications. Trends Biotechnol. 2019;37(1):72–85. [DOI] [PubMed] [Google Scholar]
  • [7].Wang B, Kumar V, Olson A, et al. Reviving the transcriptome studies: an insight into the emergence of single–molecule transcriptome sequencing. Front Genet. 2019a;10:384. DOI: 10.3389/fgene.2019.00384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Wang B, Tseng E, Regulski M, et al. Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing. Nat Commun. 2016;7(1):11708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Wang B, Regulski M, Tseng E, et al. A comparative transcriptional landscape of maize and sorghum obtained by single-molecule sequencing. Genome Res. 2018;28(6):921–932. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Chao Y, Yuan J, Li S, et al. Analysis of transcripts and splice isoforms in red clover (Trifolium pratense L.) by single-molecule long-read sequencing. BMC Plant Biol. 2018;18(1):300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Cheng B, Furtado A, Henry RJ. Long-read sequencing of the coffee bean transcriptome reveals the diversity of full-length transcripts. GigaScience. 2017;6(11). DOI: 10.1093/gigascience/gix086 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Du H, Yu Y, Ma Y, et al. Sequencing and de novo assembly of a near complete indica rice genome. Nat Commun. 2017;8(1):15324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Hoang NV, Furtado A, Mason PJ, et al. A survey of the complex transcriptome from the highly polyploid sugarcane genome using full-length isoform sequencing and de novo assembly from short read sequencing. BMC Genomics. 2017;18(1):395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Wang T, Wang H, Cai D, et al. Comprehensive profiling of rhizome-associated alternative splicing and alternative polyadenylation in moso bamboo (Phyllostachys edulis). Plant J. 2017;91(4):684–699. [DOI] [PubMed] [Google Scholar]
  • [15].Zhang G, Sun M, Wang J, et al. PacBio full-length cDNA sequencing integrated with RNA-seq reads drastically improves the discovery of splicing transcripts in rice. Plant J. 2019;97(2):296–305. [DOI] [PubMed] [Google Scholar]
  • [16].Liu X, Mei W, Soltis PS, et al. Detecting alternatively spliced transcript isoforms from single-molecule long-read sequences without a reference genome. Mol Ecol Resour. 2017;17(6):1243–1256. [DOI] [PubMed] [Google Scholar]
  • [17].Fu S, Ma Y, Yao H, et al. IDP-denovo: de novo transcriptome assembly and isoform annotation by hybrid sequencing. Bioinformatics. 2018;34(13):2168–2176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].An D, Cao H, Li C, et al. Isoform sequencing and state-of-art applications for unravelling complexity of plant transcriptomes. Genes (Basel). 2018;9(1):43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Kim E-D, Sung S. Long noncoding RNA: unveiling hidden layer of gene regulatory networks. Trends Plant Sci. 2012;17(1):16–21. [DOI] [PubMed] [Google Scholar]
  • [20].Liu J, Wang H, Chua NH. Long noncoding RNA transcriptome of plants. Plant Biotechnol J. 2015;13(3):319–328. [DOI] [PubMed] [Google Scholar]
  • [21].Wu H-J, Wang Z-M, Wang M, et al. Widespread long noncoding RNAs as endogenous target mimics for microRNAs in plants. Plant Physiol. 2013;161(4):1875–1884. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Jin J, Liu J, Wang H, et al. PLncDB: plant long non-coding RNA database. Bioinformatics. 2013;29(8):1068–1071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Szcześniak MW, Rosikiewicz W, Makałowska I. CANTATAdb: A collection of plant long non-coding RNAs. Plant Cell Physiol. 2015;57(1):e8–e8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Borges F, Martienssen RA. The expanding world of small RNAs in plants. Nat Rev Mol Cell Biol. 2015;16(12):727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Fei Q, Xia R, Meyers BC. Phased, secondary, small interfering RNAs in posttranscriptional regulatory networks. Plant Cell. 2013;25(7):2400–2415. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Manavella PA, Koenig D, Weigel D. Plant secondary siRNA production determined by microRNA-duplex structure. Proc Nat Acad Sci. 2012;109(7):2461–2466. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Yan C, Lin P, Lyu T, et al. Unraveling the roles of regulatory genes during domestication of cultivated Camellia: evidence and insights from comparative and evolutionary genomics. Genes (Basel). 2018;9(10):488. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Wei C, Yang H, Wang S, et al. Draft genome sequence of Camellia sinensis var. sinensis provides insights into the evolution of the tea genome and tea quality. Proc Nat Acad Sci. 2018;115(18):E4151–E4158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Xia E-H, Zhang H-B, Sheng J, et al. The tea tree genome provides insights into tea flavor and independent evolution of caffeine biosynthesis. Mol Plant. 2017;10(6):866–877. [DOI] [PubMed] [Google Scholar]
  • [30].Li W, Godzik A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–1659. [DOI] [PubMed] [Google Scholar]
  • [31].Li X, Li J, Fan Z, et al. Global gene expression defines faded whorl specification of double flower domestication in Camellia. Sci Rep. 2017;7(1):3197. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nat Methods. 2012;9(4):357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12(1):323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Robinson MD, McCarthy DJ, Smyth GK. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Florea L, Hartzell G, Zhang Z, et al. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 1998;8(9):967–974. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Yin H, Fan Z, Li X, et al. Phylogenetic tree-informed microRNAome analysis uncovers conserved and lineage-specific miRNAs in Camellia during floral organ development. J Exp Bot. 2016;67(9):2641–2653. [DOI] [PubMed] [Google Scholar]
  • [37].Dai X, Zhuang Z, Zhao PX. psRNATarget: a plant small RNA target analysis server (2017 release). Nucleic Acids Res. 2018;46(W1):49–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Li A, Zhang J, Zhou Z. PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioinformatics. 2014;15(1):311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [39].Kong L, Zhang Y, Ye Z-Q, et al. CPC: assess the protein–coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 2007;35(suppl_2):W345–W349. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Tuskan GA, Difazio S, Jansson S, et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science. 2006;313(5793):1596–1604. [DOI] [PubMed] [Google Scholar]
  • [41].Jaillon O, Aury J–M, Noel B, et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. 2007;449:463. [DOI] [PubMed] [Google Scholar]
  • [42].Wu X, Ji G, Li QQ. Computational analysis of plant polyadenylation signals. In: (Hunt AG, Li QQ, editors. Polyadenylation in plants: methods and protocols. New York: Springer New York; 2015. p. 3–11. [DOI] [PubMed] [Google Scholar]
  • [43].Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc. 1995;57:289–300. [Google Scholar]
  • [44].Kohl M, Wiese S, Warscheid B. Cytoscape: software for visualization and analysis of biological networks. In: Data mining in proteomics. Clifton (NJ): Springer; 2011. p. 291–303. [DOI] [PubMed] [Google Scholar]
  • [45].Stocks MB, Moxon S, Mapleson D, et al. The UEA sRNA workbench: a suite of tools for analysing and visualizing next generation sequencing microrna and small RNA datasets. Bioinformatics. 2012;28(15):2059–2061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [46].Beckers M, Mohorianu I, Stocks M, et al. Comprehensive processing of high-throughput small RNA sequencing data including quality checking, normalization, and differential expression analysis using the UEA sRNA workbench. RNA. 2017;23(6):823–835. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [47].Staiger D, Brown JW. Alternative splicing at the intersection of biological timing, development, and stress responses. Plant Cell. 2013;25(10):3640–3656. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [48].Proudfoot NJ. Ending the message: poly (A) signals then and now. Genes Dev. 2011;25(17):1770–1782. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [49].Xia R, Xu J, Arikit S, et al. Extensive Families of miRNAs and PHAS loci in norway spruce demonstrate the origins of complex phasiRNA networks in seed plants. Mol Biol Evol. 2015;32(11):2905–2918. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [50].Xia R, Zhu H, An Y-Q, et al. Apple miRNAs and tasiRNAs with novel regulatory networks. Genome Biol. 2012;13(6):R47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [51].Xia R, Chen C, Pokhrel S, et al. 24-nt reproductive phasiRNAs are broadly present in angiosperms. Nat Commun. 2019;10(1):627. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [52].Liu S, Han B. Differential expression pattern of an acidic 9/13-lipoxygenase in flower opening and senescence and in leaf response to phloem feeders in the tea plant. BMC Plant Biol. 2010;10(1):228. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [53].Zhu J, Wang X, Guo L, et al. Characterization and alternative splicing profiles of the lipoxygenase gene family in tea plant (Camellia sinensis). Plant Cell Physiol. 2018;59(9):1765–1781. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

Data Availability Statement

All original data associated with this work are deposited in NCBI BioProject PRJNA564707.


Articles from RNA Biology are provided here courtesy of Taylor & Francis

RESOURCES