Skip to main content
PeerJ logoLink to PeerJ
. 2019 Nov 15;7:e7933. doi: 10.7717/peerj.7933

Single-molecule real-time sequencing identifies massive full-length cDNAs and alternative-splicing events that facilitate comparative and functional genomics study in the hexaploid crop sweet potato

Na Ding 1,2, Huihui Cui 1, Ying Miao 1, Jun Tang 3,4, Qinghe Cao 3,4, Yonghai Luo 1,2,
Editor: Yuriy Orlov
PMCID: PMC6859871  PMID: 31741783

Abstract

Background

Sweet potato (Ipomoea batatas (L.) Lam.) is one of the most important crops in many developing countries and provides a candidate source of bioenergy. However, neither a complete reference genome nor large-scale full-length cDNA sequences for this outcrossing hexaploid crop are available, which in turn impedes progress in research studies in I. batatas functional genomics and molecular breeding.

Methods

In this study, we sequenced full-length transcriptomes in I. batatas and its diploid ancestor I. trifida by single-molecule real-time sequencing and Illumina second-generation sequencing technologies. With the generated datasets, we conducted comprehensive intraspecific and interspecific sequence analyses and experimental characterization.

Results

A total of 53,861/51,184 high-quality long-read transcripts were obtained, which covered about 10,439/10,452 loci in the I. batatas/I. trifida genome. These datasets enabled us to predict open reading frames successfully in 96.83%/96.82% of transcripts and identify 34,963/33,637 full-length cDNA sequences, 1,401/1,457 transcription factors, 25,315/27,090 simple sequence repeats, 1,656/1,389 long non-coding RNAs, and 5,251/8,901 alternative splicing events. Approximately, 32.34%/38.54% of transcripts and 46.22%/51.18% multi-exon transcripts underwent alternative splicing in I. batatas/I. trifida. Moreover, we validated one alternative splicing event in each of 10 genes and identified tuberous-root-specific expressed isoforms from a starch-branching enzyme, an alpha-glucan phosphorylase, a neutral invertase, and several ABC transporters. Overall, the collection and analysis of large-scale long-read transcripts generated in this study will serve as a valuable resource for the I. batatas research community, which may accelerate the progress in its structural, functional, and comparative genomics studies.

Keywords: Single-molecular real-time sequencing, Comparative transcriptome analysis, Alternative splicing, Sweet potato

Introduction

Sweet potato (Ipomoea batatas (L.) Lam.) is the seventh most important crop in the world and it ensures food supply and safety in many developing countries. I. batatas is a hexaploid plant with a complex and heterozygous genome (2n = 6 × = 90, 3–4 gigabase pairs in genome size (Magoon, Krishnan & Vijaya, 1970; Ozias-Akins & Jarret, 1994)). A preliminary genome estimate has revealed two genome polyploidization events occurring about 0.8 and 0.5 million years ago (Yang et al., 2017). Nevertheless, the complete reference genome of I. batatas remains lacking, which hinders the progress in molecular dissections of its evolutionary scenario and agronomically important traits. Moreover, I. batatas is a self-incompatible and thus obligate, outcrossing species (Martin, 1965). It is almost impossible to develop typical mapping populations such as F2 and recombinant inbred lines for constructing high-density linkage maps and classical genetic analyses. To date, no successful investigation in forward genetics (i.e., quantitative trait locus mapping and subsequently map-based cloning) of I. batatas has been reported. Therefore, RNA sequencing (i.e., RNA-seq, whole transcriptome shotgun sequencing (Wang, Gerstein & Snyder, 2009)) has been widely used as an attractive alternative to whole genome sequencing for gene mining in I. batatas (Schafleitner et al., 2010; Wang et al., 2010; Nurit et al., 2013). However, all reported transcriptomes in I. batatas were derived from second-generation sequencing platforms, which generate relatively short reads (i.e., hundreds of base pairs per read) and are disadvantageous in obtaining full-length transcripts (Koren et al., 2012). To date, the collection and analysis of large-scale full-length cDNA sequences have not been done in I. batatas, which is fundamental to its structural and functional genomics studies.

Ipomoea trifida (H.B.K.) G. Don has been considered as the diploid ancestor of I. batatas and accumulative evidence supports this hypothesis (Srisuwan, Sihachakr & Siljak-Yakovlev, 2006; Wu et al., 2018). Nevertheless, the evolutionary scenario underlying the origin and domestication of I. batatas remains unclear. Unlike I. batatas, I. trifida does not form tuberous roots, and thus comparative analysis of I. batatas and I. trifida may provide insights into the evolution and domestication of I. batatas. Although the reference genome of I. trifida becomes available recently (Wu et al., 2018) and short-read transcriptomes of I. trifida have been analyzed in a few projects (Cao et al., 2016; Ponniah et al., 2017), no study involving the large-scale collection and analysis of full-length cDNA sequences in I. trifida has been reported.

Long-read or full-length cDNA sequences are fundamental to structural and functional genomics studies. First, they provide complete information of transcribed sequences, which are required to gene function analyses. Second, they facilitate accurate predictions of gene models (i.e., to define proper orientation, order, and boundary of exons). Third, they may be utilized in validating or correcting the scaffold assembly in genome sequencing projects. Fourth, they are particularly useful to analyze alternative splicing of transcript isoforms, which is important to increase transcriptome diversity and adaptation potential of an organism. In the past, collecting full-length cDNA sequences was expensive, labor intensive, and time consuming (Seki et al., 2002; Shoshi et al., 2003). The advent of a third-generation sequencing platform (i.e., single-molecule real-time (SMRT) sequencing) has revolutionized DNA sequencing and thus genome/transcriptome studies  (Eid et al., 2010). Long reads of up to 20-kb in size, albeit with a relatively high error rate, can be produced by SMRT sequencing (Roberts, Carneiro & Schatz, 2013; Au et al., 2013). Today, high-throughput sequencing combining second-generation sequencing (to generate short reads with high base quality) and SMRT sequencing (to produce long reads with a relatively high error rate) has become an attractive option in genome and transcriptome studies (Au et al., 2013; Sharon et al., 2013; Xu et al., 2015). In the present study, we performed SMRT sequencing to generate large-scale full-length or long-read transcripts from I. batatas and I. trifida, respectively. Comprehensive intraspecific and interspecific sequence analyses were conducted, which has provided a valuable resource for the research community to exploit the origin of I. batatas.

Materials & Methods

Plant material and RNA preparation

Xushu18, one of the most widely cultivated I. batatas varieties in China, was selected for transcriptome sequencing in this study. Eight tissues of young leaves, mature leaves, apical shoots, mature stems, fibrous roots, initiating tuberous roots, expanding tuberous roots, and mature tuberous roots from one individual were collected and pooled together in approximately equivalent weights (Figs. 1A1H). Similarly, tissues of young leaves, mature leaves, shoots, stems, and roots of a diploid I. trifida plant were collected and pooled. Collected samples were frozen in liquid nitrogen immediately after collection and stored at −80 °C until use.

Figure 1. Plant materials used in this study and summary of PacBio RS II single-molecule real-time (SMRT) sequencing.

Figure 1

(A–H) Photos showing the developmental stages and overall morphology of eight tissues in I. batatas used for SMRT sequencing in this study. (A) Young leaves; (B) mature leaves; (C) apical shoots; (D) mature stems; (E) fibrous roots; (F) initiating tuberous roots; (G) expanding tuberous roots; (H) mature tuberous roots. The photos were adopted from our previous report (Ding et al., 2017). Number and length distributions of 220,035 reads in I. batatas (I) and 195,188 reads in I. trifida (J) from different PacBio libraries (fractionated size: 1–2, 2–3, >3 kb); Proportion of different types of PacBio reads in I. batatas (K) and I. trifida (L).

Total RNAs were extracted using Tiangen RNA preparation kits (Tiangen Biotech, Beijing, China) following the provided protocol. RNA quality and quantity were determined using a Nanodrop ND-1000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and a 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA). Qualified RNA samples were subsequently used in constructing PacBio cDNA or RNA-seq libraries.

PacBio cDNA library construction and SMRT sequencing

cDNA was synthesized using a SMARTer PCR cDNA Synthesis Kit, optimized for preparing full-length cDNA (Takara Clontech Biotech, Dalian, China). Size fractionation and selection (1–2 kb, 2–3 kb, and >3 kb) were performed using the BluePippin™ Size Selection System (Sage Science, Beverly, MA, USA). The SMRT bell libraries were constructed with the Pacific Biosciences DNA Template Prep Kit 2.0. SMRT sequencing was then performed on the Pacific Bioscience RS II platform using the provided protocol.

Illumina RNA-Seq library construction and sequencing

The RNA-Seq libraries were constructed using a NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (NEB, Beverly, MA, USA), following the manufacturer’s protocol. Qualified libraries were applied to transcriptome sequencing using an Illumina Hiseq 2500 (Illumina, San Diego, CA, USA) to generate 150-bp paired-end sequence reads (2 × 150 bp). High-throughput sequencing reported in this study was performed in the Biomarker Technology Co. (Beijing, China).

Quality filtering and error correction of SMRT long reads

The SMRT subreads were filtered using the standard protocols in the SMRT Analysis software suite (http://www.pacificbiosciences.com), and reads of insert (ROIs) were obtained using the standard protocols in the SMRT Analysis software suite (parameters: minFullPass=0, minPredictedAccuracy=75). After examining for poly(A)signals and 5′ and 3′ adaptors, full-length and non-full-length cDNA reads were recognized. Consensus isoforms were identified using the algorithm of iterative clustering for error correction and further polished to obtain high-quality consensus isoforms. The raw Illumina reads were filtered to remove adaptor sequences, ambiguous reads with ’N’ bases, and low-quality reads. Afterward, error correction of low-quality isoforms was conducted using the Illumina reads with the software proovread 2.13.841 (parameters: –coverage=50 –overwrite, –no-sampling) (Hackl et al., 2014). Redundant isoforms were then removed to generate a high-quality transcript dataset for each species (i.e., Ib53861 for I. batatas and It51184 for I. trifida, respectively) using the program CD-HIT 4.6.142 (parameters: -c 0.99 -T 6 -G 0 -aL 0.90 -AL 100 -aS 0.99 -AS 30 -o) (Li & Godzik, 2006).

Functional assignment of transcripts

Functional annotations were conducted by using BLASTX (cutoff E-value ≤ 1e−5) against different protein and nucleotide databases of COG (clusters of orthologous Groups; https://www.ncbi.nlm.nih.gov/COG/), GO (gene ontology; http://geneontology.org/), KEGG (kyoto encyclopedia of genes and genomes; https://www.kegg.jp/), Pfam (a database of conserved protein families or domains; http://pfam.xfam.org/), Swiss-prot (a manually annotated, non-redundant protein database; https://www.uniprot.org/), TrEMBL (an automatically annotated protein database; https://www.uniprot.org/), and NR (NCBI non-redundant proteins; https://www.ncbi.nlm.nih.gov/). For each transcript in each database searching, the functional information of the best matched sequence was assigned to the query transcript.

Predictions of open reading frames and simple sequence repeats

To predict putative open reading frames (ORFs) in transcripts, we used the package TransDecoder v2.0.1 (https://transdecoder.github.io/) to define coding sequences (CDS). The predicted CDS were searched and confirmed by BLASTX (E-value ≤1e−5) against the protein databases of NR, SWISS-PROT, and KEGG. Those transcripts containing complete ORFs as well as 5′- and 3′-UTR (untranslated regions) were designated as full-length transcripts. To identify putative simple sequence repeats (SSRs) in our sequences, the tool MISA (MIcroSAtellite identification tool; http://pgrc.ipk-gatersleben.de/misa) was employed. Only transcripts that were ≥500 bp in size were included in SSR detection.

Identification of transcription factor gene families

This was done according to our previous publication (Ding et al., 2017). Briefly, for each transcription factor gene family, the Hidden Markov Model (HMM) profile of the Pfam domain (when available) was downloaded from the Pfam database (http://pfam.xfam.org) and used as a query to survey all predicted proteins out of our transcript datasets using HMMER (http://www.hmmer.org). When no HMM profile was available for a gene family, all protein sequences belonging to the gene family in A. thaliana were downloaded (http://www.arabidopsis.org) and used as query sequences to search for our predicted protein datasets using BLASTP (E-value ≤1e−10). One redundant sequence was removed if two proteins shared the identity of amino acids equal to or larger than 97%. All identified non-redundant proteins were confirmed the existence of featured domains by searching the NCBI Conserved Domain Database (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi). The confirmed protein sequences as well as their corresponding transcripts were compiled (Only those gene families containing more than 10 members in at least one of our transcriptomes were presented).

Prediction of long non-coding RNAs

To sort non-coding RNAs from putative protein-coding ones, we employed each of four computational approaches including CPC (Kong et al., 2007), CNCI (Liang et al., 2013), Pfam (Finn et al., 2016), and CPAT (Wang et al., 2013). Putative protein-coding RNAs were filtered out using a minimum length and exon number threshold according the instructions of programs. For each species, the intersection of the four resulting lists were obtained as final lncRNA candidates.

Identification and validation of alternative splicing

To identify alternative splicing (AS) events, all transcripts of Ib53861 and It51184 were mapped to the the genomic contigs in I. batatas (Yang et al., 2017) and I. trifida (Wu et al., 2018), respectively, by using the program GMAP (Wu & Watanabe, 2005). The tool AStalavista v3.2 was employed to identify putative AS events (Foissac & Sammeth, 2007). Subsequently, 16 of AS events were selected and 10 of them were successfully confirmed by RT-PCR. Total RNA was isolated from the eight tissues in a I. batatas cultivar (Xushu22) as described above. The cDNA was synthesized using a cDNA Synthesis Kit (ProbeGene, China) and used as the template for PCR amplification. Afterward, PCR products were visualized in agarose gel.

Results

SMRT sequencing and generation of full-length transcriptomes

To obtain large-scale long-read transcripts for I. batatas and I. trifida, respectively, SMRT sequencing was performed using a Pacific RSII sequencing platform. Eight different tissues collected from a single plant of each species were pooled and used in mRNA extraction. Three size-fractionated, full-length cDNA libraries were constructed and subsequently sequenced in four SMRT cells (Figs. 1I and 1J; 1–2 kb for one cell, 2–3 kb for two cells, and >3 kb for one cell). In I. batatas, we obtained 220,035 reads of the insert (total bases: 701,923,565), which included 49.9% of full-length non-chimeric and 46.6% of non-full-length reads (Table 1, Fig. 1K), whereas in I. trifida, 195,188 reads of the insert (total bases: 527,497,043) were generated, of which 52.1% and 43.9% were full-length non-chimeric and non-full-length reads, respectively (Table 1, Fig. 1L).

Table 1. Summary of PacBio sequencing in this study.

I. batatas I. trifida
Reads of insert of PacBio sequencing 220,035 195,188
Bases of insert of PacBio sequencing (bp) 701,923,565 527,497,043
Reads of Illumina sequencing for correction 71,360,785 39,372,131
Bases of Illumina sequencing for correction (bp) 17,972,706,252 11,772,267,169
Number of non-full-length PacBio reads 102,510 85,680
Number of full-length non-chimeric PacBio reads 109,814 101,630
Average length of full-length non-chimeric PacBio reads (bp) 8,641 8,488
Number of non-redundant transcripts after correction 53,861 51,184
N50 of non-redundant transcripts after correction (bp) 2,933 2,642
Mean of non-redundant transcripts after correction (bp) 2,421 2,190
Number of non-redundant full-length transcripts after correction 34,963 33,637

Given that SMRT sequencing generates a high error rate, it is necessary to perform error correction, which includes self-correction by iterative clustering of circular-consensus reads and correction with high-quality Illumina short reads. To this end, cDNA libraries were prepared from the same samples that were used for SMRT sequencing, and deep RNA sequencing was conducted using an Illumina Hiseq2500 platform. A total of 71,360,785 and 39,372,131 clean reads (total bases: 17,972,706,252 and 11,772,267,169, respectively) were obtained and used to correct the SMRT reads in I. batatas and I. trifida, respectively (Table 1). After error correction, redundant transcripts were removed. Finally, we obtained 53,861 transcripts for I. batatas (named as Ib53861; N50: 2,933 bp; mean: 2,421 bp) and 51,184 for I. trifida (named as It51184; N50: 2,642 bp; mean: 2,190 bp). Those transcripts containing complete coding sequences (CDSs) as well as 5′- and 3′-UTR (untranslated regions) were defined as full-length transcripts. Approximately 34,963 and 33,637 full-length transcripts were identified for I. batatas (named as Ib34963) and I. trifida (named as It33637), respectively (File S1).

Basic sequence analysis of the full-length transcriptomes

The transcripts of Ib53861 and It51184 were functionally assigned and classified according to sequence similarities using BLASTx or tBLASTx (E-value ≤1e−5) against different protein and nucleotide databases. Overall, we successfully identified homologous sequences for 97.25% of Ib53861 and 97.34% of It51184 in the public databases, and the rates of successful validation in a single database ranged from 41.67% to 96.46% (File S2). These results indicate that most of the genes in our datasets are truly transcribed sequences in I. batatas and/or I. trifida. Furthermore, from the datasets of Ib53861 and It51184, 104,540/94,174 open reading frames (File S1), 25,315/27,090 simple sequence repeats (Files S3S5), 1,401/1,457 transcription factors (Files S6S8), 1,656/1,389 long non-coding RNAs (Files S9 and S10), and 5,251/8,901 alternative splicing events (Files S11S14) were identified. These data provide fundamental information for functional genomics study and molecular breeding in I. batatas and comparative biology study between I. batatas and I. trifida.

Analysis of long non-coding RNA

Recent studies have shown that lncRNAs act as key regulators in a wide range of biological processes. In the present study, we in silico identified 1,656 and 1,389 candidate lncRNAs out of Ib53861 and It51184, respectively (Figs. 2A and 2B; Files S9 and S10). Amongst, 421 I. batatas and 355 I. trifida transcripts could be recognized as sense, intergenic, intronic, or antisense lncRNAs (Fig. 2C). Notably, there were only 344 common candidate lncRNAs (i.e., homologs in sequences; cutoff: identity >200 bp & >90%) between the identified 1,656 I. batatas and 1,389 I. trifida transcripts, suggesting remarkable divergence in lncRNA biogenesis and thus their regulatory mechanisms between two species (Fig. 2D). These data suggest that different lncRNA members may be involved in different tissue/organ developmental processes in I. batatas.

Figure 2. Analysis of putative long noncoding RNA (lncRNA).

Figure 2

Venn Diagrams of lncRNAs predicted from Ib53861 (A) and It51184 (B) by four programs (CPAT, CPC, CNCI, and Pfam). (C) Types and numbers of lncRNAs that could be clarified in our analysis. (D) Homolog relationship of predicted lncRNAs between I. batatas and I. trifida.

Analysis of Alternative splicing

Alternative splicing (AS) is a posttranscriptional regulatory mechanism to increase transcriptome diversity, yet little is known about its roles in the development of tuberous root and the evolution of I. batatas. In the present study, we identified 5,251 and 8,901 AS events out of 10,562 and 17,826 transcript isoforms in I. batatas and I. trifida, respectively (Table 2; Files S11S14). The AS events were divided into five major types: intron retention (IR), alternative 3′ splice site (A3SS), alternative 5′ splice site (A5SS), exon skipping (ES), and mutually exclusive exon (MEX; Fig. 3A). The proportion of each AS type was comparable between I. batatas and I. trifida and the majority of AS events were IR in either species (Fig. 3B). Overall, the alternatively spliced isoforms accounted for 32.34% or 38.54% of all isoforms successfully mapped to I. batatas scaffolds or I. trifida genome (Table 2), which should have largely increased the complexity of transcriptomes in either species. Notably, 37% of the alternatively spliced isoforms in I. batatas were not alternatively spliced or not detected in I. trifida and so were 63% of the alternatively spliced isoforms in I. trifida, suggesting substantial divergence in AS biogenesis and thus their regulatory mechanisms between two species (Fig. 3C). The isoform number per AS event ranged from 2 to 35 (mean, 4.98) in I. batatas and from 2 to 46 (mean, 4.55) in I. trifida (Table 2; Fig. 3D). In total, 2,074 loci in I. batatas and 3,640 in I. trifida were involved in the detected AS events (Table 2). The maximal number of AS events per locus was 45 (mean, 2.57) in I. batatas and 38 (mean, 2.45) in I. trifida (Table 2; Fig. 3E).

Table 2. Summary of alternative splicing analysis.

I. batatas I. trfida
Number of isoforms of the datasets 53,861 51,184
Number of isoforms mapped to genome sequences 32,660 46,249
Number of isoforms (with multiple involvements) in AS events 26,146 40,473
Number of isoforms (with one involvement) in AS events 10,562 17,826
Number of detected AS events 5,251 8,901
Maximal number of isoforms in a single AS event 35 46
Mean number of isoforms per AS event 4.98 4.55
Number of loci occuring AS events 2,047 3,640
Maximal number of AS events in a single locus 45 38
Mean number of AS events per locus 2.57 2.45
Mean number of isoforms (with one involvement) per locus 5.16 4.90
Proportion of isofroms undergone AS 32.34% 38.54%
Number of estimated loci in the datasets 10,439 10,452

Figure 3. Analysis and validation of alternative splicing (AS).

Figure 3

(A) Diagrams showing five major AS types. (B) Proportions of major AS types predicted out of the dataset Ib53861 and It51184. (C) Homolog relationship of isoforms carrying putative AS events between I. batatas and I. trifida. (D) Proportion distribution of isoform number per AS event in I. batatas and I. trifida. (E) Proportion distribution of AS events per locus in I. batatas and I. trifida. (F) Diagram and (G) RT-PCR validation of AS events in ten I. batatas genes.

To assess our large-scale predictions of AS events, we manually examined 40 genes that were predicted as containing AS events and found 8 of them were likely false candidates. We then designed primers to examine 16 AS events, each of which located in one gene, by RT-PCR across eight tissues of an I. batatas variety (Xushu22), and successfully confirmed 10 of them (Figs. 3F and 3G). According these results, we concluded that at least 50% of our AS predictions were valid. Given that we only examined one of multiple AS events in each gene and only in one I. batatas variety, our data should be underestimated. Therefore, our large-scale AS analysis has provided a useful resource for studying biological functions of transcript isoforms and the regulatory mechanism of alternative splicing during the evolution of I. batatas.

For example, starch-branching enzymes (EC 2.4.1.18) are one of key enzymes involved in plant starch biosynthesis and sugar metabolism  (Zeeman, Kossmann & Smith, 2010). In our analysis, we detected multiple AS events (i.e., one ES and one IR events) in a putative I. batatas starch-branching enzyme I and verified two AS isoforms, whose expression changed over different tissues (Fig. 3G, Gene01). In aboveground tissues (i.e., T01 to T04) and fibrous roots (i.e., T05), the two isoforms were expressed at a similar level; whereas in tuberous roots (i.e., T06 to T08), the smaller isoform were specifically expressed (Fig. 3G, Gene01). Plant alpha-glucan phosphorylases, also named as starch phosphorylase (EC 2.4.1.1), are another important family of enzymes involved in carbohydrate metabolism (Rathore et al., 2009). Our results revealed distinct splicing mechanisms existed between aboveground and belowground tissues in the examined I. batatas alpha-glucan phosphorylase (Fig. 3G, Gene02). In addition, divergent gene-expression and splicing patterns were also observed in other investigated genes including a neutral invertase, an E3 ubiquitin-protein ligase, a pentatricopeptide repeat-containing protein, and a few ABC transporters (Fig. 3G, Gene03–10). These data revealed that alternative splicing and thus transcriptome regulation might play important roles during the development of tuberous roots in I. batatas.

Discussion

Understanding the genetic basis and evolutionary scenario underlying agronomically important traits is one of central research themes in the hexaploid crop I. batatas. However, achieving this goal is doomed to be challenging because of the complexity of its genome structure (Isobe, Shirasawa & Hirakawa, 2017). In the present study, we applied a hybrid sequencing approach to generate and analyze large-scale full-length or long-read transcripts and their expression profiles in I. batatas. Our study would be beneficial to the I. batatas research community at least in the following aspects: gene cloning, gene family analysis, development of cDNA-derived marker for breeding, gene model prediction, genome assembly, and study of genetic variation within or among species. For example, we have demonstrated an example of fast gene cloning and gene family analysis basing on our transcriptome datasets  (Ding et al., 2017). Overall, our study has provided a fundamental resource for functional genomics study in I. batatas, which would certainly facilitate genetic dissections of the origin of tuberous root as well as other traits.

AS commonly occurs in eukaryotes. In humans, more than 90% of genes were found to be alternatively spliced and the predominant AS type was exon-skipping (Wang et al., 2008). In higher plants, the AS frequency in intron-containing genes approximately ranged from 33% to 60% with intron retention as the major type (Filichkin et al., 2010; Zhang et al., 2010; Shen et al., 2014; Thatcher & Li, 2014). In our study, we observed an overall AS frequency of 32.34% in I. batatas isoforms (Table 2). Considering about 30.03% of isoforms contained a single exon in our dataset, the AS frequency in intron-containing isoforms in I. batatas was approximately 46.22%. The estimated AS frequency in intron-containing isoforms in I. trifida was 51.18%, a little bit higher than that of I. batatas. The major AS type was intron retention in either I. batatas or I. trifida, similar as observed in other plants. These data highlighted the prevalence of AS in both I. batatas and I. trifida, which would certainly increase the complexity of their transcriptomes. In addition, we also examined the AS pattern across eight tissues in 10 I. batatas genes and found that many isoforms exhibited a tissue-specific expression pattern (Fig. 3G). These results imply that the generation of AS isoforms in a tissue-dependent manner have contributed substantially to organ/tissue development and species evolution in I. batatas.

AS and gene/genome duplication are two fundamental biological processes contributing to transcriptome and proteome diversity. The relationship between these two evolutionary mechanisms remains debatable. Some studies have reported that the AS frequency decreased after gene duplication and genome polyploidization (Kopelman, Lancet & Yanai, 2005; Su et al., 2006). In contrast, some other reports argued that the evolutionary relationship between AS and gene/genome duplication was more complex and must be cautiously anticipated (Lin et al., 2008; Roux & Robinsonrechavi, 2011; Iñiguez & Hernández, 2017). In this study, our transcriptome-wide AS analysis revealed comparable AS patterns between I. batatas and I. trifida, in terms of mean number of isoforms per AS event or per locus, mean number of AS events per locus, and proportion of isoforms undergone AS (Table 2; Fig. 3). These data showed that the overall AS frequency (not between specific duplicated gene pairs) was not evidently decreased after the genome hexaploidization in I. batatas.

Conclusions

Although I. batatas is a global crop of great agronomic importance, advances in its functional genomics study and molecular breeding remain limited because of the complexity of its genome. Here we report the first collections and analyses of large-scale full-length or long-read transcripts in I. batatas and its putative diploid ancestor I. trifida using single-molecule real-time sequencing. By performing comprehensive intraspecific and interspecific sequence analyses, we provide a valuable resource for genetic marker development, gene discovery, and gene function study in I. batatas, as well as comparative biology study between I. batatas and I. trifida. Furthermore, we analyzed transcriptome-wide long non-coding RNA and alternative splicing, which revealed tissue-specific-expressed transcript isoforms and the importance of transcriptome regulation during the speciation and domestication of I. batatas.

Supplemental Information

File S1. Number and length distributions of predicted open reading frames.
DOI: 10.7717/peerj.7933/supp-1
File S2. Functional assignment.
DOI: 10.7717/peerj.7933/supp-2
File S3. Summary of predictions of simple sequence repeats (SSRs).
DOI: 10.7717/peerj.7933/supp-3
File S4. Information of SSRs predicted from Ib53861.
DOI: 10.7717/peerj.7933/supp-4
File S5. Information of SSRs predicted from It51184.
DOI: 10.7717/peerj.7933/supp-5
File S6. Identification of transcription factors.
DOI: 10.7717/peerj.7933/supp-6
File S7. List of transcription factors identified from Ib53861.
DOI: 10.7717/peerj.7933/supp-7
File S8. List of transcription factors identified from It51184.
DOI: 10.7717/peerj.7933/supp-8
File S9. List of lncRNAs predicted from Ib53861.
DOI: 10.7717/peerj.7933/supp-9
File S10. List of lncRNAs predicted from It51184.
DOI: 10.7717/peerj.7933/supp-10
File S11. Information of AS events in Ib53861.
DOI: 10.7717/peerj.7933/supp-11
File S12. Information of AS events in Ib53861 with GFF format.
DOI: 10.7717/peerj.7933/supp-12
File S13. Information of AS events in It51184.
DOI: 10.7717/peerj.7933/supp-13
File S14. Information of AS events in It51184 with GFF format.
DOI: 10.7717/peerj.7933/supp-14
Supplemental Information 15. Long-read transcript sequences of Ipomoea batatas, part one.
DOI: 10.7717/peerj.7933/supp-15
Supplemental Information 16. Long-read transcript sequences of Ipomoea batatas, part two.
DOI: 10.7717/peerj.7933/supp-16
Supplemental Information 17. Long-read transcript sequences of Ipomoea trifida, part one.
DOI: 10.7717/peerj.7933/supp-17
Supplemental Information 18. Long-read transcript sequences of Ipomoea trifida, part two.
DOI: 10.7717/peerj.7933/supp-18

Acknowledgments

We would like to thank Prof. Dr. Daifu Ma and Prof. Dr. Dabing Zhang for their kind help in the project performance, and Ms. Xuan Shi, Ms. Ruyuan Wang, and Mr. Yunxiang Wu for their technical assistance.

Funding Statement

This study was jointly supported by the National Natural Science Foundation of China (Grant No. 31771855), the National Sweet Potato Industry and Research System (Grant No. CARS-11-B-02), the startup funding from Fujian Agriculture and Forestry University, and the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Na Ding performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft.

Huihui Cui performed the experiments, analyzed the data, prepared figures and/or tables, approved the final draft.

Ying Miao, Jun Tang, Qinghe Cao and Yonghai Luo conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft.

DNA Deposition

The following information was supplied regarding the deposition of DNA sequences:

The assembled sequences of full-length transcriptomes are available at DDBJ/EMBL/GenBank:GHYO00000000.

Data Availability

The following information was supplied regarding data availability:

The PacBio SMRT reads and the Illumina short reads are available at the Genome Sequence Archive of Beijing Institute of Genomics, Chinese Academy of Sciences (https://bigd.big.ac.cn/gsa/browse/CRA000288).

References

  • Au et al. (2013).Au KF, Sebastiano V, Afshar PT, Durruthy JD, Lee L, Williams BA, Bakel HV, Schadt EE, Reijopera RA, Underwood JG. Characterization of the human ESC transcriptome by hybrid sequencing. Proceedings of the National Academy of Sciences of the United States of America. 2013;110(50):4821–4830. doi: 10.1073/pnas.1320101110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Cao et al. (2016).Cao Q, Li A, Chen J, Sun Y, Tang J, Zhang A, Zhou Z, Zhao D, Ma D, Gao S. Transcriptome sequencing of the sweet potato progenitor (Ipomoea Trifida (H.B.K.) G. Don.) and discovery of drought tolerance genes. Tropical Plant Biology. 2016;9(2):63–72. doi: 10.1007/s12042-016-9162-7. [DOI] [Google Scholar]
  • Ding et al. (2017).Ding N, Wang A, Zhang X, Wu Y, Wang R, Cui H, Huang R, Luo Y. Identification and analysis of glutathione S-transferase gene family in sweet potato reveal divergent GST-mediated networks in aboveground and underground tissues in response to abiotic stresses. BMC Plant Biology. 2017;17(1):225. doi: 10.1186/s12870-017-1179-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Eid et al. (2010).Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D. Real-time DNA sequencing from single polymerase molecules. Methods in Enzymology. 2010;472(5910):431–455. doi: 10.1016/S0076-6879(10)72001-2. [DOI] [PubMed] [Google Scholar]
  • Filichkin et al. (2010).Filichkin SA, Priest HD, Givan SA, Shen R, Bryant DW, Fox SE, Wong WK, Mockler TC. Genome-wide mapping of alternative splicing in Arabidopsis thaliana. Genome Research. 2010;20(1):45–58. doi: 10.1101/gr.093302.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Finn et al. (2016).Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangradorvegas A. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research. 2016;44(D1):D279–D285. doi: 10.1093/nar/gkv1344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Foissac & Sammeth (2007).Foissac S, Sammeth M. Astalavista: dynamic and flexible analysis of alternative splicing events in custom gene datasets. Nucleic Acids Research. 2007;35:297–299. doi: 10.1093/nar/gkm311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Hackl et al. (2014).Hackl T, Hedrich R, Schultz J, Förster F. proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics. 2014;30(21):3004–3011. doi: 10.1093/bioinformatics/btu392. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Iñiguez & Hernández (2017).Iñiguez LP, Hernández G. The evolutionary relationship between alternative splicing and gene duplication. Frontiers in Genetics. 2017;8:14. doi: 10.3389/fgene.2017.00014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Isobe, Shirasawa & Hirakawa (2017).Isobe S, Shirasawa K, Hirakawa H. Challenges to genome sequence dissection in sweetpotato. Breeding Science. 2017;67(1):35–40. doi: 10.1270/jsbbs.16186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Kong et al. (2007).Kong L, Zhang Y, Ye ZQ, Liu XQ, Zhao SQ, Wei L, Gao G. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Research. 2007;35:345–349. doi: 10.1093/nar/gkm391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Kopelman, Lancet & Yanai (2005).Kopelman NM, Lancet D, Yanai I. Alternative splicing and gene duplication are inversely correlated evolutionary mechanisms. Nature Genetics. 2005;37(6):588–589. doi: 10.1038/ng1575. [DOI] [PubMed] [Google Scholar]
  • Koren et al. (2012).Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, Wang Z, Rasko DA, Mccombie WR, Jarvis ED. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotechnology. 2012;30(7):693–700. doi: 10.1038/nbt.2280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Li & Godzik (2006).Li W, Godzik A. Cd-Hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
  • Liang et al. (2013).Liang S, Luo H, Bu D, Zhao G, Yu K, Zhang C, Liu Y, Chen R, Yi Z. Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Research. 2013;41(17):e166. doi: 10.1093/nar/gkt646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Lin et al. (2008).Lin H, Ouyang S, Egan A, Nobuta K, Haas BJ, Zhu W, Gu X, Silva JC, Meyers BC, Buell CR. Characterization of paralogous protein families in rice. BMC Plant Biology. 2008;8(1):18. doi: 10.1186/1471-2229-8-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Magoon, Krishnan & Vijaya (1970).Magoon ML, Krishnan R, Vijaya BK. Cytological evidence on the origin of sweet potato. Theoretical and Applied Genetics. 1970;40(8):360–366. doi: 10.1007/BF00285415. [DOI] [PubMed] [Google Scholar]
  • Martin (1965).Martin FW. Incompatibility in the sweet potato. A review. Economic Botany. 1965;19(4):406–415. doi: 10.1007/BF02904812. [DOI] [Google Scholar]
  • Nurit et al. (2013).Nurit F, Don LB, Arthur V, Yanir K, Julio S, Evgenia L, Schnitzer PT, Adi DF, Amots H, Leviah A. Transcriptional profiling of sweetpotato (Ipomoea batatas) roots indicates down-regulation of lignin biosynthesis and up-regulation of starch biosynthesis at an early stage of storage root formation. BMC Genomics. 2013;14(1):460. doi: 10.1186/1471-2164-14-460. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Ozias-Akins & Jarret (1994).Ozias-Akins P, Jarret RL. Nuclear DNA content and ploidy levels in the genus ipomoea. Journal of the American Society for Horticultural Science. 1994;119(1):110–115. doi: 10.21273/JASHS.119.1.110. [DOI] [Google Scholar]
  • Ponniah et al. (2017).Ponniah SK, Thimmapuram J, Bhide K, Kalavacharla VK, Manoharan M. Comparative analysis of the root transcriptomes of cultivated sweetpotato (Ipomoea batatas [L.] Lam) and its wild ancestor (Ipomoea trifida [Kunth] G. Don) BMC Plant Biology. 2017;17(1):9. doi: 10.1186/s12870-016-0950-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Rathore et al. (2009).Rathore RS, Garg N, Garg S, Kumar A. Starch phosphorylase: role in starch metabolism and biotechnological applications. Critical Reviews in Biotechnology. 2009;29(3):214–224. doi: 10.1080/07388550902926063. [DOI] [PubMed] [Google Scholar]
  • Roberts, Carneiro & Schatz (2013).Roberts RJ, Carneiro MO, Schatz MC. The advantages of SMRT sequencing. Genome Biology. 2013;14:405. doi: 10.1186/gb-2013-14-6-405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Roux & Robinsonrechavi (2011).Roux J, Robinsonrechavi M. Age-dependent gain of alternative splice forms and biased duplication explain the relation between splicing and duplication. Genome Research. 2011;21(3):357–363. doi: 10.1101/gr.113803.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Schafleitner et al. (2010).Schafleitner R, Tincopa LR, Palomino O, Rossel G, Robles RF, Alagon R, Rivera C, Quispe C, Rojas L, Pacheco JA. A sweetpotato gene index established by de novo assembly of pyrosequencing and Sanger sequences and mining for gene-based microsatellite markers. BMC Genomics. 2010;11(1):604. doi: 10.1186/1471-2164-11-604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Seki et al. (2002).Seki M, Narusaka M, Kamiya A, Ishida J, Satou M, Sakurai T, Nakajima M, Enju A, Akiyama K, Oono Y, Muramatsu M, Hayashizaki Y, Kawai J, Carninci P, Itoh M, Ishii Y, Arakawa T, Shibata K, Shinagawa A, Shinozaki K. Functional annotation of a full-length arabidopsis cDNA collection. Science. 2002;296(5565):141–145. doi: 10.1126/science.1071006. [DOI] [PubMed] [Google Scholar]
  • Sharon et al. (2013).Sharon D, Tilgner H, Grubert F, Snyder M. A single-molecule long-read survey of the human transcriptome. Nature Biotechnology. 2013;31(11):1009–1014. doi: 10.1038/nbt.2705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Shen et al. (2014).Shen Y, Zhou Z, Wang Z, Li W, Fang C, Wu M, Ma Y, Liu T, Kong LA, Peng DL. Global dissection of alternative splicing in paleopolyploid soybean. The Plant Cell. 2014;26(3):996–1008. doi: 10.1105/tpc.114.122739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Shoshi et al. (2003).Shoshi K, Kouji S, Toshifumi N, Nobuyuki K, Koji D, Naoki K, Junshi Y, Masahiro I, Hitomi Y, Hisako O. Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice. Science. 2003;301(5631):376–379. doi: 10.1126/science.1081288. [DOI] [PubMed] [Google Scholar]
  • Srisuwan, Sihachakr & Siljak-Yakovlev (2006).Srisuwan S, Sihachakr D, Siljak-Yakovlev S. The origin and evolution of sweet potato (Ipomoea batatas Lam.) and its wild relatives through the cytogenetic approaches. Plant Science. 2006;171(3):424–433. doi: 10.1016/j.plantsci.2006.05.007. [DOI] [PubMed] [Google Scholar]
  • Su et al. (2006).Su Z, Wang J, Yu J, Huang X, Gu X. Evolution of alternative splicing after gene duplication. Genome Research. 2006;16(2):182–189. doi: 10.1101/gr.4197006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Thatcher & Li (2014).Thatcher SR, Li B. Genome-wide analysis of alternative splicing in zea mays: landscape and genetic regulation. The Plant Cell. 2014;26(9):3472–3487. doi: 10.1105/tpc.114.130773. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Wang et al. (2008).Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456(7221):470–476. doi: 10.1038/nature07509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Wang et al. (2013).Wang L, Park HJ, Dasari S, Wang S, Kocher J, Li W. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Research. 2013;41(6):e74. doi: 10.1093/nar/gkt006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Wang et al. (2010).Wang Z, Fang B, Chen J, Zhang X, Luo Z, Huang L, Chen X, Li Y. De novo assembly and characterization of root transcriptome using Illumina paired-end sequencing and development of cSSR markers in sweetpotato (Ipomoea batatas) BMC Genomics. 2010;11(1):726. doi: 10.1186/1471-2164-11-726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Wang, Gerstein & Snyder (2009).Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics. 2009;10(1):57–63. doi: 10.1038/nrg2484. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Wu et al. (2018).Wu S, Lau KH, Cao Q, Hamilton JP, Sun H, Zhou C, Eserman L, Gemenet DC, Olukolu BA, Wang H, Crisovan E, Godden GT, Jiao C, Wang X, Kitavi M, Manrique-Carpintero N, Vaillancourt B, Wiegert-Rininger K, Yang X, Bao K, Schaff J, Kreuze J, Gruneberg W, Khan A, Ghislain M, Ma D, Jiang J, Mwanga ROM, Leebens-Mack J, Coin LJM, Yencho GC, Buell CR, Fei Z. Genome sequences of two diploid wild relatives of cultivated sweetpotato reveal targets for genetic improvement. Nature Communications. 2018;9:4580. doi: 10.1038/s41467-018-06983-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Wu & Watanabe (2005).Wu TD, Watanabe CK. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005;21(9):1859–1875. doi: 10.1093/bioinformatics/bti310. [DOI] [PubMed] [Google Scholar]
  • Xu et al. (2015).Xu Z, Peters RJ, Weirather J, Luo H, Liao B, Zhang X, Zhu Y, Ji A, Zhang B, Hu S. Full-length transcriptome sequences and splice variants obtained by a combination of sequencing platforms applied to different root tissues of Salvia miltiorrhiza and tanshinone biosynthesis. The Plant Journal. 2015;82(6):951–961. doi: 10.1111/tpj.12865. [DOI] [PubMed] [Google Scholar]
  • Yang et al. (2017).Yang J, Moeinzadeh M, Kuhl H, Helmuth J, Peng X, Haas S, Liu G, Zheng J, Zhe S, Fan W. Haplotype-resolved sweet potato genome traces back its hexaploidization history. Nature Plants. 2017;3(9):696–703. doi: 10.1038/s41477-017-0002-z. [DOI] [PubMed] [Google Scholar]
  • Zeeman, Kossmann & Smith (2010).Zeeman SC, Kossmann J, Smith AM. Starch: its metabolism, evolution, and biotechnological modification in plants. Annual Review of Plant Biology. 2010;61(1):209–234. doi: 10.1146/annurev-arplant-042809-112301. [DOI] [PubMed] [Google Scholar]
  • Zhang et al. (2010).Zhang G, Guo GX, Zhang Y, Li Q, Li R, Zhuang R, Lu Z, He Z, Fang X, Chen L. Deep RNA sequencing at single base-pair resolution reveals high complexity of the rice transcriptome. Genome Research. 2010;20(5):646–654. doi: 10.1101/gr.100677.109. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

File S1. Number and length distributions of predicted open reading frames.
DOI: 10.7717/peerj.7933/supp-1
File S2. Functional assignment.
DOI: 10.7717/peerj.7933/supp-2
File S3. Summary of predictions of simple sequence repeats (SSRs).
DOI: 10.7717/peerj.7933/supp-3
File S4. Information of SSRs predicted from Ib53861.
DOI: 10.7717/peerj.7933/supp-4
File S5. Information of SSRs predicted from It51184.
DOI: 10.7717/peerj.7933/supp-5
File S6. Identification of transcription factors.
DOI: 10.7717/peerj.7933/supp-6
File S7. List of transcription factors identified from Ib53861.
DOI: 10.7717/peerj.7933/supp-7
File S8. List of transcription factors identified from It51184.
DOI: 10.7717/peerj.7933/supp-8
File S9. List of lncRNAs predicted from Ib53861.
DOI: 10.7717/peerj.7933/supp-9
File S10. List of lncRNAs predicted from It51184.
DOI: 10.7717/peerj.7933/supp-10
File S11. Information of AS events in Ib53861.
DOI: 10.7717/peerj.7933/supp-11
File S12. Information of AS events in Ib53861 with GFF format.
DOI: 10.7717/peerj.7933/supp-12
File S13. Information of AS events in It51184.
DOI: 10.7717/peerj.7933/supp-13
File S14. Information of AS events in It51184 with GFF format.
DOI: 10.7717/peerj.7933/supp-14
Supplemental Information 15. Long-read transcript sequences of Ipomoea batatas, part one.
DOI: 10.7717/peerj.7933/supp-15
Supplemental Information 16. Long-read transcript sequences of Ipomoea batatas, part two.
DOI: 10.7717/peerj.7933/supp-16
Supplemental Information 17. Long-read transcript sequences of Ipomoea trifida, part one.
DOI: 10.7717/peerj.7933/supp-17
Supplemental Information 18. Long-read transcript sequences of Ipomoea trifida, part two.
DOI: 10.7717/peerj.7933/supp-18

Data Availability Statement

The following information was supplied regarding data availability:

The PacBio SMRT reads and the Illumina short reads are available at the Genome Sequence Archive of Beijing Institute of Genomics, Chinese Academy of Sciences (https://bigd.big.ac.cn/gsa/browse/CRA000288).


Articles from PeerJ are provided here courtesy of PeerJ, Inc

RESOURCES