Skip to main content
Data in Brief logoLink to Data in Brief
. 2023 Nov 20;52:109838. doi: 10.1016/j.dib.2023.109838

Datasets of Iso-Seq transcripts for decoding transcriptome complexity in four Leishmania species

Sandra González-de la Fuente a, Jose M Requena b,c,, Begoña Aguado a,
PMCID: PMC10698239  PMID: 38076479

Abstract

The Iso-Seq technology, based on PacBio sequencing, enables the generation of high-quality, full-length transcripts, providing insights into transcriptome complexity. In this study, total RNA from promastigotes of four Leishmania species (Leishmania braziliensis, Leishmania donovani, Leishmania infantum and Leishmania major) was sequenced using Single Molecule, Real-Time (SMRT) Sequencing (PacBio) methodology. The Iso-seq transcripts were categorized as either complete or truncated according to the presence or absence of the Spliced-Leader (SL) sequence at their 5′-end, respectively. Moreover, only transcripts having a poly-A+ at their 3’-end were considered. Supplied datasets represent valuable information that may help to uncover novel transcripts and alternative splicing events in a parasite that regulates its gene expression at the post-transcriptional level. A better knowledge of gene expression regulation in Leishmania will open avenues for the development of new drugs to treat leishmaniasis, a devastating disease that has worldwide distribution. Additionally, the bioinformatics pipeline followed here may guide the analysis of Iso-Seq data derived from related trypanosomatids like Trypanosoma cruzi (Chagas disease agent) and Trypanosoma brucei (sleeping disease).

© 2023 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)

Keywords: RNA sequencing, Transcriptomics, L. braziliensis, L. donovani, L. infantum, L. major


Specifications Table

Subject Bioinformatics, Transcriptomic and Molecular Biology (General)
Specific subject area PacBio RNA sequencing (Iso-Seq) data
Type of data Sequence reads obtained after PacBio sequencing of RNA samples from four Leishmania species: L. braziliensis, L. donovani, L. infantum and L. major.
How the data were acquired The samples were sequenced using the SMRT v8.0 chemistry in Sequel II sequencing platform, and raw data were analysed subsequently by the tools IsoSeq v3 (v3.4.0) from SMRT Link v6.0.0, ccs (v6.0.0), lima (version v1.10.0), bamtools (2.5.1), LoRDEC v0.9 software, isONclust 0.0.6.1 (for clustering), minimap2 (2.17-r941), TAMA (v2020_12_14), Cupcake ToFu program, and an in-house Python script to find Spliced Leader derived-nucleotides in the final sequences.
Data format Raw, filtered and analyzed
Description of data collection Total RNA samples (three biological replicates) from promastigotes of four Leishmania species (L. braziliensis, L. donovani, L. infantum and L. major) were separately retrotranscribed by oligo-dT priming into cDNA. The resulting DNA was sequenced using the Single Molecule, Real-Time (SMRT) by Long Read SMRTCell sequencing methodology (PacBio) and chemistry. High quality reads were used to generate circular consensus sequences (CCS), which were classified as complete or truncated transcripts, depending on the presence at their 5’-end of the spliced-leader (or SL) sequence or not, respectively.
Data source location EASI Genomics Consortium, Genomics and NGS Facility (GENGS)
Institution: Centro de Biologia Molecular Severo Ochoa (CSIC-UAM), Consejo Superior de Investigaciones Científicas, Universidad Autonoma de Madrid
City: Madrid
Country: Spain
Data accessibility All Iso-seq raw data and sequences were deposited at the European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena/) under the umbrella Study accession number: PRJEB60560. Independent ENA projects were generated for each species: PRJEB60500 (L. infantum, strain JPCM5), PRJEB60502 (L. donovani, strain HU3) PRJEB60504 (L. braziliensis, strain M2904) and PRJEB60505 (L. major, strain Friedlin).
FASTA files including the sequences for all complete transcripts can be downloaded at Mendeley data repository through the following links: L. braziliensis (https://data.mendeley.com/drafts/kmbhnnxksy/1), L. donovani (https://data.mendeley.com/datasets/75cxwyt29w/1), L. infantum (https://data.mendeley.com/drafts/h7czpsnp4z/1), and L. major (https://data.mendeley.com/drafts/xyd427tt4g/1).

1. Value of the Data

 

  • The provided datasets consist of raw data and nucleotide sequences of poly-A+ transcripts generated by the Iso-Seq technology (PacBio sequencing platform) for Leishmania braziliensis, Leishmania donovani, Leishmania infantum and Leishmania major.

  • As each Iso-Seq transcript derives from a single molecule, these datasets are adequate to identify mRNA isoforms and to evidence alternative splicing events.

  • These data contribute information that may be useful to decipher the transcriptional complexity of this group of parasites.

2. Objective

This study was aimed to sequence, by PacBio methodology, poly-A+ RNA samples derived from promastigotes of four Leishmania species (L. braziliensis, L. donovani, L. infantum and L. major). Here, we are providing the data generated together with the complete information, regarding methodological details and bioinformatics procedures, to make these data easily reusable by the research community.

3. Data Description

In total, 647154 polymerase reads were generated by Iso-Seq sequencing (PacBio) platform. After quality control, 13432507 filtered subreads with a mean length of 1990 nucleotides were used to generate a total of 433021 CCS reads with a mean length of 2679 nucleotides (Table 1). Afterwards, CCS reads were demultiplexed into four groups according to the Leishmania species (source of the samples). Finally, full-length non-concatemer (FLNC) sequences were subjected to error correction by using Illumina RNA-seq reads generated for the same Leishmania species and described elsewhere [1], [2], [3]. Finally, between 73805 and 84913 high-quality (corrected and clustered) FLNC reads were obtained for every Leishmania species (Table 1).

Table 1.

Metrics of the provided datasets.

Iso-seq data Number Minimum size (nt) Maximum size (nt) Average size (nt)
Subreads (total) 13,432,507 50 244,154 1990
Circular consensus sequencing (CCS) reads (total) 433,021 114 43,757 2679
Corrected and clustered FLNC reads - L. infantum 84913 86 17150 2441
Corrected and clustered FLNC reads - L. major 73805 84 16379 2524
Corrected and clustered FLNC reads - L. donovani 75716 95 16653 2462
Corrected and clustered FLNC reads - L. braziliensis 80834 87 17142 2396

Next, species-specific FLNC reads were mapped against annotated gene models. As a result, transcripts for 5799, 5589, 5717, and 5371 gene loci were identified for L. infantum, L. major, L. donovani, and L. braziliensis, respectively. As the current numbers of poly-A+ transcripts annotated are 9646 in L. infantum [1], 9745 in L. major [2], 10893 in L. donovani [3], and 9932 in L. braziliensis (manuscript in preparation), the transcriptome coverages in the provided datasets would be 60.1%, 57.4%, 52.5%, and 54.1%, respectively. Moreover, more than two transcript isoforms were identified for 3726 (64.25%), 3631 (65%), 3570 (62.4%), and 3612 (67.25%) genes of the indicated species, respectively. Finally, a search for sequences derived from the Spliced Leader (SL) sequence was done; hence, this sequence was found at the 5′-end of some transcripts, which may be categorized as complete transcripts: 874 for L. infantum, 808 for L. major, 790 for L. donovani, and 900 for L. braziliensis. The presence of a poly-A+ tail was a common feature of all transcripts (both complete or 5’-truncated ones).

4. Experimental Design, Materials and Methods

4.1. Leishmania cell culture and RNA extraction

Cells from the following Leishmania species were used: L. infantum (strain JPCM5), L. donovani (strain HU3), L. braziliensis (strain M2904) and L. major (strain Friedlin). Logarithmically growing promastigotes (cell density between 6 × 106 and 107 per ml) were cultured in RPMI medium supplemented with 10% heat-inactivated fetal bovine serum. After harvesting, total RNA was isolated by the NucleoSpin RNA kit following the manufacturer's instructions (Macherey-Nagel). RNA samples from three biological replicates for each species were prepared. RNA integrity was checked in a bioanalyzer (Agilent 2100) before proceeding with cDNA synthesis.

4.2. Iso-Seq library construction and sequencing

Library preparation and Iso-Seq sequencing were performed by the EASI-Genomics consortium which evolved from the ESGI (European Sequencing and Genotyping Infrastructure – EU FP7 2007-2013 Infrastructure Project, project number 262055). In brief, the NEBNext Single cell/low input cDNA synthesis & amplification module and PacBio Iso-Seq express oligo kit were used to perform first-strand cDNA synthesis and PCR amplification. Barcoded forward and reverse primers (barcoded NEBNext Single Cell cDNA PCR Primer and barcoded Iso-Seq express cDNA PCR Primer) were used during PCR amplification of cDNA samples in order to multiplex them during sequencing. Once the amplified cDNA samples were barcoded, they were purified using SMRTbell cleanup beads, and pooled together into a single SMRTbell library. Agilent Bioanalyzer with a High Sensitivity DNA kit (Agilent Technologies) and Qubit dsDNA HS kit (Life Technologies) were used for quality accession of the library. Finally, the DNA library was sequenced using the Single Molecule, Real-Time (SMRT v8.0) chemistry on a Sequel II System.

4.3. Iso-Seq reads processing and analyses

The computational procedure and the bioinformatics tools used in the analysis are depicted in Fig. 1. Firstly, Iso-Seq reads were submitted to three successive steps: establishing of circular consensus sequences (CCS), demultiplexing according to the Leishmania species and refinement by using the command line IsoSeq v3 (v3.4.0), implemented in the IsoSeq GUI-based analysis application (SMRT Link v6.0.0). In order to generate one representative CCS for each transcript the zero-mode waveguide (ZMW) of the CCS (v6.0.0) program was used with the --min-rq 0.9, –draft-mode winpoa and –disable-heuristics parameters. Barcode demultiplexing and primer removal were performed using lima (version v1.10.0) with the –isoseq mode and –peek-guess parameter to remove spurious false positive signal. IsoSeq3 refine (option –require-poly-A) was used to select those reads having a 3’-end adenine (A)-tract, after trimming out the poly(A) tails and concatemer identification the FLNC transcripts were generated.

Fig. 1.

Fig 1

Diagram of the workflow used for Iso-Seq data processing. (A) Raw reads were analyzed using the PacBio Iso-seq v3.4.0 pipeline to generate Circular Consensus Sequence (CCS) and finally FLNCs reads. Point sequence errors were corrected using the LoRDEC v0.9 method and high-quality (Q>25) Illumina RNA-seq reads. (B) FLNC corrected reads were clustered using the isONclust algorithm and mapped to reference genome with the minimap2 v2.17-r941 software (red box). (C) Transcript refinements were conducted by the TAMA software (using the collapse and merge steps) to reduce redundancy (bright green). Complete transcripts were identified by the Cupcake ToFu software together a custom Python script designed to find Spliced Leader (SL)-derived nucleotides at the 5′-end of these transcripts (bright green boxes). The accuracy of complete transcripts was checked by visualization using the IGV tool (orange box).

The LoRDEC v0.9 software [4] was used for FLNC correction using the high-quality Illumina RNA-seq reads (cut-off value, 25) with the default parameters. These reads are available under independent ENA projects for each species (see Data accessibility section).

For clustering of FLNC read, the isONclust algorithm [5] was used with parameters: –mapped_threshold 0.8 –aligned_threshold 0.5 –q 15.0. Then, the high-quality isoforms were aligned to the corresponding reference genome using the minimap2 software (v2.17-r941) [6]. To mitigate redundancy and enhance transcriptome analysis accuracy, TAMA collapse and TAMA merge steps [7] were employed. TAMA collapse and merge step served to consolidate overlapping transcripts from the same gene locus, reducing in turn redundancy. Finally, an in-house Python script was designed to discriminate between complete and 5’-truncated transcripts, based on the presence at the 5’-end of 8 or more nucleotides derived from the Spliced Leader (SL) sequence (common to all Leishmania mRNAs) in the Iso-Seq molecules.

To create GFF3 files for every set of Iso-Seq molecules, these were processed following the Cupcake ToFu software v29.0.0 [https://github.com/Magdoll/cDNA_Cupcake/wiki/]. Also, the complete Iso-seq molecules (full transcripts) were grouped into FASTA files (see Data accessibility section).

CRediT authorship contribution statement

Sandra González-de la Fuente: Formal analysis, Software, Data curation, Writing – original draft, Writing – review & editing. Jose M. Requena: Methodology, Investigation, Conceptualization, Funding acquisition, Supervision, Writing – review & editing. Begoña Aguado: Conceptualization, Project administration, Funding acquisition, Supervision, Writing – review & editing.

Acknowledgments

Acknowledgments

The next-generation sequencing (NGS) data analysis has been performed by the Genomics and NGS Core Facility (GENGS), at the Centro de Biología Molecular Severo Ochoa (CBMSO, CSIC-UAM). This research was supported by the Spanish Ministerio de Ciencia, Innovación (MICINN), Agencia Estatal de Investigación (AEI), grant number PID2020-117916RB-I00, and Instituto de Salud Carlos III, grant CB21/13/00018 (CIBERINFEC). An institutional grant from Fundacion Ramon Areces is also acknowledged. Sequencing was conducted by the EASI-Genomics consortium, which received funding from the European Union's Horizon 2020 Research and Innovation Program under grant agreement No. 824110 (EASI-Genomics PID:7712).

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships which have or could be perceived to have influenced the work reported in this article.

Contributor Information

Jose M. Requena, Email: jmrequena@cbm.csic.es.

Begoña Aguado, Email: baguado@cbm.csic.es.

Data Availability

References

  • 1.Camacho E., González-de la Fuente S., Solana J.C., Tabera L., Carrasco-Ramiro F., Aguado B., Requena J.M. Leishmania infantum (JPCM5) transcriptome, gene models and resources for an active curation of gene annotations. Genes. 2023;14(4):866. doi: 10.3390/genes14040866. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Camacho E., González-de la Fuente S., Solana J.C., Rastrojo A., Carrasco-Ramiro F., Requena J.M., Aguado B. Gene annotation and transcriptome delineation on a de novo genome assembly for the reference Leishmania major Friedlin strain. Genes. 2021;12:1359. doi: 10.3390/genes12091359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Camacho E., González-de la Fuente S., Rastrojo A., Peiró-Pastor R., Solana J.C., Tabera L., Gamarro F., Carrasco-Ramiro F., Requena J.M., Aguado B. Complete assembly of the Leishmania donovani (HU3 strain) genome and transcriptome annotation. Sci. Rep. 2019;9:6127. doi: 10.1038/s41598-019-42511-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Salmela L., Rivals E. LoRDEC: accurate and efficient long read error correction. Bioinformatics. 2014;30(24):3506–3514. doi: 10.1093/bioinformatics/btu538. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Sahlin K., Medvedev P. De novo clustering of long-read transcriptome data using a greedy, quality value-based algorithm. J. Comput. Biol.: J. Comput. Mol. Cell Biol. 2020;27(4):472–484. doi: 10.1089/cmb.2019.0299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kuo R.I., Cheng Y., Zhang R., Brown J.W.S., Smith J., Archibald A.L., Burt D.W. Illuminating the dark side of the human transcriptome with long read transcript sequencing. BMC Genom. 2020;21(1):751. doi: 10.1186/s12864-020-07123-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement


Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES