Data on large cardamom transcriptome associated with Chirke disease

K Mary Mathew; Ranjanan Reshma; M Geethu; Varghese Rithin; Sasidharan Swapna; PP Gouthaman; KK Sabu; F Nadiya; Muhammad Ali Noushad; Soumya S Dharan; R Prakashkumar; AB Remashree

doi:10.1016/j.dib.2019.105047

. 2020 Jan 7;29:105047. doi: 10.1016/j.dib.2019.105047

Data on large cardamom transcriptome associated with Chirke disease

K Mary Mathew ^a, Ranjanan Reshma ^a, M Geethu ^a, Varghese Rithin ^a, Sasidharan Swapna ^a, PP Gouthaman ^b, KK Sabu ^b,^∗, F Nadiya ^b, Muhammad Ali Noushad ^b, Soumya S Dharan ^b, R Prakashkumar ^b, AB Remashree ^a

PMCID: PMC6970142 PMID: 31989006

Abstract

Large cardamom (Amomum subulatum Roxburg), is an ancient spice native to North-Eastern India and Southeast Asia, which belongs to the family Zingiberaceae under the order Scitaminae. Large cardamom is mostly affected by a viral disease termed Chirke caused by Large Cardamom Chirke Virus (LCCV). These disease has spread due to drastic changes in the ecosystem, inadequate rain in dry months and absence of good agricultural practices by the farmers resulting in aphid infestations. In the present study, using HiSeq™ 2000 RNA sequencing technology transcriptome sequencing was performed for both control (disease not expressed) and diseased large cardamom leaf tissues. RNA-seq generated 77260968 (7.72 GB) and 72239708 (7.22 GB) paired raw reads for large cardamom control and diseased samples respectively. The raw data were submitted to the NCBI SRA database under the accession numbers SRX2529373 and SRX2529372 and the assembled transcriptomes were submitted to TSA under the accession numbers GIAV01000000 and GIAW01000000 for the control and diseased samples respectively. The raw reads were quality trimmed and assembled de novo using TRINITY assembler which created 156822 (control) and 148953 (diseased) contigs with N50 values 2107 (control) and 2182 (diseased). The data were used to identify the significantly differentially expressed genes between control and diseased samples.

Keywords: Large cardamom, RNA sequencing, Transcriptome, Differential expression

Specifications Table

Subject	Agricultural and Biological Sciences
Specific subject area	Plant Science
Type of data	Text (FASTQ sequence files), table
How data were acquired	RNA sequencing data generated from Illumina HiSeq™ 2000
Data format	Raw data FASTQ format
Parameters for data collection	Freshly collected leaf samples from both control and naturally infected (diseased) large cardamom plants were used for RNA isolation.
Description of data collection	RNA seq libraries representing control and chirke disease stressed large cardamom were prepared, transcriptome sequencing was performed and de novo assembled to generate unigenes.
Data source location	Plants naturally infected at ICRI Regional Research Station, Gangtok in the East District of Sikkim, India (27° 18′ 41.724″ N, 88° 35′ 31.923″ E). Data was generated from Illumina HiSeq™ 2000
Data accessibility	Raw sequences of both control and disease stressed samples are available at NCBI SRA public repository: https://www.ncbi.nlm.nih.gov/sra/SRX2529373[accn] (control) https://www.ncbi.nlm.nih.gov/sra/SRX2529372[accn] (diseased) Transcriptome Shotgun Assembly for the control sample has been deposited at DDBJ/EMBL/GenBank under the accession GIAV00000000. The version described in this paper is the first version, GIAV01000000. Transcriptome Shotgun Assembly for the diseased sample has been deposited at DDBJ/EMBL/GenBank under the accession GIAW00000000. The version described in this paper is the first version, GIAW01000000.

Open in a new tab

Value of Data

•
Large cardamom is severely affected by many diseases, prominent among them is chirke viral infection which affects the crop productivity.
•
Expression profiling could unravel the over-expression of R genes or genes related to plant stress tolerance.
•
Transcriptome data generated from leaves of plants grown under specific conditions could provide information on the molecular mechanism underlying disease tolerance.
•
Differential expression analysis of control and disease stressed large cardamom could compare the expression variation of particular genes in healthy and infected plants and can be utilized for several downstream applications.

Open in a new tab

1. Data

The dataset contains raw sequencing data obtained through transcriptome sequencing of leaf samples of large cardamom (Amomum subulatum Roxburg). The data files were deposited at NCBI SRA database under project accession no. PRJNA369131. Information generated from the raw data and that of assembly are provided in Table 1 and Fig. 1.

Table 1.

Read and assembly statistics of control and infected large cardamom data.

Plant Material	Control	Diseased
Total number of raw reads	77260968	72239708
Total number of bases	7803357768	7296210508
Initial GC%	46	45
Read length	101	101
GC% after trimming	45.5	45
Reads after adapter removal and quality trimming	37733851	35199417
Total contigs	156822	148953
Largest contig	37547	23530
N50	2107	2182
L50	23639	22103
Total Length	172328012	167556334
GC% after assembly	41.97	42.11
Size of the assembly	168.3 MB	163.6 MB
Raw reads mapped to assembly (%)	97.70	97.17
Coverage	44.68	42.73
Scaffolds with any coverage (%)	98.83	99.00

Open in a new tab

Fig. 1 — Representation of numerical difference in gene and peptide count among the control and treatment.

2. Experimental design, materials, and methods

2.1. Plant material

Transcriptome sequencing was carried out in leaf samples of large cardamom (Amomum subulatum Roxburg). Large cardamom chirke virus (LCCV) was not expressed in one of the samples which served as the control whereas the disease was expressed in the other sample. Leaf tissues from both sets were collected followed by immediate freezing in liquid nitrogen.

2.2. Total RNA isolation and transcriptome sequencing

RNA extraction was done using a modified protocol of the RNeasy Plant Mini Kit (Qiagen) and CTAB method [1] RNA integrity and quality analysis were done using 2100 BioAnalyzer (Agilent Technologies). Illumina sequencing was performed using the HiSeq™ 2000 platform as per the manufacturer's instructions (Illumina, San Diego, CA). RNA-seq generated paired-end strand-specific 77260968 (101 bases) and 72239708 (101 bases) raw reads which correspond to 7.72 GB and 7.22 GB of sequence data for large cardamom control and diseased samples respectively.

2.3. De novo transcriptome assembly

Raw reads were first quality checked using the FastQC [2] tool and the different criteria were cross-checked to determine the integrity of the raw data and based on the quality control data it was determined to trim the raw reads of any adapters present in it. Adapter trimming was done using BBDuk [3] against Illumina universal adapters. Non-coding RNAs such as tRNAs, rRNAs, snRNAs, and snoRNAs were filtred using BBSplit [3] against all non-coding RNA sequences of viridiplantae collected from NCBI, based on further quality checking it was determined that the data was ready for assembly. De novo transcriptome assembly was performed using the Trinity [4] assembler program (Trinity Release v 2.8.5) utilizing three consecutive modules: Inchworm, Chrysalis, and Butterfly to generate contigs. The assembler created 156822 and 148953 contigs for control and infected large cardamom samples (Table 1). The assembled transcripts were converted into peptides using Transdecoder [5] and the peptides were clustered using cd-hit [6] to produce non-redundant and representative sequences. Further statistical data were generated from the assembly by means of the QUAST tool [7].

2.4. Confirmation of chirke virus genome sequences in the assembled transcriptome

Virus genome sequences were fetched from NCBI (https://www.ncbi.nlm.nih.gov/nuccore/?term=chirke) and found only 4 sequences for chirke (JN257715.1, MH899149.1, MH899148.1, and MH899147.1). These were aligned to both infected and control sequences using BLAST+ [8]. The Alignment generated 140 hits for the infected sequences. Whereas the control sequence showed one hit from all four of the sequences. This might be due to the dormant virus particles present in the control sequences or possible cross-contamination.

2.5. Quantification of peptides from the transcripts

A total of 156822 transcripts were generated from the control sample while 148953 were generated from the diseased. While converting the transcripts into peptides the control sample generated 76913 peptide sequences while the treatment generated 74060. The obtained peptides were clustered for non-redundancy which resulted in 30498 unique peptides being generated from control compared to the 29512 that were generated from the diseased (Fig. 1).

Acknowledgments

We acknowledge the financial support of Secretary, Spices Board of India, Ministry of Commerce and Industry, Government of India.

Conflict of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

^{Appendix A}

Supplementary data to this article can be found online at https://doi.org/10.1016/j.dib.2019.105047.

Appendix A. Supplementary data

The following is the Supplementary data to this article:

Multimedia component 1

mmc1.xml^{(974B, xml)}

References

1.Chomczynski P., Sacchi N. The single step method of RNA isolation by acid guanidinium thiocyanatephenol chloroform extraction: twenty something years on. Nat. Protoc. 2006;1:581–585. doi: 10.1038/nprot.2006.83. [DOI] [PubMed] [Google Scholar]
2.Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc Available online at:
3.Bushnell Brian. 2014. BBMap: A Fast, Accurate, Splice-Aware Aligner. United States: N. P. (Web) [Google Scholar]
4.Haas B.J., Papanicolaou A., Yassour M., Grabherr M., Blood P.D., Bowden J., Couger M.B., Eccles D., Li B., Lieber M., Macmanes M.D., Ott M., Orvis J., Pochet N., Strozzi F., Weeks N., Westerman R., William T., Dewey C.N., Henschel R., Leduc R.D., Friedman N., Regev A. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 2013 Aug;8(8):1494–1512. doi: 10.1038/nprot.2013.084. Open Access in PMC, Epub 2013 Jul 11. PubMed PMID:23845962. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Haas B.J., Papanicolaou A., Yassour M., Grabherr M., Blood P.D., Bowden J., Couger M.B., Eccles D., Li B., Lieber M., MacManes M.D., Ott M., Orvis J., Pochet N., Strozzi F., Weeks N., Westerman R., William T., Dewey C.N., Henschel R., LeDuc R.D., Friedman N., Regev A. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 2013 Aug;8(8):1494–1512. doi: 10.1038/nprot.2013.084. Epub 2013 Jul 11. PMID: 23845962; PMCID: PMC3875132. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Li Weizhong, Jaroszewski Lukasz, Godzik Adam. Clustering of highly homologous sequences to reduce the size of large protein database. Bioinformatics. 2001;17:282–283. doi: 10.1093/bioinformatics/17.3.282. [DOI] [PubMed] [Google Scholar]
7.Gurevich Alexey, Saveliev Vladislav, Vyahhi Nikolay, Glenn Tesler QUAST: a quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–1075. doi: 10.1093/bioinformatics/btt086. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., Madden T.L. BLAST+: architecture and applications. BMC Bioinf. 2009 Dec 15;10:421. doi: 10.1186/1471-2105-10-421. PMID: 20003500; PMCID: PMC2803857. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia component 1

mmc1.xml^{(974B, xml)}

[bib1] 1.Chomczynski P., Sacchi N. The single step method of RNA isolation by acid guanidinium thiocyanatephenol chloroform extraction: twenty something years on. Nat. Protoc. 2006;1:581–585. doi: 10.1038/nprot.2006.83. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc Available online at:

[bib3] 3.Bushnell Brian. 2014. BBMap: A Fast, Accurate, Splice-Aware Aligner. United States: N. P. (Web) [Google Scholar]

[bib4] 4.Haas B.J., Papanicolaou A., Yassour M., Grabherr M., Blood P.D., Bowden J., Couger M.B., Eccles D., Li B., Lieber M., Macmanes M.D., Ott M., Orvis J., Pochet N., Strozzi F., Weeks N., Westerman R., William T., Dewey C.N., Henschel R., Leduc R.D., Friedman N., Regev A. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 2013 Aug;8(8):1494–1512. doi: 10.1038/nprot.2013.084. Open Access in PMC, Epub 2013 Jul 11. PubMed PMID:23845962. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Haas B.J., Papanicolaou A., Yassour M., Grabherr M., Blood P.D., Bowden J., Couger M.B., Eccles D., Li B., Lieber M., MacManes M.D., Ott M., Orvis J., Pochet N., Strozzi F., Weeks N., Westerman R., William T., Dewey C.N., Henschel R., LeDuc R.D., Friedman N., Regev A. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 2013 Aug;8(8):1494–1512. doi: 10.1038/nprot.2013.084. Epub 2013 Jul 11. PMID: 23845962; PMCID: PMC3875132. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Li Weizhong, Jaroszewski Lukasz, Godzik Adam. Clustering of highly homologous sequences to reduce the size of large protein database. Bioinformatics. 2001;17:282–283. doi: 10.1093/bioinformatics/17.3.282. [DOI] [PubMed] [Google Scholar]

[bib7] 7.Gurevich Alexey, Saveliev Vladislav, Vyahhi Nikolay, Glenn Tesler QUAST: a quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–1075. doi: 10.1093/bioinformatics/btt086. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., Madden T.L. BLAST+: architecture and applications. BMC Bioinf. 2009 Dec 15;10:421. doi: 10.1186/1471-2105-10-421. PMID: 20003500; PMCID: PMC2803857. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Data on large cardamom transcriptome associated with Chirke disease

K Mary Mathew

Ranjanan Reshma

M Geethu

Varghese Rithin

Sasidharan Swapna

PP Gouthaman

KK Sabu

F Nadiya

Muhammad Ali Noushad

Soumya S Dharan

R Prakashkumar

AB Remashree

Abstract

1. Data

Table 1.

Fig. 1.

2. Experimental design, materials, and methods

2.1. Plant material

2.2. Total RNA isolation and transcriptome sequencing

2.3. De novo transcriptome assembly

2.4. Confirmation of chirke virus genome sequences in the assembled transcriptome

2.5. Quantification of peptides from the transcripts

Acknowledgments

Conflict of Interest

Footnotes

Appendix A. Supplementary data

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Data on large cardamom transcriptome associated with Chirke disease

K Mary Mathew

Ranjanan Reshma

M Geethu

Varghese Rithin

Sasidharan Swapna

PP Gouthaman

KK Sabu

F Nadiya

Muhammad Ali Noushad

Soumya S Dharan

R Prakashkumar

AB Remashree

Abstract

1. Data

Table 1.

Fig. 1.

2. Experimental design, materials, and methods

2.1. Plant material

2.2. Total RNA isolation and transcriptome sequencing

2.3. De novo transcriptome assembly

2.4. Confirmation of chirke virus genome sequences in the assembled transcriptome

2.5. Quantification of peptides from the transcripts

Acknowledgments

Conflict of Interest

Footnotes

Appendix A. Supplementary data

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases