Skip to main content
3 Biotech logoLink to 3 Biotech
. 2021 Aug 2;11(8):393. doi: 10.1007/s13205-021-02943-0

Reference-based assembly of chloroplast genome from leaf transcriptome data of Pterocarpus santalinus

Shanmugavel Senthilkumar 1, Kandasamy Ulaganathan 2, Modhumita Ghosh Dasgupta 1,
PMCID: PMC8329147  PMID: 34458062

Abstract

Chloroplast genome sequencing is an essential tool to understand genome evolution and phylogenetic relationship. The available methods for constructing chloroplast genome include chloroplast enrichment followed by long overlapping PCR or extraction and assembly of chloroplast-specific reads from whole-genome datasets. In the present study, we propose an alternate strategy of extraction and assembly of chloroplast-specific reads from leaf transcriptome data of Pterocarpus santalinus using bowtie2 aligner program. The assembled genome was compared with the published chloroplast genome of P. santalinus for genome size, number of predicted genes, microsatellite repeat motifs, and nucleotide repeats. A near-complete chloroplast genome was assembled from the transcriptome reads. The proposed method requires less computational time and know-how, limited virtual memory, and is cost-effective when compared to whole-genome sequencing. Assembly of Cp genome from transcriptome data will enhance the resolution of phylogenetic studies through comparative plastome analysis, facilitate accurate species/genotype discrimination and accelerate the development of transplastomic plants with enhanced biotic and abiotic tolerance.

Supplementary Information

The online version contains supplementary material available at 10.1007/s13205-021-02943-0.

Keywords: Assembly, Chloroplast genome, Phylogeny, Repeat analysis, Transcriptome data

Introduction

Chloroplasts (Cp) are semi-autonomous organelles and metabolic centre of life. They were the first genome to be sequenced and have been a valuable resource for deciphering phylogenetic relatedness between taxa and resolving evolutionary relationships (Daniell et al. 2016). Until 2020, 4717 Cp genomes have been sequenced from plants (Zhong 2020). Predominantly, Cp genome is a circular molecule and recent studies have shown multi-branched linear structures of Cp DNA in few angiosperms (Mower et al. 2018). Apart from being the organelle for conducting photosynthesis, it also regulates other crucial biochemical processes including the synthesis of biomolecules like fatty acids, nucleotides, amino acids, phytohormones, vitamins and play a major role in plant response to biotic and abiotic stresses (Daniell et al. 2016). The use of Cp genomes in timber forensics, crop improvement, and production of biopharmaceuticals is extensively documented (Bansal and Saha 2012; Daniell et al. 2016; Yu et al. 2020; Teske et al. 2020; Li et al. 2021)

The conventional method for Cp genome sequencing involves chloroplast enrichment using sucrose or percoll gradient, high salt method or use of proprietary kits (Chloroplast Isolation Kit from Sigma Aldrich, USA or Abcam, MA, USA). Subsequently, long overlapping PCR is conducted to sequence the genome (reviewed by Twyford and Ness 2017). The major limitation in this strategy is the cost involved in the isolation of chloroplast and the large quantity of starting material required, which can be a limitation for samples sourced from herbaria or endangered species (Vieira Ldo et al. 2014). Another challenge is the primer designing for long-range PCR which depends on the sequence conservation across species. The difference in gene organization can severely hamper the amplification success thus affecting genome assembly (Atherton et al. 2010). Alternately, screening of bacterial artificial chromosome (BAC) or fosmid libraries using chloroplast-specific probes is also reported (Daniell et al. 2006; Jansen et al. 2011) which are technically demanding procedures. With the advent of next-generation sequencing (NGS) platforms, sequencing of enriched chloroplast DNA (Atherton et al. 2010) or use of whole-genome sequence datasets has emerged as a viable method for assembling Cp genomes (reviewed by Twyford and Ness 2017). The Illumina platform is considered the most suitable NGS platform for sequencing Cp genomes, since it allows rolling circle amplification products (Atherton et al. 2010). Software tools like IOGA (Baker et al. 2010), Fast-Plast (McKain and Wilson 2017), GetOrganelle (Jin et al. 2020), NOVOplasty (Dierckxsens et al. 2017) and ChloroExtractor (Ankenbrand et al. 2018) were developed for extracting organellar reads from whole-genome datasets and used for assembling chloroplast genomes. A comparison of different tools to assemble complete chloroplast revealed that GetOrganelle performed the best both on simulated and real data, followed by Fast-Plast (Freudenthal et al. 2020). However, sequencing the whole genome is cost-intensive, and assembling organellar genome from these datasets requires high-end computational knowledge and infrastructure.

In the present study, an alternate approach of extracting and assembling chloroplast reads from transcriptome dataset was attempted and the method was demonstrated using the leaf transcriptome of Pterocarpus santalinus. This methodology demands less computational time and limited virtual memory and can be executed by researchers with limited knowledge in computational biology.

Total RNA was isolated from young leaves of P. santalinus using RNAqueous®-Micro Total RNA Isolation Kit (Thermo Scientific, USA). The concentration of RNA was quantified using Qubit fluorometer (Thermo Fisher Scientific, MA, USA) and TapeStation (Agilent Technologies Inc., Santa Clara, CA). RNA integrity number equivalent (RINe) was determined in TapeStation. Five hundred ng of total RNA was used to enrich mRNA using NEB Next Poly (A) mRNA magnetic isolation module and the enriched mRNA was chemically fragmented, reverse transcribed, and cleaned. The cDNA was end-repaired, adapter-ligated, size selected, PCR amplified (12 cycles) and cleaned prior to library construction. The library was constructed using NEBNext® Ultra™ II RNA Library Prep Kit using manufacturer’s protocol, quantified using Qubit fluorometer, validated in TapeStation and sequenced in Illumina HiSeq 2000 (Illumina Inc., San Diego, CA, USA) using 150 bp paired-end chemistry.

The raw RNA-seq data were quality checked using FastQC and low-quality and adapter sequences were removed using Trimmomatic tool (Bolger et al. 2014). The processed reads were subsequently used as input for Bowtie2 aligner program and P. santalinus chloroplast sequence (Acc. No. MT249117.1; Hong et al. 2020) was used as reference. The reference Cp genome used for the present study was assembled from whole-genome dataset, which was generated using a hybrid strategy of short-read sequencing on Illumina Hiseq 4000 and long-read sequencing using PacBio Sequel (Hong et al. 2020).

The SAM file from Bowtie2 program was then converted to coordinate-sorted BAM file followed by the generation of consensus FASTA sequence using SAM tools (Li et al. 2009) and VCF tools (Danecek et al. 2011).

The commands used for constructing Cp genome is given below:

Command for building reference index

$ bowtie2-build-f/path/to/reference.fasta/directory/path/to/write/reference/index/.

Command for building Cp genome with reference sequence

$ bowtie2-local-p10-x/path/to/reference/index/directory-1/path/to/transcriptome/raw/reads/forward.fastq.gz-2 /path/to/transcriptome/raw/reads/reverse.fastq.gz-S output.sam.

Conversion of SAM to sorted BAM file

$ samtools view-bS output.sam|samtools sort-o output.bam.

Generation of consensus FASTA file from BAM file

$ samtools mpileup-uf/path/to/reference.fasta output.bam|bcftools call-c|vcfutils.pl vcf2fq > output.fastq.

All analysis were carried out on a Dell precision workstation 3630 (i7–8700 K 3.2 GHz processor 6 cores 12 threads, 32 GB RAM in Linux Ubuntu 20.10 LTS).

The sequence thus sorted was annotated using GeSeq online tool (Tillich et al. 2017). The number of genes in the assembled and the reference Cp genome was predicted using the same tool. REputer (Kurtz et al. 2001) and MISA (Beier et al. 2017) were used to identify the nucleotide repeats and microsatellite repeats in both assembled and reference Cp genomes respectively using default parameters. The number of each nucleotide was determined using Python script. mVISTA (available at http://genome.lbl.gov/vista/index.shtml) (Mayor et al. 2000) was used to visualize the alignment of reference and assembled Cp genome of P. santalinus and identify sequence variations in the two assemblies.

The assembled and reference Cp genomes of P. santalinus along with 27 members from Fabaceae were used to construct the phylogenetic tree. Pterocarpus species including P. indicus, P. macrocarpus, P. marsupium. P. tinctorius and P. pedatus were included in the study to document their phylogenetic relatedness. Multiple sequence alignment was conducted using BioEdit (Hall 1999) and phylogenetic analysis was carried out in MEGA X (Kumar et al. 2018). Neighbor-Joining (NJ) tree was constructed using p-distance model with 1000 iterations for bootstrap values and pair-wise deletions was selected for gap treatment.

The concentration of total RNA isolated from the leaf tissues was 43.2 ng/µl using Qubit fluorometer and the RNA integrity number equivalent (RINe) value was 7.8. The enriched sequencing library was quantified using both Qubit fluorometer and TapeStation and the concentration was 18.7 and 15.2 ng/µl respectively.

A total of 35,861,326 raw reads were generated with a read length of 150 bp and the percent of reads above Q30 was 89.04%. The reference-based assembly of P. santalinus from leaf RNA-seq raw reads generated a Cp genome of 158,966 bp (Fig. 1), similar to the genome reported by Hong et al. (2020). A total of 158 genes were identified in the assembled genome when compared to 159 genes predicted from the reference genome (Hong et al. 2020) (Table 1). The genome sequences were annotated using GeSeq and the list of genes annotated in both the Cp genomes, gene position, and gene length is presented in Table 1. The predicted genes and their numbers were comparable except for trnI-CAU, which was not predicted in the assembled genome. The comparative analysis indicated that a near-complete assembly of P. santalinus Cp genome was achievable using the present method.

Fig. 1.

Fig. 1

Chloroplast genome of Pterocarpus santalinus assembled from leaf transcriptome data. The genes drawn outside and inside of the circle are transcribed in clockwise and counter clockwise directions, respectively. Genes are colored based on their functional groups

Table 1.

Comparative analysis of genes predicted from the assembled and reference chloroplast genomes of Pterocarpus santalinus using GeSeq

Group Gene name Gene position Gene length Total no of genes
Assembled Reference Assembled Reference Assembled Reference
ATP synthase atpA 52,185 52,185 1533 1533 7 7
atpB 7331 7331 1488 1488
atpE 8815 8815 402 402
atpF 50,786 50,786 145 145
atpF 51,703 51,703 407 407
atpH 50,123 50,123 246 246
atpI 48,266 48,266 744 744
NADH dehydrogenase ndhA 32,511 32,511 553 553 15 15
ndhA 34,305 34,305 542 542
ndhB 57,733 57,733 777 777
ndhB 59,195 59,195 756 756
ndhB 146,192 146,192 777 777
ndhB 147,654 147,654 756 756
ndhC 10,821 10,821 363 363
ndhD 37,775 37,775 1497 1497
ndhE 36,826 36,826 306 306
ndhF 42,562 42,562 2256 2256
ndhG 36,060 36,060 531 531
ndhH 31,328 31,328 1182 1182
ndhI 34,926 34,926 486 486
ndhJ 12,053 12,053 477 477
ndhK 11,153 11,153 744 744
Cytochrome b/f complex petA 64,436 64,436 963 963 8 8
petB 78,320 78,320 6 6
petB 79,143 79,143 642 642
petD 79,997 79,997 8 8
petD 80,714 80,714 475 475
petG 68,950 68,950 114 114
petL 68,689 68,689 96 96
petN 124,655 124,655 90 90
Photosystem I psaA 20,016 20,016 2253 2253 5 5
psaB 22,294 22,294 2205 2205
psaC 37,398 37,398 246 246
psaI 62,223 62,223 105 105
psaJ 70,142 70,142 135 135
Photosystem II psbA 157,594 157,594 1062 1062 14 14
psbB 75,816 75,816 1527 1527
psbC 130,754 130,754 1386 1386
psbD 129,709 129,709 1062 1062
psbE 91,561 91,561 252 252
psbF 91,822 91,822 120 120
psbH 77,953 77,953 222 222
psbI 102,726 102,726 111 111
psbJ 92,216 92,216 123 123
psbK 102,027 102,027 186 186
psbL 91,964 91,964 117 117
psbM 32,839 32,839 105 105
psbT 77,538 77,538 108 108
psbZ 132,836 132,836 189 189
Large subunit of ribosome rpl14 73,842 73,842 369 369 14 14
rpl16 72,140 72,140 9 9
rpl16 73,310 73,310 399 399
rpl2 68,957 68,957 391 391
rpl2 70,013 70,013 434 434
rpl2 157,416 157,416 391 391
rpl2 158,472 158,472 434 434
rpl20 86,804 86,804 360 360
rpl22 70,941 71,243 113 327
rpl23 68,657 68,657 276 276
rpl23 157,116 157,116 276 276
rpl32 117,190 117,190 147 147
rpl33 70,781 70,781 201 201
rpl36 75,419 75,419 114 114
Small subunit of ribosome rps11 76,054 76,054 417 417 18 18
rps12 56,100 56,100 232 232
rps12 56,864 56,864 26 26
rps12 144,559 144,559 232 232
rps12 145,323 145,323 26 26
rps12-fragment 85,828 85,828 114 114
rps14 24,622 24,622 303 303
rps15 30,944 30,944 273 273
rps16 58,227 58,227 40 40
rps16 59,166 59,166 230 230
rps18 71,243 71,254 327 84
rps19 70,507 70,507 279 279
rps2 47,307 47,307 711 711
rps3 71,322 71,322 657 657
rps4 15,958 15,958 606 606
rps7 56,947 56,947 468 468
rps7 145,406 145,406 468 468
rps8 74,578 74,578 405 405
RNA polymerase subunits rpoA 76,553 76,553 996 996 5 5
rpoB 36,680 36,680 3213 3213
rpoC1 39,919 39,919 432 432
rpoC1 41,095 41,095 1623 1623
rpoC2 42,888 42,888 4167 4167
Ribosomal RNA rrn16 16,558 16,558 1491 1491 10 10
rrn16 105,017 105,017 1491 1491
rrn23 20,448 20,448 2617 2617
rrn23 23,065 23,065 199 199
rrn23 108,907 108,907 2617 2617
rrn23 111,524 111,524 199 199
rrn4.5 23,362 23,362 104 104
rrn4.5 111,821 111,821 104 104
rrn5 23,690 23,690 121 121
rrn5 112,149 112,149 121 121
Transfer RNA genes trnA-UGC 19,418 19,418 38 38 44 45
trnA-UGC 20,256 20,256 35 35
trnA-UGC 107,877 107,877 38 38
trnA-UGC 108,715 108,715 35 35
trnC-GCA 123,449 123,449 71 71
trnD-GUC 32,339 32,339 74 74
trnE-UUC 31,613 31,613 73 73
trnF-GAA 145,553 145,553 73 73
trnfM-CAU 25,092 25,092 74 74
trnG-GCC 133,681 133,681 71 71
trnG-UCC 103,910 103,910 23 23
trnG-UCC 104,640 104,640 48 48
trnH-GUG 158,851 158,851 75 75
trnI-CAU 68,139 68,139 74 74
trnI-CAU 156,598 74
trnI-GAU 18,335 18,335 37 37
trnI-GAU 19,324 19,324 35 35
trnI-GAU 106,794 106,794 37 37
trnI-GAU 107,783 107,783 35 35
trnK-UUU 154,633 154,633 37 37
trnK-UUU 157,243 157,243 35 35
trnL-CAA 60,528 60,528 81 81
trnL-CAA 148,987 148,987 81 81
trnL-UAA 144,543 144,543 35 35
trnL-UAA 145,116 145,116 50 50
trnL-UAG 118,203 118,203 80 80
trnM-CAU 149,521 149,521 73 73
trnN-GUU 45,663 45,663 72 72
trnN-GUU 134,122 134,122 72 72
trnP-UGG 89,449 89,449 74 74
trnQ-UUG 57,577 57,577 72 72
trnR-ACG 24,071 24,071 74 74
trnR-ACG 112,530 112,530 74 74
trnR-UCU 104,944 104,944 72 72
trnS-GCU 55,883 55,883 87 87
trnS-GGA 142,090 142,090 88 88
trnS-UGA 26,492 26,492 93 93
trnT-GGU 128,163 128,163 72 72
trnT-UGU 15,609 15,609 73 73
trnV-GAC 16,264 16,264 72 72
trnV-GAC 104,723 104,723 72 72
trnV-UAC 9629 9629 39 39
trnV-UAC 10,261 10,261 35 35
trnW-CCA 89,698 89,698 74 74
trnY-GUA 31,746 31,746 84 84
Miscellaneous group accD 60,176 60,176 1506 1506 10 10
ccsA 118,409 118,409 972 972
cemA 63,530 63,530 690 690
clpP1 83,619 83,619 71 71
clpP1 84,501 84,501 292 292
clpP1 85,385 85,385 228 228
infA 75,165 75,165 168 168
matK 155,389 155,389 1326 1326
pbf1 81,125 81,125 132 132
rbcL 152,413 152,413 1428 1428
Hypothetical chloroplast reading frames ycf1 25,227 25,227 5334 5334 8 8
ycf1 113,686 113,686 468 468
ycf2 3124 2458 6195 6861
ycf2 90,917 90,917 6861 6861
ycf3 17,138 17,138 124 124
ycf3 17,984 17,984 230 230
ycf3 18,995 18,995 153 153
ycf4 62,512 62,512 555 555
Total 158 159

Comparison of the assembled and reference Cp genome with mVISTA showed significant sequence similarity except for variability in the ycf genes (Supplementary Fig. 1). The sequence variability in this gene is well documented and is a target for Pterocarpus barcode development (Jiao et al. 2019).

Repeat analysis using REPuter predicted a total of 25 repeat regions with 23 repeats between 22 and 65 bp and 2 repeats between 244 and 287 bp in forward vs forward comparison in the assembled genome (Supplementary Fig. 2a). In the reference genome, 11 repeats were documented between 24 and 67 bp, one repeat in 68–111 bp and 2 repeats were predicted between 244 and 287 bp in forward vs forward comparison (Supplementary Fig. 2b). Similarly, in the forward versus reverse complement comparison, 34 repeats were identified between 26 and 1409 bp, one repeat between 1410 and 2792 bp, while two repeats were predicted between 5560–6943 and 6944–8326 bp in the assembled genome (Supplementary Fig. 2c). In the reference genome, forward vs reverse compliment identified 17 repeats between 24 and 4301 bp and one repeat in 21,416–25,693 bp, totalling to 32 repeat regions (Supplementary Fig. 2d).

The number of nucleotides in the assembled genome was A = 35,355, G = 17,851, T = 35,573, C = 17,608, while in the reference genome it was A = 50,633, G = 29,013, T = 50,615, C = 28,705. Microsatellite repeat analysis using MISA predicted 344 repeats (Fig. 2a) with 268 mono-nucleotide (77.90%), 52 di-nucleotide (15.12%), 15 tri-nucleotide (4.36%), 5 tetra-nucleotide (1.45%), 3 penta-nucleotide (0.87%) and 1 hexa-nucleotide (0.29%) repeats in assembled Cp genome. In comparison, a total of 349 microsatellite repeats were identified in reference genome with 272 mono-nucleotide (77.93%), 51 di-nucleotide (14.61%) and 15 tri-nucleotide (4.29%), 5 tetra nucleotide (1.43%), 5 Penta -nucleotide (1.43%) and 1 hexa-nucleotide (0.28%) microsatellite repeats. A total of 10 and 12 repeat types were predicted in assembled and reference genome respectively and AT/AT was the predominant repeat class in both assembled and reference Cp genome (Fig. 2b).

Fig. 2.

Fig. 2

a Number of microsatellite repeat motifs predicted in genic and intergenic regions of assembled and reference chloroplast genome of Pterocarpus santalinus. b Number of repeat types predicted in genic and intergenic regions of assembled and reference chloroplast genome of Pterocarpus santalinus

The phylogenetic tree grouped both the Cp genomes of P. santalinus with 100% confidence (Fig. 3). The other Pterocarpus species including P. pedatus, P. indicus, P. marsupium and P. macrocarpus grouped into a single clade, while P. tinctorius formed as a separate clade (Fig. 3). The phylogenetic grouping of the Pterocarpus species is in consonance with the previous report by Hong et al. (2020). Hence, the comparative analysis of the two genomes indicates that the methodology proposed in the present study can effectively assemble a near-complete Cp genome from transcriptome datasets. Phylogenetic grouping of the reference and assembled genomes with 100% confidence reiterates the feasibility of the method developed in the study.

Fig. 3.

Fig. 3

Phylogenetic tree constructed using Neighbor-Joining (NJ) method from complete chloroplast genomes of 28 species belonging to Fabaceae. Numbers at the nodes indicate bootstrap values from 1000 iterations. Arrow indicates grouping of the assembled and reference chloroplast genome of Pterocarpus santalinus with 100% confidence

In land plants, the Cp DNA is highly conserved in structure, content, and gene order (Shaw et al. 2007). The genome size varies from 15,553 to 521,168 bp (Dobrogojski et al. 2020) and the total number of genes encoded by Cp genomes ranges from 120 to 140 (Rogalski et al. 2015). A typical Cp genome is arranged in a quadripartite structure, consisting of a large single copy (LSC 80–90 kbp) region and a small single copy (SSC 16–27 kbp) region separated by a pair of inverted repeats (IRs 20–30 kbp) (Wicke et al. 2011). Comparative chloroplast genomics revealed that the Cp DNA is highly variable at genome-scale (Whittall et al. 2010; Besnard et al. 2011) specifically in the non-coding intergenic spacer region (Daniell et al. 2006, 2016). Hence, recent studies have utilized the entire plastomes as ‘super barcodes’ enabling identification of hypervariable loci and lineage-specific InDels for efficient discrimination of plant species (Niu et al. 2017; Fu et al. 2019).

The use of Cp genome in evolutionary analysis, phylogenomics, barcoding, and meta-barcoding is well established (Li et al. 2015; Hollingsworth et al. 2016; Dormontt et al. 2018). In crop breeding, it has been used in the identification of cultivars, assessing hybrid purity, and understanding domestication history (Daniell et al. 2016; Teske et al. 2020). The translational application of chloroplast transformation in conferring biotic and abiotic stress tolerance in plants and production of biopharmaceuticals, biomaterials, enzymes, biofuels, and vaccines is also reported (reviewed by Bansal and Saha 2012; Daniell et al. 2016; Yu et al. 2020; Li et al. 2021). These transplastomic plants can integrate and express up to 10,000 copies of transgenes in contrast to nuclear genome, facilitating an extremely high level of transgene expression (Oey et al. 2009; Jin and Daniell 2015). Due to its maternal inheritance, it also minimizes the transgene escape, alleviating biosafety concerns (Daniell 2007; Boehm and Bock 2019).

RNA editing is a post-translational gene expression process which generates RNA and protein diversity and regulate gene expression (Okuda et al. 2007). Land plants typically have 20–60 editing spots in chloroplast RNA (Ichinose and Sugita 2016) and the key editing target is the rbcL gene encoding the large subunit of ribulose bisphosphate carboxylase/oxygenase (RuBisCO). Transplastomic plants have facilitated understanding RNA editing and have been extensively used in the mapping of cis-acting elements, introduction of heterologous editing sites to characterize trans-acting specificity factors and expression of synthetic sequences (Ruf and Bock 2011; Avila et al. 2016). In a recent study, transplastomic tobacco expressing synthetic glycolate metabolic pathways were reported and field evaluation of the transgenic lines revealed 20% improvement in photosynthesis and up to 37% increase in biomass. These lines were also tolerant to photorespiration stress (South et al. 2019). This study opens up a new vista in chloroplast genomics indicating that gene editing in conjunction with synthetic biology can enhance the photosynthetic efficiency of crop plants, thereby enhancing productivity.

Cp genome sequencing has been successfully conducted either by chloroplast enrichment and sequencing or by assembling it from whole-genome datasets (reviewed by Twyford and Ness 2017). Computational pipelines like Fast-Plast, GetOrganelle, NOVOplasty and ChloroExtractor have been evaluated for their efficiency in assembling the Cp genomes (Freudenthal et al. 2020). These tools vary in their hardware requirements and utilization, efficiency, repeatability and time consumption in processing the WGS reads. We had used Novoplasty and GetOrganelle programs to assemble the Cp genome of P. santalinus from transcriptome data. Both programs generated fragmented contigs in the range of ~ 500 bp to ~ 21 kb (data not shown) and successful assembly could not be achieved. Hence, a pipeline was developed to construct a near-complete Cp genome from the leaf RNA-seq reads. This alternate approach is more cost-effective and less labour intensive when compared to chloroplast enrichment followed by NGS or whole-genome sequencing. An indicative costing of chloroplast enrichment and sequencing using Illumina platform in P. santalinus is ~ 335 USD, while WGS with 30× coverage will be ~ 1272 USD. Genome skimming at 1.5 × depth would cost ~ 536 USD. Transcriptome sequencing which would cost ~ 340 USD can be used for both expression studies and retrieval of Cp specific reads for genome assembly.

The pipeline developed in the present study offers several advantages including the limited requirement of computing time and know-how and cost-effectiveness when compared to WGS. One major benefit of using transcriptome data is the reduced size of the dataset, which is less than 5% of the entire genome (Pertea 2012). Further, the presence of less tandem repeat elements in transcriptome data reduces errors in sequence assembly when compared to WGS data (Tørresen et al. 2019). The near-complete Cp genome of P. santalinus assembled using the present method is highly encouraging, considering that the reference genome used for comparison was assembled from high depth whole genome sequencing. The minor gaps observed in the present assembly could be minimized by increasing the depth of RNA-seq or can be bridged using amplicon sequencing. This method can fast pace evolutionary and phylogenomic studies, enable species discrimination and hybrid validation in breeding programs, delineate cryptic species, assist timber forensics and accelerate chloroplast genomics in plants.

Supplementary Information

Below is the link to the electronic supplementary material.

Acknowledgements

The authors acknowledge the National Biodiversity Authority, Government of India for funding support.

Author contributions

SS conducted Cp genome assembly, annotation, analysis and drafted the manuscript; KU conceptualized the pipeline; MGD conceptualized the research, obtained funding, conducted transcriptome sequencing, prepared and finalized the manuscript. All authors have approved the manuscript.

Funding

This study was funded by the National Biodiversity Authority, Government of India.

Availability of data and material

Not applicable.

Code availability

Not applicable.

Declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

References

  1. Ankenbrand MJ, Pfaff S, Terhoeven N, et al. chloroExtractor: extraction and assembly of the chloroplast genome from whole genome shotgun data. J Open Source Softw. 2018;3:464. doi: 10.2110/joss.00464. [DOI] [Google Scholar]
  2. Atherton RA, McComish BJ, Shepherd LD, Berry LA, Albert NW, Lockhart PJ. Whole genome sequencing of enriched chloroplast DNA using the illumina GAII platform. Plant Methods. 2010;6:22. doi: 10.1186/1746-4811-6-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Avila ME, Gisby MF, Day A. Seamless editing of the chloroplast genome in plants. BMC Plant Biol. 2016;16:168. doi: 10.1186/s12870-016-0857-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Baker P, Jackson P, Aitken K. Bayesian estimation of marker dosage in sugarcane and other autopolyploids. Theor Appl Genet. 2010;120:1653–1672. doi: 10.1007/s00122-010-1283-z. [DOI] [PubMed] [Google Scholar]
  5. Bansal KC, Saha D. Chloroplast genomics and genetic engineering for crop improvement. Agric Res. 2012;1:53–66. doi: 10.1007/s40003-011-0010-6. [DOI] [Google Scholar]
  6. Beier S, Thiel T, Münch T, et al. MISA-web: a web server for microsatellite prediction. Bioinformatics. 2017;33:2583–2585. doi: 10.1093/bioinformatics/btx198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Besnard G, Hernández P, Khadari B, Dorado G, Savolainen V. Genomic profiling of plastid DNA variation in the mediterranean olive tree. BMC Plant Biol. 2011;11:80. doi: 10.1186/1471-2229-11-80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Boehm CR, Bock R. Recent advances and current challenges in synthetic biology of the plastid genetic system and metabolism. Plant Physiol. 2019;179:794–802. doi: 10.1104/pp.18.00767. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Daniell H. Transgene containment by maternal inheritance: effective or elusive? Proc Natl Acad Sci USA. 2007;104:6879–6880. doi: 10.1073/pnas.0702219104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Daniell H, Lee SB, Grevich J, Saski C, Quesada-Vargas T, Guda C, et al. Complete chloroplast genome sequences of Solanum bulbocastanum, Solanum lycopersicum and comparative analyses with other Solanaceae genomes. Theor Appl Genet. 2006;112:1503–1518. doi: 10.1007/s00122-006-0254-x. [DOI] [PubMed] [Google Scholar]
  13. Daniell H, Lin CS, Yu M, Chang WJ. Chloroplast genomes: diversity, evolution, and applications in genetic engineering. Genome Biol. 2016;17(1):134. doi: 10.1186/s13059-016-1004-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Dierckxsens N, Mardulyn P, Smits G. NOVOPlasty: de novo assembly of organelle genomes from whole genome data. Nucleic Acids Res. 2017;45:18. doi: 10.1093/nar/gkw955. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Dobrogojski J, Adamiec M, Lucinski R. The chloroplast genome: a review. Acta Physiol Plant. 2020;42:98. doi: 10.1007/s11738-020-03089-x. [DOI] [Google Scholar]
  16. Dormontt EE, van Dijk K, Bell KL, Biffin E, Breed MF, Byrne M, Caddy-Retalic S, Encinas-Viso F, Nevill PG, Shapcott A, Young JM, Waycott M, Lowe AJ. Advancing DNA barcoding and metabarcoding applications for plants requires systematic analysis of herbarium collections—an Australian p[erspective. Front Ecol Evol. 2018;6:134. doi: 10.3389/fevo.2018.00134. [DOI] [Google Scholar]
  17. Freudenthal JA, Pfaff S, Terhoeven N, et al. A systematic comparison of chloroplast genome assembly tools. Genome Biol. 2020;21:254. doi: 10.1186/s13059-020-02153-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Fu CN, Wu CS, Ye LJ, Mo ZQ, Liu J, Chang YW, Li DZ, Chaw SM, Gao LM. Prevalence of isomeric plastomes and effectiveness of plastome super-barcodes in yews (Taxus) worldwide. Sci Rep. 2019;9(1):1–11. doi: 10.1038/s41598-019-39161-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Hall TA. BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symp Ser. 1999;41:95–98. [Google Scholar]
  20. Hollingsworth PM, Li D-Z, Van Der Bank M, Twyford AD. Telling plant species apart with DNA: from barcodes to genomes. Philos Trans R Soc B. 2016;371:20150338. doi: 10.1098/rstb.2015.0338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Hong Z, Wu Z, Zhao K, Yang Z, Zhang N, Guo J, Tembrock LR, Xu D. Comparative analyses of five complete chloroplast genomes from the genus Pterocarpus (Fabacaeae) Int J Mol Sci. 2020;21(11):3758. doi: 10.3390/ijms21113758. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Ichinose M, Sugita M. RNA editing and its molecular mechanism in plant organelles. Genes. 2016;8(1):5. doi: 10.3390/genes8010005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Jansen RK, Saski C, Lee SB, Hansen AK, Daniell H. Complete plastid genome sequences of three rosids (Castanea, Prunus, Theobroma): evidence for at least two independent transfers of rpl22 to the nucleus. Mol Biol Evol. 2011;28:835–847. doi: 10.1093/molbev/msq261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Jiao L, Lu Y, He T, Li J, Yin Y. A strategy for developing high-resolution DNA barcodes for species discrimination of wood specimens using the complete chloroplast genome of three Pterocarpus species. Planta. 2019;250(1):95–104. doi: 10.1007/s00425-019-03150-1. [DOI] [PubMed] [Google Scholar]
  25. Jin S, Daniell H. The engineered chloroplast genome just got smarter. Trends Plant Sci. 2015;20:622–640. doi: 10.1016/j.tplants.2015.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Jin JJ, Bin YuW, Yang JB, et al. GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biol. 2020;21:241. doi: 10.1186/s13059-020-02154-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kumar S, Stecher G, Li M, et al. MEGA X: Molecular evolutionary genetics analysis across computing platforms. Mol Biol Evol. 2018;35:1547–1549. doi: 10.1093/molbev/msy096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Kurtz S, Choudhuri JV, Ohlebusch E, et al. REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 2001;29:4633–4642. doi: 10.1093/nar/29.22.4633. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:207–209. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Li XW, Yang Y, Henry RJ, Rossetto M, Wang YT, Chen SL. Plant DNA barcoding: from gene to genome. Biol Rev. 2015;90:157–166. doi: 10.1111/brv.12104. [DOI] [PubMed] [Google Scholar]
  31. Li S, Chang L, Zhang J. Advancing organelle genome transformation and editing for crop improvement. Plant Commun. 2021;2:100–141. doi: 10.1016/j.xplc.2021.100141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Mayor C, Brudno M, Schwartz JR, Poliakov A, Rubin EM, Frazer KA, Pachter LS, Dubchak I. VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics. 2000;16:1046–1047. doi: 10.1093/bioinformatics/16.11.1046. [DOI] [PubMed] [Google Scholar]
  33. McKain, M Wilson (2017) Fast-Plast: rapid de novo assembly and finishing for whole chloroplast genomes. https://github.com/mrmckain/Fast-Plast
  34. Mower JP, Vickrey TL. Structural diversity among plastid genomes of land plants. In: Chaw S-M, Jansen RK, editors. Advances in botanical research. Cambridge: Academic Press; 2018. pp. 263–292. [Google Scholar]
  35. Niu Z, Xue Q, Zhu S, Sun J, Liu W, Ding X. The complete plastome sequences of four orchid species: insights into the evolution of the Orchidaceae and the utility of plastomic mutational hotspots. Front Plant Sci. 2017;8:715. doi: 10.3389/fpls.2017.00715. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Oey M, Lohse M, Kreikemeyer B, Bock R. Exhaustion of the chloroplast protein synthesis capacity by massive expression of a highly stable protein antibiotic. Plant J. 2009;57:436–445. doi: 10.1111/j.1365-313X.2008.03702.x. [DOI] [PubMed] [Google Scholar]
  37. Okuda K, Myouga F, Motohashi R, Shinozaki K, Shikanai T. Conserved domain structure of pentatricopeptide repeat proteins involved in chloroplast RNA editing. Proc Natl Acad Sci USA. 2007;104:8178–8183. doi: 10.1073/pnas.0700865104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Pertea M. The human transcriptome: an unfinished story. Genes (basel) 2012;3(3):344–360. doi: 10.3390/genes3030344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Rogalski M, do Vieira NL, Fraga HP, Guerra MP. Plastid genomics in horticultural species: importance and applications for plant population genetics, evolution, and biotechnology. Front Plant Sci. 2015;6:586. doi: 10.3389/fpls.2015.00586. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Ruf S, Bock R (2011) In vivo analysis of rna editing in plastids. In: Aphasizhev R (ed) RNA and DNA editing. Methods in molecular biology, 718. Humana Press, Totowa. 10.1007/978-1-61779-018-8_8 [DOI] [PubMed]
  41. Shaw J, Lickey EB, Schilling EE, Small RL. Comparison of whole chloroplast genome sequences to choose noncoding regions for phylogenetic studies in angiosperms: the tortoise and the hare III. Am J Bot. 2007;94(3):275–288. doi: 10.3732/ajb.94.3.275. [DOI] [PubMed] [Google Scholar]
  42. South PF, Cavanagh AP, Liu HW, Ort DR. Synthetic glycolate metabolism pathways stimulate crop growth and productivity in the field. Science. 2019;363:aat9077. doi: 10.1126/science.aat9077. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Teske D, Peters A, Möllers A, Fischer M. Genomic profiling: the strengths and limitations of chloroplast genome-based plant variety authentication. J Agric Food Chem. 2020;68(49):14323–14333. doi: 10.1021/acs.jafc.0c03001. [DOI] [PubMed] [Google Scholar]
  44. Tillich M, Lehwark P, Pellizzer T, et al. GeSeq—versatile and accurate annotation of organelle genomes. Nucleic Acids Res. 2017;45:W6–W11. doi: 10.1093/nar/gkx391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Tørresen OK, Star B, Mier P, Andrade-Navarro MA, Bateman A, Jarnot P, Gruca A, Grynberg M, Kajava AV, Promponas VJ, Anisimova M, Jakobsen KS, Linke D. Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases. Nucleic Acids Res. 2019;47(21):10994–11006. doi: 10.1093/nar/gkz841. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Twyford AD, Ness RW. Strategies for complete plastid genome sequencing. Mol Ecol Resour. 2017;17:858–868. doi: 10.1111/1755-0998.12626. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Vieira Ldo LN, Faoro H, Rogalski M, Fraga HP, Cardoso RL, de Souza EM, et al. The complete chloroplast genome sequence of Podocarpus lambertii: genome structure, evolutionary aspects, gene content and SSR detection. PLoS ONE. 2014 doi: 10.1371/journal.pone.0090618. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Whittall JB, Syring J, Parks M, Buenrostro J, Dick C, Liston A, et al. Finding a (pine) needle in a haystack: chloroplast genome sequence divergence in rare and widespread pines. Mol Ecol. 2010;19(Suppl 1):100–114. doi: 10.1111/j.1365-294X.2009.04474.x. [DOI] [PubMed] [Google Scholar]
  49. Wicke S, Schneeweiss GM, de Pamphilis CW, Müller KF, Quandt D. The evolution of the plastid chromosome in land plants: gene content, gene order, gene function. Plant Mol Biol. 2011;76(3):273–297. doi: 10.1007/s11103-011-9762-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Yu Y, Yu PC, Chang WJ, Yu K, Lin CS. Plastid transformation: how does it work? Can it be applied to crops? What can it offer? Int J Mol Sci. 2020;21(14):4854. doi: 10.3390/ijms21144854. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Zhong X (2020) Assembly, annotation and analysis of chloroplast genomes. Doctoral thesis, The University of Western Australia

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

Not applicable.

Not applicable.


Articles from 3 Biotech are provided here courtesy of Springer

RESOURCES