Abstract
Chloroplast genome sequencing is an essential tool to understand genome evolution and phylogenetic relationship. The available methods for constructing chloroplast genome include chloroplast enrichment followed by long overlapping PCR or extraction and assembly of chloroplast-specific reads from whole-genome datasets. In the present study, we propose an alternate strategy of extraction and assembly of chloroplast-specific reads from leaf transcriptome data of Pterocarpus santalinus using bowtie2 aligner program. The assembled genome was compared with the published chloroplast genome of P. santalinus for genome size, number of predicted genes, microsatellite repeat motifs, and nucleotide repeats. A near-complete chloroplast genome was assembled from the transcriptome reads. The proposed method requires less computational time and know-how, limited virtual memory, and is cost-effective when compared to whole-genome sequencing. Assembly of Cp genome from transcriptome data will enhance the resolution of phylogenetic studies through comparative plastome analysis, facilitate accurate species/genotype discrimination and accelerate the development of transplastomic plants with enhanced biotic and abiotic tolerance.
Supplementary Information
The online version contains supplementary material available at 10.1007/s13205-021-02943-0.
Keywords: Assembly, Chloroplast genome, Phylogeny, Repeat analysis, Transcriptome data
Introduction
Chloroplasts (Cp) are semi-autonomous organelles and metabolic centre of life. They were the first genome to be sequenced and have been a valuable resource for deciphering phylogenetic relatedness between taxa and resolving evolutionary relationships (Daniell et al. 2016). Until 2020, 4717 Cp genomes have been sequenced from plants (Zhong 2020). Predominantly, Cp genome is a circular molecule and recent studies have shown multi-branched linear structures of Cp DNA in few angiosperms (Mower et al. 2018). Apart from being the organelle for conducting photosynthesis, it also regulates other crucial biochemical processes including the synthesis of biomolecules like fatty acids, nucleotides, amino acids, phytohormones, vitamins and play a major role in plant response to biotic and abiotic stresses (Daniell et al. 2016). The use of Cp genomes in timber forensics, crop improvement, and production of biopharmaceuticals is extensively documented (Bansal and Saha 2012; Daniell et al. 2016; Yu et al. 2020; Teske et al. 2020; Li et al. 2021)
The conventional method for Cp genome sequencing involves chloroplast enrichment using sucrose or percoll gradient, high salt method or use of proprietary kits (Chloroplast Isolation Kit from Sigma Aldrich, USA or Abcam, MA, USA). Subsequently, long overlapping PCR is conducted to sequence the genome (reviewed by Twyford and Ness 2017). The major limitation in this strategy is the cost involved in the isolation of chloroplast and the large quantity of starting material required, which can be a limitation for samples sourced from herbaria or endangered species (Vieira Ldo et al. 2014). Another challenge is the primer designing for long-range PCR which depends on the sequence conservation across species. The difference in gene organization can severely hamper the amplification success thus affecting genome assembly (Atherton et al. 2010). Alternately, screening of bacterial artificial chromosome (BAC) or fosmid libraries using chloroplast-specific probes is also reported (Daniell et al. 2006; Jansen et al. 2011) which are technically demanding procedures. With the advent of next-generation sequencing (NGS) platforms, sequencing of enriched chloroplast DNA (Atherton et al. 2010) or use of whole-genome sequence datasets has emerged as a viable method for assembling Cp genomes (reviewed by Twyford and Ness 2017). The Illumina platform is considered the most suitable NGS platform for sequencing Cp genomes, since it allows rolling circle amplification products (Atherton et al. 2010). Software tools like IOGA (Baker et al. 2010), Fast-Plast (McKain and Wilson 2017), GetOrganelle (Jin et al. 2020), NOVOplasty (Dierckxsens et al. 2017) and ChloroExtractor (Ankenbrand et al. 2018) were developed for extracting organellar reads from whole-genome datasets and used for assembling chloroplast genomes. A comparison of different tools to assemble complete chloroplast revealed that GetOrganelle performed the best both on simulated and real data, followed by Fast-Plast (Freudenthal et al. 2020). However, sequencing the whole genome is cost-intensive, and assembling organellar genome from these datasets requires high-end computational knowledge and infrastructure.
In the present study, an alternate approach of extracting and assembling chloroplast reads from transcriptome dataset was attempted and the method was demonstrated using the leaf transcriptome of Pterocarpus santalinus. This methodology demands less computational time and limited virtual memory and can be executed by researchers with limited knowledge in computational biology.
Total RNA was isolated from young leaves of P. santalinus using RNAqueous®-Micro Total RNA Isolation Kit (Thermo Scientific, USA). The concentration of RNA was quantified using Qubit fluorometer (Thermo Fisher Scientific, MA, USA) and TapeStation (Agilent Technologies Inc., Santa Clara, CA). RNA integrity number equivalent (RINe) was determined in TapeStation. Five hundred ng of total RNA was used to enrich mRNA using NEB Next Poly (A) mRNA magnetic isolation module and the enriched mRNA was chemically fragmented, reverse transcribed, and cleaned. The cDNA was end-repaired, adapter-ligated, size selected, PCR amplified (12 cycles) and cleaned prior to library construction. The library was constructed using NEBNext® Ultra™ II RNA Library Prep Kit using manufacturer’s protocol, quantified using Qubit fluorometer, validated in TapeStation and sequenced in Illumina HiSeq 2000 (Illumina Inc., San Diego, CA, USA) using 150 bp paired-end chemistry.
The raw RNA-seq data were quality checked using FastQC and low-quality and adapter sequences were removed using Trimmomatic tool (Bolger et al. 2014). The processed reads were subsequently used as input for Bowtie2 aligner program and P. santalinus chloroplast sequence (Acc. No. MT249117.1; Hong et al. 2020) was used as reference. The reference Cp genome used for the present study was assembled from whole-genome dataset, which was generated using a hybrid strategy of short-read sequencing on Illumina Hiseq 4000 and long-read sequencing using PacBio Sequel (Hong et al. 2020).
The SAM file from Bowtie2 program was then converted to coordinate-sorted BAM file followed by the generation of consensus FASTA sequence using SAM tools (Li et al. 2009) and VCF tools (Danecek et al. 2011).
The commands used for constructing Cp genome is given below:
Command for building reference index
$ bowtie2-build-f/path/to/reference.fasta/directory/path/to/write/reference/index/.
Command for building Cp genome with reference sequence
$ bowtie2-local-p10-x/path/to/reference/index/directory-1/path/to/transcriptome/raw/reads/forward.fastq.gz-2 /path/to/transcriptome/raw/reads/reverse.fastq.gz-S output.sam.
Conversion of SAM to sorted BAM file
$ samtools view-bS output.sam|samtools sort-o output.bam.
Generation of consensus FASTA file from BAM file
$ samtools mpileup-uf/path/to/reference.fasta output.bam|bcftools call-c|vcfutils.pl vcf2fq > output.fastq.
All analysis were carried out on a Dell precision workstation 3630 (i7–8700 K 3.2 GHz processor 6 cores 12 threads, 32 GB RAM in Linux Ubuntu 20.10 LTS).
The sequence thus sorted was annotated using GeSeq online tool (Tillich et al. 2017). The number of genes in the assembled and the reference Cp genome was predicted using the same tool. REputer (Kurtz et al. 2001) and MISA (Beier et al. 2017) were used to identify the nucleotide repeats and microsatellite repeats in both assembled and reference Cp genomes respectively using default parameters. The number of each nucleotide was determined using Python script. mVISTA (available at http://genome.lbl.gov/vista/index.shtml) (Mayor et al. 2000) was used to visualize the alignment of reference and assembled Cp genome of P. santalinus and identify sequence variations in the two assemblies.
The assembled and reference Cp genomes of P. santalinus along with 27 members from Fabaceae were used to construct the phylogenetic tree. Pterocarpus species including P. indicus, P. macrocarpus, P. marsupium. P. tinctorius and P. pedatus were included in the study to document their phylogenetic relatedness. Multiple sequence alignment was conducted using BioEdit (Hall 1999) and phylogenetic analysis was carried out in MEGA X (Kumar et al. 2018). Neighbor-Joining (NJ) tree was constructed using p-distance model with 1000 iterations for bootstrap values and pair-wise deletions was selected for gap treatment.
The concentration of total RNA isolated from the leaf tissues was 43.2 ng/µl using Qubit fluorometer and the RNA integrity number equivalent (RINe) value was 7.8. The enriched sequencing library was quantified using both Qubit fluorometer and TapeStation and the concentration was 18.7 and 15.2 ng/µl respectively.
A total of 35,861,326 raw reads were generated with a read length of 150 bp and the percent of reads above Q30 was 89.04%. The reference-based assembly of P. santalinus from leaf RNA-seq raw reads generated a Cp genome of 158,966 bp (Fig. 1), similar to the genome reported by Hong et al. (2020). A total of 158 genes were identified in the assembled genome when compared to 159 genes predicted from the reference genome (Hong et al. 2020) (Table 1). The genome sequences were annotated using GeSeq and the list of genes annotated in both the Cp genomes, gene position, and gene length is presented in Table 1. The predicted genes and their numbers were comparable except for trnI-CAU, which was not predicted in the assembled genome. The comparative analysis indicated that a near-complete assembly of P. santalinus Cp genome was achievable using the present method.
Fig. 1.
Chloroplast genome of Pterocarpus santalinus assembled from leaf transcriptome data. The genes drawn outside and inside of the circle are transcribed in clockwise and counter clockwise directions, respectively. Genes are colored based on their functional groups
Table 1.
Comparative analysis of genes predicted from the assembled and reference chloroplast genomes of Pterocarpus santalinus using GeSeq
| Group | Gene name | Gene position | Gene length | Total no of genes | |||
|---|---|---|---|---|---|---|---|
| Assembled | Reference | Assembled | Reference | Assembled | Reference | ||
| ATP synthase | atpA | 52,185 | 52,185 | 1533 | 1533 | 7 | 7 |
| atpB | 7331 | 7331 | 1488 | 1488 | |||
| atpE | 8815 | 8815 | 402 | 402 | |||
| atpF | 50,786 | 50,786 | 145 | 145 | |||
| atpF | 51,703 | 51,703 | 407 | 407 | |||
| atpH | 50,123 | 50,123 | 246 | 246 | |||
| atpI | 48,266 | 48,266 | 744 | 744 | |||
| NADH dehydrogenase | ndhA | 32,511 | 32,511 | 553 | 553 | 15 | 15 |
| ndhA | 34,305 | 34,305 | 542 | 542 | |||
| ndhB | 57,733 | 57,733 | 777 | 777 | |||
| ndhB | 59,195 | 59,195 | 756 | 756 | |||
| ndhB | 146,192 | 146,192 | 777 | 777 | |||
| ndhB | 147,654 | 147,654 | 756 | 756 | |||
| ndhC | 10,821 | 10,821 | 363 | 363 | |||
| ndhD | 37,775 | 37,775 | 1497 | 1497 | |||
| ndhE | 36,826 | 36,826 | 306 | 306 | |||
| ndhF | 42,562 | 42,562 | 2256 | 2256 | |||
| ndhG | 36,060 | 36,060 | 531 | 531 | |||
| ndhH | 31,328 | 31,328 | 1182 | 1182 | |||
| ndhI | 34,926 | 34,926 | 486 | 486 | |||
| ndhJ | 12,053 | 12,053 | 477 | 477 | |||
| ndhK | 11,153 | 11,153 | 744 | 744 | |||
| Cytochrome b/f complex | petA | 64,436 | 64,436 | 963 | 963 | 8 | 8 |
| petB | 78,320 | 78,320 | 6 | 6 | |||
| petB | 79,143 | 79,143 | 642 | 642 | |||
| petD | 79,997 | 79,997 | 8 | 8 | |||
| petD | 80,714 | 80,714 | 475 | 475 | |||
| petG | 68,950 | 68,950 | 114 | 114 | |||
| petL | 68,689 | 68,689 | 96 | 96 | |||
| petN | 124,655 | 124,655 | 90 | 90 | |||
| Photosystem I | psaA | 20,016 | 20,016 | 2253 | 2253 | 5 | 5 |
| psaB | 22,294 | 22,294 | 2205 | 2205 | |||
| psaC | 37,398 | 37,398 | 246 | 246 | |||
| psaI | 62,223 | 62,223 | 105 | 105 | |||
| psaJ | 70,142 | 70,142 | 135 | 135 | |||
| Photosystem II | psbA | 157,594 | 157,594 | 1062 | 1062 | 14 | 14 |
| psbB | 75,816 | 75,816 | 1527 | 1527 | |||
| psbC | 130,754 | 130,754 | 1386 | 1386 | |||
| psbD | 129,709 | 129,709 | 1062 | 1062 | |||
| psbE | 91,561 | 91,561 | 252 | 252 | |||
| psbF | 91,822 | 91,822 | 120 | 120 | |||
| psbH | 77,953 | 77,953 | 222 | 222 | |||
| psbI | 102,726 | 102,726 | 111 | 111 | |||
| psbJ | 92,216 | 92,216 | 123 | 123 | |||
| psbK | 102,027 | 102,027 | 186 | 186 | |||
| psbL | 91,964 | 91,964 | 117 | 117 | |||
| psbM | 32,839 | 32,839 | 105 | 105 | |||
| psbT | 77,538 | 77,538 | 108 | 108 | |||
| psbZ | 132,836 | 132,836 | 189 | 189 | |||
| Large subunit of ribosome | rpl14 | 73,842 | 73,842 | 369 | 369 | 14 | 14 |
| rpl16 | 72,140 | 72,140 | 9 | 9 | |||
| rpl16 | 73,310 | 73,310 | 399 | 399 | |||
| rpl2 | 68,957 | 68,957 | 391 | 391 | |||
| rpl2 | 70,013 | 70,013 | 434 | 434 | |||
| rpl2 | 157,416 | 157,416 | 391 | 391 | |||
| rpl2 | 158,472 | 158,472 | 434 | 434 | |||
| rpl20 | 86,804 | 86,804 | 360 | 360 | |||
| rpl22 | 70,941 | 71,243 | 113 | 327 | |||
| rpl23 | 68,657 | 68,657 | 276 | 276 | |||
| rpl23 | 157,116 | 157,116 | 276 | 276 | |||
| rpl32 | 117,190 | 117,190 | 147 | 147 | |||
| rpl33 | 70,781 | 70,781 | 201 | 201 | |||
| rpl36 | 75,419 | 75,419 | 114 | 114 | |||
| Small subunit of ribosome | rps11 | 76,054 | 76,054 | 417 | 417 | 18 | 18 |
| rps12 | 56,100 | 56,100 | 232 | 232 | |||
| rps12 | 56,864 | 56,864 | 26 | 26 | |||
| rps12 | 144,559 | 144,559 | 232 | 232 | |||
| rps12 | 145,323 | 145,323 | 26 | 26 | |||
| rps12-fragment | 85,828 | 85,828 | 114 | 114 | |||
| rps14 | 24,622 | 24,622 | 303 | 303 | |||
| rps15 | 30,944 | 30,944 | 273 | 273 | |||
| rps16 | 58,227 | 58,227 | 40 | 40 | |||
| rps16 | 59,166 | 59,166 | 230 | 230 | |||
| rps18 | 71,243 | 71,254 | 327 | 84 | |||
| rps19 | 70,507 | 70,507 | 279 | 279 | |||
| rps2 | 47,307 | 47,307 | 711 | 711 | |||
| rps3 | 71,322 | 71,322 | 657 | 657 | |||
| rps4 | 15,958 | 15,958 | 606 | 606 | |||
| rps7 | 56,947 | 56,947 | 468 | 468 | |||
| rps7 | 145,406 | 145,406 | 468 | 468 | |||
| rps8 | 74,578 | 74,578 | 405 | 405 | |||
| RNA polymerase subunits | rpoA | 76,553 | 76,553 | 996 | 996 | 5 | 5 |
| rpoB | 36,680 | 36,680 | 3213 | 3213 | |||
| rpoC1 | 39,919 | 39,919 | 432 | 432 | |||
| rpoC1 | 41,095 | 41,095 | 1623 | 1623 | |||
| rpoC2 | 42,888 | 42,888 | 4167 | 4167 | |||
| Ribosomal RNA | rrn16 | 16,558 | 16,558 | 1491 | 1491 | 10 | 10 |
| rrn16 | 105,017 | 105,017 | 1491 | 1491 | |||
| rrn23 | 20,448 | 20,448 | 2617 | 2617 | |||
| rrn23 | 23,065 | 23,065 | 199 | 199 | |||
| rrn23 | 108,907 | 108,907 | 2617 | 2617 | |||
| rrn23 | 111,524 | 111,524 | 199 | 199 | |||
| rrn4.5 | 23,362 | 23,362 | 104 | 104 | |||
| rrn4.5 | 111,821 | 111,821 | 104 | 104 | |||
| rrn5 | 23,690 | 23,690 | 121 | 121 | |||
| rrn5 | 112,149 | 112,149 | 121 | 121 | |||
| Transfer RNA genes | trnA-UGC | 19,418 | 19,418 | 38 | 38 | 44 | 45 |
| trnA-UGC | 20,256 | 20,256 | 35 | 35 | |||
| trnA-UGC | 107,877 | 107,877 | 38 | 38 | |||
| trnA-UGC | 108,715 | 108,715 | 35 | 35 | |||
| trnC-GCA | 123,449 | 123,449 | 71 | 71 | |||
| trnD-GUC | 32,339 | 32,339 | 74 | 74 | |||
| trnE-UUC | 31,613 | 31,613 | 73 | 73 | |||
| trnF-GAA | 145,553 | 145,553 | 73 | 73 | |||
| trnfM-CAU | 25,092 | 25,092 | 74 | 74 | |||
| trnG-GCC | 133,681 | 133,681 | 71 | 71 | |||
| trnG-UCC | 103,910 | 103,910 | 23 | 23 | |||
| trnG-UCC | 104,640 | 104,640 | 48 | 48 | |||
| trnH-GUG | 158,851 | 158,851 | 75 | 75 | |||
| trnI-CAU | 68,139 | 68,139 | 74 | 74 | |||
| trnI-CAU | 156,598 | 74 | |||||
| trnI-GAU | 18,335 | 18,335 | 37 | 37 | |||
| trnI-GAU | 19,324 | 19,324 | 35 | 35 | |||
| trnI-GAU | 106,794 | 106,794 | 37 | 37 | |||
| trnI-GAU | 107,783 | 107,783 | 35 | 35 | |||
| trnK-UUU | 154,633 | 154,633 | 37 | 37 | |||
| trnK-UUU | 157,243 | 157,243 | 35 | 35 | |||
| trnL-CAA | 60,528 | 60,528 | 81 | 81 | |||
| trnL-CAA | 148,987 | 148,987 | 81 | 81 | |||
| trnL-UAA | 144,543 | 144,543 | 35 | 35 | |||
| trnL-UAA | 145,116 | 145,116 | 50 | 50 | |||
| trnL-UAG | 118,203 | 118,203 | 80 | 80 | |||
| trnM-CAU | 149,521 | 149,521 | 73 | 73 | |||
| trnN-GUU | 45,663 | 45,663 | 72 | 72 | |||
| trnN-GUU | 134,122 | 134,122 | 72 | 72 | |||
| trnP-UGG | 89,449 | 89,449 | 74 | 74 | |||
| trnQ-UUG | 57,577 | 57,577 | 72 | 72 | |||
| trnR-ACG | 24,071 | 24,071 | 74 | 74 | |||
| trnR-ACG | 112,530 | 112,530 | 74 | 74 | |||
| trnR-UCU | 104,944 | 104,944 | 72 | 72 | |||
| trnS-GCU | 55,883 | 55,883 | 87 | 87 | |||
| trnS-GGA | 142,090 | 142,090 | 88 | 88 | |||
| trnS-UGA | 26,492 | 26,492 | 93 | 93 | |||
| trnT-GGU | 128,163 | 128,163 | 72 | 72 | |||
| trnT-UGU | 15,609 | 15,609 | 73 | 73 | |||
| trnV-GAC | 16,264 | 16,264 | 72 | 72 | |||
| trnV-GAC | 104,723 | 104,723 | 72 | 72 | |||
| trnV-UAC | 9629 | 9629 | 39 | 39 | |||
| trnV-UAC | 10,261 | 10,261 | 35 | 35 | |||
| trnW-CCA | 89,698 | 89,698 | 74 | 74 | |||
| trnY-GUA | 31,746 | 31,746 | 84 | 84 | |||
| Miscellaneous group | accD | 60,176 | 60,176 | 1506 | 1506 | 10 | 10 |
| ccsA | 118,409 | 118,409 | 972 | 972 | |||
| cemA | 63,530 | 63,530 | 690 | 690 | |||
| clpP1 | 83,619 | 83,619 | 71 | 71 | |||
| clpP1 | 84,501 | 84,501 | 292 | 292 | |||
| clpP1 | 85,385 | 85,385 | 228 | 228 | |||
| infA | 75,165 | 75,165 | 168 | 168 | |||
| matK | 155,389 | 155,389 | 1326 | 1326 | |||
| pbf1 | 81,125 | 81,125 | 132 | 132 | |||
| rbcL | 152,413 | 152,413 | 1428 | 1428 | |||
| Hypothetical chloroplast reading frames | ycf1 | 25,227 | 25,227 | 5334 | 5334 | 8 | 8 |
| ycf1 | 113,686 | 113,686 | 468 | 468 | |||
| ycf2 | 3124 | 2458 | 6195 | 6861 | |||
| ycf2 | 90,917 | 90,917 | 6861 | 6861 | |||
| ycf3 | 17,138 | 17,138 | 124 | 124 | |||
| ycf3 | 17,984 | 17,984 | 230 | 230 | |||
| ycf3 | 18,995 | 18,995 | 153 | 153 | |||
| ycf4 | 62,512 | 62,512 | 555 | 555 | |||
| Total | 158 | 159 | |||||
Comparison of the assembled and reference Cp genome with mVISTA showed significant sequence similarity except for variability in the ycf genes (Supplementary Fig. 1). The sequence variability in this gene is well documented and is a target for Pterocarpus barcode development (Jiao et al. 2019).
Repeat analysis using REPuter predicted a total of 25 repeat regions with 23 repeats between 22 and 65 bp and 2 repeats between 244 and 287 bp in forward vs forward comparison in the assembled genome (Supplementary Fig. 2a). In the reference genome, 11 repeats were documented between 24 and 67 bp, one repeat in 68–111 bp and 2 repeats were predicted between 244 and 287 bp in forward vs forward comparison (Supplementary Fig. 2b). Similarly, in the forward versus reverse complement comparison, 34 repeats were identified between 26 and 1409 bp, one repeat between 1410 and 2792 bp, while two repeats were predicted between 5560–6943 and 6944–8326 bp in the assembled genome (Supplementary Fig. 2c). In the reference genome, forward vs reverse compliment identified 17 repeats between 24 and 4301 bp and one repeat in 21,416–25,693 bp, totalling to 32 repeat regions (Supplementary Fig. 2d).
The number of nucleotides in the assembled genome was A = 35,355, G = 17,851, T = 35,573, C = 17,608, while in the reference genome it was A = 50,633, G = 29,013, T = 50,615, C = 28,705. Microsatellite repeat analysis using MISA predicted 344 repeats (Fig. 2a) with 268 mono-nucleotide (77.90%), 52 di-nucleotide (15.12%), 15 tri-nucleotide (4.36%), 5 tetra-nucleotide (1.45%), 3 penta-nucleotide (0.87%) and 1 hexa-nucleotide (0.29%) repeats in assembled Cp genome. In comparison, a total of 349 microsatellite repeats were identified in reference genome with 272 mono-nucleotide (77.93%), 51 di-nucleotide (14.61%) and 15 tri-nucleotide (4.29%), 5 tetra nucleotide (1.43%), 5 Penta -nucleotide (1.43%) and 1 hexa-nucleotide (0.28%) microsatellite repeats. A total of 10 and 12 repeat types were predicted in assembled and reference genome respectively and AT/AT was the predominant repeat class in both assembled and reference Cp genome (Fig. 2b).
Fig. 2.
a Number of microsatellite repeat motifs predicted in genic and intergenic regions of assembled and reference chloroplast genome of Pterocarpus santalinus. b Number of repeat types predicted in genic and intergenic regions of assembled and reference chloroplast genome of Pterocarpus santalinus
The phylogenetic tree grouped both the Cp genomes of P. santalinus with 100% confidence (Fig. 3). The other Pterocarpus species including P. pedatus, P. indicus, P. marsupium and P. macrocarpus grouped into a single clade, while P. tinctorius formed as a separate clade (Fig. 3). The phylogenetic grouping of the Pterocarpus species is in consonance with the previous report by Hong et al. (2020). Hence, the comparative analysis of the two genomes indicates that the methodology proposed in the present study can effectively assemble a near-complete Cp genome from transcriptome datasets. Phylogenetic grouping of the reference and assembled genomes with 100% confidence reiterates the feasibility of the method developed in the study.
Fig. 3.
Phylogenetic tree constructed using Neighbor-Joining (NJ) method from complete chloroplast genomes of 28 species belonging to Fabaceae. Numbers at the nodes indicate bootstrap values from 1000 iterations. Arrow indicates grouping of the assembled and reference chloroplast genome of Pterocarpus santalinus with 100% confidence
In land plants, the Cp DNA is highly conserved in structure, content, and gene order (Shaw et al. 2007). The genome size varies from 15,553 to 521,168 bp (Dobrogojski et al. 2020) and the total number of genes encoded by Cp genomes ranges from 120 to 140 (Rogalski et al. 2015). A typical Cp genome is arranged in a quadripartite structure, consisting of a large single copy (LSC 80–90 kbp) region and a small single copy (SSC 16–27 kbp) region separated by a pair of inverted repeats (IRs 20–30 kbp) (Wicke et al. 2011). Comparative chloroplast genomics revealed that the Cp DNA is highly variable at genome-scale (Whittall et al. 2010; Besnard et al. 2011) specifically in the non-coding intergenic spacer region (Daniell et al. 2006, 2016). Hence, recent studies have utilized the entire plastomes as ‘super barcodes’ enabling identification of hypervariable loci and lineage-specific InDels for efficient discrimination of plant species (Niu et al. 2017; Fu et al. 2019).
The use of Cp genome in evolutionary analysis, phylogenomics, barcoding, and meta-barcoding is well established (Li et al. 2015; Hollingsworth et al. 2016; Dormontt et al. 2018). In crop breeding, it has been used in the identification of cultivars, assessing hybrid purity, and understanding domestication history (Daniell et al. 2016; Teske et al. 2020). The translational application of chloroplast transformation in conferring biotic and abiotic stress tolerance in plants and production of biopharmaceuticals, biomaterials, enzymes, biofuels, and vaccines is also reported (reviewed by Bansal and Saha 2012; Daniell et al. 2016; Yu et al. 2020; Li et al. 2021). These transplastomic plants can integrate and express up to 10,000 copies of transgenes in contrast to nuclear genome, facilitating an extremely high level of transgene expression (Oey et al. 2009; Jin and Daniell 2015). Due to its maternal inheritance, it also minimizes the transgene escape, alleviating biosafety concerns (Daniell 2007; Boehm and Bock 2019).
RNA editing is a post-translational gene expression process which generates RNA and protein diversity and regulate gene expression (Okuda et al. 2007). Land plants typically have 20–60 editing spots in chloroplast RNA (Ichinose and Sugita 2016) and the key editing target is the rbcL gene encoding the large subunit of ribulose bisphosphate carboxylase/oxygenase (RuBisCO). Transplastomic plants have facilitated understanding RNA editing and have been extensively used in the mapping of cis-acting elements, introduction of heterologous editing sites to characterize trans-acting specificity factors and expression of synthetic sequences (Ruf and Bock 2011; Avila et al. 2016). In a recent study, transplastomic tobacco expressing synthetic glycolate metabolic pathways were reported and field evaluation of the transgenic lines revealed 20% improvement in photosynthesis and up to 37% increase in biomass. These lines were also tolerant to photorespiration stress (South et al. 2019). This study opens up a new vista in chloroplast genomics indicating that gene editing in conjunction with synthetic biology can enhance the photosynthetic efficiency of crop plants, thereby enhancing productivity.
Cp genome sequencing has been successfully conducted either by chloroplast enrichment and sequencing or by assembling it from whole-genome datasets (reviewed by Twyford and Ness 2017). Computational pipelines like Fast-Plast, GetOrganelle, NOVOplasty and ChloroExtractor have been evaluated for their efficiency in assembling the Cp genomes (Freudenthal et al. 2020). These tools vary in their hardware requirements and utilization, efficiency, repeatability and time consumption in processing the WGS reads. We had used Novoplasty and GetOrganelle programs to assemble the Cp genome of P. santalinus from transcriptome data. Both programs generated fragmented contigs in the range of ~ 500 bp to ~ 21 kb (data not shown) and successful assembly could not be achieved. Hence, a pipeline was developed to construct a near-complete Cp genome from the leaf RNA-seq reads. This alternate approach is more cost-effective and less labour intensive when compared to chloroplast enrichment followed by NGS or whole-genome sequencing. An indicative costing of chloroplast enrichment and sequencing using Illumina platform in P. santalinus is ~ 335 USD, while WGS with 30× coverage will be ~ 1272 USD. Genome skimming at 1.5 × depth would cost ~ 536 USD. Transcriptome sequencing which would cost ~ 340 USD can be used for both expression studies and retrieval of Cp specific reads for genome assembly.
The pipeline developed in the present study offers several advantages including the limited requirement of computing time and know-how and cost-effectiveness when compared to WGS. One major benefit of using transcriptome data is the reduced size of the dataset, which is less than 5% of the entire genome (Pertea 2012). Further, the presence of less tandem repeat elements in transcriptome data reduces errors in sequence assembly when compared to WGS data (Tørresen et al. 2019). The near-complete Cp genome of P. santalinus assembled using the present method is highly encouraging, considering that the reference genome used for comparison was assembled from high depth whole genome sequencing. The minor gaps observed in the present assembly could be minimized by increasing the depth of RNA-seq or can be bridged using amplicon sequencing. This method can fast pace evolutionary and phylogenomic studies, enable species discrimination and hybrid validation in breeding programs, delineate cryptic species, assist timber forensics and accelerate chloroplast genomics in plants.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
The authors acknowledge the National Biodiversity Authority, Government of India for funding support.
Author contributions
SS conducted Cp genome assembly, annotation, analysis and drafted the manuscript; KU conceptualized the pipeline; MGD conceptualized the research, obtained funding, conducted transcriptome sequencing, prepared and finalized the manuscript. All authors have approved the manuscript.
Funding
This study was funded by the National Biodiversity Authority, Government of India.
Availability of data and material
Not applicable.
Code availability
Not applicable.
Declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
References
- Ankenbrand MJ, Pfaff S, Terhoeven N, et al. chloroExtractor: extraction and assembly of the chloroplast genome from whole genome shotgun data. J Open Source Softw. 2018;3:464. doi: 10.2110/joss.00464. [DOI] [Google Scholar]
- Atherton RA, McComish BJ, Shepherd LD, Berry LA, Albert NW, Lockhart PJ. Whole genome sequencing of enriched chloroplast DNA using the illumina GAII platform. Plant Methods. 2010;6:22. doi: 10.1186/1746-4811-6-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Avila ME, Gisby MF, Day A. Seamless editing of the chloroplast genome in plants. BMC Plant Biol. 2016;16:168. doi: 10.1186/s12870-016-0857-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baker P, Jackson P, Aitken K. Bayesian estimation of marker dosage in sugarcane and other autopolyploids. Theor Appl Genet. 2010;120:1653–1672. doi: 10.1007/s00122-010-1283-z. [DOI] [PubMed] [Google Scholar]
- Bansal KC, Saha D. Chloroplast genomics and genetic engineering for crop improvement. Agric Res. 2012;1:53–66. doi: 10.1007/s40003-011-0010-6. [DOI] [Google Scholar]
- Beier S, Thiel T, Münch T, et al. MISA-web: a web server for microsatellite prediction. Bioinformatics. 2017;33:2583–2585. doi: 10.1093/bioinformatics/btx198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Besnard G, Hernández P, Khadari B, Dorado G, Savolainen V. Genomic profiling of plastid DNA variation in the mediterranean olive tree. BMC Plant Biol. 2011;11:80. doi: 10.1186/1471-2229-11-80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boehm CR, Bock R. Recent advances and current challenges in synthetic biology of the plastid genetic system and metabolism. Plant Physiol. 2019;179:794–802. doi: 10.1104/pp.18.00767. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daniell H. Transgene containment by maternal inheritance: effective or elusive? Proc Natl Acad Sci USA. 2007;104:6879–6880. doi: 10.1073/pnas.0702219104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daniell H, Lee SB, Grevich J, Saski C, Quesada-Vargas T, Guda C, et al. Complete chloroplast genome sequences of Solanum bulbocastanum, Solanum lycopersicum and comparative analyses with other Solanaceae genomes. Theor Appl Genet. 2006;112:1503–1518. doi: 10.1007/s00122-006-0254-x. [DOI] [PubMed] [Google Scholar]
- Daniell H, Lin CS, Yu M, Chang WJ. Chloroplast genomes: diversity, evolution, and applications in genetic engineering. Genome Biol. 2016;17(1):134. doi: 10.1186/s13059-016-1004-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dierckxsens N, Mardulyn P, Smits G. NOVOPlasty: de novo assembly of organelle genomes from whole genome data. Nucleic Acids Res. 2017;45:18. doi: 10.1093/nar/gkw955. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dobrogojski J, Adamiec M, Lucinski R. The chloroplast genome: a review. Acta Physiol Plant. 2020;42:98. doi: 10.1007/s11738-020-03089-x. [DOI] [Google Scholar]
- Dormontt EE, van Dijk K, Bell KL, Biffin E, Breed MF, Byrne M, Caddy-Retalic S, Encinas-Viso F, Nevill PG, Shapcott A, Young JM, Waycott M, Lowe AJ. Advancing DNA barcoding and metabarcoding applications for plants requires systematic analysis of herbarium collections—an Australian p[erspective. Front Ecol Evol. 2018;6:134. doi: 10.3389/fevo.2018.00134. [DOI] [Google Scholar]
- Freudenthal JA, Pfaff S, Terhoeven N, et al. A systematic comparison of chloroplast genome assembly tools. Genome Biol. 2020;21:254. doi: 10.1186/s13059-020-02153-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fu CN, Wu CS, Ye LJ, Mo ZQ, Liu J, Chang YW, Li DZ, Chaw SM, Gao LM. Prevalence of isomeric plastomes and effectiveness of plastome super-barcodes in yews (Taxus) worldwide. Sci Rep. 2019;9(1):1–11. doi: 10.1038/s41598-019-39161-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hall TA. BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symp Ser. 1999;41:95–98. [Google Scholar]
- Hollingsworth PM, Li D-Z, Van Der Bank M, Twyford AD. Telling plant species apart with DNA: from barcodes to genomes. Philos Trans R Soc B. 2016;371:20150338. doi: 10.1098/rstb.2015.0338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hong Z, Wu Z, Zhao K, Yang Z, Zhang N, Guo J, Tembrock LR, Xu D. Comparative analyses of five complete chloroplast genomes from the genus Pterocarpus (Fabacaeae) Int J Mol Sci. 2020;21(11):3758. doi: 10.3390/ijms21113758. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ichinose M, Sugita M. RNA editing and its molecular mechanism in plant organelles. Genes. 2016;8(1):5. doi: 10.3390/genes8010005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jansen RK, Saski C, Lee SB, Hansen AK, Daniell H. Complete plastid genome sequences of three rosids (Castanea, Prunus, Theobroma): evidence for at least two independent transfers of rpl22 to the nucleus. Mol Biol Evol. 2011;28:835–847. doi: 10.1093/molbev/msq261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiao L, Lu Y, He T, Li J, Yin Y. A strategy for developing high-resolution DNA barcodes for species discrimination of wood specimens using the complete chloroplast genome of three Pterocarpus species. Planta. 2019;250(1):95–104. doi: 10.1007/s00425-019-03150-1. [DOI] [PubMed] [Google Scholar]
- Jin S, Daniell H. The engineered chloroplast genome just got smarter. Trends Plant Sci. 2015;20:622–640. doi: 10.1016/j.tplants.2015.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jin JJ, Bin YuW, Yang JB, et al. GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biol. 2020;21:241. doi: 10.1186/s13059-020-02154-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kumar S, Stecher G, Li M, et al. MEGA X: Molecular evolutionary genetics analysis across computing platforms. Mol Biol Evol. 2018;35:1547–1549. doi: 10.1093/molbev/msy096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kurtz S, Choudhuri JV, Ohlebusch E, et al. REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 2001;29:4633–4642. doi: 10.1093/nar/29.22.4633. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:207–209. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li XW, Yang Y, Henry RJ, Rossetto M, Wang YT, Chen SL. Plant DNA barcoding: from gene to genome. Biol Rev. 2015;90:157–166. doi: 10.1111/brv.12104. [DOI] [PubMed] [Google Scholar]
- Li S, Chang L, Zhang J. Advancing organelle genome transformation and editing for crop improvement. Plant Commun. 2021;2:100–141. doi: 10.1016/j.xplc.2021.100141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mayor C, Brudno M, Schwartz JR, Poliakov A, Rubin EM, Frazer KA, Pachter LS, Dubchak I. VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics. 2000;16:1046–1047. doi: 10.1093/bioinformatics/16.11.1046. [DOI] [PubMed] [Google Scholar]
- McKain, M Wilson (2017) Fast-Plast: rapid de novo assembly and finishing for whole chloroplast genomes. https://github.com/mrmckain/Fast-Plast
- Mower JP, Vickrey TL. Structural diversity among plastid genomes of land plants. In: Chaw S-M, Jansen RK, editors. Advances in botanical research. Cambridge: Academic Press; 2018. pp. 263–292. [Google Scholar]
- Niu Z, Xue Q, Zhu S, Sun J, Liu W, Ding X. The complete plastome sequences of four orchid species: insights into the evolution of the Orchidaceae and the utility of plastomic mutational hotspots. Front Plant Sci. 2017;8:715. doi: 10.3389/fpls.2017.00715. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oey M, Lohse M, Kreikemeyer B, Bock R. Exhaustion of the chloroplast protein synthesis capacity by massive expression of a highly stable protein antibiotic. Plant J. 2009;57:436–445. doi: 10.1111/j.1365-313X.2008.03702.x. [DOI] [PubMed] [Google Scholar]
- Okuda K, Myouga F, Motohashi R, Shinozaki K, Shikanai T. Conserved domain structure of pentatricopeptide repeat proteins involved in chloroplast RNA editing. Proc Natl Acad Sci USA. 2007;104:8178–8183. doi: 10.1073/pnas.0700865104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pertea M. The human transcriptome: an unfinished story. Genes (basel) 2012;3(3):344–360. doi: 10.3390/genes3030344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rogalski M, do Vieira NL, Fraga HP, Guerra MP. Plastid genomics in horticultural species: importance and applications for plant population genetics, evolution, and biotechnology. Front Plant Sci. 2015;6:586. doi: 10.3389/fpls.2015.00586. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ruf S, Bock R (2011) In vivo analysis of rna editing in plastids. In: Aphasizhev R (ed) RNA and DNA editing. Methods in molecular biology, 718. Humana Press, Totowa. 10.1007/978-1-61779-018-8_8 [DOI] [PubMed]
- Shaw J, Lickey EB, Schilling EE, Small RL. Comparison of whole chloroplast genome sequences to choose noncoding regions for phylogenetic studies in angiosperms: the tortoise and the hare III. Am J Bot. 2007;94(3):275–288. doi: 10.3732/ajb.94.3.275. [DOI] [PubMed] [Google Scholar]
- South PF, Cavanagh AP, Liu HW, Ort DR. Synthetic glycolate metabolism pathways stimulate crop growth and productivity in the field. Science. 2019;363:aat9077. doi: 10.1126/science.aat9077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Teske D, Peters A, Möllers A, Fischer M. Genomic profiling: the strengths and limitations of chloroplast genome-based plant variety authentication. J Agric Food Chem. 2020;68(49):14323–14333. doi: 10.1021/acs.jafc.0c03001. [DOI] [PubMed] [Google Scholar]
- Tillich M, Lehwark P, Pellizzer T, et al. GeSeq—versatile and accurate annotation of organelle genomes. Nucleic Acids Res. 2017;45:W6–W11. doi: 10.1093/nar/gkx391. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tørresen OK, Star B, Mier P, Andrade-Navarro MA, Bateman A, Jarnot P, Gruca A, Grynberg M, Kajava AV, Promponas VJ, Anisimova M, Jakobsen KS, Linke D. Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases. Nucleic Acids Res. 2019;47(21):10994–11006. doi: 10.1093/nar/gkz841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Twyford AD, Ness RW. Strategies for complete plastid genome sequencing. Mol Ecol Resour. 2017;17:858–868. doi: 10.1111/1755-0998.12626. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vieira Ldo LN, Faoro H, Rogalski M, Fraga HP, Cardoso RL, de Souza EM, et al. The complete chloroplast genome sequence of Podocarpus lambertii: genome structure, evolutionary aspects, gene content and SSR detection. PLoS ONE. 2014 doi: 10.1371/journal.pone.0090618. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Whittall JB, Syring J, Parks M, Buenrostro J, Dick C, Liston A, et al. Finding a (pine) needle in a haystack: chloroplast genome sequence divergence in rare and widespread pines. Mol Ecol. 2010;19(Suppl 1):100–114. doi: 10.1111/j.1365-294X.2009.04474.x. [DOI] [PubMed] [Google Scholar]
- Wicke S, Schneeweiss GM, de Pamphilis CW, Müller KF, Quandt D. The evolution of the plastid chromosome in land plants: gene content, gene order, gene function. Plant Mol Biol. 2011;76(3):273–297. doi: 10.1007/s11103-011-9762-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu Y, Yu PC, Chang WJ, Yu K, Lin CS. Plastid transformation: how does it work? Can it be applied to crops? What can it offer? Int J Mol Sci. 2020;21(14):4854. doi: 10.3390/ijms21144854. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhong X (2020) Assembly, annotation and analysis of chloroplast genomes. Doctoral thesis, The University of Western Australia
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Not applicable.
Not applicable.



