Skip to main content
Data in Brief logoLink to Data in Brief
. 2023 Mar 4;47:109029. doi: 10.1016/j.dib.2023.109029

Complete chloroplast genome data of Shorea macrophylla (Engkabang): Structural features, comparative and phylogenetic analysis

Ivy Yee Yen Chew a, Hung Hui Chung a,, Leonard Whye Kit Lim a, Melinda Mei Lin Lau a, Han Ming Gan b,c, Boon Siong Wee a, Siong Fong Sim a
PMCID: PMC10018430  PMID: 36936629

Abstract

Shorea macrophylla belongs to the Shorea genus under the Dipterocarpaceae family. It is a woody tree that grows in the rainforest in Southeast Asia. The complete chloroplast (cp) genome sequence of S. macrophylla is reported here. The genomic size of S. macrophylla is 150,778 bp and it possesses a circular structure with conserved constitute regions of large single copy (LSC, 83,681 bp) and small single copy (SSC, 19,813 bp) regions, as well as a pair of inverted repeats with a length of 23,642 bp. It has 112 unique genes, including 78 protein-coding genes, 30 tRNA genes, and four rRNA genes. The genome exhibits a similar GC content, gene order, structure, and codon usage when compared to previously reported chloroplast genomes from other plant species. The chloroplast genome of S. macrophylla contained 262 SSRs, the most prevalent of which was A/T, followed by AAT/ATT. Furthermore, the sequences contain 43 long repeat sequences, practically most of them are forward or palindrome type long repeats. The genome structure of S. macrophylla was compared to the genomic structures of closely related species from the same family, and eight mutational hotspots were discovered. The phylogenetic analysis demonstrated a close relationship between Shorea and Parashorea species, indicating that Shorea is not monophyletic. The complete chloroplast genome sequence analysis of S. macrophylla reported in this paper will contribute to further studies in molecular identification, genetic diversity, and phylogenetic research.

Keywords: Shorea macrophylla, Dipterocarpaceae, Chloroplast genome, Phylogenetic analysis, Monophyletic


Specifications Table

Subject Biological Sciences
Specific subject area Omics: Chloroplast Genome
Type of data Sequencing raw reads, assembly, Table, Figure, Graph
How data were acquired Sequencing
Data format Raw Reads
Description of data collection The leaf pieces were washed in sorbitol buffer (0.35 M sorbitol, 1% PVP 40,000, 100 mM Tris-HCL pH 8, 5 mM EDTA pH 8) before bringing resuspended in 500 μL homogenization buffer (150 mM NaCl, 50 mM Tris-HCL pH 8, 25 mM EDTA). Approximately 100 ng of the gDNA, as quantified by the Denovix high sensitivity kit (Denovix) was fragmented to 350 bp using a Bioruptor. Then, library preparation was performed using the NEB Ultra II Illumina library preparation kit. On a NovaSEQ6000, the generated library was sequenced with a read configuration of 2 × 150 bp, yielding about 1 Gb of sequencing data.
Data source location The collection of engkabang leaves is under the permission of Sarawak Forestry Corporation (Reference Number: SFC.810-4/6/1(2022)). The engkabang leaves are provided by the ranger of the Sarawak Forestry Corporation. The collection of leaves is carried out at Semenggoh Wildlife Center, Kuching, Sarawak, Malaysia (1.402258002376039, 110.31446195505569).
Data accessibility All relevant data are included in this manuscript. The chloroplast genome of S. macrophylla were deposited on GenBank with accession number ON321899 (https://www.ncbi.nlm.nih.gov/nuccore/ON321899.1).

Value of Data

  • Complete chloroplast genome is important for improving our understanding of chloroplast biology and for engineering of chloroplast transgenes to enhance plant agronomic features or to develop high-value agricultural or biomedical products.

  • As the distributions of Shorea macrophylla is limited to some rainforests in Southeast Asia, the genomic data availability of Shorea spp. is very valuable for future analyses.

  • The identification and analysis of the chloroplast genome of S. macrophylla could help researchers better understand the species’ variety and evolutionary links.

1. Objectives

The primary objective of this study is to sequence, assemble and annotate the complete chloroplast genome of the Shorea macrophylla which is deemed an important species for reforestation purpose and provide canopy for many primary rainforests in Southeast Asia region. The improved understanding of the chloroplast genome allows deeper understanding into their photosynthesis capacity and may potentially serves as template for genetic markers identification which further our knowledge on the species distribution and evolution.

2. Data Description

Shorea macrophylla, also known as Engkabang, is a woody tree native to the rainforests. It's mainly found in Kalimantan and Borneo Malaysia [1]. With 196 species spread across 16 genera, Shorea is one of the largest genera in the Dipterocarpaceae family [2]. The chloroplast genome of S. macrophylla has been submitted to GenBank of the National Center for Biotechnology Information (NCBI) with accession number of ON321899. The assembly of S. macrophylla was conducted using NOVOPlasty. The chloroplast genome size of S. macrophylla is 150,778 bp. The assembled chloroplast genome had a typical circular structure and conserved constitute regions (Fig. 1).

Fig. 1.

Fig 1

Gene map of S. macrophylla chloroplast genome. The genes inside and outside the outer circle are transcribed in clockwise and anticlockwise directions. Genes belonging to different functional groups are coloured with different colours. The inner circle indicates the four regions of the chloroplast genome. IRa, inverted repeat region A; IRb, inverted repeat region B; LSC, large single-copy region; SSC, small single-copy region. The line chart in grey colour shows GC content along the genome.

The chloroplast genome is organized in four separate regions: a large single copy (LSC) segment with a length of 83,681 bp and a small single copy (SSC) region of 19,813 bp. A pair of inverted repeats with a length of 23,642 bp separate the SSC and LSC. A total of 112 genes were discovered as unique genes with only one copy in the genome, including 78 protein-coding genes, 30 tRNA genes and four rRNA genes. Around 17 genes are labelled as duplicate copies in the inverted repeats portions of the genome (six protein-coding genes, seven tRNA genes, four rRNA genes). Among all the genes, 12 genes have one intron, and three genes have two introns (two rps12 and one ycf3 gene). The gene content of S. macrophylla chloroplast genome is shown in Table 1. Content (%) of the four bases was A (31.75%)> T (30.96%) > G (19.02%) > C (18.27%). The genome's GC content is 37.3%, which is comparable to other chloroplast genomes previously published [3,4]. The IR region has a higher GC content (43.3%) than the LSC (35.2%) and SSC regions (31.5%). The IR region's high GC concentration is due to the high GC content of the four ribosomal RNA (rRNA) genes found here.

Table 1.

Gene content in chloroplast genomes of Shorea macrophylla.

Category Gene groups Gene Names
RNA genes Ribosomal RNA genes (rRNA) rrn4.5*, rrn5*, rrn16*, rrn23*
Transfer RNA genes (tRNA) trnA-UGC+*, trnC-GCA, trnD-GUC, trnE-UUC, trnF-GAA, trnG-GCC, trnG-UCC+, trnH-GUG, trnI-CAU*, trnI-GAU+*, trnK-UUU+, trnL-CAA*, trnL-UAA+, trnL-UAG, trnM-CAU, trnN-GUU*, trnP-UGG, trnQ-UUG, trnR-ACG*, trnR-UCU, trnS-GCU, trnS-UGA, trnS-GGA, trnT-UGU, trnT-GGU, trnV-GAC*, trnV-UAC+, trnW-CCA, trnY-GUA, trnfM-CAU.
Ribosomal proteins Small ribosome subunit rps2, rps3, rps4, rps7*, rps8, rps11, rps12++*, rps14, rps15, rps16+, rps18, rps19
Transcription Large ribosomal subunit rpl2+*, rpl14, rpl16, rpl20, rpl22, rpl23*, rpl32, rpl33, rpl36
DNA dependent RNA polymerase rpoA, rpoB, rpoC1+, rpoC2
Protein-coding genes Photosystem I psaA, psaB, psaC, psaI, psaJ, pafI/ycf3++
Photosystem II psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ, psbK, psbL, psbM, psbT, psbZ, pbf1
Subunit of cytochrome petA, petB+, petD+, petG, petL, petN,
Subunit of synthase atpA, atpB, atpE, atpF, atpH, atpI,
Large subunit of rubisco rbcL
NADH dehydrogenase ndhA+, ndhB+*, ndhC, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ, ndhK
ATP dependent protease subunit P clpP1
Chloroplast envelope membrane protein cemA
Other genes Maturase matK
Subunit of acetyl-CoA carboxylase accD
C-type cytochrome synthesis ccsA
Hypothetical proteins ycf2*, ycf4/pafII
Component of TIC complex ycf1.
+

Gene with one intron

++

Gene with two introns

Gene with multiple copies

The IR region contraction and expansion are evolutionary events that are assumed to be the major drivers of size differences in chloroplast genomes, making them excellent for studying the phylogeny and evolution of chloroplast genomes in early land plant lineages. We compared the IR boundaries of six Shorea species, including Shorea macrophylla and two closely related Parashorea species, based on phylogenetic analyses. Fig. 2 shows that the majority of them share a similar structure. In all the eight species, the psbA, trnH and rps19 genes are completely covered by LSC regions. The rpl2 gene are found fully embedded in IR region for S. henryana, S. leprosula, S. roxburghii and P. chinensis, however the rpl2 gene expanded to LSC region for S. macrophylla, S. pachyphylla, S. zeylanica and P. macrophylla. ndhF gene of S. leprosula, S. pachyphylla, S. zeylanica and P. macrophylla are found completely in SSC region; but in some of the species, ndhF gene are embedded to the IR region. The IRb/SSC boundary is the most conserved, since all the ycf1 gene of all the observed species are all completely located in the SSC region, however the location of ycf1 gene to the boundary varies in all the species, ranging from 51 to 600 bp.

Fig. 2.

Fig 2

Comparison of the borders of the LSC, SSC and IR regions among the eight chloroplast genomes.

Based on protein-coding genes, the frequency of codons in the cp genome of S. macrophylla was estimated. The coding sequences for protein-coding genes were 78,654 bp long, with 26,218 codons encoded. AUU (ILE) was the most common codon (1,075 codons), whereas UGC (CYS) was the least common (77 codons). This may be due to their high sensitivity to physiological and environmental factors. With the exception of Methionine (AUG) and Tryptophan (UGG), whose relative value of synonymous codon usage (RSCU) was one, the usage of most codons was biased. Five codons were overrepresented (RSCU > 1.6) while the majority were underrepresented (RSCU < 0.6). The codon is strongly favoured if the RSCU is more than one, and it is used more frequently than expected. If the value is less than one, it is considered less preferred and is utilised less frequently than expected. Table 2 shows the codon usage in further detail. In the chloroplast genome of S. macrophylla, a study of codon use revealed a preference for T and A at the third codon position (Table 3).

Table 2.

Codon usage of S. macrophylla chloroplast genomes.

Amino acid Codon Count RSCU tRNA
Phe UUU 1008 1.35
Phe UUC 490 0.65 trnF-GAA
Leu UUA 844 1.83 trnL-UAA
Leu UUG 572 1.24 trnL-CAA
Leu CUU 586 1.27
Leu CUC 195 0.42
Leu CUA 377 0.82 trnL-UAG
Leu CUG 194 0.42
Ser AGU 943 1.51
Ser AGC 307 0.49 trnS-GCU
Ser UCU 581 1.73
Ser UCC 314 0.94 trnS-GGA
Ser UCA 394 1.17 trnS-UGA
Ser UCG 186 0.55
Tyr UAU 774 1.60
Tyr UAC 194 0.40 trnY-GUA
STOP UAA 41 1.48
STOP UAG 27 0.98
STOP UGA 41 1.48
Cys UGU 774 1.60
Cys UGC 194 0.40 trnC-GCA
Trp UGG 27 0.98 trnW-CCA
Pro CCU 425 1.56
Pro CCC 201 0.74
Pro CCA 309 1.14 trnP-UGG
Pro CCG 153 0.56
His CAU 468 1.52
His CAC 146 0.48 trnH-GUG
Gln CAA 719 1.51 trnQ-UUG
Gln CAG 231 0.49
Arg CGU 353 1.32 trnR-ACG
Arg CGC 116 0.43
Arg CGA 374 1.40
Arg CGG 99 0.37
Arg AGA 468 1.75 trnR-UCU
Arg AGG 191 0.72
Ile AUU 1075 1.46
Ile AUC 465 0.63 trnI-GAU
Ile AUA 672 0.91
Met AUG 582 1.00 trnM-CAU
Thr ACU 497 1.56
Thr ACC 235 0.74 trnT-GGU
Thr ACA 396 1.25 trnT-UGU
Thr ACG 143 0.45
Asn AAU 943 1.51
Asn AAC 307 0.49 trnN-GUU
Lys AAA 1043 1.47 trnK-UUU
Lys AAG 377 0.53
Val GUU 516 1.46
Val GUC 182 0.52 trnV-GAC
Val GUA 497 1.41 trnV-UAC
Val GUG 214 0.61
Ala GCU 628 1.77
Ala GCC 224 0.63
Ala GCA 382 1.08 trnA-UGC
Ala GCG 184 0.52
Asp GAU 884 1.58
Asp GAC 232 0.42 trnD-GUC
Glu GAA 1051 1.46 trnE-UUC
Glu GAG 388 0.54
Gly GGU 577 1.31
Gly GGC 176 0.40 trnG-GCC
Gly GGA 697 1.58 trnG-UCC
Gly GGG 310 0.70

Table 3.

The S. macrophylla chloroplast genome features.

Regions Positions Length (bp) T/U (%) C (%) A (%) G (%) AT/U (%)
Genome 150,778 31.7 19.0 31.0 18.3 62.7
LSC 83,681 32.9 18.1 31.7 17.3 64.6
IRa 23,642 27.8 20.4 28.8 23.0 56.6
SSC 19,813 34.5 16.8 34.0 14.7 68.5
IRb 23,642 28.8 23.0 27.8 20.4 56.6
Protein coding genes 78654 31.3 17.7 30.5 20.5 61.8
1st position 26218 23.5 18.9 30.4 27.2 53.9
2nd position 26218 32.4 20.0 29.8 17.8 62.2
3rd position 26218 37.9 14.1 31.6 16.4 69.5
tRNA 2791 24.9 23.6 22.1 29.4 47.0
rRNA 9050 18.7 23.6 26.1 31.6 44.8

SSRs were detected in abundance across S. macrophylla’s chloroplast genome. SSRs from chloroplast genomes can be used to explore evolutionary relationships and population genetics because of their high polymorphism rates and consistent repetition. Most SSRs are made up of A or T units, contributing to the AT richness of the chloroplast genome. The chloroplast genome of S. macrophylla contained 262 SSRs, and their distribution was studied. The most common SSR was A/T (61.83%), followed by AAT/ATT (11.45%) and AAG/CTT (9.54%) (Table 4). Furthermore, depending on the number of repeats, mononucleotide and dinucleotide SSRs prefer bases that include A/T units. The details of the SSR annotations are available as supplementary document (Supplementary table 1).

Table 4.

The frequency of identified microsatellite motifs in S. macrophylla chloroplast genome.

Repeats Number of SSRs identified
A/T 162
C/G 5
AT/AT 18
AAC/GTT 3
AAG/CTT 25
AAT/ATT 30
ACT/AGT 1
AGC/CTG 3
ATC/ATG 2
AAAT/ATTT 3
AATC/ATTG 1
AATG/ATTC 2
ATCC/ATGG 1
AAAAT/ATTTT 1
AATAT/ATATT 5

Forward (F), palindromic (P), reverse (R) and complement (C) are the four forms of long repeat sequences, and the repetition length should be larger than 30 bp. A total of 43 long repeat sequences have been discovered. The size of repetitive sequences in the chloroplast genome of S. macrophylla in between 30 and 56 bp are the majority, and almost all are F- or P-type long repeats, with two R- and one C-type long repeats.

Furthermore, the findings of the DNAsp sliding window analysis revealed numerous regions with significantly varied nucleotide sequences between the complete chloroplast genome of six Shorea species and two Parashorea species. The IR regions have substantially less nucleotide variability than the SSC and LSC regions. The determined average nucleotide diversity (Pi) value was 0.01297. There are 7 mutational hotspots discovered in the LSC region with remarkably high Pi values (>0.04) which include ycf2, psbI-trnG, ndhA-ndhI, psbZ-trnG, rps12-ndhB, rpl33-rpoA, rpoA-psbB. In the SSC area, no mutational hotspots have been observed. The rpoC1 gene, on the other hand, is accommodated in the IR regions high Pi values (>0.04). These regions approximate species-level nucleotide substitutions, allowing researchers to investigate and develop molecular markers that are critical for plant taxonomy and identification. Fig. 3 depicts the nucleotide diversity (Pi) graph.

Fig. 3.

Fig 3

Nucleotide diversity analysis of the whole chloroplast genome. Window length: 600 bp; Step size: 200 bp. X-axis: midpoint positions of a window; Y-axis: nucleotide diversity between S. macrophylla and five other Shorea and two Parashorea species.

PREPACT was used to predict RNA editing sites in chloroplast genomes. The first nucleotide of the first codon was used in all analyses. The amino acids Serine and Leucine were shown to be the most frequently transformed in the codon site, according to findings. With Gossypium hirsutum as a reference, the algorithm revealed 178 editing sites in the genome of S. macrophylla, dispersed across 83 protein-coding genes, but no editing sites in the tRNA or rRNA genes. Table 5 lists the details for RNA editing sites. The ycf1 gene has the most editing sites (20 sites), followed by rpoC2 which has 13 editing sites, and ycf2 and matK, which both have ten editing sites. There are 50 genes with 1 to 8 editing sites. The remaining genes (psbA, psbK, psbI, atpH, atpI, petN, psbM, psbD, psbZ, ycf3, ndhJ, atpE, atpB, rbcL, psbL, psbE, petG, psaJ, rps18, rps12, clpP, petD, rps11, rpl14, rpl16, rps19, rpl2, rps15, ndhI, ndhE, psaC, rpl32) do not have predicted RNA editing sites in the first nucleotide of the first codon. About 60 (33.71%) RNA editing events were predicted to occur in the first position of the codon, 118 (66.29%) in the second position, and none in the third position. With 35 sites (19.66%), conversion of serine (polar) to leucine (non-polar) was the most observed amino acid modification because of RNA editing. This was followed by Alanine to Valine (25 sites, 14.04%) and Serine to Phenylalanine (24 sites, 13.48%). By using Gossypium hirsutum as reference genome, 50 genes were predicted to have RNA editing sites.

Table 5.

Conversion of amino acids at RNA editing sites.

Conversion of amino acid (Single-Letter Amino Acid Code) Conversion of amino acid Number of RNA editing sites
A -> V Alanine -> Valine 25
H -> Y Histidine -> Tyrosine 10
L -> F Leucine -> Phenylalanine 21
P -> F Proline -> Phenylalanine 3
P -> L Proline -> Leucine 15
P -> S Proline -> Serine 20
S -> F Serine -> Phenylalanine 24
S -> L Serine -> Leucine 35
R -> C Arginine -> Cysteine 4
R -> W Arginine -> Tryptophan 3
T -> I Threonine -> Isoleucine 13
T -> M Threonine -> Methionine 5

By using the annotation of Shorea macrophylla as a reference, the mVISTA program was used to align the chloroplast genomes and visualise the pattern of sequence identity among closely related species with S. macrophylla, including S. henryana, S. leprosula, S. pachyphylla, S. roxburghii, S. zeylanica, P. chinensis and P. macrophylla. Overall, coding regions were more conserved than non-coding regions, with most variations were detected in non-coding sequences. The sequences of exons and UTRs were nearly identical throughout the eight taxa since they had a similar structure and gene order (Fig. 4). In addition, as shown in other taxa's chloroplast genome research, the LSC and SSC regions diverged more than the IR regions [5], [6], [7]. In the LSC, the majority of exons are conserved in intergenic spacers with a high amount of divergence (>50%, as determined by the white space seen across alignments).

Fig. 4.

Fig 4

The shuffle-LAGAN program analyzed the comparative analysis of S. macrophylla with Shorea and Parashorea species. The percentage of identity is shown on the vertical axis, which ranges from 50 to 100%, while the horizontal axis represents the position in the chloroplast genome. Each arrow indicates the annotated gene in the reference genome and the direction of its transcription. Genomic regions are color-coded into exons(purple), UTR (neon blue) and CNS (pink).

To investigate the evolutionary position of S. macrophylla in the Dipterocarpaceae family, protein coding genes from 24 species under Dipterocarpaceae family and two species from outgroup families (Gossypium thurberi and Bixa orellana) were chosen for phylogenetic analysis. The phylogenetic tree is shown in Fig. 5. From the phylogenetic tree, we discovered that the S. macrophylla was isolated from these wild relative species in the tree. Outgroups were segregated from Dipterocarpaceae species, which constituted a clade. Hopea species from the Dipterocarpaceae family made up the first subclade. Neobalanocarpus species made up the second subclade. Shorea and Parashorea species make up the third and fourth subclades respectively. While the fifth and the sixth clade are consisting of Dipterocarpus and Vatica species. Among the 25 Dipterocarpaceae species, Dryobalanops aromatica is the furthest away. Except for Shorea, which formed a clade with Parashorea species, each genus clustered together to create a single subclade. The whole chloroplast genome of S. macrophylla provides insight into tree plants for future evolutionary studies on this species, as well as a reference to help tree species conservation and improve desirable traits in S. macrophylla breeding.

Fig. 5.

Fig 5

Phylogenetic relationships of S. macrophylla with other Dipterocarpaceae family species and two outgroups based on their protein-coding genes. The bootstrap value based on 1000 replicates is shown on each node. The subclades are drawn with Shorea sp. subclade drawn in red colour and Parashorea sp. subclade drawn in green colour. The result shown Shorea genus is not monophyletic.

3. Experimental Design, Materials and Methods

3.1. Plant material, DNA extraction, and sequencing

The DNA was extracted from 50 mg of Engkabang leaves. Experimental research and field studies on plants in this study, including the collection of plant material are all complied with the ethical standards and legislation. The leaf pieces were washed in sorbitol buffer (0.35 M sorbitol, 1% PVP 40,000, 100 mM Tris-HCL pH 8, 5 mM EDTA pH 8) before bringing resuspensed in 500 μL homogenization buffer (150 mM NaCl, 50 mM Tris-HCL pH 8, 25 mM EDTA) [8]. A 1.5 mL microcentrifuge tube containing 2 mL steel beads was filled with 500 μL of the resuspended material. Homogenization took 30 minutes on a vortexer spinning at 4,000 rpm. After that, ⅔ vol of saturated NaCl was added to the homogenate, followed by 5 minutes incubation on ice to precipitate proteins [9]. The cells are then centrifuged for 10 minutes at 10,000 x g. To precipitate the DNA pellet, the supernatant was combined with 1x volume of isopropanol and centrifuged at 10,000 x g for 10 minutes. The DNA pellet was resuspended in 0.1 X TE buffer (1 mM Tris-HCL pH 8, 0.1 mM EDTA) after being washed twice with 75% ethanol. Approximately 100 ng of the gDNA, as quantified by the Denovix high sensitivity kit (Denovix) was fragmented to 350 bp using a Bioruptor. Then, library preparation was performed using the NEB Ultra II Illumina library preparation kit. On a NovaSEQ6000, the generated library was sequenced with a read configuration of 2 × 150 bp, yielding about 1 Gb of sequencing data. With a total base of 1.7 Gb, 2 × 5.83 million paired-end reads were generated. The fastq file contained raw data that is without adapter and had not been quality trimmed. The filtered readings were assembled into complete chloroplast genomes using NOVOPlasty with Gossypiodes kirkii rbcL sequence as the seed sequence. The chloroplast genome was then mapped to the Shorea pachyphylla complete chloroplast genome.

3.2. Genome annotation and codon usage analysis

Using the complete chloroplast genome sequences of Shorea pachyphylla and Shorea zeylanica as references, GeSeq was used to annotate the chloroplast genomes of Shorea macrophylla [10]. The GeSeq software tool was used to execute BLAST searches against S. pachyphylla NCBI RefSeq data. Comparative analysis between S. pachyphylla, S. zeylanica, S. roxburghii, S. henryana and S. leprosula was conducted to further verify the manual annotation. To ensure functionality, all protein-coding nucleotide sequences (CDS) were checked in their amino acids using MEGA X to fix premature and truncated stop codons [11]. tRNA genes were identified by the trnAscan-SE server [12]. Gene homologies and ontologies were verified using the Kyoto Encyclopedia of Genes and Genomes (KEGG) [13]. Organellar Genome DRAW (OGDRAW) was used to create and illustrate the structural characteristics of the chloroplast genome map of S. macrophylla [14]. MEGA 11.0 was used to examine the relative synonymous codon usage (RCSU) values, base composition and codon usage [11].

3.3. Short Sequence repeats (SSRs) and long repeat structure

The MIcroSAtellite (MISA) web tool was utilised to identify short sequence repeats (SSRs) [15]. The minimal repeat unit criteria for mononucleotide SSRs were established at eight, for dinucleotide SSRs at five, and for trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide SSRs at three. Complement, palindromic, reverse and forward repeat sequences were identified and located using REPuter [16]. The repeat sizes were kept to a minimum of 30 bp, with sequence identities of over 90%.

3.4. Internal repeat contraction and expansion

The expansion and contraction of the IR regions (IRa and IRb) at four junction sites (namely LSC/IRb, IRb/SSC, SSC/IRa and IRa/LSC) between five different Shoreas and two Parashoreas from Dipterocarpaceae family (S. henryana, S. leprosula, S. pachyphylla, S. roxburghii, S. zeylanica, P. chinensis, P. macrophylla) were verified and plotted manually.

3.5. Comparative and divergence analysis of chloroplast genomes of S. macrophylla and closely related species

To compare against S. henryana, S. leprosula, S. pachyphylla, S. roxburghii, S. zeylanica, P. chinenesis and P. macrophylla sequences from GenBank, the complete chloroplast genome of S. macrophylla was used as a reference genome. mVISTA program was used to align the sequences in Shuffle-LAGAN mode [17,18]. Nucleotide variability was calculated using DNAsp software with a 200 bp step size and a 600 bp window length [19].

3.6. RNA editing analysis

The PREPACT online tool was used to anticipate potential RNA editing sites present in the CDS of S. macrophylla with the default search option and Gossypium hirsutum as the reference genome [20].

3.7. Phylogenetic analysis

All protein-coding genes in the chloroplast genome of S. macrophylla, as well as 24 species under the Dipterocarpaceae family (Dipterocarpus alatus, Dipterocarpus gracilis, Dipterocarpus intricatus, Dipterocarpus turbinatus, Dryobalanops aromatica, Hopea chinensis, Hopea dryobalandoides, Hopea hainanensis, Hopea mollissima, Hopea odorata, Hopea reticulata, Neobalanocarpus heimii, Parashorea chinensis, Parashorea macrophylla, Shorea henryana, Shorea leprosula, Shorea pachyphylla, Shorea roxburghii, Shorea zeylanica, Vatica guangxiensis, Vatica mangachapoi, Vatica odorata, Vatica rassak, Vatica xishuangbannaensis) and two outgroup gymnosperms (Gossypium thurberi and Bixa orellana) were selected as input for phylogenetic analysis. The sequences were first aligned using Clustal W in MEGA 11.0 [11]. This was followed by running a Model Test via MEGA 11.0 to select the best model for the maximum likelihood tree construction [11]. MEGA 11.0 constructed the maximum likelihood family tree, with parameter GTR + I + G nucleotide substitution model and 1000 bootstrap replications [11].

Ethics Statement

Experimental research and field studies on plants in this study, including the collection of plant material complied with the guidelines and legislation. The collection of engkabang leaves is under the permission of Sarawak Forestry Corporation (Reference Number: SFC.810-4/6/1(2022)). The engkabang leaves are provided by the ranger of the Sarawak Forestry Corporation.

Availability of Data and Materials

All relevant data are included in this manuscript. The chloroplast genome of S. macrophylla were deposited on GenBank with accession number ON321899 (https://www.ncbi.nlm.nih.gov/nuccore/ON321899.1).

Credit Author Statement

Ivy Yee Yen Chew: Collected and analysed all data and wrote the manuscript; Hung Hui Chung: Supervised the experimentation process, edited the manuscript as well as provided the funding for this research; Leonard Whye Kit Lim, Melinda Mei Lin Lau and Han Ming Gan: Assisted with the interpretation of results, review and edited the manuscript; Boon Siong Wee and Siong Fong Sim: Reviewed and edited the manuscript as well as supplied funding for this research.

Declaration of Competing Interest

The authors declare that they have no competing interests.

Acknowledgements

The authors acknowledged Tun Zaidi Chair had fully funded this research with grant number F07/TZC/2164/2021 to Dr. Chung Hung Hui.

Footnotes

Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.dib.2023.109029.

Appendix. Supplementary materials

mmc1.docx (36.2KB, docx)

References

  • 1.Chai E.O.K. University of Edinburgh; Malaysia: 1998. Aspects of a Tree Improvement Programme for Shorea Macrophylla (de Vriese) Ashton in Sarawak. [Google Scholar]
  • 2.Ashton P.S. Dipterocarpaceae. Flora Malesiana. 1982;9:237–552. [Google Scholar]
  • 3.Yu Y., Han Y., Peng Y., Tian Z., Zeng P., Zong H., Zhou T., Cai J. Comparative and phylogenetic analyses of eleven complete chloroplast genomes of Dipterocarpoideae. Chinese Med. (United Kingdom) 2021;16:1–15. doi: 10.1186/s13020-021-00538-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Alzahrani D., Albokhari E., Abba A., Yaradua S. The first complete chloroplast genome sequences in Resedaceae: Genome structure and comparative analysis. Sci. Prog. 2021;104:1–18. doi: 10.1177/00368504211059973. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Qian J., Song J., Gao H., Zhu Y., Xu J., Pang X., Yao H., Sun C., Li X., Li C., Liu J., Xu H., Chen S. The Complete Chloroplast Genome Sequence of the Medicinal Plant Salvia miltiorrhiza. PLoS One. 2013;8 doi: 10.1371/journal.pone.0057607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Lim L.W.K., Chung H.H., Hussain H. Complete chloroplast genome sequencing of sago palm (Metroxylon sagu Rottb.): Molecular structures, comparative analysis and evolutionary significance. Gene Reports. 2020;19 doi: 10.1016/j.genrep.2020.100662. [DOI] [Google Scholar]
  • 7.L. Gu, Ti. Su, M.-T. An, G.-X. Hu, The Complete Chloroplast Genome of the Vulnerable Oreocharis esquirolii (Gesneriaceae): Structural Features, Comparative and Phylogenetic Analysis, Plants. 9 (2020). 10.3390/plants9121692. [DOI] [PMC free article] [PubMed]
  • 8.Inglis P.W., Marilia de Castro R.P., Resende L.V., Grattapaglia D. Fast and inexpensive protocols for consistent extraction of high quality DNA and RNA from challenging plant and fungal samples for high-throughput SNP genotyping and sequencing applications. PLoS One. 2018;13:1–14. doi: 10.1371/journal.pone.0206085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Miller S.A., Dykes D.D., Polesky H.F. A simple salting out procedure for extracting DNA from human nucleated cells. Nucleic. Acids. Res. 1988;16:1215. doi: 10.1093/nar/16.3.1215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Tillich M., Lehwark P., Pellizzer T., Ulbricht-Jones E.S., Fischer A., Bock R., Greiner S. GeSeq - Versatile and accurate annotation of organelle genomes. Nucleic. Acids. Res. 2017;45:W6–W11. doi: 10.1093/nar/gkx391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Tamura K., Stecher G., Kumar S. MEGA11: Molecular Evolutionary Genetics Analysis Version 11. Mol. Biol. Evol. 2021;38:3022–3027. doi: 10.1093/molbev/msab120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lowe T.M., Chan P.P. tRNAscan-SE On-line: integrating search and context for analysis of transfer RNA genes. Nucleic. Acids. Res. 2016;44:W54–W57. doi: 10.1093/nar/gkw413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Kanehisa M., Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic. Acids. Res. 2000;28:27–30. doi: 10.3892/ol.2020.11439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Greiner S., Lehwark P., Bock R. OrganellarGenomeDRAW (OGDRAW) version 1.3.1: Expanded toolkit for the graphical visualization of organellar genomes. Nucleic. Acids. Res. 2019;47:W59–W64. doi: 10.1093/nar/gkz238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Beier S., Thiel T., Münch T., Scholz U., Mascher M. MISA-web: A web server for microsatellite prediction. Bioinformatics. 2017;33:2583–2585. doi: 10.1093/bioinformatics/btx198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Kurtz S., Choudhuri J.V., Ohlebusch E., Schleiermacher C., Stoye J., Giegerich R. REPuter: The manifold applications of repeat analysis on a genomic scale. Nucleic. Acids. Res. 2001;29:4633–4642. doi: 10.1093/nar/29.22.4633. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Frazer K.A., Pachter L., Poliakov A., Rubin E.M., Dubchak I. VISTA: Computational tools for comparative genomics. Nucleic. Acids. Res. 2004;32:273–279. doi: 10.1093/nar/gkh458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Brudno M., Malde S., Poliakov A., Do C.B., Couronne O., Dubchak I., Batzoglou S. Glocal alignment: Finding rearrangements during alignment. Bioinformatics. 2003;19 doi: 10.1093/bioinformatics/btg1005. [DOI] [PubMed] [Google Scholar]
  • 19.Rozas J., Ferrer-Mata A., Sanchez-DelBarrio J.C., Guirao-Rico S., Librado P., Ramos-Onsins S.E., Sanchez-Gracia A. DnaSP 6: DNA sequence polymorphism analysis of large data sets. Mol. Biol. Evol. 2017;34:3299–3302. doi: 10.1093/molbev/msx248. [DOI] [PubMed] [Google Scholar]
  • 20.Lenz H., Hein A., Knoop V. Plant organelle RNA editing and its specificity factors: Enhancements of analyses and new database features in PREPACT 3.0. BMC Bioinf. 2018;19:1–18. doi: 10.1186/s12859-018-2244-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

mmc1.docx (36.2KB, docx)

Data Availability Statement

All relevant data are included in this manuscript. The chloroplast genome of S. macrophylla were deposited on GenBank with accession number ON321899 (https://www.ncbi.nlm.nih.gov/nuccore/ON321899.1).


Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES