Abstract
More than 600 species in over 40 genera have been identified in family Theaceae worldwide. The accurate identification of Theaceae plants can ensure the market economic order, and it plays a vital role in achieving the sustainable utilization of germplasm resources. DNA barcoding, one of the most potential species identification technologies at present, has advanced in the rapid, accurate and repetitive discrimination of species. In this study, matK + ndhF + ycf1 was observed as the optimal combined candidate gene sequence of DNA barcodes by analyzing genetic information of four single chloroplast DNA sequences, including matK, rbcL, ndhF and ycf1, as well as six combined gene sequences. Subsequently, the experiments were performed on phylogenetic analysis based on genetic distance to study the phylogenetic relationship of Theaceae plants and evaluate the species identification accuracy of matK + ndhF + ycf1. Lastly, the species-specific DNA barcodes were designed by searching the variable sites (one type of single nucleotide polymorphism sites) for the accurate identification of Camellia amplexicaulis, Franklinia alatamaha, Gordonia brandegeei and Stewartia micrantha. The previous methods of screening and testing candidate gene sequences were optimized, and innovation was made in the above methods. The process of making visual DNA barcodes was standardized. Besides, DNA barcoding technology increased the accuracy of species identification and DNA barcoding was analyzed in accordance with the theories of population genetics (e.g., neutral theory of molecular evolution). The results of the study will lay a basis for the identification and protection of Theaceae species and germplasm resources.
Supplementary Information
The online version contains supplementary material available at 10.1007/s12298-022-01175-7.
Keywords: DNA barcoding, Theaceae plants, Chloroplast genes, Species identification
Introduction
There are nearly 40 genera and 600 species of Theaceae extensively grown in tropical and subtropical Asia and America (Zhang et al. 2020). Plant materials from Theaceae are employed for producing cosmetics (Camellia sinensis) (Koch et al. 2019), dietary supplements (Camellia) (Wang et al. 2016) and decorations (Camellia japonica) (Zhang et al. 2020). In addition, Theaceae plants are important biological resource reservoirs. Some species preserve quality germplasm resources which arise from interspecific breeding compatibility and ability to reproduce asexually (Wang et al. 2016). Over the past few decades, increasing counterfeit or adulterated Theaceae-related products and a wide variety of destructive activities by humans have threatened the diversity of Theaceae plant species and human health (Wang and Zhang 2018; Chen and Sun 2018). Accordingly, an efficient and stable method of identifying Theaceae plants is urgently required to standardize the market action and facilitate the protection of species diversity (Meng et al. 2019).
Most Theaceae plants are identified using the combination of conventional morphology and molecular technology (PCR) (Lu et al. 2012) or stoichiometric methods (Liu et al. 2020). The above methods have high performance in the identification of most species, while the limitation of expensive materials and time cost significantly decreases the efficiency of identification. DNA barcoding technology was initially reported by Paul Hebert in 2003 (Hebert et al. 2003), which employs short sequence segments derived from ribosomal, mitochondrial or chloroplast DNA (cpDNA) sequences for the rapid, accurate and automated identification of species (Hebert et al. 2003; Stoeckle et al. 2011). Compared with technologies of biological identification based on conventional taxonomy, DNA barcoding is capable of deeply excavating genetic information and obtaining more species-specific molecular markers (Sheth and Thaker 2017). The DNA barcodes are finally available online, and the above barcodes help researchers investigate the different populations of phylogenetic and evolutionary relationships (Kroymann et al. 2011), operation of food webs (García-Robledo et al. 2013) and measures for global biodiversity investigation and protection (Fontaine et al. 2012). Mitochondrial DNA sequences had been generally employed for the identification of animal species (Gonzalez et al. 2009). Some ribosomal DNA segments (e.g., ITS region in nuclear ribosomal DNA) are present in paralogous gene copies within the respective cell, and divergent copies may result in messy sequences in some groups (Hollingsworth 2011). Thus, researchers are inclined to screen appropriate DNA barcode regions from cpDNA sequences by evaluating high repeatability and sequence quality (Kroymann et al. 2011). matK gene sequence is characterized by high operational stability and repeatability (Newmaster et al. 2009). rbcL gene sequence was initially employed as a single gene sequence for species identification with an accuracy of 90% (Hebert et al. 2003). However, as revealed by the results of recent studies, the species identification ability of rbcL is low (Hebert et al. 2003; Hollingsworth 2011). The experimental results (Zhang et al. 2015; Amar 2020) suggested that ndhF and ycf1 gene sequences have high mutation rates and low mutation saturation, thus becoming suitable for phylogenetic analysis and construction of DNA barcodes.
Existing cross-sectional studies of Theaceae species identification have primarily investigated their traits of morphology and ecological distribution (Zhang et al. 2020). Recently, researchers gradually carried out the work of constructing barcodes within Camellia (Li et al. 2019) and Schima (Yu et al. 2017) genera. However, to the best of our knowledge, there has been rare research determining the DNA barcodes and phylogenetic relationship of this important plant family. Furthermore, several attempts to construct barcodes have often ignored the tests for identification accuracy, and results had been hardly applied to practical tests (Chaveerach et al. 2016).
This study aimed to construct DNA barcodes of Theaceae species and test the identification accuracy of barcodes. Some sequences on online databases are incomplete and have indels due to human impact. Genetic features and diversity of Theaceae species provide exact reference data to screen suitable gene sequences and confirm the long-term identification potential in the above sequences (Gong et al. 2021). Besides, the results of genetic distance and phylogenetic analysis confirmed that the screened barcodes accurately identify most Theaceae species (Chaveerach et al. 2016). Based on the above analysis, the above barcodes obtained will be more reliable, thus laying a foundation for the identification and protection of Theaceae species and germplasm resources.
In this study, a new methodology was proposed to construct specific cpDNA barcodes and increase the accuracy of species identification in barcodes. Four single gene sequences (matK, rbcL, ndhF and ycf1) and six combined sequences (matK + rbcL, matK + ndhF, matK + ycf1, rbcL + ndhF, rbcL + ycf1, ndhF + ycf1) were specially processed to obtain gene sequence databases based on previous investigations (Bhargava and Sharma 2013; Yu et al. 2016). Subsequently, the combined gene sequences were screened based on the results of major genetic features and the genetic diversity analysis. Lastly, the combined gene sequences with more accurate identification levels were investigated to test accuracy and reliability for obtaining specific DNA marker segments through the genetic distance and phylogenetic analysis.
Materials and methods
Multiple sequence alignments
All complete chloroplast single gene sequences (e.g., matK, rbcL, ndhF and ycf1 of Theaceae plants) were collected from the NCBI databases (Yu et al. 2016) (https://www.ncbi.nlm.nih.gov/) to establish species-specific gene sequence databases using MEGA11.0 software (https://www.megasoftware.net/) (Tamura et al. 2021). Synonymies in Theaceae plants were combined into a single species name when using barcodes for species identification (Zhang et al. 2021). Accordingly, The Plant List (TPL) (www.theplantlist.org) and Flora of China (FOC) (http://www.iplant.cn/foc) in English version were referenced to modify species names of Theaceae plants.
In accordance with four single gene sequences, six combined gene sequences (matK + rbcL, matK + ndhF, matK + ycf1, rbcL + ndhF, rbcL + ycf1, ndhF + ycf1) were obtained through directed assembly. Multiple sequence alignment was performed using Muscle in MEGA11.0 (Tamura et al. 2021). For large sequence databases and gene sequences with interior vacancies (e.g., matK), Muscle operated faster and more accurately (Edgar 2004).
Data analysis
Major genetic features were summarized with MEGA11.0 and TBtools v1.09852 software (Chen et al. 2020a) for reference, include the proportions of bases, the number of C (Conserved), V (Variable), Pi (Parsimony informative) and S (Singleton) sites, as well as base substitution saturation effect of all gene sequences after alignment. Genetic diversity analysis including nucleotide diversity and neutrality tests was performed with DnaSP v6 (http://www.ub.edu/dnasp/index_v5.html) (Fu 1997; Chen et al. 2020b). Genetic parameters in nucleotide diversity tests consisted of nucleotide substitution rate (θ) and haplotype diversity (Hd). Neutrality tests involved Tajima’s D tests (D) and Fu and Li’s F tests (Fu's Fs), constructed with the correlation between segregating sites (S) and nucleotide diversity (π) (Chen et al. 2020b).
Genetic distance has been found as a vital standard to screen candidate gene sequences for making barcodes (Hall 2013). Thus, the difference between the interspecific, intraspecific and overall average distances of all gene sequences was compared using MEGA11.0. Phylogenetic trees constructed using the NJ, ML and Bayes methods according to the K2P model were evaluated based on MEGA 11.0 and MrBayes v3.2.6 (http://nbisweden.github.io/MrBayes/download.html) (Hebert et al. 2003; Ronquist et al. 2012). In addition, the bootstrap value was set above 75% to ensure the reliability of branches (Tamura et al. 2021). To reduce the complexity of trees, they were finally modified in iTOL (https://itol.embl.de/) (Letunic and Bork 2021) to clearly compare genetic characteristics.
Construction of barcodes based on SNP sites
A multi-segment specific barcode was combined from 150-bp candidate gene sequence segments starting with V sites. Besides, the above appropriate segments were blasted into the NCBI databases (https://blast.ncbi.nlm.nih.gov/Blast.cgi) in accordance with the standards of the NCBI Blast output.
Results
Basic information of gene sequences
In this study, 160 matK sequences (12 genera, 101 species), 133 rbcL sequences (11 genera, 96 species), 156 ndhF sequences (11 genera, 102 species), and 125 ycf1 sequences (10 genera, 92 species) of all Theaceae species were retrieved on NCBI databases (Table 1). After sequence alignment and editing, the average length of matK, rbcL, ndhF and ycf1 included 1533 bp, 1428 bp, 2314 bp and 5817 bp, respectively, and that of the combined gene sequences, including matK + rbcL sequences (11 genera, 96 species), matK + ndhF sequences (11 genera, 88 species), matK + ycf1 sequences (10 genera, 87 species), rbcL + ndhF sequences (11 genera, 89 species), rbcL + ycf1 sequences (10 genera, 88 species) and ndhF + ycf1 sequences (10 genera, 88 species), was 2691 bp, 3846 bp, 7354 bp, 3741 bp, 7239 bp and 8110 bp, respectively. Tables S1–S10 list the supplementary information, including accession numbers, version numbers, accepted names, synonyms and definition of above species-specific gene sequences on the NCBI online database.
Table 1.
Basic information of gene sequences
| Sequences | Genera | Species | Number of sequences | Average length in NCBI | Average length after alignment |
|---|---|---|---|---|---|
| matK | 12 | 101 | 160 | 1527 | 1533 |
| rbcL | 11 | 96 | 133 | 1428 | 1428 |
| ndhF | 11 | 102 | 156 | 2247 | 2314 |
| ycf1 | 10 | 92 | 125 | 5628 | 5817 |
| matK + rbcL | 11 | 96 | 130 | 2928 | 2691 |
| matK + ndhF | 11 | 88 | 118 | 3780 | 3846 |
| matK + ycf1 | 10 | 87 | 118 | 7215 | 7354 |
| rbcL + ndhF | 11 | 89 | 118 | 3708 | 3741 |
| rbcL + ycf1 | 10 | 88 | 118 | 7044 | 7239 |
| ndhF + ycf1 | 10 | 88 | 115 | 7941 | 8110 |
Major genetic features of gene sequences
The mutation rate at C and G sites was significantly higher compared to that at A and T sites and the majority of base mutations belonged to GC → AT transitions in Theaceae plants (Fig. 1). rbcL and ycf1 had the highest and lowest GC content (44.0% and 29.8%, respectively) individually among single gene sequences, so rbcL was highly conserved. The bases in ycf1 were more prone to mutation. Some combined gene sequences also showed similar features, including matK + rbcL (GC content 38.3%) and ndhF + ycf1 (GC content 30.3%) (Fig. 1). Average AT and GC content at different coding positions of codons are listed in Table S11 for reference.
Fig. 1.

Average AT and GC content of candidate gene nucleotide sequences in Theaceae plants
ycf1 had the most V sites and the maximum proportion of V sites (1047, 18.1%) among single gene sequences, followed by ndhF (368, 16.2%), matK (219, 14.3%), rbcL (98, 6.9%) (Table 2). Besides, the proportion of Pi sites was the highest in matK (79.0%), followed by those in ndhF (75.5%), rbcL (73.5%) and ycf1(71.1%) (Table 2). The genetic stability of ycf1 with more mutations was relatively low. In contrast, matK had fewer mutations, while its genetic stability was 8% higher compared to that of ycf1. Accordingly, ycf1 and matK were key single gene sequences to affect the genetic features of combined gene sequences (e.g., ndhF + ycf1 with 1401 V sites and matK + rbcL with the highest proportion of Pi sites (76.9%)).
Table 2.
The number and proportion of Conserved (C) sites, Variable sites (V), Parsimony informative sites (Pi) and Singleton sites (S) in Theaceae plants
| Sequences | Conserved sites | Variable sites | Parsimony informative sites | Singleton sites | ||||
|---|---|---|---|---|---|---|---|---|
| Number | Total proportion (%) | Number | Total proportion (%) | Number | Variable sites proportion (%) | Number | Variable sites proportion (%) | |
| matK | 1314 | 85.7 | 219 | 14.3 | 173 | 79.0 | 46 | 21.0 |
| rbcL | 1330 | 93.1 | 98 | 6.9 | 72 | 73.5 | 26 | 26.5 |
| ndhF | 1909 | 83.8 | 368 | 16.2 | 278 | 75.5 | 84 | 22.8 |
| ycf1 | 4722 | 81.9 | 1047 | 18.1 | 744 | 71.1 | 303 | 28.9 |
| matK + rbcL | 2656 | 89.9 | 299 | 10.1 | 230 | 76.9 | 69 | 23.1 |
| matK + ndhF | 3249 | 85.5 | 549 | 14.5 | 418 | 76.1 | 125 | 22.8 |
| matK + ycf1 | 6057 | 83.0 | 1242 | 17.0 | 874 | 70.4 | 368 | 29.6 |
| rbcL + ndhF | 3247 | 87.8 | 452 | 12.2 | 329 | 72.8 | 117 | 25.9 |
| rbcL + ycf1 | 6063 | 84.2 | 1137 | 15.8 | 802 | 70.5 | 335 | 29.5 |
| ndhF + ycf1 | 6652 | 82.6 | 1401 | 17.4 | 958 | 68.4 | 433 | 30.9 |
As depicted in Fig. 2, ycf1, matK + ycf1, rbcL + ycf1, ndhF + ycf1 had high-frequency base substitution mutations, and the main form of mutations in Theaceae species was base transition. All R values were higher than 1.0 (Fig. 3), which revealed that the main form of gene mutation in Theaceae plants was base transition. matK and ndhF had a lower saturation effect and less evolutionary murmur, and the above features were conducive to constructing phylogenetic trees and collecting correct genetic information (Fig. 3). Moreover, matK and ndhF mitigated the above effect on ycf1 in combined gene sequences. Table S12 lists the number of base substitutions and R values at different coding positions of codons.
Fig. 2.

Significant correlation between GC content and overall mean genetic distance. Stacked Bar of the number of base substitution including base transversion and base transition in candidate nucleotide sequences of Theaceae plants
Fig. 3.
Analysis of ratios of transitionsal pairs to transversional pairs (R value). Y axis takes y = 1.0 as the baseline, which suggests that all of candidate nucleotide sequences have the effect of base substitution saturation
Genetic diversity analysis
ycf1 had high levels of nucleotide diversity with the maximum θ and Hd values (0.03826, 0.997) among single gene sequences (Table 3). Thus, the high mutation rate of ycf1 contributed to abundant genetic diversity and genetic resources within Theaceae plants. Besides, θ and Hd values of matK + ycf1 (0.03583, 0.997), rbcL + ycf1 (0.03314, 0.997) and ndhF + ycf1 (0.03644, 0.998) were significantly higher compared to that of other combined gene sequences (Table 3). As revealed by the above results, ycf1 gene sequences could be suitable for the species identification of Theaceae plants.
Table 3.
Genetic diversity of gene sequences
| Sequences | Nucleotide diversity | Neutrality tests | ||||||
|---|---|---|---|---|---|---|---|---|
| θ | Hd | S | π | Fu’s Fs | P value | D | P value | |
| matK | 0.02837 | 0.976 | 212 | 0.02063 | − 0.98564 | P > 0.10 | − 0.88178 | P > 0.10 |
| rbcL | 0.01371 | 0.931 | 98 | 0.00761 | − 1.56362 | P > 0.10 | − 1.43028 | P > 0.10 |
| ndhF | 0.03094 | 0.982 | 317 | 0.01779 | − 1.71519 | P > 0.10 | − 1.38427 | P > 0.10 |
| ycf1 | 0.03826 | 0.997 | 969 | 0.02571 | − 1.85941 | P > 0.10 | − 1.09645 | P > 0.10 |
| matK + rbcL | 0.02054 | 0.984 | 291 | 0.01324 | − 1.32595 | P > 0.10 | − 1.17354 | P > 0.10 |
| matK + ndhF | 0.02918 | 0.993 | 492 | 0.0203 | − 1.19153 | P > 0.10 | − 1.01837 | P > 0.10 |
| matK + ycf1 | 0.03583 | 0.997 | 1165 | 0.02467 | − 1.84762 | P > 0.10 | − 1.04623 | P > 0.10 |
| rbcL + ndhF | 0.02424 | 0.991 | 403 | 0.01595 | − 1.53279 | P > 0.10 | − 1.14323 | P > 0.10 |
| rbcL + ycf1 | 0.03314 | 0.997 | 1069 | 0.02232 | − 1.85036 | P > 0.10 | − 1.09645 | P > 0.10 |
| ndhF + ycf1 | 0.03644 | 0.998 | 1278 | 0.02466 | − 1.97306 | 0.10 > P > 0.05 | − 1.08892 | P > 0.10 |
In accordance with P values of Fu’s Fs and D (Table 3), the mutational pattern of chloroplast genes in Theaceae plants was observed as having random mutations. Thus, Theaceae species were observed as less susceptible to environmental, human, or other external factors. Among single gene sequences, rbcL, ndhF and ycf1, all facilitated the population expansion of Theaceae plants with higher absolute values of Fu’s Fs (− 1.56362, − 1.71519, − 1.85941) and D (− 1.43028, − 1.38427, − 1.09645).
Genetic distance analysis
There was no overlap between the maximum intraspecific genetic distance and the minimum interspecific genetic distance among single and combined gene sequences (Fig. 4), which revealed that chloroplast genes of Theaceae plants had significant genetic differences and could help obtain barcodes. Moreover, as indicated by the significant differences between intraspecific and interspecific distances of ycf1, matK + ycf1, rbcL + ycf1 and ndhF + ycf1, the above gene sequences were more suitable for species recognition and identification. However, rbcL had difficulty in making barcodes due to the smaller corresponding difference of rbcL + ndhF gene sequence.
Fig. 4.
Bar chart of interspecific distances and intraspecific distances of Theaceae plants based on K2P genetic distance of gene sequences
Phylogenetic analysis
In accordance with the results of major genetic features, genetic diversity and genetic distance analysis, matK, ndhF and ycf1 gene sequences were combined to construct NJ, ML and Bayes phylogenetic trees. NJ tree contained 9 blocks, 10 genera and 90 different species (Fig. 5). Camellia was observed as the first largest genera in Theaceae plants, while Apterosperma, Eurya, Ternstroemia, Gordonia, and Franklinia were small genera. Eurya, Ternstroemia and Apterosperma underwent a long evolutionary process, as revealed by the estimated genetic distances of species at terminal branches. Camellia, Schima, Stewartia and Pyrenaria followed the same evolutionary transition pattern with similar genetic distances. Since Polyspora and Apterosperma were small genera, they were probably new genera evolved from Camellia and Pyrenaria. Moreover, Eurya and Ternstroemia had a related ancestry with similar genetic distances.
Fig. 5.
The NJ tree of Theaceae plants coming from matK + rbcL + ycf1 gene sequences based on the K2P model. The bootstraps of tree branches were all greater than 75%. Triangle marks (color in red) on branches represent bootstrap values; the bigger the triangles, the higher the bootstrap values would be. At the same time, displaying branch lengths on the main branches, namely the length of evolutionary distance, which retained four significant digits
matK + ndhF + ycf1 gene sequences had excellent identification ability at genera level. There were obvious genetic nodes on branches of all genera in Theaceae plants for segmentation (Fig. 5). For species-level identification, species in monotypic genera (Apterosperma, Ternstroemia, Franklinia), small genera (Eurya, Gordonia) and species in large genera (species ≤ 10, Polyspora) had obvious internal branches. However, species in large genera (species > 10, Stewartia, Pyrenaria, Camellia, Schima) were reported with limitations in terms of species-level identification ability, especially for Camellia; their genetic distances at terminal branches were below 0.0001.
The same results were acquired in ML and Bayes phylogenetic trees (Figs. S1 and S2). Impacted by the algorithm, some branches with low bootstrap values in Bayes tree would be difficult to exclude. Supplementary information regarding matK + ndhF + ycf1 gene sequence database is listed in Table S13.
Specific barcodes based on matK + ndhF + ycf1 gene sequences
It was observed that when “Query Cover” and “Per.Ident” were both 100% for the candidate gene sequence segments, the accuracy of species-level identification increased markedly by repeated experiments (Table 4). Subsequently, species-specific barcodes based on the above segments effectively identified Camellia amplexicaulis (Ca), Franklinia alatamaha (Fa), Gordonia brandegeei (Gb) and Stewartia micrantha (Sm) from other species of Theaceae plants (Fig. 6) with the highest “Max score” and “Total score” values. Furthermore, alternative segments with the same identification ability were provided to increase efficiency (Fig. S3 and Table S14).
Table 4.
NCBI Blast output of candidate gene sequence segments
| Barcodes | Query cover (%) | Per.Ident (%) | Max score | Total score | E value | Accession on NCBI | Identified species |
|---|---|---|---|---|---|---|---|
| Ca-matK-1 | 100 | 100 | 278 | 278 | 2E−70 | NC_051559.1; MT317095.1 | Camellia amplexicaulis |
| Ca-ndhF-1 | 100 | 100 | 278 | 278 | 2E−70 | NC_051559.1; MT317095.1 | Camellia amplexicaulis |
| Ca-ycf1-1 | 100 | 100 | 278 | 556 | 2E−70 | NC_051559.1; MT317095.2 | Camellia amplexicaulis |
| Fa-matK-1 | 100 | 100 | 278 | 278 | 2E−70 | KY406774.1; AF380082.1 | Franklinia alatamaha |
| Fa-ndhF-1 | 100 | 100 | 278 | 278 | 2E−70 | KY406774.1; HM100319.1; HM164089.1 | Franklinia alatamaha |
| Fa-ycf1-1 | 100 | 100 | 278 | 278 | 2E−70 | KY406774.1 | Franklinia alatamaha |
| Gb-matK-1 | 100 | 100 | 278 | 278 | 2E−70 | KY406761.1; AF380084.1 | Gordonia brandegeei |
| Gb-ndhF-1 | 100 | 100 | 278 | 278 | 2E−70 | KY406761.1 | Gordonia brandegeei |
| Gb-ycf1-1 | 100 | 100 | 278 | 278 | 2E−70 | KY406761.1 | Gordonia brandegeei |
| Sm-matK-1 | 100 | 100 | 278 | 278 | 2E−70 | NC_041471.1; MH782186.1 | Stewartia micrantha |
| Sm-ndhF-1 | 100 | 100 | 278 | 278 | 2E−70 | NC_041471.1; MH782186.1 | Stewartia micrantha |
| Sm-ycf1-1 | 100 | 100 | 278 | 556 | 2E−70 | NC_041471.1; MH782186.1 | Stewartia micrantha |
Fig. 6.
Combinatorial DNA barcodes including one-dimensional and two-dimensional DNA barcodes of Theaceae plants based on matK, ndhF and ycf1. In one-dimensional DNA barcodes, base A, T, C, G in green, red, blue and black, respectively
Discussion
Theaceae-containing products which further include precious herbal medicines (Wang et al. 2016) and healthy food (Koch et al. 2019) spread across the world. Moreover, germplasm resources of Theaceae plants with low genetic diversity should be identified and protected (Niu et al. 2019). Existing studies have confirmed that barcodes distinguish plants at the species and population levels using chloroplast gene segments (Cui et al. 2020; Besse et al. 2021). Methods of constructing barcodes cannot be limited to a single standard in accordance with genetic algorithms and ignore tests for identification accuracy of barcodes. Accordingly, as revealed by the results of barcode stability and accuracy analysis in this study, chloroplast barcodes based on matK + ndhF + ycf1 combined gene sequences which efficiently identified Camellia amplexicaulis, Franklinia alatamaha, Gordonia brandegeei and Stewartia micrantha from Theaceae species. Constructing species-specific barcode databases will be a more accepted Theaceae species identification technology by rapidly sequencing technologies (Li et al. 2015) exhibiting high stability and accuracy.
Design of specific barcodes
In accordance with previous barcodes regarding Rehmannia (Duan et al. 2019) and Orchidaceae species (Li et al. 2021), whether combined gene sequences increase the accuracy of identifying Theaceae plants should be further examined. Moreover, novel synonyms among species of Theaceae plants would lead to difficulty in making accurate species-level identifications (Skvarla et al. 2020). In this study, sequences with confused synonyms were deleted, or some synonyms were replaced with more common species names.
GC content in complete chloroplast genomes of Theaceae plants is 36–38% (Kim et al. 2017; Xu et al. 2021; Yang et al. 2021), which is similar to 37.2% GC content in Orchidaceae plants (Trávníček et al. 2019) and 37.7% GC content in Asteraceae plants (Lv et al. 2020; Xie et al. 2021). In fact, the population expansion of Orchidaceae and Asteraceae species have been affected by GC content of own sequences, while there is no correlation between Theaceae species and GC content of own sequences (Wang et al. 2016). The reason is probably that the severe growth environment with relatively high temperatures of some Theaceae species instead of the population expansion enriched genetic diversity in chloroplast genomes (Wang et al. 2022). Thus, as indicated by the results of 34.6% GC content of 4 single gene sequences (matK, rbcL, ndhF and ycf1) (Fig. 1), some specific sequences with lower GC content instead of complete chloroplast genomes artificially improved the success of species identification. Studies on C, V, Pi and S sites revealed single-nucleotide polymorphisms (SNP) in Theaceae plants to identify de novo mutations of sequences (Neininger et al. 2019). Recent studies have demonstrated V sites as the most common SNP sites that played a role of segregating sites to construct barcode segments (Gong et al. 2021; Li et al. 2021), and barcodes based on Pi and S sites were found to appear sometimes due to their rare genetic traits (Delabye et al. 2019; Smidt et al. 2020). In our studies, S and Pi sites (the sum of Pi and S sites is V sites in Table 2) were considered the standards of screening gene sequences, and V sites were adopted to achieve barcode segments. ycf1 was confirmed as a new focus on making barcodes (Fig. 2) (Li et al. 2021), and matK and ndhF gene sequences could increase the identification accuracy by the combination with ycf1.
Genetic distance refers to a critical factor of judging whether a combined gene sequence reduces identification accuracy because the four single gene sequence fragments share high sequence homology (Li et al. 2021). Besides, since Theaceae databases on NCBI have more limited datasets, this method was employed instead of “Best Close Match” (Collins and Cruickshank 2013; Li et al. 2021). In fact, tests for genetic distances have confirmed that matK (Chaveerach et al. 2016), ndhF (Korotkova et al. 2014) and ycf1 (Gogoi et al. 2020) cannot reduce identification accuracy of combined gene sequences.
Camellia amplexicaulis, the main raw materials of new neolignan component, has significant potential for officinal development (e.g., osteoblast differentiation treatment) (Tung et al. 2009). Franklinia alatamaha used as an ornamental plant is primarily distributed in the United States (Luna and Ochoterena 2004). The feasibility of using genetic traits of Gordonia brandegeei and Stewartia micrantha and establishing them in creating germplasm resources of Theaceae species is high since Gordonia and Stewartia genera species have great abilities to asexual propagation and interspecific crossing (Wang et al. 2016). Constructing barcodes of above species could ensure the utilization and protection of biological resources of Theaceae plants.
Identification accuracy of barcodes
Results of major genetic features (Table 2) and genetic diversity (Table 3) have showed that ndhF and ycf1 have a positive significance for population expansion and facilitate the generation of abundant SNP sites, while matK + ndhF + ycf1 cannot effectively identify some species in large genera (such as Camellia) (Fig. 5), which is similar to other researches (Li et al. 2021). According to studies regarding phylogenetic analysis in Piper species (Chaveerach et al. 2016) and Orchidaceae species (Li et al. 2021), some species with low distance values in large genera were difficult to distinguish. The results were achieved since nucleotide diversity among species in large genera was low (Feng et al. 2017). Accordingly, multiple SNP sites were screened in matK, ndhF and ycf1 gene sequences to expand the scope of species identification. Besides, Eurya, Ternstroemia and Apterosperma species were observed as rare species (Wang et al. 2016; Zhang et al. 2020), and Camellia, Schima, Stewartia and Pyrenaria species had a similar evolutionary degree of internal species individually (Luna and Ochoterena 2004).
Several limitations in this study should be solved. One limitation was that a lower number of gene sequences on NCBI and ambiguous codes, indels and missing symbols in sequences affected genetic features and diversity of Theaceae species. As a result, some significant SNP sites were omitted, thus causing a narrower spectrum of identifiable species. Furthermore, in Blast tests for screening candidate gene sequence segments, the recognition rate of numerous segments with lower max or total scores could not approach 100%, so we could derive candidate gene sequence segments from only a few Theaceae species.
Notwithstanding the above unavoidable limitations, this is the first research that construct barcodes based on genetic information of Theaceae species. Since the emergence of the next-generation sequencing technology facilitates the widespread use of online sequence databases and pursuits of highly efficient species identification technology (Besse et al. 2021), we consider that DNA barcoding still applies to Theaceae species identification, and matK + ndhF + ycf1 will serve as a vital candidate combined gene sequence.
Lastly, DNA barcoding identifies species in accordance with the principle of differential alignment between sequences, which pertains to the branch of genetics in essence. Thus, this technology should be continuously optimized to ensure its innovation and universality, and it should be gradually combined with other theories and technologies of genetics to form a complementary data analysis system about genetics (e.g., the analysis of codon usage bias (Ganie et al. 2019), gene flow and genetic differentiation (Presgraves 2018)). Besides, we suggest that researchers upload complete chloroplast sequences to databases, and NCBI presents up-to-date sequence information and corresponding supplementary references for the protection of some important species.
Conclusions
In this dissertation, the aim was to construct species-specific cpDNA barcodes of Theaceae plants and assess identification accuracy of barcodes. matK + ndhF + ycf1 with high accuracy and stability identifies Camellia amplexicaulis, Franklinia alatamaha, Gordonia brandegeei and Stewartia micrantha from other Theaceae species. To the best of our knowledge, this study has been considered the first comprehensive investigation of barcodes of Theaceae species. Methods of constructing barcodes involved in this study are more theoretically supportive through the genetic information analysis. Moreover, matK + ndhF + ycf1 was tested for its accuracy as compared with combined gene sequences through the phylogenetic analysis. Lastly, barcoding provided in this study was used as references for identifying Theaceae species and optimizing existing barcoding technologies to protect and make rational use of biological resources.
Supplementary Information
Below is the link to the electronic supplementary material.
Fig. S1 The ML tree of Theaceae plants coming from matK + rbcL + ycf1 sequences based on the K2P model. The bootstraps of tree branches were all greater than 75%. Triangle marks (color in red) on branches represent bootstrap values; the bigger the triangles, the higher the bootstrap values would be. At the same time, displaying branch lengths on the main branches, namely the length of evolutionary distance, which retained four significant digits. (TIF 2747 kb)
Fig. S2 The Bayes tree of Theaceae plants coming from matK + rbcL + ycf1 sequences based on the K2P model. The bootstraps of tree branches were all greater than 75%. Triangle marks (color in red) on branches represent bootstrap values; the bigger the triangles, the higher the bootstrap values would be. At the same time, displaying branch lengths on the main branches, namely the length of evolutionary distance, which retained four significant digits. (TIF 3597 kb)
Fig. S3 Combinatorial super DNA barcodes employed as substitute sequences including one-dimensional and two-dimensional DNA barcodes of Theaceae plants based on matK, ndhF and ycf1. In one-dimensional DNA barcodes, base A, T, C, G in green, red, blue and black respectively. (TIF 6484 kb)
Table S1 Accession numbers, version numbers, accepted names, synonyms and definition of matK (single gene sequence) in Theaceae plants on the NCBI online database. (XLSX 19 kb)
Table S2 Accession numbers, version numbers, accepted names, synonyms and definition of rbcL (single gene sequence) in Theaceae plants on the NCBI online database. (XLSX 17 kb)
Table S3 Accession numbers, version numbers, accepted names, synonyms and definition of ndhF (single gene sequence) in Theaceae plants on the NCBI online database. (XLSX 19 kb)
Table S4 Accession numbers, version numbers, accepted names, synonyms and definition of ycf1 (single gene sequence) in Theaceae plants on the NCBI online database. (XLSX 17 kb)
Table S5 Accession numbers, version numbers, accepted names, synonyms and definition of matK + rbcL (combined gene sequence) in Theaceae plants on the NCBI online database. (XLSX 17 kb)
Table S6 Accession numbers, version numbers, accepted names, synonyms and definition of matK + ndhF (combined gene sequence) in Theaceae plants on the NCBI online database. (XLSX 16 kb)
Table S7 Accession numbers, version numbers, accepted names, synonyms and definition of matK + ycf1 (combined gene sequence) in Theaceae plants on the NCBI online database. (XLSX 16 kb)
Table S8 Accession numbers, version numbers, accepted names, synonyms and definition of rbcL + ndhF (combined gene sequence) in Theaceae plants on the NCBI online database. (XLSX 16 kb)
Table S9 Accession numbers, version numbers, accepted names, synonyms and definition of rbcL + ycf1 (combined gene sequence) in Theaceae plants on the NCBI online database. (XLSX 16 kb)
Table S10 Accession numbers, version numbers, accepted names, synonyms and definition of ndhF + ycf1 (combined gene sequence) in Theaceae plants on the NCBI online database. (XLSX 16 kb)
Table S11 Average AT and GC content at different coding positions of codons in Theaceae plants. (DOCX 17 kb)
Table S12 The mean number of identical pairs, base transition and transversion at different coding positions of codons and ratio of transitionsal pairs to transversional pairs (R value) of Theaceae plants. (DOCX 21 kb)
Table S13 Accession numbers, version numbers, accepted names, synonyms and definition of matK + ndhF + ycf1 (combined gene sequence) in Theaceae plants on the NCBI online database (XLSX 15 kb)
Table S14 NCBI Blast output of alternative candidate gene sequence segments. (DOCX 18 kb)
Acknowledgements
The authors are grateful Dr. Shuai Hu, School of life sciences, Tsinghua University for critical review of this manuscript.
Abbreviations
- cpDNA
Chloroplast DNA
- C sites
Conserved sites
- V sites
Variable sites
- Pi sites
Parsimony informative sites
- S sites
Singleton sites
- SNP sites
Single nucleotide polymorphism sites
- ii
Identical Pairs
- si
Transitionsal pairs
- sv
Transversional pairs
- R value
Transitionsal pairs/Transversional pairs
- θ
Nucleotide substitution rate
- Hd
Haplotype diversity
- S
Segregating sites
- π
Nucleotide diversity
- Fu’s Fs
Representative value of neutrality tests
- Tajima’s D
Representative value of neutrality tests
- P value
Hypothetical probability
- ML tree
Maximum likelihood tree
- NJ tree
Neighbor joining tree
- Ca
Camellia amplexicaulis
- Fa
Franklinia alatamaha
- Gb
Gordonia brandegeei
- Sm
Stewartia micrantha
Author’s contributions
XHG and YLL designed the research. SJ, FLC, PQ, HX, GP and YLL collected and analyzed data. SJ, FLC, YLL and XHG wrote the main manuscript text. SJ prepared all figures and tables. All authors read and approved the manuscript.
Funding
Authors would like to acknowledge the funding received from Key Research & Development Project of Hunan Provincial Department of Science and Technology (2019NK2081), National Natural Science Foundation of China (31872866) and National Key Research and Development Program of China (2017YFF0210301) to carry out this assignment.
Data availability
The data supporting the finding of this study is provided in the manuscript and its supplementary material.
Declarations
Conflict of interest
The authors declare that they have no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Shuai Jiang and Fenglin Chen have contributed equally to this work.
Contributor Information
Yongliang Li, Email: 1608956268@qq.com.
Xinhong Guo, Email: gxh@hnu.edu.cn.
References
- Amar MH. ycf1-ndhF genes, the most promising plastid genomic barcode, sheds light on phylogeny at low taxonomic levels in Prunus persica. J Genet Eng Biotechnol. 2020;18(1):42. doi: 10.1186/s43141-020-00057-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Besse P, Da Silva D, Grisoni M. Plant DNA barcoding principles and limits: a case study in the genus vanilla. Methods Mol Biol. 2021;2222:131–148. doi: 10.1007/978-1-0716-0997-2_8. [DOI] [PubMed] [Google Scholar]
- Bhargava M, Sharma A. DNA barcoding in plants: evolution and applications of in silico approaches and resources. Mol Phylogenet Evol. 2013;67(3):631–641. doi: 10.1016/j.ympev.2013.03.002. [DOI] [PubMed] [Google Scholar]
- Chaveerach A, Tanee T, Sanubol A, Monkheang P, Sudmoon R. Efficient DNA barcode regions for classifying Piper species (Piperaceae) PhytoKeys. 2016;70:1–10. doi: 10.3897/phytokeys.70.6766. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen G, Sun W. The role of botanical gardens in scientific research, conservation, and citizen science. Plant Divers. 2018;40(4):181–188. doi: 10.1016/j.pld.2018.07.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen C, Chen H, Zhang Y, Thomas HR, Frank MH, He Y, Xia R. TBtools: an integrative toolkit developed for interactive analyses of big biological data. Mol Plant. 2020;13(8):1194–1202. doi: 10.1016/j.molp.2020.06.009. [DOI] [PubMed] [Google Scholar]
- Chen LP, Zheng FY, Bai J, Wang JM, Lv CY, Li X, Zhi YC, Li XJ. Comparative analysis of mitogenomes among six species of grasshoppers(Orthoptera: Acridoidea: Catantopidae) and their phylogeneticimplications in wing-type evolution. Int J Biol Macromol. 2020;159(1):1062–1072. doi: 10.1016/j.ijbiomac.2020.05.058. [DOI] [PubMed] [Google Scholar]
- Collins RA, Cruickshank RH. The seven deadly sins of DNA barcoding. Mol Ecol Resour. 2013;13(6):969–975. doi: 10.1111/1755-0998.12046. [DOI] [PubMed] [Google Scholar]
- Cui N, Liao BS, Liang CL, Li SF, Zhang H, Xu J, Li XW, Chen SL. Complete chloroplast genome of Salvia plebeia: organization, specific barcode and phylogenetic analysis. Chin J Nat Med. 2020;18(8):563–572. doi: 10.1016/S1875-5364(20)30068-6. [DOI] [PubMed] [Google Scholar]
- Delabye S, Rougerie R, Bayendi S, Andeime-Eyene M, Zakharov EV, deWaard JR, Hebert PDN, Kamgang R, Le Gall P, Lopez-Vaamonde C, Mavoungou JF, Moussavou G, Moulin N, Oslisly R, Rahola N, Sebag D, Decaëns T. Characterization and comparison of poorly known moth communities through DNA barcoding in two Afrotropical environments in Gabon. Genome. 2019;62(3):96–107. doi: 10.1139/gen-2018-0063. [DOI] [PubMed] [Google Scholar]
- Duan H, Wang W, Zeng Y, Guo M, Zhou Y. The screening and identification of DNA barcode sequences for Rehmannia. Sci Rep. 2019;9(1):17295. doi: 10.1038/s41598-019-53752-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feng C, Pettersson M, Lamichhaney S, Rubin CJ, Rafati N, Casini M, Folkvord A, Andersson L. Moderate nucleotide diversity in the Atlantic herring is associated with a low mutation rate. Elife. 2017;6:e23907. doi: 10.7554/eLife.23907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fontaine B, Achterberg K, Alonso-Zarazaga MA. New species in the old world: Europe as a frontier in biodiversity exploration, a test bed for 21st century taxonomy. PLoS ONE. 2012;7(5):e36881. doi: 10.1371/journal.pone.0036881. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fu YX. Statistical tests of neutrality of mutations against population growth. Hitchhiking Backgr Sel Genet. 1997;147(2):915–925. doi: 10.1093/genetics/147.2.915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ganie SA, Molla KA, Henry RJ, Bhat KV, Mondal TK. Advances in understanding salt tolerance in rice. Theor Appl Genet. 2019;132(4):851–870. doi: 10.1007/s00122-019-03301-8. [DOI] [PubMed] [Google Scholar]
- García-Robledo C, Erickson DL, Staines CL, Erwin TL, Kress WJ. Tropical plant-herbivore networks: reconstructing species interactions using DNA barcodes. PLoS ONE. 2013;8(1):e52967. doi: 10.1371/journal.pone.0052967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gogoi B, Wann SB, Saikia SP. DNA barcodes for delineating Clerodendrum species of North East India. Sci Rep. 2020;10(1):13490. doi: 10.1038/s41598-020-70405-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gong L, Zhang D, Ding X, Huang J, Guan W, Qiu X, Huang Z. DNA barcode reference library construction and genetic diversity and structure analysis of Amomum villosum Lour. (Zingiberaceae) populations in Guangdong Province. PeerJ. 2021;9:e12325. doi: 10.7717/peerj.12325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gonzalez MA, Baraloto C, Engel J, Mori SA, Pétronelli P, Riéra B, Roger A, Thébaud C, Chave J. Identification of Amazonian trees with DNA barcodes. PLoS ONE. 2009;4(10):e7483. doi: 10.1371/journal.pone.0007483. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hall BG. Building phylogenetic trees from molecular data with MEGA. Mol Biol Evol. 2013;30(5):1229–1235. doi: 10.1093/molbev/mst012. [DOI] [PubMed] [Google Scholar]
- Hebert PDN, Cywinska A, Ball SL, deWaard JR. Biological identifications through DNA barcodes. Proc R Soc Lond B Biol Sci. 2003;270:313–321. doi: 10.1098/rspb.2002.2218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hollingsworth PM. Refining the DNA barcode for land plants. Proc Natl Acad Sci USA. 2011;108(49):19451–19452. doi: 10.1073/pnas.1116812108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim SH, Cho CH, Yang M, Kim SC. The complete chloroplast genome sequence of the Japanese Camellia (Camellia japonica L) Mitochondrial DNA B Resour. 2017;2(2):583–584. doi: 10.1080/23802359.2017.1372719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koch W, Zagórska J, Marzec Z, Kukula-Koch W. Applications of tea (Camellia sinensis) and its active constituents in cosmetics. Molecules. 2019;24(23):4277. doi: 10.3390/molecules24234277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Korotkova N, Nauheimer L, Ter-Voskanyan H, Allgaier M, Borsch T. Variability among the most rapidly evolving plastid genomic regions is lineage-specific: implications of pairwise genome comparisons in Pyrus (Rosaceae) and other angiosperms for marker choice. PLoS ONE. 2014;9(11):e112998. doi: 10.1371/journal.pone.0112998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kroymann J, de Groot GA, During HJ, Maas JW, Schneider H, Vogel JC, Erkens RH. Use of rbcL and trnL-F as a two-locus DNA barcode for identification of NW-European ferns: an ecological perspective. PLoS ONE. 2011;6(1):e16371. doi: 10.1371/journal.pone.0016371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Letunic I, Bork P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 2021;49(W1):W293–W296. doi: 10.1093/nar/gkab301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li X, Yang Y, Henry RJ, Rossetto M, Wang Y, Chen S. Plant DNA barcoding: from gene to genome. Biol Rev Camb Philos Soc. 2015;90(1):157–166. doi: 10.1111/brv.12104. [DOI] [PubMed] [Google Scholar]
- Li W, Zhang C, Guo X, Liu Q, Wang K. Complete chloroplast genome of Camellia japonica genome structures, comparative and phylogenetic analysis. PLoS ONE. 2019;14(5):e0216645. doi: 10.1371/journal.pone.0216645. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, Xiao W, Tong T, Li Y, Zhang M, Lin X, Zou X, Wu Q, Guo X. The specific DNA barcodes based on chloroplast genes for species identification of Orchidaceae plants. Sci Rep. 2021;11(1):1424. doi: 10.1038/s41598-021-81087-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu HL, Zeng YT, Zhao X, Ye YL, Wang B, Tong HR. Monitoring the authenticity of pu'er tea via chemometric analysis of multielements and stable isotopes. Food Res Int. 2020;136:109483. doi: 10.1016/j.foodres.2020.109483. [DOI] [PubMed] [Google Scholar]
- Lu H, Jiang W, Ghiassi M, Lee S, Nitin M. Classification of Camellia (Theaceae) species using leaf architecture variations and pattern recognition techniques. PLoS ONE. 2012;7(1):e29704. doi: 10.1371/journal.pone.0029704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luna I, Ochoterena H. Phylogenetic relationships of the genera of Theaceae based on morphology. Cladistics. 2004;20(3):223–270. doi: 10.1111/j.1096-0031.2004.00024.x. [DOI] [PubMed] [Google Scholar]
- Lv ZY, Zhang JW, Chen JT, Li ZM, Sun H. The complete chloroplast genome of Soroseris umbrella (Asteraceae) Mitochondrial DNA B. 2020;5:637–638. doi: 10.1080/23802359.2019.1711223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meng XH, Li N, Zhu HT, Wang D, Yang CR, Zhang YJ. Plant resources, chemical constituents, and bioactivities of tea plants from the genus camellia section Thea. J Agric Food Chem. 2019;67(19):5318–5349. doi: 10.1021/acs.jafc.8b05037. [DOI] [PubMed] [Google Scholar]
- Neininger K, Marschall T, Helms V. SNP and indel frequencies at transcription start sites and at canonical and alternative translation initiation sites in the human genome. PLoS ONE. 2019;14(4):e0214816. doi: 10.1371/journal.pone.0214816. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Newmaster SG, Ragupathy S, Janovec J. A botanical renaissance: state-of-the-art DNA barcoding facilitates an Automated Identification Technology system for plants. Int J Comput Appl Technol. 2009;35(1):50–60. doi: 10.1504/IJCAT.2009.024595. [DOI] [Google Scholar]
- Niu S, Song Q, Koiwa H, Qiao D, Zhao D, Chen Z, Liu X, Wen X. Genetic diversity, linkage disequilibrium, and population structure analysis of the tea plant (Camellia sinensis) from an origin center, Guizhou plateau, using genome-wide SNPs developed by genotyping-by-sequencing. BMC Plant Biol. 2019;19(1):328. doi: 10.1186/s12870-019-1917-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Presgraves DC. Evaluating genomic signatures of "the large X-effect" during complex speciation. Mol Ecol. 2018;27(19):3822–3830. doi: 10.1111/mec.14777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol. 2012;61(3):539–542. doi: 10.1093/sysbio/sys029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sheth BP, Thaker VS. DNA barcoding and traditional taxonomy: an integrated approach for biodiversity conservation. Genome. 2017;60(7):618–628. doi: 10.1139/gen-2015-0167. [DOI] [PubMed] [Google Scholar]
- Skvarla M, Kramer M, Owen CL, Miller GL. Reexamination of Rhopalosiphum (Hemiptera: Aphididae) using linear discriminant analysis to determine the validity of synonymized species, with some new synonymies and distribution data. Biodivers Data. 2020;8:e49102. doi: 10.3897/BDJ.8.e49102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smidt EDC, Páez MZ, Vieira LDN, Viruel J, De Baura VA, Balsanelli E, De Souza EM, Chase MW. Characterization of sequence variability hotspots in Cranichideae plastomes (Orchidaceae, Orchidoideae) PLoS ONE. 2020;15:e0227991. doi: 10.1371/journal.pone.0227991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stoeckle MY, Gamble CC, Kirpekar R, Young G, Ahmed S, Little DP. Commercial teas highlight plant DNA barcode identification successes and obstacles. Sci Rep. 2011;1:42. doi: 10.1038/srep00042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tamura K, Stecher G, Kumar S. MEGA11: molecular evolutionary genetics analysis version 11. Mol Biol Evol. 2021;38(7):3022–3027. doi: 10.1093/molbev/msab120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trávníček P, Čertner M, Ponert J, Chumová Z, Jersáková J, Suda J. Diversity in genome size and GC content shows adaptive potential in orchids and is closely linked to partial endoreplication, plant life-history traits and climatic conditions. New Phytol. 2019;224:1642–1656. doi: 10.1111/nph.15996. [DOI] [PubMed] [Google Scholar]
- Tung NH, Ding Y, Choi EM, Minh CV, Kim YH. New neolignan component from Camellia amplexicaulis and effects on osteoblast differentiation. Chem Pharm Bull (tokyo) 2009;57(1):65–68. doi: 10.1248/cpb.57.65. [DOI] [PubMed] [Google Scholar]
- Wang M, Zhang Y. Adulteration detection of tea samples based on plant rbcL gene sequencing. Sheng Wu Gong Cheng Xue Bao. 2018;34(2):275–281. doi: 10.13345/j.cjb.170375. [DOI] [PubMed] [Google Scholar]
- Wang Y, Yang Y, Wei C, Wan X, Thompson HJ. Principles of biomedical agriculture applied to the plant family Theaceae to identify novel interventions for cancer prevention and control. J Agric Food Chem. 2016;64(14):2809–2814. doi: 10.1021/acs.jafc.6b00719. [DOI] [PubMed] [Google Scholar]
- Wang J, Tang X, Chu Q, Zhang M, Zhang Y, Xu B. Characterization of the volatile compounds in Camellia oleifera seed oil from different geographic origins. Molecules. 2022;27(1):308. doi: 10.3390/molecules27010308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xie L, Zhao J, Liu R. The complete chloroplast genome of Pseudognaphalium affine (D.Don) Anderb. (Asteraceae) Mitochondrial DNA B Resour. 2021;6(11):3276–3277. doi: 10.1080/23802359.2021.1993104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu Y, Liu Y, Jia X. Complete chloroplast genome of a cultivated oil camellia species, Camellia gigantocarpa. Mitochondrial DNA B Resour. 2021;7(1):43–45. doi: 10.1080/23802359.2021.2008836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang M, Xie F, Li J, Zhang Y, Li X, Yin H, Li J. The complete chloroplast genome of Camellia fluviatilis (Theaceae), a wild oil-Camellia species. Mitochondrial DNA B Resour. 2021;6(12):3511–3512. doi: 10.1080/23802359.2021.2005482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu N, Gu H, Wei Y, Zhu N, Wang Y, Zhang H, Zhu Y, Zhang X, Ma C, Sun A. Suitable DNA barcoding for identification and supervision of Piper kadsura in Chinese medicine markets. Molecules. 2016;21(9):1221. doi: 10.3390/molecules21091221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu XQ, Drew BT, Yang JB, Gao LM, Li DZ. Comparative chloroplast genomes of eleven Schima (Theaceae) species: Insights into DNA barcoding and phylogeny. PLoS ONE. 2017;12(6):e0178026. doi: 10.1371/journal.pone.0178026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang L, Wang Y, Chen Q, Luo Y, Zhang Y, Tang HR, Wang XR. Phylogeny of Rubus in China based on ndhF sequences. Acta Horticult Sin. 2015;42(1):19–30. [Google Scholar]
- Zhang Y, Meng Q, Wang Y, Zhang X, Wang W. Climate change-induced migration patterns and extinction risks of Theaceae species in China. Ecol Evol. 2020;10(10):4352–4361. doi: 10.1002/ece3.6202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang M, Tang YW, Xu Y, Takahiro Y, Shao Y, Wang YG, Song ZP, Yang J, Zhang WJ. Concerted and birth-and-death evolution of 26S ribosomal DNA in Camellia L. Ann Bot. 2021;127(1):63–73. doi: 10.1093/aob/mcaa169. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Fig. S1 The ML tree of Theaceae plants coming from matK + rbcL + ycf1 sequences based on the K2P model. The bootstraps of tree branches were all greater than 75%. Triangle marks (color in red) on branches represent bootstrap values; the bigger the triangles, the higher the bootstrap values would be. At the same time, displaying branch lengths on the main branches, namely the length of evolutionary distance, which retained four significant digits. (TIF 2747 kb)
Fig. S2 The Bayes tree of Theaceae plants coming from matK + rbcL + ycf1 sequences based on the K2P model. The bootstraps of tree branches were all greater than 75%. Triangle marks (color in red) on branches represent bootstrap values; the bigger the triangles, the higher the bootstrap values would be. At the same time, displaying branch lengths on the main branches, namely the length of evolutionary distance, which retained four significant digits. (TIF 3597 kb)
Fig. S3 Combinatorial super DNA barcodes employed as substitute sequences including one-dimensional and two-dimensional DNA barcodes of Theaceae plants based on matK, ndhF and ycf1. In one-dimensional DNA barcodes, base A, T, C, G in green, red, blue and black respectively. (TIF 6484 kb)
Table S1 Accession numbers, version numbers, accepted names, synonyms and definition of matK (single gene sequence) in Theaceae plants on the NCBI online database. (XLSX 19 kb)
Table S2 Accession numbers, version numbers, accepted names, synonyms and definition of rbcL (single gene sequence) in Theaceae plants on the NCBI online database. (XLSX 17 kb)
Table S3 Accession numbers, version numbers, accepted names, synonyms and definition of ndhF (single gene sequence) in Theaceae plants on the NCBI online database. (XLSX 19 kb)
Table S4 Accession numbers, version numbers, accepted names, synonyms and definition of ycf1 (single gene sequence) in Theaceae plants on the NCBI online database. (XLSX 17 kb)
Table S5 Accession numbers, version numbers, accepted names, synonyms and definition of matK + rbcL (combined gene sequence) in Theaceae plants on the NCBI online database. (XLSX 17 kb)
Table S6 Accession numbers, version numbers, accepted names, synonyms and definition of matK + ndhF (combined gene sequence) in Theaceae plants on the NCBI online database. (XLSX 16 kb)
Table S7 Accession numbers, version numbers, accepted names, synonyms and definition of matK + ycf1 (combined gene sequence) in Theaceae plants on the NCBI online database. (XLSX 16 kb)
Table S8 Accession numbers, version numbers, accepted names, synonyms and definition of rbcL + ndhF (combined gene sequence) in Theaceae plants on the NCBI online database. (XLSX 16 kb)
Table S9 Accession numbers, version numbers, accepted names, synonyms and definition of rbcL + ycf1 (combined gene sequence) in Theaceae plants on the NCBI online database. (XLSX 16 kb)
Table S10 Accession numbers, version numbers, accepted names, synonyms and definition of ndhF + ycf1 (combined gene sequence) in Theaceae plants on the NCBI online database. (XLSX 16 kb)
Table S11 Average AT and GC content at different coding positions of codons in Theaceae plants. (DOCX 17 kb)
Table S12 The mean number of identical pairs, base transition and transversion at different coding positions of codons and ratio of transitionsal pairs to transversional pairs (R value) of Theaceae plants. (DOCX 21 kb)
Table S13 Accession numbers, version numbers, accepted names, synonyms and definition of matK + ndhF + ycf1 (combined gene sequence) in Theaceae plants on the NCBI online database (XLSX 15 kb)
Table S14 NCBI Blast output of alternative candidate gene sequence segments. (DOCX 18 kb)
Data Availability Statement
The data supporting the finding of this study is provided in the manuscript and its supplementary material.




