ABSTRACT
The role of transposable elements (TEs) in genome evolution and phenotypic diversification in Camellia remains poorly understood. Here, we present an integrated analysis of genome resequencing data from 237 Camellia accessions and 11 de novo genome assemblies representing all major floral colour types. We constructed a comprehensive phylogenetic framework for the genus and suggest that the most recent common ancestor likely had white flowers. Comparative genomic analyses reveal structural variants across species that overlap with numerous transposable elements and contribute to genome content divergence. Using a graph‐based genome to characterise these structural variants, we find that lineage‐specific TE amplifications drive the regulatory network rewiring, which modulates homoeologous gene expression, influencing flower colour diversification. Further experimental validation identifies a lineage‐specific, high‐frequency presence variation mediated by a TIR transposon that regulates MYB60 expression, suppressing anthocyanin biosynthesis and leading to large‐scale floral colour divergence. Therefore, these findings highlight the central role of TE‐mediated regulatory innovation in the evolution of flower colour in Camellia and offer broader insights into the molecular mechanisms driving phenotypic diversification in plants.
Keywords: camellia, flower colour, genome, structure variant, transposable element
1. Introduction
Extensive genome‐wide association studies have revealed substantial genomic diversity underlying intraspecific phenotypic variation, including differences in coding sequences, gene expression and regulatory elements (Cheng et al. 2025; Fan et al. 2024; Gong et al. 2022). Beyond intraspecific analyses, studying phenotypic evolution across species under long‐term natural selection can uncover valuable genetic variations from wild gene pools (Wang et al. 2022), offering promising solutions to modern breeding bottlenecks. Such evolution often stems from sequence and expression changes driven by single nucleotide polymorphisms (SNPs) or structural variations (SVs). As a major source of genetic diversity, SVs—including insertions, deletions and inversions—play a pivotal role in plant evolution (Shi et al. 2024). Recent studies show they regulate key traits like flowering time, fruit quality and inflorescence meristem maintenance; (Wang et al. 2021; Chen, Wang, Kong, et al. 2023; Li et al. 2024).
Transposable elements (TEs) are abundant in plant and animal genomes. In some species, such as Camellia, maize and lily, TEs comprise over 80% of the genome (Hu, Fan, et al. 2024; Chen, Wang, Tan, et al. 2023a; Liang et al. 2025). TE insertions and movements drive SVs, including rearrangements, duplications and inversions, making them a key force in genome evolution (Zhang et al. 2025). TEs influence gene expression through multiple mechanisms: Introducing transcription factor binding sites, triggering epigenetic modifications (e.g., DNA methylation) (Li et al. 2024) and participating in alternative splicing to generate new transcript isoforms, potentially altering coding sequences, thereby providing raw material for the emergence of novel traits during evolution (Tian et al. 2025). Additionally, TE insertions in untranslated regions can posttranscriptionally modulate gene expression, contributing to phenotypic variation (Drongitis et al. 2019). Understanding how TE‐mediated genomic variation regulates phenotypic evolution is thus crucial for unravelling genome plasticity and trait development in plants.
The Camellia genus is an ideal model for studying TE‐driven phenotypic diversification. Comprising ~200 species—including ornamental camellias, tea and oil camellias—this Theaceae family member holds significant economic value and is primarily distributed across subtropical Asia (Zan et al. 2023). Genomic analyses show TEs constitute > 80% of the Camellia genome, coinciding with its remarkable genome‐wide phenotypic diversity (Zhang et al. 2021). However, this variation has also caused taxonomic ambiguity. Traditional morphology‐based systems divide the genus into 12–18 sections, yet inconsistencies persist (Chang 1998; Ming 1999; Sealy 1958). While molecular phylogenetic studies have clarified some taxonomic controversies, limited sampling and genomic data continue to hinder progress, impeding effective utilisation of the genus's genetic resources.
Among phenotypic traits, flower colour is an excellent model for studying evolution due to its visible variation and well‐defined biochemical basis. In Camellia, flower colour varies strikingly: Section Camellia species typically have red petals, Section Chrysantha species are yellow and many others bear white flowers—differences largely attributed to variations in flavonoid composition (Fan et al. 2022; Fan et al. 2023; Jiang et al. 2024). Here, we use Camellia flower colour to investigate TE's role in phenotypic evolution. Using whole‐genome data, we constructed a comprehensive phylogenetic framework and analysed flower colour evolution alongside geographic distribution. We generated high‐quality genomes for two Camellia species with contrasting petal colours and integrated them with nine published genomes to build a graph‐based pangenome, identifying structural variation across 172 population samples. Our analyses revealed that TE‐associated SVs drive gene expression divergence and genome evolution, providing new insights into the molecular mechanisms linking genomic dynamics to plant phenotypic diversity.
2. Result
2.1. Phylogenetic and Petal Colours Diversity
To fully investigate the relationship between phylogeny and flower colour, we collected and sequenced diverse representative Camellia species genomes from China, Vietnam and Japan—the main regions of natural distribution. A total of 237 re‐sequencing datasets were used for further analysis, including 82 species newly sequenced in this study (Table S1). We first reconstructed the Camellia phylogenetic framework based on 4182517 single nucleotide polymorphism (SNP) (Figure 1a and Figures S1, S2), identifying seven well‐supported clades. Clade 1, comprising Sect. Corallina, Calpandria, Brachyandra, Tuberculata, Longipedicellata, Pseudocamellia, Luteoflora and most species from Vietnam are located at the base of the phylogenetic tree, indicating early divergence. These species are distributed in southwestern China and northern Vietnam, regions that may represent the origin of Camellia. Clade 2 is primarily composed of Sect. Chrysantha species and is closely related to Clade 1, also showing early divergence. Furthermore, concordant with the optimal solution of seven populations (K = 7), the phylogenetic tree topology was robustly corroborated by the results of population structure analysis and PCA analysis (Figure 1b and Figure S3). The extensive nucleotide sequence data enabled the reclassification of some taxa. C. hongkongensis appeared as an early‐derived species within Clade 4, which comprises Sect. Furfuracea, and shares traits such as a brown, rough capsule surface (similar to C. furfuracea ), red petals and oblong leaves.
FIGURE 1.

Phylogeny and phenotypic analysis of flower colour in Camellia. (a) Phylogenetic tree of 237 Camellia accession. Branch colours indicate seven major clades. Pie charts show the proportion of species from different sections within each clade. (b) Population structure of 237 Camellia accessions. The bar plots indicate membership proportions (q) for each accession at K = 2, 3, 6 and 7. (c) Geographic distribution and petal colour variation among 237 Camellia species. Representative flower colours are shown for species from each region. The yellow dashed area indicates the distribution of early‐diverging species.
Phenotypically, the phylogenetic tree assigns a unique basal position to C. pilosperma, bearing white petals. Clade 1, which diverged early, includes three flower colours—yellow, white and red—whereas later‐diverging clades lack the yellow type. Notably, yellow‐flowered Camellia species are restricted to the putative ancestral regions, while red‐ and white‐flowered species have a wider distribution (Figure 1c). The reconstruction of ancestral character states for flower colour was performed by using maximum parsimony and maximum likelihood methods; we obtained similar results, namely, that white colour was inferred as ancestral traits of Camellia (Figure S4). In addition, we also observed multiple potential transitions between red and white petals during evolution, such as within Clade 4, which includes red‐petaled C. hongkongensis and white‐petaled species from Sect. Furfuracea, as well as within the evolutionary branches of Sect. Camellia and Sect. Paracamellia.
2.2. Metabolic Basis of Red Floral Colour Variation
The petal coloration in Camellia species primarily includes varying shades of red, white and yellow. Our previous studies revealed that anthocyanins and flavanols are the main pigments in red and yellow‐like petals respectively (Fan et al. 2023; Fan et al. 2022; Jiang et al. 2024). To understand the metabolic basis underlying flower colour variation, UPLC‐MS analysis was conducted on 29 red‐like petal species from Sect. Camellia, Sect. Paracamellia, Sect. Tuberculata and Sect. Longipedicellata, covering all sections with red‐petaled species (Table S2). We identified 23 anthocyanins in red petals, with significant variation in their accumulation among species. Each species contained 2–15 anthocyanins. Cyanidin was detected in all analysed species and accounted for over 90% of the total anthocyanin content. Additionally, trace delphinidin amounts were also detected in C. hongkongensis , C. rubituberculata and C. calcicola . Notably, Cy3GEpc was more abundant in purple‐red petals, while Cy3G predominated in red petals (Figure S5). These findings further support that flavonoid metabolite level differences form the biochemical basis for flower colour variation in Camellia.
2.3. High‐Quality Genome Assemblies of Camellia Species
To explore the relationship between genome evolution and flower colour diversity in Camellia, we analysed 11 diploid genomes from Sect. Camellia, Paracamellia, Thea, Furfuracea and Chrysantha, representing the three major flower colours (red, yellow and white). This included two endangered species ( C. hongkongensis and C. chrysanthoides) whose genomes were de novo assembled in this study, while the remaining genomes were obtained from previous studies (Chen, Wang, Kong, et al. 2023; Hu, Fan, et al. 2024; Shen et al. 2022; Wang et al. 2025). The two new genomes were sequenced using a combination of HiFi reads (average sequencing depth of 40× per genome), Illumina short reads (coverage depth of 70× per genome) and high‐throughput chromosome conformation capture data (Hi‐C, average coverage depth of 111× per genome). The final assembled genome sizes were 2.85 and 2.66 Gb, with contig N50 values of 88.44 megabase (Mb) and 86.78 Mb respectively—consistent with flow cytometry (Figure S6c,d) and survey analyses results (Figure S7a,b). A total of 99.5% and 99.9% of contigs were anchored to 15 chromosomes, with only 24 and 19 gaps remaining respectively (Table 1).
TABLE 1.
Summary statistics from assembly and annotation of GH1 and JH3 genomes.
| Species | C. hongkongensis | C. chrysanthoides |
|---|---|---|
| Assembly size (Gbp) | 2.85 | 2.66 |
| Anchor ratio (%) | 99.5 | 99.9 |
| Number of contigs | 51 | 38 |
| Contig N50 (Mbp) | 88.44 | 86.78 |
| Scaffold N50 (Mbp) | 202.95 | 184.84 |
| Number of gaps | 24 | 19 |
| Repeat ratio (%) | 81.77 | 80.36 |
| GC content (%) | 38.88 | 38.66 |
| Assembly completeness (BUSCO, %) | 98.1 | 98.3 |
| Merqury | 50.8 | 53.31 |
| Gene number | 54 091 | 52 815 |
| Annotation completeness (BUSCO, %) | 96 | 96.9 |
Multiple genome assessments confirmed the high quality of the GH1 and JH3 assemblies. First, Benchmarking Universal Single Copy Orthologue (BUSCO) analysis revealed completeness scores of 98.1% and 98.3% respectively (Table 1). The consensus quality values (QVs) were 50.8 and 53.31 (Figure S7c,d), while clipping reveals assembly quality (CRAQ) analysis, which evaluates the regional and structural error rates, yielded regional assembly quality indicator (R‐AQI) scores of 99.42 and 98.16 and overall structural assembly quality indicator (S‐AQI) scores of 97.96 and 99.04, further supporting assembly accuracy. Additionally, Illumina read mapping achieved genome coverages of 99.92% and 99.96%. Hi‐C heatmaps also displayed strong diagonal signals, confirming proper spatial proximity of genomic regions (Figure S7e,f).
To minimise pipeline‐specific artefacts, we uniformly annotated all 11 genomes using a combined approach (ab initio, homology‐based and transcriptome‐based prediction). This identified 44 356–54 214 protein‐coding genes per genome, with BUSCO completeness ranging from 91.2% to 97.2% (Table S3). Over 93% of predicted genes had homologues in at least one major database (GO, KEGG, COG, Pfam, Swiss‐Prot, TrEMBL, InterProScan or NR).
2.4. TE Expansion Effect Genome Architecture in Camellia
A species tree constructed using single‐copy orthologous genes revealed phylogenetic relationships among the 11 genomes consistent with those observed in the population data (Figure 2a). TEs accounted for 76.6%–84.5% of the sequences in these genomes (Table S4). Long terminal repeat retrotransposons (LTRs), particularly Gypsy elements, were the most abundant, each comprising over 47% (Figure 2b). To investigate the role of TEs in the genomic evolution of the Camellia genus, we analysed their insertion times and distributions across the phylogeny. Between 10 194 and 24 196 full‐length LTRs (fl‐LTRs) were identified, with evidence of continuous Copia and Gypsy expansion over the past 6 million years (Figure 2c). TEs were mainly distributed within 2 kb surrounding gene bodies (Figure 2d), though Gypsy elements showed a broader range, primarily between 2 and 5 kb (Figure 2e,h), suggesting a wider involvement in gene regulation. TE content was significantly higher in species‐specific regions than in shared (homologous) regions (Figure 2g), highlighting their role in genome differentiation. Moreover, closely related species shared a highly similar number of fl‐LTRs (Figure S8b), consistent with the general understanding in evolutionary biology. On average, each genome contained approximately 11 400 TE‐associated genes (Figure 2f), with significant enrichment (p < 0.05) in some Kyoto Encyclopedia of Genes and Genomes (KEGG) terms related to secondary metabolism, including ‘Caffeine metabolism,’ ‘Isoquinoline alkaloid biosynthesis’, ‘Anthocyanin biosynthesis’ and ‘Flavone and flavonol biosynthesis’ (Figure 2i). These findings underscore the crucial role of TEs in driving metabolic diversity and stress response evolution.
FIGURE 2.

Transposable element (TE) features in Camellia genomes. (a) Phylogenetic tree inferred from single‐copy orthologs using ASTRAL. (b) Bar plot showing the proportion of different TE types. (c) Insertion time distribution of full‐length Copia and Gypsy retrotransposons. (d) TE density per 100 bp in gene bodies and 2 kb flanking regions across 11 Camellia genomes. (e, h) Distribution of different TE types in 10 kb flanking regions of genes. (g) The comparison of TE proportion in shared and species‐specific genomic regions. (f, i) Schematic of a gene‐containing LTR retrotransposon and the top 10 significantly enriched KEGG terms of these genes.
To further investigate the impact of TEs on gene evolution after species formation, we classified genes from 11 species/varieties into four categories using previously reported strategies (Fang et al. 2024; Yang et al. 2022). A total of 46 979 gene families were grouped as Core (13 862, 29.5%), Softcore (7977, 16.9%), Dispensable (21 862, 46.5%) and Private (3278, 6.9%). The proportion of each gene category was similar across species, with average counts of 22 260 (43.0%), 12 448 (24.0%), 13 299 (25.4%) and 3933 (7.4%) respectively (Figure 3a,b). Compared with core genes, dispensable and private genes showed higher nonsynonymous‐to‐synonymous substitution ratios (Ka/Ks) (Figure 3c), shorter gene and coding sequence lengths, fewer exons (Figure 3d–f) and more TE insertions (Figure 3g), suggesting that the accumulation of TE is a characteristic of the dispensable genome compartment. Therefore, these results further support the role of TEs in genome divergence within the Camellia genus.
FIGURE 3.

Pan‐gene analysis across Camellia genomes. (a) The histogram showing the frequency distribution of pan‐gene families across genomes. The pie chart shows the proportion of different pan‐gene family types. (b) Presence/absence variation of all pan‐gene families in 11 genomes. (c–g) The comparisons of Ka/Ks values, gene length, exon number, CDS length and TE insertion numbers in genes. Different letters indicate significant differences at p < 0.05. *** represent significant differences (p < 0.001).
3. TEs Contribute to Most Structural Variations
To explore large‐scale genomic variations among Camellia species, we aligned the 11 chromosome‐level genomes to the GH1 reference. Consistent with the phylogenetic tree, C. crapnelliana and C. hongkongensis exhibited a higher number of syntenic regions, indicating a closer relationship (Figure S9). In total, 326 392 SVs were identified, including presence/absence variations (PAVs) (24 bp to 1.501 Mb) and 15 074 inversions (100 bp to 71.462 Mb), accounting for 12.2% to 25% of each genome (Figure 4a–c). The numbers of presence and absence variations were nearly equal (Table S5). Most SVs were located in intronic and upstream/downstream regions (> 23%), while significantly fewer were found in exonic and UTR regions (Figure 4d), suggesting that SVs affecting coding regions may be under stronger selective pressure. Notably, SVs were mainly concentrated within 2 kb upstream and downstream of genes (Figure 4e), and on average, 79.1% overlapped with TEs, with Gypsy retrotransposons being the most abundant type, accounting for an average of 45% of all TE‐associated SVs (Figure 4f and Figure S10), highlighting the central role of TEs in driving structural variation. We randomly selected 25 SVs for manual validation. These SVs were confirmed by long reads validation (Figure S11).
FIGURE 4.

Characteristics of structural variations among 11 Camellia genomes. (a) The distribution of GC content, gene count, TE density and SV number in the GH1 reference genome. (b) The number of different SV types across Camellia genomes. (c) Total SV size in each genome, including presence/absence variations and inversions. (d) SV counts in six genomic regions: Downstream (+2 kb), exon, intergenic, intron, upstream (−2 kb) and UTR. (e) SV density per 100 bp in gene bodies and 5 kb flanking regions. (f) Proportion of SVs overlapping with TEs.
3.1. Structural Variations Affect Orthotonus Expression in Flower
We identified 37 175 SV genes, defined as the nearest genes located within 10 kb of an SV. Among these, only 598 SV genes were shared across species—markedly fewer than species‐specific SV genes (Figure 5a). Notably, more SV genes were shared among closely related species, suggesting that SVs may drive species divergence by altering gene structure or expression. To explore the effects of SVs on the transcription of orthologous genes in floral tissues, we analysed petal transcriptomes from 10 Camellia species/variants; these SV genes included 25 652 genes expressed in petals, and 24 950–26 388 collinear genes were used for subsequent analyses. Based on SV insertion sites, these genes were classified as exonic, intronic, upstream, downstream or intergenic. Differential expression analysis revealed that a substantial proportion of SVs influenced gene expression. Specifically, an average of 81% of SVs in coding regions affected expression, compared to 73% for intergenic SVs (Figure 5b). Among the SVs impacting gene expression, 98.8% were TE‐derived, with DNA transposons accounting for 40.3%, despite comprising only 12.4% of the genome's TE content (Figure 5c). A chi‐square test confirmed that this enrichment was highly significant (p = 1.1357 × 10−44), underscoring the prominent role of DNA transposons in gene regulation.
FIGURE 5.

The effect of SVs on gene expression. (a) Venn diagram showing overlap of SV‐associated genes across 10 Camellia genomes. Distinct petals represent unique SV genes; intersecting regions represent shared SV genes. (b) Expression of SV genes by SV location type. (c) Proportion of different TE types in expression‐associated SVs. (d) Schematic representation of presence/absence SV genotype groups. (e) The expression fold‐change of SV genes between the presence and absence genotype groups.
Following established methods, longer sequences (genomes with SV insertions) were classified as present, and shorter sequences (genomes with SV deletions) as absent. SV genes were grouped accordingly into presence and absence categories (Figure 5d). Expression levels of collinear genes in these groups were compared, with fold changes in affected SV genes ranging from −12.42 to −13.45 (Figure 5e). Additionally, a binomial test showed no significant bias toward either upregulation or downregulation of gene expression in petals (p = 0.06). In summary, these results highlight the complex and significant role of SVs in regulating petal gene expression.
3.2. SVs Associated With Flower Colour Diversification
To investigate the role of SVs in the flower colour diversification in Camellia species, we merged 326 392 presence/absence variations (PAVs) identified across all species into a nonredundant set of 207 853 SVs for analysis. A total of 176 individuals (45 red‐flowered, 25 yellow‐flowered and 106 white‐flowered samples from Sect. Thea, Sect. Camellia, Sect. Chrysantha and Sect. Paracamellia) were mapped to the graph genome, yielding 194 471 SVs across the population. The phylogenetic tree constructed from SV sets showed a topology largely consistent with SNP‐based results (Figure S12). We identified 3337 (Sect. Chrysantha), 3659 (Sect. Camellia) and 11 419 (Sect. Thea) subgroup‐specific high‐frequency SVs (frequency > 0.8 in the target subgroup and < 0.2 in others) in the three major subgroups (Figure 6a), involving 6152 SV genes (Figure 6b). The floral colour diversity observed among these species can be attributed to variations in both the composition and abundance of flavonoid compounds. To investigate the genetic basis of this variation, we identified key structural genes involved in the flavonoid biosynthetic pathway across these genomes by screening for conserved functional domains. No obvious gene losses were detected among the structural genes. Interestingly, no structural variants (SVs) were detected within the exonic regions of these key structural genes; however, SVs were identified within promoter regions (up to 2 kb upstream) of several core structure genes, including DFR, ANS and FLS. These findings suggest that SVs may influence flavonoid biosynthesis by directly or indirectly modulating the expression of structural genes.
FIGURE 6.

Structural variations potentially linked to flower colour divergence. (a) Heatmap showing high‐frequency SVs in Sect. Thea, Sect. Camellia and Sect. Chrysantha. Red and white indicate high and low frequency respectively. (b) The difference of SV and SV‐associated gene expression across three subgroups. (c) The frequency of accessions with presence or absence SV (associated with MYB114) genotype for Sect. Camellia and Sect. Thea accessions. (d) The comparison of MYB114 expression level and anthocyanin content between SV presence and absence species. (e) The frequency of accessions with the presence or absence SV (associated with MYB7) genotype for Sect. Camellia and Sect. Chrysantha accessions. (f) The comparison of MYB7 expression level and flavonoid content between SV presence and absence species.
Using petal transcriptomic data from 10 white‐, 5 yellow‐ and 8 red‐flowered species, we compared the expression of these genes and found 3503 with significant differential expression among colour types (FDR < 0.05, fold change > 1). Notably, 643 genes were upregulated in red petals; for example, MYB114 has been reported as a transcriptional activator of anthocyanin biosynthesis (Zhang et al. 2024). In Camellia species from the red‐flowered section, MYB114 exhibits higher expression levels in red petals, despite showing minimal differences in its coding sequence. However, a structural variant involving the insertion of a TIR‐type transposable element approximately 2.1 kb upstream of the gene was identified exclusively in these species. In contrast, white‐flowered species from Sect. Thea, which typically lack this structural variant, show markedly lower expression of MYB114 in their petals (Figure 6c,d). In addition, we identified 778 structural variant (SV)‐associated genes that were differentially expressed between yellow and red petals. For instance, a high‐frequency SV specific to Sect. Chrysantha was detected within the ninth intron of a MYB7 ortholog (Figure 6e). This SV consists of a 330 bp insertion comprising an LTR/Copia‐type transposable element. In Arabidopsis thaliana , AtMYB7 functions as a negative regulator of flavonol biosynthesis. Compared to genotypes lacking this SV, the expression of the MYB7 ortholog was significantly reduced in accessions carrying the insertion (Figure 6f). In addition, 388 were specifically upregulated in white petals, potentially suppressing anthocyanin biosynthesis. These findings provide important insights into the molecular mechanisms underlying flower colour evolution in Camellia.
Another high‐frequency presence SV was specific to Sect. Chrysantha (predominantly yellow‐flowered) but absent in Sect. Camellia (predominantly red‐flowered). This SV corresponds to a 240 bp insertion of a TIR‐type transposable element, located 108 bp downstream of an MYB60 transcription factor ortholog (Figure 7a,b). MYB60 expression was significantly higher in yellow‐flowered species than in red‐flowered ones (Figure 7c), and transient expression assays confirmed its strong suppression of anthocyanin biosynthesis in petals (Figure 7d–f). To test the effect of SV, we cloned gene and downstream regions of MYB60 from C. japonica (TE‐absent), C. chrysanthoides (TE‐present) and C. chrysanthoides ▲ (TE‐absent), fusing them to a LUC reporter. Dual‐luciferase assays showed that the fragment with the SV had significantly higher LUC/REN activity than those without it, confirming the SV's positive regulatory role in MYB60 expression (Figure 7g). These findings provide strong evidence for TE‐mediated complex regulation of homologous gene expression.
FIGURE 7.

An SV potentially associated with flower colour. (a) The frequency of accessions with presence or absence SV (associated with MYB60) genotype for Sect. Camellia and Sect. Chrysantha accessions. (b) Schematic diagram of SV affecting the MYB60 gene in yellow and red‐flowered Camellia species. (c) The comparison of MYB60 expression level and anthocyanin content between SV presence and absence species. (d) Phenotypes of camellia petals overexpressing CcMYB60. Scale bar: 1 cm. (e) Anthocyanin content in MYB60‐overexpressing camellia petals. Data are mean ± SD (n = 3). Asterisks indicate significant differences (Student's t‐test). (f) Relative expression of MYB60 in MYB60‐overexpressing camellia petals. Data are mean ± SD (n = 3). Asterisks indicate significant differences (Student's t‐test). (g) The activity of LUC constructs with or without the 240 bp SV region of MYB60 in transfected N. benthamiana. Each bar represents the mean ± SD from three independent experiments.
4. Discussion
Understanding how genome evolution shapes phenotypic diversity is a key goal in evolutionary biology. Increasing evidence indicates that TEs are more dynamic and functionally important than once believed (Tian et al. 2025; Wang et al. 2021). The genus Camellia, known for its high TE content and striking phenotypic variation, now has genome assemblies for over 10 species. Using these resources alongside population‐scale sequencing data, we identify conserved genomic features across lineages and phenotypes, including TE‐mediated sequence composition differences and expression divergence of orthologous genes.
Due to the genus's high species diversity and extensive morphological variation, species identification and classification within Camellia remain challenging. In this study, we analysed genome sequences from 237 accessions representing 129 species, covering all sections recognised in Flora of China (Chang 1998), 13 sections from Monograph of the Genus Camellia (Ming 1999) and 11 sections in the Sealy classification system (Sealy 1958), filling gaps in previous phylogenetic research. Using whole‐genome resequencing data, we conducted a comprehensive phylogenetic analysis. Our results show that C. hongkongensis and species from Sect. Furfuracea form a single clade, rather than clustering with Sect. Camellia, where they were previously placed by Chang and Ming. Notably, their semi‐persistent tepals and distinct three‐style morphology are atypical for Sect. Camellia. Additionally, we found that C. rubriflora is nested within Sect. Chrysantha, despite its solitary red flowers on young shoots—contrasting with the yellow flowers on mature branches typical of the section. However, it shares leathery, glabrous, oblong‐elliptic leaves with other Sect. Chrysantha species. Similarly, C. longissima falls within Sect. Thea, conflicting with traditional classifications by Chang (1998) and Ming (1999). Chang placed it and C. hekouensis in a separate Sect. Longissima, while Ming grouped them with C. longipetiolata in Sect. Longipedicellata. Recent phylogenomic studies also indicate that Sect. Longissima is not monophyletic and is nested within Sect. Thea (Zhang et al. 2023). These cases highlight the limitations of traditional taxonomy in the face of extensive morphological convergence and variability.
A well‐resolved phylogeny provides a robust foundation for studying phenotypic evolution. Based on our phylogenetic, ancestral state reconstruction and distribution patterns, we infer that the most recent common ancestor of Camellia likely had white petals. At the genus level, an evolutionary trend is clear: gradual loss of yellow pigmentation followed by the emergence of red and pink flowers. Intriguingly, a recent study suggests a similar transition has occurred within Rosa, which similarly began diversifying around 6 million years ago (Cheng et al. 2025). This parallel pattern implies convergent natural selection on floral colour preference in both genera. Flower colour is primarily determined by petal pigments (Hu, Chen, et al. 2024). Our analysis across Camellia species revealed that anthocyanins are the main chromogenic compounds for red and pink petals. Previous work showed that yellow pigmentation in Sect. Chrysantha stems from flavonol derivatives rather than carotenoids (Jiang et al. 2024), suggesting that flavonoid biosynthesis genes play a central role in Camellia floral coloration. Using a domain‐based search strategy, we identified key structural genes in flavonoid biosynthesis across these genomes (Du et al. 2024), with no loss of core enzymatic genes. Thus, regulatory changes—such as cis‐regulatory element rearrangements, transcriptional rewiring and epigenetic modifications—likely drive shifts in structural gene expression and may explain floral colour transitions in Camellia. Additionally, we cannot exclude potential contributions from single nucleotide polymorphisms (SNPs) and small insertions and deletions (Indels).
Our genomic analysis of Camellia revealed that structural variants (SVs) can either activate or repress gene expression, consistent with findings in other plants (Chen, Wang, Kong, et al. 2023; Li et al. 2024). TEs are a major source of SVs and play a key role in shaping regulatory networks across plant and animal genomes (Wang et al. 2021; Wang et al. 2022). Most SVs in our study overlapped with TEs, primarily LTR retrotransposons, while DNA transposons were significantly enriched among SVs linked to expression changes. A similar enrichment of DNA transposons upstream of species‐specific genes was reported in Gossypium (Tian et al. 2025), suggesting that DNA transposon polymorphisms contribute to genome divergence and homoeolog gene expression divergence. In tomato, TE polymorphisms have also driven phenotypic variation during domestication (Domínguez et al. 2020).
Using a graph‐based pangenome and population resequencing data, we identified lineage‐specific high‐frequency SVs across Camellia lineages. Integrating petal transcriptomics revealed candidate genes for floral colour variation, where expression divergence likely stems from SVs. MYB transcription factors regulate flavonoid metabolism, development and stress responses (Wang et al. 2020; An et al. 2020). Our SV gene set included key flavonoid regulators like MYB114 and MYB7 (He et al. 2023; Wang et al. 2017). Interestingly, the intronic insertion in MYB7 potentially encodes miR396, which has been reported to regulate floral development by targeting growth‐regulating factors (GRFs) (Yuan et al. 2020), its regulatory mechanisms in Camellia require further investigation and will be the subject of upcoming work. In addition, AtMYB60—a negative anthocyanin biosynthesis regulator in lettuce; its orthologs had a Chrysantha‐specific TIR type transposon insertion downstream, and transient assays suggested CcMYB60 suppress anthocyanin biosynthesis, possibly suppressing anthocyanins to yield yellow flowers. In Brassica, SVs can enhance expression by introducing new transcription factor binding sites. Similarly, we found HD‐ZIP and WRKY transcription factors enriched in SV regions, implying TEs remodel regulatory networks via novel binding motifs, driving Camellia floral colour divergence. Our functional validation supports TE‐mediated SV regulation of homoeolog expression as a key driver of intra‐generic phenotypic diversification.
5. Conclusion
In summary, by integrating high‐quality genome assemblies with extensive population‐scale genomic data, we elucidated SVs' role in Camellia flower colour evolution. Our study provides a comprehensive genus‐wide genetic and transcriptomic landscape, identifying lineage‐specific, high‐frequency SVs underlying phenotypic divergence. We show that TE‐mediated SVs likely introduce regulatory network rewiring, modulating homologous gene expression and fuelling transcriptional divergence during the evolutionary diversification of Camellia species.
6. Material and Method
6.1. Plant Materials Collection
Genome sequencing data were collected for 237 Camellia germplasm accessions, including 82 species sequenced for the first time. These resources were collected over two decades across nearly the entire natural distribution range of the genus, mainly in China, Vietnam and Japan. All accessions are conserved at the Camellia Germplasm Resource Center, Research Institute of Subtropical Forestry, Chinese Academy of Forestry. They represent all 18 sections of Camellia in the Flora of China (Chang 1998) and exhibit broad phenotypic diversity, including 47 samples from Sect. Camellia, 21 samples from Sect. Chrysantha, 72 from Sect. Thea, 10 from Sect. Furfuracea, 1 from Sect. Archecamellia, 8 species from Vietnam and others. Sample details are provided in Table S1.
6.2. Genome Sequencing
Genomic DNA was extracted from tender leaves using the cetyltrimethylammonium bromide (CTAB) method. For next‐generation sequencing, 150 bp paired‐end reads were generated on the DNBSEQ‐T7 platform (Berry Genomics), yielding ~180 Gb per sample, with QV30 scores exceeding 94%. For circular consensus sequencing (CCS), PacBio HiFi libraries were prepared per manufacturer protocols (Pacific Biosciences) and sequenced on the Revio or Sequel II platforms, generating ~108.5 Gb of high‐fidelity (HiFi) reads per sample after quality control. Hi‐C libraries were constructed using the Illumina TruSeq DNA Sample Prep Kit and sequenced on the Illumina HiSeq platform, producing ~300 Gb per sample. To enhance genome annotation, multiple tissues (roots, stems, young and mature leaves, flowers and fruits) were collected from GH1 and GJH3, pooled and subjected to CCS‐based transcriptome sequencing, with each composite sample yielding over 100 Gb of data.
6.3. Genome Assembly, and Evaluation
Genome size for the two newly sequenced species was estimated using the k‐mer frequency method with Jellyfish (v2.3.0; Marcais and Kingsford 2011) and findGSE (Sun et al. 2018). Furthermore, flow cytometry analysis was also performed to validate genome size. Contig‐level genome assembly was conducted using Hifiasm (v0.24.0‐r703) with parameters ‘−l 3 −z 20’ (Cheng et al. 2021). The redundant contigs were removed using the Khaper algorithm with default settings (Chen, Wang, Kong, et al. 2023). For pseudomolecule construction, Hi‐C paired‐end reads were aligned to the respective contigs using bwa‐mem2 (v2.2.1) with the −5SP parameter (Langarita et al. 2023). The resulting BAM files and primary contigs were processed through the Haphic (v1.0.6) pipeline to generate scaffold assemblies (Zeng et al. 2024). Misassemblies and chimeric errors were corrected by manual curation using Juicebox (v1.11.08) (Robinson et al. 2018).
Genome assembly quality was assessed using multiple approaches. The genome completeness was evaluated using 1614 single‐copy orthologs from the Embryophyta_odb10 database (Seppey et al. 2019). Quality values were estimated with Merqury (v.1.3) to determine base‐level accuracy (Rhie et al. 2020). Chimeric contigs and local misassemblies were detected using CRAQ (v.1.0.9) (Li et al. 2023). Additionally, short reads and HiFi reads were mapped to the genome to assess assembly consistency.
6.4. Transposable Element and Protein‐Coding Gene Annotation
For TE annotation, the EDTA pipeline—which integrates structural and homology‐based methods—was used with the ‘‐sensitive 1’ parameter (Ou et al. 2019). For ‘LTR/unknown’ type transposons, TEsorter was used for reclassification (Zhang et al. 2022). In addition, transposable elements sharing over 90% sequence identity are defined as shared transposable elements. The same pipeline was applied for gene structure annotation in each genome, combining transcript‐based, homology‐based and de novo prediction methods. Briefly, PacBio mRNA sequencing was performed on mixed samples of root, stem, leaf and flower from GH1 and JH3 (100G data per sample). mRNA data for the remaining genomes were downloaded from the National Genomics Data Center database. Reads were aligned to their respective genomes using Minimap2 (v2.26) or HISAT2 (v2.2.1) (Kim et al. 2019; Li 2018), followed by PASA and StringTie to generate full‐length sequences for de novo training in ANNEVO and Augustus (Bruna et al. 2020; Haas et al. 2003; Pertea et al. 2015). Protein sequences from grape, rice, maize, Arabidopsis, cabbage and cacao were used for homology‐based prediction using miniprot (Li 2023). Gene models were integrated using EvidenceModeler (Haas et al. 2008), and predicted proteins were functionally annotated using Swiss‐Prot and InterPro databases. To identify structural genes in the anthocyanin pathway, we annotated the genome using the gfanno pipeline (v.1.4) (Du et al. 2024). This tool employs a combined HMMSearch and BLASTP pipeline, which effectively filters out incomplete and aberrant sequences.
6.5. Whole‐Genome Resequencing and Variant Calling
Genomic DNA was extracted from leaf tissues of 94 samples. DNA libraries were prepared according to the manufacturer's instructions and sequenced on the DNBSEQ‐T7 platform (Berry Genomics), generating an average of 18× coverage per sample. Both new and previously published resequencing data were quality‐filtered using Trimmomatic (v0.39) with default settings (Bolger et al. 2014). Cleaned reads were aligned to the GH1 reference genome using BWA‐mem2 (v2.2.1), and duplicate reads were marked. Variant calling followed established protocols using GATK pipelines (McKenna et al. 2010): HaplotypeCaller was used to generate GVCF files for each sample, CombineGVCFs was used to merge individual GVCFs from 237 samples, and GenotypeGVCFs was used to call genotypes. Final variant filtering was performed using PLINK with the parameters ‘‐biallelic‐only ‐geno 0.01 ‐maf 0.05’.
6.6. Phylogenetic and Population Structure Analysis
For the population evolution analysis of the Camellia genus, SNPs with no missing data from 237 samples were used to construct a phylogenetic tree. Two methods were employed: the maximum likelihood‐based IQ‐TREE (v2.3.6) and the recently developed VCF2Dis, which rapidly builds distance‐based phylogenetic trees using default parameters (Minh et al. 2020; Xu et al. 2025). The resulting tree was visualised using the online tool iTOL Letunic and Bork (2021). For population structure analysis, 1 361 364 SNPs were retained after linkage disequilibrium‐based filtering and analysed using ADMIXTURE across different K values (Alexander et al. 2009). For the phylogenetic tree of 12 genomes, single‐copy orthologs were identified using OrthoFinder2 (v2.5.5) with default parameters (Emms and Kelly 2019). Gene sequences of each OGs were aligned using MAFFT v7.525 with the L‐INS‐i option (Katoh and Standley 2013), and poorly aligned regions were removed using TRIMAL v1.5.0 with the ‘‐automated 1’ setting (Capella‐Gutiérrez et al. 2009). Gene trees were constructed using IQ‐TREE (v2.3.6) with the parameter ‘‐B 5000’ for ultrafast bootstrap analysis (Minh et al. 2020), and ASTRAL v1.54 was used to infer the final species tree by combining these gene trees (Zhang et al. 2020).
6.7. Ancestral State Reconstruction
Ancestral states for floral colour were reconstructed using Mesquite software (v.4.01) under both maximum parsimony and maximum likelihood methods. The trait was modelled as discrete with the following: 0 for white, 1 for yellow and 2 for red/pink.
6.8. Gene Family Analysis
Homologous gene families across the 11 Camellia genomes were identified using OrthoFinder (v2.5.5). These were categorised as core (present in all 11 genomes), soft‐core (9–10 genomes), dispensable (2–8 genomes) and unique (present in only one genome). Genes without homologues and tandem duplications were defined as orphan genes.
6.9. SV Identification and Validation
Whole‐genome alignments between the 10 genomes and the GH1 reference genome were performed using nucmer (v4.0.0rc1) with parameters ‐c 100 ‐l 40 (Marçais et al. 2018). Alignments were filtered using delta‐filter (−m ‐i 90 ‐l 100), and coordinate files were generated using show‐coords (‐THrd). Divergent sequences within species‐specific regions were initially called using show‐diff. Following this, the putative variants were aligned to the reference genome via minimap2, and only those alignments with < 80% sequence coverage were retained for downstream analysis. SVs were identified using both SyRI (v1.7.0) and SVIM‐asm with default settings (Goel et al. 2019); Heller and Vingron (2021) and only SVs supported by both tools were retained. In addition, SVs and TEs were considered overlapping if > 50% of the SV sequence coincided with a TE. For SV validation, Integrative Genomics Viewer (v2.16.0) was used to examine a subset of randomly selected SVs. TGS reads were mapped to the genomes, and the SV regions were manually inspected.
6.10. Graph Genome Construction and Population SVs Calling
A graph‐based pangenome was constructed using the vg graph tool (v1.56.0), incorporating merged insertion (INS) and deletion (DEL) structural variants into the linear GH1 reference genome (Garrison et al. 2018). XG and GCSA index files were generated using the vg index tool with default settings. Next‐generation sequencing (NGS) data from 176 samples were mapped to the graph genome using vg map. Low‐quality alignments were filtered with vg pack using parameters ‐Q 5 ‐s 5. Structural variants were genotyped for each sample using vg call with parameters ‐a ‐s. SV calls from all samples were then merged into a single VCF file using bcftools (v1.13) (Danecek et al. 2021).
6.11. Anthocyanin Measurement
Anthocyanin identification and quantification followed the method described in our previous study (Fan et al. 2023). Briefly, fresh petal powder was incubated in 5 mL of extraction buffer (methanol–water–formic acid–trifluoroacetic acid, 70:27:2:1, v/v) for 24 h in the dark. The extract was filtered through a 0.22‐μm membrane and stored at −20°C. The chromatographic elution gradient was: 0 min at 22% B, 15 min at 28% B and 35 min at 68% B, with detection at 525 nm. A cyanidin‐3‐O‐β‐glucoside (Cy3G) standard (Sigma, St. Louis, MO, USA) was used for quantification.
6.12. Transcriptome Analysis
Total RNA was extracted from petal samples stored in liquid nitrogen using the RN38 Kit (Aidlab, Beijing), following the manufacturer's protocol. NGS libraries were prepared and sequenced on an Illumina platform (Berry Genomics, Beijing). Low‐quality reads and adapters were trimmed with Trimmomatic (v0.39) using default settings. Filtered reads were mapped to reference genomes using HISAT2 with default parameters. Gene expression levels were quantified using featureCounts (Liao et al. 2014), and differential gene expression analysis was performed using the limma package (Ritchie et al. 2015).
6.13. Agrobacterium‐Mediated Transient Transformation
The full‐length MYB60 CDS was cloned into the pCAMBIA1300 vector to generate the CcMYB60‐GFP expression construct, which was transformed into Agrobacterium tumefaciens strain GV3101. After centrifugation, the bacterial pellet was resuspended in infiltration buffer to an OD600 of 1.0. Petal discs were vacuum‐infiltrated at −80 kPa for 3 min, repeated twice. Following infiltration, the petals were incubated in the dark at 18°C for 2 days, then transferred to light conditions for an additional 2 days before phenotypic analysis. Each treatment included three independent biological replicates, with ~20 petal discs per replicate.
6.14. Luciferase Report Experiment
The target sequences were amplified by PCR and inserted into the LUC luciferase reporter vector using a ClonExpress ultra one step cloning kit (Vazyme, Nanjing). The resulting constructs were introduced into A. tumefaciens GV3101 (pSoup) and transiently expressed in Nicotiana benthamiana leaves via Agrobacterium‐mediated infiltration. Firefly and Renilla luciferase activities were measured using the Bio‐Lite Luciferase Assay System (Vazyme, Nanjing).
Author Contributions
M.F. conceived and designed the study, M.F. performed data analysis and wrote the manuscript. Y.Q., H. Jiang and Y.Z. performed the experiments. X.L. prepared the samples. Y.W. and X.L. revised the paper. All authors read and approved the final paper.
Conflicts of Interest
The authors declare no conflicts of interest.
Supporting information
Figure S1: Phylogenetic tree of 237 Camellia accession was inferred by iqtree based on SNP data. Tucheria hexalocularia and Polyspora speciosa were identified as outgroups.
Figure S2: Phylogenetic tree of 237 Camellia accession was inferred by VCF2DIS based on SNP data. Tucheria hexalocularia and Polyspora speciosa were identified as outgroups.
Figure S3: Principal component analysis of Camellia accessions. PC1 and PC2 account for 22.14% and 15.55% of the total variation respectively.
Figure S4: Ancestral state reconstruction for flower colour of Camellia, performed using both the maximum parsimony method (above the node of the tree) and the maximum likelihood method (below the node of the tree) for the backbone of nuclear phylogeny. The pie diagrams in the internal nodes represent the most likely ancestral character states and the relative probabilities of each alternative state. The red arrows point to the common ancestor nodes of Camellia, showing the inferred ancestral character states of flower colour.
Figure S5: Heatmap showing different anthocyanin relative content in red‐flowered species. Data were normalised to the mean of each row to highlight variations in abundance profiles. Red and blue indicate high and low frequency respectively.
Figure S6: (a, b) Phenotypic characteristics of two newly sequenced species: C. hongkongensis (a) and C. chrysanthoides (b). (c, d) Genome size estimation of C. hongkongensis (GH1) and C. chrysanthoides (JH3) via flow cytometry, using Solanum lycopersicum (Heinz1706; 0.88 Gb) as an internal reference standard.
Figure S7: Assessment of genome assemblies. (a) (b) The distribution of 21‐bp Kmer of the corresponding genome. The Kmer abundance is used to calculate the estimated genome size. (c) (d) Each plot displays the copy number spectrum of an individual genome with its corresponding quality value. (e) (f) The heat map shows the intensity signals of Hi‐C chromosome interaction.
Figure S8: (a) The phylogenetic relationship and estimation of divergence times of Camellia and A. chinensis , R. simsii, V. vinifera , T. cacao , A. trichopoda . (b) The number of pairwise shared and still‐intact fl‐LTRs across species. The reading direction is column to row.
Figure S9: Representative synteny between GH1 and the other camellia assemblies. Gene synteny was assessed using the MCScanX program to identify collinear blocks of syntenic gene pairs.
Figure S10: Proportion of transposable elements within the identified structural variants. Bar graph showing the mean proportion (± SEM) of each transposable element type (Copia, Gypsy, TIR, helitron and other) aggregated from all analysed structural variants (SVs) (n = 10 samples). Different letters indicate significant differences at p < 0.05.
Figure. S11 Examples of structural variant (SV) validation using Integrative Genomics Viewer (IGV) were performed manually. The randomly selected deletion in the JH3 genome, with the long‐reads of JH3 being mapped to the GH1 genome.
Figure S11: Examples of structural variant (SV) validation using Integrative Genomics Viewer (IGV) were performed manually. The randomly selected deletion in the JH3 genome, with the long‐reads of JH3 being mapped to the GH1 genome.
Figure S12: Phylogenetic tree of 176 Camellia accession based on structural variant (SV) data obtained from a graph‐based genome. Different colour indicates the distribution of the five sections.
Figure S13: Venn diagram showing the number of differentially expressed genes (DEGs) across red, white and yellow petal.
Data S1: Supporting Information
Acknowledgements
In loving memory of my supervisor, Professor Zhenyuan Sun. Your untimely passing left a void that cannot be filled. Though you cannot see the final form of this work, your guiding hand is evident in every chapter—your wisdom was the compass that steered me through its greatest challenges; thank you for everything. We also acknowledge Yunpeng Zhao (College of Life Sciences, Zhejiang University, China) for the technical assistance. We would like to thank Professor Zhonglang Wang, Shixiong Yang and Xiangqin Yu at the Chinese Academy of Sciences (Kunming Institute of Botany), and Wenju Zhang at the University of Fudan for their help in accession collection. We also acknowledge Guoren He for the experimental guidance.
Funding: This work was supported by the Zhejiang Science and Technology Major Program on Agricultural New Variety Breeding (2021C02071‐2).
Data Availability Statement
The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.
References
- Alexander, D. H. , Novembre J., and Lange K.. 2009. “Fast Model‐Based Estimation of Ancestry in Unrelated Individuals.” Genome Research 19: 1655–1664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- An, J. P. , Wang X. F., Zhang X. W., et al. 2020. “An Apple MYB Transcription Factor Regulates Cold Tolerance and Anthocyanin Accumulation and Undergoes MIEL1‐Mediated Degradation.” Plant Biotechnology Journal 18: 337–353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bolger, A. M. , Lohse M., and Usadel B.. 2014. “Trimmomatic: A Flexible Trimmer for Illumina Sequence Data.” Bioinformatics 30: 2114–2120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bruna, T. , Lomsadze A., and Borodovsky M.. 2020. “GeneMark‐EP Plus:Eukaryotic Gene Prediction With Self‐Training in the Space of Genes and Proteins.” NAR Genomics and Bioinformatics 2: lqaa026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Capella‐Gutiérrez, S. , Silla‐Martínez J. M., and Gabaldón T.. 2009. “trimAl: A Tool for Automated Alignment Trimming in Large‐Scale Phylogenetic Analyses.” Bioinformatics 25: 1972–1973. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang, H. T. 1998. “Theaceae (1) Theoideae 1. Camellia.” In Flora Reipublicae Popularis Sinicae, 49, 3–195. Science Press. [Google Scholar]
- Chen, J. , Wang Z., Tan K., et al. 2023a. “A Complete Telomere‐To‐Telomere Assembly of the Maize Genome.” Nature Genetics 55: 1221–1231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen, S. , Wang P., Kong W., et al. 2023b. “Gene Mining and Genomics‐Assisted Breeding Empowered by the Pangenome of Tea Plant Camellia Sinensis .” Nature Plants 9: 1986–1999. [DOI] [PubMed] [Google Scholar]
- Cheng, B. , Zhao K., Zhou M., et al. 2025. “Phenotypic and Genomic Signatures Across Wild Rosa Species Open New Horizons for Modern Rose Breeding.” Nature Plants 11: 775–789. [DOI] [PubMed] [Google Scholar]
- Cheng, H. , Concepcion G. T., Feng X., Zhang H., and Li H.. 2021. “Haplotype‐Resolved de Novo Assembly Using Phased Assembly Graphs With Hifiasm.” Nature Methods 18: 170–175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Danecek, P. , Bonfield J. K., Liddle J., et al. 2021. “Twelve Years of SAMtools and BCFtools.” GigaScience 10: giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Domínguez, M. , Dugas E., Benchouaia M., et al. 2020. “The Impact of Transposable Elements on Tomato Diversity.” Nature Communications 11: 4058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Drongitis, D. , Aniello F., Fucci L., and Donizetti A.. 2019. “Roles of Transposable Elements in the Different Layers of Gene Expression Regulation.” International Journal of Molecular Sciences 20: 5755. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Du, L. , Lu C., Wang Z., Zou L., Xiong Y., and Zhang Q.. 2024. “GFAnno: Integrated Method for Plant Flavonoid Biosynthesis Pathway Gene Annotation.” Beverage Plant Research 4: e008. [Google Scholar]
- Emms, D. M. , and Kelly S.. 2019. “OrthoFinder: Phylogenetic Orthology Inference for Comparative Genomics.” Genome Biology 20: 238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan, M. , Li X., Zhang Y., et al. 2023. “Novel Insight Into Anthocyanin Metabolism and Molecular Characterization of Its Key Regulators in Camellia sasanqua .” Plant Molecular Biology 111: 249–262. [DOI] [PubMed] [Google Scholar]
- Fan, M. , Wei X., Song Z., et al. 2024. “GWAS‐Driven Gene Mining and Genomic Prediction of Ornamental Traits in Flowering Trees: A Case Study of Camellia japonica .” Horticulture Plant Journal 2468‐0141. 10.1016/j.hpj.2024.05.017. [DOI] [Google Scholar]
- Fan, M. , Zhang Y., Yang M., et al. 2022. “Transcriptomic and Chemical Analyses Reveal the Hub Regulators of Flower Color Variation From Camellia Japonica bud Sport.” Horticulturae 8: 129. [Google Scholar]
- Fang, Y. , Xiao X., Lin J., et al. 2024. “Pan‐Genome and Phylogenomic Analyses Highlight Hevea Species Delineation and Rubber Trait Evolution.” Nature Communications 15: 7232. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garrison, E. , Sirén J., Novak A. M., et al. 2018. “Variation Graph Toolkit Improves Read Mapping by Representing Genetic Variation in the Reference.” Nature Biotechnology 36: 875–879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goel, M. , Sun H., Jiao W. B., and Schneeberger K.. 2019. “SyRI: Finding Genomic Rearrangements and Local Sequence Differences From Whole‐Genome Assemblies.” Genome Biology 20: 277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gong, W. , Xiao S., Wang L., et al. 2022. “Chromosome‐Level Genome of Camellia Lanceoleosa Provides a Valuable Resource for Understanding Genome Evolution and Self‐Incompatibility.” Plant Journal 110: 881–898. [DOI] [PubMed] [Google Scholar]
- Haas, B. J. , Delcher A. L., Mount S. M., et al. 2003. “Improving the Arabidopsis Genome Annotation Using Maximal Transcript Alignment Assemblies.” Nucleic Acids Research 31: 5654–5666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haas, B. J. , Salzberg S. L., Zhu W., et al. 2008. “Automated Eukaryotic Gene Structure Annotation Using EVidenceModeler and the Program to Assemble Spliced Alignments.” Genome Biology 9: R7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He, G. , Zhang R., Jiang S., Wang H., and Ming F.. 2023. “The MYB Transcription Factor RcMYB1 Plays a Central Role in Rose Anthocyanin Biosynthesis.” Horticultural Research 10: uhad080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heller, D. , and Vingron M.. 2021. “SVIM‐Asm: Structural Variant Detection From Haploid and Diploid Genome Assemblies.” Bioinformatics 36: 5519–5521. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu, W. , Chen Y., Xu Z., et al. 2024. “Natural Variations in the Cis‐Elements of GhRPRS1 Contributing to Petal Colour Diversity in Cotton.” Plant Biotechnology Journal 22: 3473–3488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu, Z. , Fan Z., Li S., et al. 2024. “Genomics Insights Into Flowering and Floral Pattern Formation: Regional Duplication and Seasonal Pattern of Gene Expression in Camellia.” BMC Biology 22: 50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang, L. , Han L., Zhang W., et al. 2024. “Elucidation of the Key Flavonol Biosynthetic Pathway in Golden Camellia and Its Application in Genetic Modification of Tomato Fruit Metabolism.” Horticulture Research 12: uhae308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Katoh, K. , and Standley D. M.. 2013. “MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability.” Molecular Biology and Evolution 30: 772–780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim, D. , Paggi J. M., Park C., Bennett C., and Salzberg S. L.. 2019. “Graph‐Based Genome Alignment and Genotyping With HISAT2 and HISAT‐Genotype.” Nature Biotechnology 37: 907–915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Langarita, R. , Armejach A., Ibanez P., Alastruey‐Benede J., and Moreto M.. 2023. “Porting and Optimizing BWA‐MEM2 Using the Fujitsu A64FX Processor.” IEEE/ACM Transactions on Computational Biology and Bioinformatics 20: 3139–3153. [DOI] [PubMed] [Google Scholar]
- Letunic, I. , and Bork P.. 2021. “Interactive Tree of Life (iTOL) v5: An Online Tool for Phylogenetic Tree Display and Annotation.” Nucleic Acids Research 49: W293–W296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, H. 2018. “Minimap2: Pairwise Alignment for Nucleotide Sequences.” Bioinformatics 34: 3094–3100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, H. 2023. “Protein‐To‐Genome Alignment With Miniprot.” Bioinformatics 39: btad014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, K. , Xu P., Wang J., Yi X., and Jiao Y.. 2023. “Identification of Errors in Draft Genome Assemblies at Single‐Nucleotide Resolution for Quality Assessment and Improvement.” Nature Communications 14: 6556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, X. , Wang Y., Cai C., et al. 2024. “Large‐Scale Gene Expression Alterations Introduced by Structural Variation Drive Morphotype Diversification in Brassica oleracea .” Nature Genetics 56: 517–529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang, Y. , Gao Q., Li F., et al. 2025. “The Giant Genome of Lily Provides Insights Into the Hybridization of Cultivated Lilies.” Nature Communications 16: 45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liao, Y. , Smyth G. K., and Shi W.. 2014. “featureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics 30: 923–930. [DOI] [PubMed] [Google Scholar]
- Marcais, G. , and Kingsford C.. 2011. “A Fast, Lock‐Free Approach for Efficient Parallel Counting of Occurrences of k‐Mers.” Bioinformatics 27: 764–770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marçais, G. , Delcher A. L., Phillippy A. M., Coston R., Salzberg S. L., and Zimin A.. 2018. “MUMmer4: A fast and versatile genome alignment system.” PLoS Computational Biology 14: e1005944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKenna, A. , Hanna M., Banks E., et al. 2010. “The Genome Analysis Toolkit: A MapReduce Framework for Analyzing Next‐Generation DNA Sequencing Data.” Genome Research 20: 1297–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ming, T. L. 1999. “A Systematic Synopsis of the Genus Camellia.” Acta Botanica Yunnanica 21: 149–159. [Google Scholar]
- Minh, B. Q. , Schmidt H. A., Chernomor O., et al. 2020. “IQ‐TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era.” Molecular Biology and Evolution 37: 1530–1534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ou, S. , Su W., Liao Y., et al. 2019. “Benchmarking Transposable Element Annotation Methods for Creation of a Streamlined, Comprehensive Pipeline.” Genome Biology 20: 275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pertea, M. , Pertea G. M., Antonescu C. M., Chang T. C., Mendell J. T., and Salzberg S. L.. 2015. “StringTie Enables Improved Reconstruction of a Transcriptome From RNA‐Seq Reads.” Nature Biotechnology 33: 290–295. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rhie, A. , Walenz B. P., Koren S., and Phillippy A. M.. 2020. “Merqury: Reference‐Free Quality, Completeness, and Phasing Assessment for Genome Assemblies.” Genome Biology 21: 245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ritchie, M. E. , Phipson B., Wu D., et al. 2015. “Limma Powers Differential Expression Analyses for RNA‐Sequencing and Microarray Studies.” Nucleic Acids Research 43, no. 7: e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robinson, J. T. , Turner D., Durand N. C., Thorvaldsdóttir H., Mesirov J. P., and Aiden E. L.. 2018. “Juicebox.Js Provides a Cloud‐Based Visualization System for Hi‐C Data.” Cell Systems 6: 256–258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sealy, J. R. 1958. A Revision of the Genus Camellia. Royal Horticultural Society. [Google Scholar]
- Seppey, M. , Manni M., and Zdobnov E. M.. 2019. “BUSCO: Assessing Genome Assembly and Annotation Completeness.” Methods in Molecular Biology 1962: 227–245. [DOI] [PubMed] [Google Scholar]
- Shen, T. F. , Huang B., Xu M., et al. 2022. “The Reference Genome of Camellia Chekiangoleosa Provides Insights Into Camellia Evolution and Tea Oil Biosynthesis.” Horticultural Research 9: uhab083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shi, T. , Zhang X., Hou Y., et al. 2024. “The Super‐Pangenome of Populus Unveils Genomic Facets for Its Adaptation and Diversification in Widespread Forest Trees.” Molecular Plant 17: 725–746. [DOI] [PubMed] [Google Scholar]
- Sun, H. , Ding J., Piednoël M., and Schneeberger K.. 2018. “findGSE: Estimating Genome Size Variation Within Human and Arabidopsis Using k‐Mer Frequencies.” Bioinformatics 34: 550–557. [DOI] [PubMed] [Google Scholar]
- Tian, X. , Wang R., Liu Z., et al. 2025. “Widespread Impact of Transposable Elements on the Evolution of Post‐Transcriptional Regulation in the Cotton Genus Gossypium.” Genome Biology 26: 60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang, M. , Li J., Qi Z., et al. 2022. “Genomic Innovation and Regulatory Rewiring During Evolution of the Cotton Genus Gossypium.” Nature Genetics 54: 1959–1971. [DOI] [PubMed] [Google Scholar]
- Wang, M. , Li J., Wang P., et al. 2021. “Comparative Genome Analyses Highlight Transposon‐Mediated Genome Expansion and the Evolutionary Architecture of 3D Genomic Folding in Cotton.” Molecular Biology and Evolution 38: 3621–3636. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang, N. , Xu H., Jiang S., et al. 2017. “MYB12 and MYB22 Play Essential Roles in Proanthocyanidin and Flavonol Synthesis in Red‐Fleshed Apple ( Malus sieversii f. Niedzwetzkyana).” Plant Journal 90: 276–292. [DOI] [PubMed] [Google Scholar]
- Wang, X. C. , Wu J., Guan M., et al. 2020. “Arabidopsis MYB4 Plays Dual Roles in Flavonoid Biosynthesis.” Plant Journal 101, no. 3: 637–652. [DOI] [PubMed] [Google Scholar]
- Wang, X. F. , Liu T. J., Feng T., et al. 2025. “A Telomere‐To‐Telomere Genome Assembly of Camellia Nitidissima .” Scientific Data 12: 815. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu, L. , He W., Tai S., et al. 2025. “VCF2Dis: An Ultra‐Fast and Efficient Tool to Calculate Pairwise Genetic Distance and Construct Population Phylogeny From VCF Files.” GigaScience 14: giaf032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang, T. , Liu R., Luo Y., et al. 2022. “Improved Pea Reference Genome and Pan‐Genome Highlight Genomic Features and Evolutionary Characteristics.” Nature Genetics 54: 1553–1563. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan, S. , Li Z., Yuan N., et al. 2020. “MiR396 Is Involved in Plant Response to Vernalization and Flower Development in Agrostis Stolonifera .” Horticultural Research 7: 173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zan, T. , He Y. T., Zhang M., et al. 2023. “Phylogenomic Analyses of Camellia Support Reticulate Evolution Among Major Clades.” Molecular Phylogenetics and Evolution 182: 107744. [DOI] [PubMed] [Google Scholar]
- Zeng, X. , Yi Z., Zhang X., et al. 2024. “Chromosome‐Level Scaffolding of Haplotype‐Resolved Assemblies Using Hi‐C Data Without Reference Genomes.” Nature Plants 10: 1184–1200. [DOI] [PubMed] [Google Scholar]
- Zhang, C. , Scornavacca C., Molloy E. K., and Mirarab S.. 2020. “ASTRAL‐Pro: Quartet‐Based Species‐Tree Inference Despite Paralogy.” Molecular Biology and Evolution 37: 3292–3307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang, K. , Yu H., Zhang L., et al. 2025. “Transposon Proliferation Drives Genome Architecture and Regulatory Evolution in Wild and Domesticated Peppers.” Nature Plants 11: 359–375. [DOI] [PubMed] [Google Scholar]
- Zhang, Q. , Folk R. A., Mo Z. Q., et al. 2023. “Phylotranscriptomic Analyses Reveal Deep Gene Tree Discordance in Camellia (Theaceae).” Molecular Phylogenetics and Evolution 188: 107912. [DOI] [PubMed] [Google Scholar]
- Zhang, R. G. , Li G. Y., Wang X. L., et al. 2022. “TEsorter: An Accurate and Fast Method to Classify LTR‐Retrotransposons in Plant Genomes.” Horticultural Research 9: uhac017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang, X. , Chen S., Shi L., et al. 2021. “Haplotype‐Resolved Genome Assembly Provides Insights Into Evolutionary History of the Tea Plant Camellia sinensis .” Nature Genetics 53: 1250–1259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang, Y. Z. , Liu X., Ma H. P., et al. 2024. “R2R3‐MYB Transcription Factor CjMYB114 Interacts With CjbHLH1 to Jointly Regulate Anthocyanins in Camellia Japonica. L ‘Fendan’.” Scientia Horticulturae 328: 112897. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Figure S1: Phylogenetic tree of 237 Camellia accession was inferred by iqtree based on SNP data. Tucheria hexalocularia and Polyspora speciosa were identified as outgroups.
Figure S2: Phylogenetic tree of 237 Camellia accession was inferred by VCF2DIS based on SNP data. Tucheria hexalocularia and Polyspora speciosa were identified as outgroups.
Figure S3: Principal component analysis of Camellia accessions. PC1 and PC2 account for 22.14% and 15.55% of the total variation respectively.
Figure S4: Ancestral state reconstruction for flower colour of Camellia, performed using both the maximum parsimony method (above the node of the tree) and the maximum likelihood method (below the node of the tree) for the backbone of nuclear phylogeny. The pie diagrams in the internal nodes represent the most likely ancestral character states and the relative probabilities of each alternative state. The red arrows point to the common ancestor nodes of Camellia, showing the inferred ancestral character states of flower colour.
Figure S5: Heatmap showing different anthocyanin relative content in red‐flowered species. Data were normalised to the mean of each row to highlight variations in abundance profiles. Red and blue indicate high and low frequency respectively.
Figure S6: (a, b) Phenotypic characteristics of two newly sequenced species: C. hongkongensis (a) and C. chrysanthoides (b). (c, d) Genome size estimation of C. hongkongensis (GH1) and C. chrysanthoides (JH3) via flow cytometry, using Solanum lycopersicum (Heinz1706; 0.88 Gb) as an internal reference standard.
Figure S7: Assessment of genome assemblies. (a) (b) The distribution of 21‐bp Kmer of the corresponding genome. The Kmer abundance is used to calculate the estimated genome size. (c) (d) Each plot displays the copy number spectrum of an individual genome with its corresponding quality value. (e) (f) The heat map shows the intensity signals of Hi‐C chromosome interaction.
Figure S8: (a) The phylogenetic relationship and estimation of divergence times of Camellia and A. chinensis , R. simsii, V. vinifera , T. cacao , A. trichopoda . (b) The number of pairwise shared and still‐intact fl‐LTRs across species. The reading direction is column to row.
Figure S9: Representative synteny between GH1 and the other camellia assemblies. Gene synteny was assessed using the MCScanX program to identify collinear blocks of syntenic gene pairs.
Figure S10: Proportion of transposable elements within the identified structural variants. Bar graph showing the mean proportion (± SEM) of each transposable element type (Copia, Gypsy, TIR, helitron and other) aggregated from all analysed structural variants (SVs) (n = 10 samples). Different letters indicate significant differences at p < 0.05.
Figure. S11 Examples of structural variant (SV) validation using Integrative Genomics Viewer (IGV) were performed manually. The randomly selected deletion in the JH3 genome, with the long‐reads of JH3 being mapped to the GH1 genome.
Figure S11: Examples of structural variant (SV) validation using Integrative Genomics Viewer (IGV) were performed manually. The randomly selected deletion in the JH3 genome, with the long‐reads of JH3 being mapped to the GH1 genome.
Figure S12: Phylogenetic tree of 176 Camellia accession based on structural variant (SV) data obtained from a graph‐based genome. Different colour indicates the distribution of the five sections.
Figure S13: Venn diagram showing the number of differentially expressed genes (DEGs) across red, white and yellow petal.
Data S1: Supporting Information
Data Availability Statement
The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.
