Abstract
Tea plants, which are among the world’s most economically important beverage crops, exhibit extensive genetic diversity and are rich in secondary metabolites. While structural variations (SVs) drive phenotypic diversification, their regulatory roles in transcriptional networks and agronomic traits remain underexplored in this perennial crop. Here, we construct a pangenome from 22 representative tea accessions and their wild relatives. Genomic SV analysis reveals that 22% of the gene promoters contain variants influencing flavonoid, amino acid, and terpenoid biosynthesis. Population SV analysis of 275 tea accessions reveals three haplotypes in the ANS3 promoter, with Hap1, containing a 192 bp insertion, predominantly found in wild relatives but largely lost in modern cultivars. This insertion increases CtANS3 expression and anthocyanin content in wild relatives. Additionally, a 159 bp insertion in the CtLRR1 promoter reduces resistance to Colletotrichum gloeosporioides in wild relatives. Our findings underscore SVs as pivotal regulators of flavor differentiation and adaptive evolution during tea plant domestication.
Subject terms: Secondary metabolism, Natural variation in plants, Structural variation, Biotic
The regulatory roles of structural variations (SVs) in transcriptional networks and agronomic traits of tea plants are largely unexplored. Here, the authors assemble the pangnome from 22 representative tea accessions and their wild relative and reveal SVs driving gene expression alteration and agronomic traits diversification.
Introduction
Tea is among the world’s most valuable agricultural commodities, driving significant economic activity and playing a central role in international trade1. The two main cultivated tea varieties, Camellia sinensis var. sinensis (CSS) and Camellia sinensis var. assamica (CSA), are widely grown and contribute substantially to global tea production2. Additionally, Camellia taliensis (CT), a wild relative, is distributed primarily in southwestern China and exhibits extensive genetic diversity3,4. Despite the need for continuous genetic improvement to meet evolving consumer preferences, the domestication of tea plants has reduced genetic diversity and created a genetic bottleneck for breeding5. Compared with cultivated tea plants, wild relatives display higher genetic diversity and elevated phenolic acid content but lower flavan-3-ol levels6–8. This narrowing of genetic variation necessitates a deeper exploration of wild tea germplasm to improve breeding programs.
Structural variations (SVs), such as presence/absence variants (PAVs), inversions, translocations and copy number variants (CNVs), play critical roles in plant evolution, domestication, and breeding, influencing key traits such as flowering time9,10, fruit flavor11, stress resistance12, and environmental adaptability13. Recent studies have highlighted the impact of SVs on agronomic traits across various crops. For example, Wang et al. reported that a 209 bp noncoding RNA insertion within the intron of MMK2, a mitogen-activated protein kinase homolog, regulates fruit coloration in apple via SV-mediated expression changes14. Similarly, in soybean, a 10 kb PAV was identified as a key determinant of seed luster variation15. These findings underscore the importance of SVs in crop improvement and adaptive evolution.
However, traditional reference genome-based analyses struggle to fully capture SV diversity, leading to reference bias. While whole-genome resequencing effectively identifies SNPs and small indels, detecting SVs using short-read sequencing remains technically challenging16. The advent of high-throughput long-read sequencing has significantly improved SV detection across crops17. For instance, de novo assembly of 12 maize inbred genomes revealed extensive SVs that contributed to heterosis18, while comparative genomic analysis in tomato revealed beneficial SV-associated genes from wild relatives, enabling crop improvement19.
Despite recent advancements in tea plant pangenome research, existing studies remain limited. Prior efforts constructed pangenomes using continuous long reads (CLR) from 22 tea accessions20, but these efforts primarily on cultivated tea plants and lacked systematic SV analyses of wild relatives. Additionally, most genome assemblies have relied on CLR technology, limiting the resolution for detecting complex SVs such as large insertions and deletions. Current pangenomes also lack saturation, with inadequate representation of wild relatives from southwestern China and CSA accessions. Future research should integrate high-quality genomes with multiomics approaches to comprehensively characterize SVs and their functional impact, providing valuable genetic resources for tea breeding.
To address these gaps, we assemble five cultivated tea genomes and one wild relative genome. We then construct a comprehensive pangenome variation map together with 16 published genomes3,5,8,20–22. By integrating transcriptomic data, we identify SVs and demonstrate how PAVs within promoter regions modulate gene expression and drive phenotypic differentiation. Population SV analysis of 275 accessions reveals a 192 bp insertion in the CtANS3 promoter that increases anthocyanin accumulation in wild relatives. Additionally, a 159 bp insertion in the CtLRR1 promoter compromises the resistance of the wild relatives to Colletotrichum gloeosporioides. These findings highlight SVs as key regulators of gene expression and suggest that their enrichment in promoter regions drives phenotypic diversity during domestication.
Results
High-quality genome assembly of representative tea plants
To characterize the diversity of wild and cultivated tea plants, we selected two wild relatives (CT) and 20 cultivated tea plants (eight CSA; one Camellia sinensis var. pubilimba (CSP); and eleven CSS) for pangenome analysis (Table 1). High-quality chromosome-scale reference genomes were generated for five cultivated tea plants (three CSAs, namely, MHDY, QS3, and YK37; and two CSSs, namely, HJY and ZC102) using PacBio Circular Consensus Sequencing (CCS) technology, with sequencing depths ranging from 24.19 to 60.92× (Table 1 and Supplementary Table 1). Additionally, a genome for the wild relative (CT: DLC) was assembled, forming the foundation for comparative genomic analyses. The size of the HiFi-based assemblies averaged 3.18 Gb, with contig N50 values ranging from 131.14 to 142.81 Mb. An average of 96.46% of the assembled contigs were successfully anchored to 15 pseudochromosomes (Table 1). The genome assemblies were assessed using 2326 benchmarking universal single-copy orthologs (BUSCOs). The average completeness of the HiFi-based genome assemblies was 98.2% (Table 1 and Supplementary Data 1). Furthermore, the long terminal repeat (LTR) assembly index (LAI) ranged from 12.59 to 14.74, meeting high-quality reference genome standards. To construct haplotype-resolved assemblies, Hifiasm was used to integrate the HiFi reads with the Hi-C data (Supplementary Table 2), generating both haplotype 1 (hap 1) and haplotype 2 (hap 2) assemblies. The final 10 chromosome-scale haplotype-resolved sequences were aligned to the reference genome using RagTag, achieving an average contig N50 of 209 Mb, with a mean GC content of 39.1% and a BUSCO completeness score of 98.1% (Supplementary Table 3).
Table 1.
Summary of the assembly and annotation of 22 tea plant genomes
| Accession | Sample name | Species name | Size (Gb) | Contig N50 (Mb) | Anchoring rate (%) | Assembly BUSCO (%) | No. of genes | Annotation BUSCO (%) | Repeats (%) | LAI |
|---|---|---|---|---|---|---|---|---|---|---|
| DLC* | – | CT | 3.02 | 1.57 | 97.75 | 97.71 | 40,951 | 93.50 | 79.66 | 14.48 |
| DASZ | – | CT | 3.11 | 2.59 | 99.55 | 97.50 | 39,303 | 90.00 | 87.41 | 10.08 |
| MHDY* | Menghaidaye | CSA | 3.25 | 136.35 | 97.08 | 98.10 | 43,843 | 95.00 | 85.31 | 12.59 |
| QS3* | Qingshui3 | CSA | 3.11 | 131.14 | 96.52 | 98.00 | 43,160 | 95.10 | 84.19 | 14.45 |
| YK37* | Yunkang37 | CSA | 3.21 | 138.18 | 97.04 | 98.30 | 43,214 | 95.00 | 78.49 | 13.94 |
| ZJ | Zijuan | CSA | 3.06 | 94.85 | 97.00 | 97.90 | 40,580 | 94.80 | 72.58 | 9.19 |
| GH3H | Guihong3hao | CSA | 2.95 | 0.88 | 96.51 | 95.20 | 41,717 | 90.50 | 86.60 | 11.33 |
| YH9H | Yinghong9hao | CSA | 3.08 | 0.82 | 95.89 | 97.60 | 43,119 | 93.50 | 79.60 | 10.67 |
| L618 | – | CSA | 3.01 | 94.24 | 98.50 | 98.20 | 39,737 | 94.00 | 69.26 | 8.93 |
| LTDC | Lingtoudancong | CSA | 2.90 | 0.38 | 96.25 | 93.90 | 41,332 | 88.20 | 82.90 | 11.42 |
| DX5H | Danxia5hao | CSP | 3.11 | 0.71 | 96.68 | 95.50 | 40,449 | 91.40 | 79.75 | 11.69 |
| ZC102* | Zhongcha102 | CSS | 3.10 | 139.09 | 96.18 | 98.20 | 44,668 | 94.20 | 84.15 | 14.50 |
| HJY* | Huangjinye | CSS | 3.23 | 142.81 | 95.46 | 98.20 | 44,788 | 94.20 | 83.47 | 14.74 |
| BHZ | Baihaozao | CSS | 2.96 | 2.24 | 99.26 | 96.40 | 41,474 | 92.60 | 85.03 | 10.34 |
| AJBC | Anjibaicha | CSS | 3.24 | 62.73 | 88.50 | 98.20 | 40,637 | 94.30 | 73.77 | 10.07 |
| JMZ | Jinmingzao | CSS | 2.79 | 1.25 | 97.64 | 95.50 | 38,789 | 90.60 | 84.25 | 13.19 |
| JX | Jinxuan | CSS | 3.00 | 1.34 | 98.28 | 97.10 | 41,521 | 91.50 | 85.04 | 7.56 |
| FDDB | Fudingdabai | CSS | 3.10 | 0.92 | 97.50 | 96.40 | 38,789 | 91.40 | 84.44 | 11.85 |
| ZYQ | Zhuyeqi | CSS | 2.93 | 1.69 | 98.32 | 96.20 | 43,412 | 91.00 | 84.86 | 6.72 |
| SCZ | Shuchazao | CSS | 2.94 | 0.60 | 86.73 | 95.70 | 48,605 | 90.80 | 86.78 | 12.45 |
| HD | Huangdan | CSS | 2.94 | 3.61 | 99.83 | 97.40 | 42,709 | 92.90 | 70.75 | 16.16 |
| TGY | Tieguanyin | CSS | 3.11 | 1.94 | 98.96 | 96.50 | 44,177 | 93.90 | 78.20 | 10.17 |
*Genomes reported in this study.
The number of protein-coding genes varied considerably among previously published genomes20, possibly because of inconsistent annotation pipelines. To ensure consistency, we reannotated 22 genomes using a unified pipeline and evaluated annotation completeness with the eudicots_odb10 benchmark. This approach revealed that eight HiFi-sequenced genomes achieved an average annotation quality of 94.6% (Table 1 and Supplementary Data 2). On the basis of the reannotated protein sequences, we constructed a phylogenetic tree of two wild and 20 cultivated tea plants using 4354 single-copy orthologous genes, with Actinidia chinensis Planch23 as the outgroup. The results revealed clear genetic differentiation between wild and cultivated tea plants, which formed two distinct clades estimated to have diverged approximately 5.71 million years ago (MYA) (Fig. 1a). Within the cultivated tea plants, three subclades were identified (CSA, CSP, and CSS). CSP appeared as a distinct lineage nested within the CSS group, showing clear genetic divergence from typical CSS accessions but remaining more closely related to CSS than to CSA. This phylogenetic pattern is consistent with the findings of earlier studies on tea plant population genetics20,24.
Fig. 1. Phylogenetic analysis and transposable element dynamics in tea plant genomes.
a Phylogenetic analysis of 22 tea plant genomes using kiwifruit (ACH) as an outgroup, with a heatmap showing presence/absence of genes in pan-gene families. The genomes cluster into four clades: wild relative CT and cultivated groups (CSA, CSS, CSP). ACH Actinidia chinensis Planch. b LTR content proportions in the genomes assembled in this study, with Gypsy families (44.3%) predominating over Copia (9.6%). c Insertion time distribution of full-length Copia and Gypsy LTRs across the 22 genomes. Each line represents an individual genome from (a), with Gypsy elements shown in blue and Copia in purple. d The expression levels of genes with LTR insertions (n = 1746) versus genes without LTR insertions in their promoters (n = 42,922) across various tissues of C. sinensis cv. Zhongcha102. Box plots show the median (center line), first and third quartiles (box bounds), and whiskers extending to 1.5× the interquartile range, with individual data points overlaid. Statistical significance was assessed using a two-tailed Student’s t-tests. TPM transcripts per million mapped reads. Source data are provided as a Source Data file.
Repetitive sequences accounted for 78.49–85.31% of the assembled genomes, with LTR retrotransposons (LTR-RTs) constituting the dominant fraction. Across the HiFi-assembled genomes, LTR-RTs made up an average of 70.9% of the genomic content (Fig. 1b). To further characterize these LTR-RTs, analysis using LTR_retriever revealed full-length LTR-RTs across all 22 genomes, ranging from 17,388 in the accession LTDC to 25,566 in the accession HJY (Supplementary Table 4). A comparative analysis of LTR insertion times revealed that the Gypsy and Copia families underwent continuous expansion approximately 4 MYA, followed by a pronounced increase approximately 0.25 MYA. The insertion activity of the Gypsy family was significantly greater than that of the Copia family, suggesting that the expansion of the Gypsy family was more extensive (Fig. 1c). We further analyzed genes associated with LTR insertions, with an average of 1490 genes per genome affected by LTRs (Supplementary Table 4). Transcriptomic analysis across various tissues of ZC102 revealed that genes without LTR insertions in their promoters presented significantly higher average expression levels compared with those with LTR insertions (Fig. 1d). The same phenomenon was also observed in CT and CSA accessions, in which genes exhibited lower expression levels in the presence of LTR insertion in the promoter than in the absence of insertion (Supplementary Fig. 1).
Gene gain and loss during tea plant domestication
We constructed a pangenome across 22 genomes using OrthoFinder clustering on the annotated gene models. Pangenome simulation analysis revealed that the total number of genes and core genes in the pangenome gradually approached stability when more than 20 genomes were included (Fig. 2a). A total of 15,216 core families were shared across all 22 genomes, constituting 35.1% of the total clusters (Fig. 2b); the genes within these core families represented 55.2% of all the genes (Supplementary Fig. 2a). Softcore families, defined as those present in 20 or 21 genomes, made up 20.2% of the total clusters. The remaining clusters were classified as dispensable (present in 2–19 genomes) or private (unique to a single genome) gene families. Compared with dispensable genes, core and softcore genes exhibited longer coding sequences (CDSs) and higher expression levels, along with lower Ka/Ks ratios, indicating stronger functional constraints and evolutionary conservation (Supplementary Fig. 2b–d).
Fig. 2. Pangenome analysis and gene gain/loss patterns.
a Pan-gene and core gene family dynamics across sample sizes. Lines connect median values, with error bars representing the interquartile range (IQR) based on random sampling permutations (n = 22 genomes). b Composition of the tea pangenome. The histogram shows gene family distribution changes with increasing genome number, while the pie chart displays the relative proportions of different gene families. c KEGG pathway enrichment analysis for genes gained in CSA compared to CT. d KEGG pathway enrichment analysis for genes gained in CSS compared to CT. The heatmaps show enriched KEGG pathways with color intensity representing statistical significance (-log10(p-value)) calculated using hypergeometric test (one-tailed).
To study the dynamics of gene numbers during tea plant domestication, a collinearity analysis was performed between wild and cultivated tea plants using MCScan. Gene gains and losses were identified based on linear gene pairs within collinearity blocks. We found that 1132 and 1553 linear gene pairs were completely absent in CT compared with the CSA and CSS accessions, respectively (Supplementary Data 3 and 4). These genes were significantly enriched in biological processes, including anthocyanin biosynthesis, isoflavonoid biosynthesis, caffeine metabolism, terpenoid metabolism, photosynthesis, and plant-pathogen interactions (Fig. 2c, d). Conversely, the CT accessions presented higher gene copy numbers in 349 gene pairs (80% absent in CSA and CSS; Supplementary Data 5), which were enriched in the carotenoid biosynthesis, flavonoid biosynthesis, and amino acid metabolism pathways (Supplementary Fig. 3 and Supplementary Data 6). These findings indicated that the gain of these genes in cultivated tea plants might be linked to selective breeding for enhanced flavor quality during domestication, which is consistent with the superior flavor profile observed in cultivated tea plants25.
Extensive SVs within tea plant genomes
To identify large-scale SVs in the tea plant pangenome, we first aligned 26 genome assemblies (including haplotype-resolved genomes) to the reference genome (ZC102hap2) and then identified SVs using SyRI software (Fig. 3a and Supplementary Fig. 4). In total, 1,375,161 SVs were identified, comprising 939,060 PAVs (referring to >50 bp insertions and deletions in this study), 208,748 CNVs, 12,549 inversions, and 214,804 translocations (Supplementary Table 5). PAVs were the most prevalent SV class, accounting for 68.3% of the total SVs, followed by translocations (15.6%) and CNVs (15.1%), while inversions were the least frequent (1%). Furthermore, we found that PAVs had similar distribution frequencies and were widely distributed across each chromosome. In contrast, inversions and translocations were significantly less frequent and exhibited a random chromosomal distribution. Our analysis confirmed the substantial transposable element (TE) content in the tea plant genome, with most TEs inserted in intergenic regions (Fig. 3b).
Fig. 3. Structural variations in the tea plant genomes.
a Distribution of five SV types (CNV copy number variation, INS insertion, DEL deletion, INV inversion, TRANS translocation) across tea plant genomes, with colors indicating population membership (CT, CSA, CSS, CSP). b SV distribution across 15 chromosomes in 26 genomes. c Proportion of hemizygous genes in each haplotype, with green representing CSA accessions and yellow representing CSS accessions. d Percentage overlap between PAVs and different genomic regions across 26 genomes. e Intersection of upregulated differentially expressed genes between CSA vs CT (the ratio of CSA to CT) and CSS vs CT (the ratio of CSS to CT), highlighting genes with deletion variants in promoter regions based on transcriptome analysis. f KEGG enrichment of genes identified in (e), statistical significance (−log10(p-value)) calculated using hypergeometric test (one-tailed). Source data are provided as a Source Data file.
The application of long-read sequencing has greatly improved SV detection, enabling a more comprehensive characterization of hemizygous genes resulting from structural rearrangements in heterozygous diploid genomes. We analyzed SVs between the hap1 and hap2 genomes, defining hemizygous genes as those harboring heterozygous PAV variants. The percentage of hemizygous genes ranged from 3.21% to 4.76%, with a mean frequency in the CSA accessions of 3.96% (n = 4), which was slightly greater than the 3.39% observed in the CSS accessions (n = 6) (Fig. 3c). GO enrichment analysis revealed that hemizygous genes were primarily associated with plant-type primary cell wall biogenesis, cellulose biosynthesis, and ovule development, suggesting that these genes played a role in male gametophyte development (Supplementary Fig. 5).
PAVs influence metabolic gene expression in tea plant domestication
SVs can modulate gene expression by altering gene sequences or modifying cis-regulatory elements26. Since PAVs were the predominant SV type identified in our analysis, we examined their genomic distribution to assess their functional impact. We found that most SVs occurred in noncoding regions, >79.3% of which were intergenic, >20.7% of which were intronic, and only <0.3% of which were in exonic regions (Fig. 3d and Supplementary Fig. 6). Transcriptome data from 21 accessions revealed that 22% of the gene promoters harbored PAVs (9304/42,623; Supplementary Data 7 and 8), which significantly influenced gene expression (Fig. 3e and Supplementary Fig. 7). In cultivated tea plants, upregulated genes were enriched in taste- and aroma-related pathways, including flavonoid, phenylpropanoid, and amino acid metabolism, as well as environmental adaptation and secondary metabolism, supporting their roles in domestication-related traits (Fig. 3f and Supplementary Fig. 7). These findings highlighted that PAVs in promoter regions played important roles in regulating the expression of genes associated with flavonoid, amino acid, and terpenoid metabolism in tea plants.
A 192 bp insertion in the CtANS3 promoter increases anthocyanin content
Despite increasing interest in tea genomics, large-scale SV analyses across diverse tea populations remain limited. We constructed a pangenome variation map by integrating variant data from multiple genomes. Using vg call, we genotyped SVs across 275 tea accessions (Supplementary Data 9). To identify selection signatures associated with domestication, we analyzed genomic regions under selection during tea plant domestication. We identified 522 shared selective intervals containing 886 SV sites, which narrowed to 91 candidate genes after promoter region mapping (Fig. 4a and Supplementary Data 10). Transcriptomic analysis across accessions revealed that the ANS3 gene (zc102_237396) on chromosome 14 was expressed at significantly higher levels in wild tea plants than in cultivated tea plants (Fig. 4b). We identified three SV-derived haplotypes in the ANS3 promoter region: Hap1 (192 bp insertion), Hap2 (deletion), and Hap3 (283 bp insertion) with each showing distinct population distribution patterns across populations (Fig. 4c). Hap1 was predominant in CT (95%) but absent in cultivated tea plants. In contrast, the cultivated CSA accessions were dominated by Hap2 (78%) with a moderate presence of Hap3 (12%), while the CSS accessions were primarily fixed for Hap3 (78%), with minimal retention of Hap2 (5%) (Fig. 4d). These results suggested that domestication led to the gradual loss of Hap1, whereas the transition from CSA to CSS involved a shift from Hap2 to Hap3 as the dominant haplotype.
Fig. 4. Selection signatures in tea populations based on SVs.
a FST distribution of SVs between wild relatives (CT) and cultivated tea plants (CSA, CSS), with the black dashed line indicating the top 5% threshold. b TPM expression levels of ANS3 across CT (n = 7), CSA (n = 7), and CSS (n = 7), measured in seven accessions per group with three biological replicates each. Box plots show the median (center line), first and third quartiles (box bounds), and whiskers extending to 1.5× the interquartile range. Statistical significance was determined by two-tailed Student’s t-tests. c Schematic representation of three haplotypes in the ANS3 gene structure. The 192 bp and 283 bp sequences are independent and unrelated segments. d Frequency distribution of three ANS3 haplotypes (Hap1INS192, Hap2DEL, and Hap3INS283) across 275 tea accessions. Source data are provided as a Source Data file.
The wild relatives exhibited distinct purple shoots, whereas cultivated tea plants typically displayed green shoots (Fig. 5a), a phenotypic difference that was likely governed by anthocyanin metabolism. Previous study has reported that ANS participates in anthocyanin biosynthesis27. Therefore, we speculated that the promoter variations in ANS3 might contribute to the differences in shoot color between wild and cultivated tea plants. PCR amplification confirmed that most CT accessions contained a homozygous 192 bp insertion fragment in the ANS3 promoter region. In contrast, the CSS harbored a heterozygous 283 bp insertion, whereas CSA showed no PAVs sites (Fig. 5a). The two promoter insertions (192 bp and 283 bp) were shown to be structurally unrelated by alignment analysis (Supplementary Fig. 8). Distribution analysis based on resequencing data revealed that the 192 bp insertion occurred in wild tea plants including Camellia gymnogyna, C. kwangsiensis, C. crassicolumna, and C. tachangensis, but was absent in C. ptilophylla, C. fangchengensis, and C. grandibracteata (Supplementary Fig. 9). Subsequent PCR validation confirmed the insertion in C. gymnogyna, C. kwangsiensis, and C. crassicolumna (Supplementary Fig. 10). These findings suggested that the 192 bp insertion was an ancestral variant that has been preserved in certain tea populations. Quantitative real-time PCR (qRT-PCR) analysis revealed that the expression of ANS3 in CT accessions was significantly greater than that in cultivated tea plants (Fig. 5b, d). Similarly, anthocyanin compounds, including delphinidin and cyanidin, were significantly more abundant in CT than in cultivated tea plants (Fig. 5c, e and Supplementary Fig. 11). Furthermore, correlation analysis across the 39 tea accessions revealed a significant positive correlation between ANS3 expression levels and anthocyanin content (R = 0.79) (Fig. 5f). However, no significant differences in ANS3 expression levels or anthocyanin content were detected between CSS and CSA.
Fig. 5. Effects of SVs in the ANS3 promoter on gene expression.
a Genotyping of ANS3 in 15 accessions using a 2000 bp DNA marker. PCR analysis confirmed the presence of a 192 bp insertion in five CT accessions and a 283 bp insertion in the CSS accessions. The experiment was independently repeated three times with similar results. b qRT-PCR analysis of ANS3 expression in 39 tea accessions (CT, n = 11; CSA, n = 16; CSS, n = 12). Data are presented as mean ± SD (n = 3 biological replicates). c HPLC quantification of delphinidin in 39 accessions (CT, n = 11; CSA, n = 16; CSS, n = 12), with DW indicating dry weight. Data are presented as mean ± SD (n = 3 biological replicates). d, e Box plots showing that both ANS3 expression and delphinidin content are significantly elevated in CT accessions, but lowest in CSA and CSS accessions. Statistical significance between groups was assessed using two-tailed Student’s t-tests. f Correlations between ANS3 expression and delphinidin content across CT, CSA, and CSS accessions. Statistical significance was assessed using two-tailed Student’s t-tests. g Schematic of the dual-luciferase assay used to assess ANS3 promoter activity. h, i Luminescence images (h) and luciferase (LUC) activity (i) showing that proCtANS3 exhibits significantly higher promoter activity than proCsaANS3 and proCssANS3 in transient N. benthamiana leaf expression assays. Truncating the 192 bp insertion significantly reduces proCtANS3-192 activity. Relative LUC activity was normalized to REN activity. Empty represents the empty vector. Data are shown as mean ± SD (n = 4). Different lowercase letters indicate significant differences at P < 0.05 (one-way ANOVA with Waller–Duncan multiple-range test). Source data are provided as a Source Data file.
To further investigate the genetic basis of shoot color variation, we examined sequence variations in the UTR and CDS regions of ANS3 across the wild and cultivated tea genomes. We identified some SNPs in the CT, CSA, and CSS accessions, most of which were synonymous mutations. Additionally, the remaining nonsynonymous SNPs did not occur at the conserved enzymatic activity sites of the ANS3 protein (Supplementary Fig. 12). Importantly, these genomic variations were not clearly correlated with anthocyanin content or gene expression patterns. These findings suggested that promoter variations in the ANS3 region were more likely to affect transcriptional level. Therefore, we performed promoter activity assays and reported that the activity of the CtANS3 promoter was significantly greater than that of the CsANS3 promoter (Fig. 5g, h). Deletion of the 192 bp fragment substantially reduced promoter activity, confirming its crucial role in driving differences in ANS3 expression between CT and cultivated tea plants (Fig. 5i). Conversely, compared with the CSA accessions, the CSS accessions harbored a distinct 283 bp insertion (CssANS3), but this did not significantly impact ANS3 expression, anthocyanin content, and promoter activity (Fig. 5b–i). These findings underscored the functional significance of promoter SVs in CtANS3 from wild tea plants in shaping tea anthocyanin biosynthesis and domestication-driven phenotypic variation.
Phenotypic and gene expression characterization of tea plant resistance to C. gloeosporioides infection
Anthracnose is a leaf disease caused by the fungus C. gloeosporioides that severely damages the growth of tea plants28. However, there is a knowledge gap regarding the association between the genome variation and resistance to anthracnose among different tea accessions. Thus, the mycelium of C. gloeosporioides was used to infect the leaves of CT, CSA, and CSS. Lesion area measurements revealed significantly larger lesions in CT accessions than in cultivated tea plants, whereas no significant difference was observed between CSA and CSS (Fig. 6a–c and Supplementary Fig. 13). Plant resistance to pathogens is largely mediated by resistance (R) genes29. Therefore, we performed a comprehensive analysis of resistance gene analogs (RGAs) across 22 genomes, categorizing them into seven families based on protein domain architecture (Fig. 6d). The most prevalent RGA family was RLK-LRR (34.4%), followed by NBS-LRR (25.3%) and CC-NBS-LRR (25.2%), whereas CC-NBS and NBS-TIR were the least represented (1.9% and 1.0%, respectively) (Supplementary Table 6). Cultivated tea plants (CSA and CSS) contained higher numbers of several RGA classes, including NBS-LRR-TIR, NBS-LRR, CC-NBS-LRR, and RLK-LRR, with NBS-LRR-TIR and NBS-LRR being 1.3-fold more abundant than in CT. Consistent with the collinearity analysis, cultivated tea plants exhibited gains of disease resistance-related genes. For example, one NBS-LRR family member (ZC102AChr03G019210.1) that was lost in CT was widely present in cultivated tea plants (Fig. 6e).
Fig. 6. Phenotypic and transcriptomic characterization of tea plant resistance to Colletotrichum gloeosporioides infection.
a Lesion phenotypes of CT, CSA, and CSS accessions after C. gloeosporioides (Cg) infection. CT2, CT4, CT5, and CT7 belong to the CT group; YK11, YK47, YK50, and YC1 belong to the CSA group; ZYQ, FDDB, JX, and RZWL belong to the CSS group. CK represents mock infection with water. b Statistical analysis of lesion areas in 18 tea accessions post-infection (Supplementary Fig. 13). Box plots show the median (center line), first and third quartiles (box bounds), and whiskers extending to 1.5× the interquartile range. Sample sizes: CT1 (n = 21), CT2 (n = 27), CT3 (n = 27), CT4 (n = 32), CT5 (n = 27), CT6 (n = 15), CT7 (n = 33), CT8 (n = 26), YK47 (n = 28), YK11 (n = 30), YK50 (n = 30), YK10 (n = 30), YC1 (n = 18), BHZ (n = 29), ZYQ (n = 33), FDDB (n = 26), RZWL (n = 30), and JX (n = 27). c Comparative lesion area analysis among CT (n = 8), CSA (n = 5), and CSS (n = 5). Data are presented as mean ± SD. Different lowercase letters indicate significant differences at P < 0.05 (one-way ANOVA with Waller–Duncan multiple-range test for post-hoc pairwise comparisons). Groups sharing the same letter are not significantly different. d Statistical classification of R genes across the CT, CSA, and CSS genomes, identifying seven RGA types. e Collinearity analysis showing the presence of the ZC102AChr03G019210.1 gene in cultivated tea genomes (CSS, CSA) and its absence in wild tea genomes (CT). f Venn analysis of two gene sets: DEGs of Cg (differentially expressed genes identified from transcriptome analysis of seven accessions post-Cg infection: CT1, CT4, CT7, YK11, YK47, FDDB, and ZYQ) and CT-SV genes (genes in the CT genome with SVs in their promoter regions). g KEGG pathway enrichment analysis of the 36 intersecting genes identified in (f), statistical significance (−log10(p-value)) calculated using hypergeometric test (one-tailed). Source data are provided as a Source Data file.
In addition to gene copy number variations, SV analysis of R genes revealed extensive genomic variation within this gene class. Among the 739 identified R genes, 398 (53.85%) harbored SVs within their mRNA regions, with an average of five SV events per gene locus (Supplementary Fig. 14 and Supplementary Data 11). Additionally, SVs were detected in the promoter regions of 326 R genes (44.11% of the total RGA repertoire), indicating that regulatory regions underwent frequent structural modifications that might influence gene expression and disease resistance responses.
A 159 bp insertion upregulates CtLRR1 expression and impairs pathogen defense
To elucidate the molecular mechanisms underlying disease susceptibility in wild tea relatives, we performed transcriptome sequencing on seven pathogen-infected accessions. Transcriptomic analysis of pathogen-treated samples revealed 674 differentially expressed genes (DEGs) in wild tea plants (CT) and 129 DEGs in cultivated tea plants. These DEGs were merged to create a nonredundant set of pathogen-responsive genes (Supplementary Fig. 15). To assess whether SVs contributed to disease susceptibility in CT accessions, integrative analysis comparing SV-associated genes (CT-SVs) with the pathogen-responsive gene set revealed 36 DEGs harboring SVs (Fig. 6f). KEGG enrichment analysis of the 36 DEGs harboring SVs revealed significant enrichment for leucine-rich repeat (LRR) domain-containing proteins, with ZC102AChr10G019190.1 (LRR1) being the sole representative gene in this category (Fig. 6g). Following pathogen infection, compared with that in cultivated tea plants, CtLRR1 expression in CT accessions was significantly upregulated (Fig. 7a).
Fig. 7. Functional analysis of the 159 bp CtLRR1 promoter insertion in pathogen defense.
a qRT-PCR analysis of LRR1 expression in seven tea accessions. Data are presented as mean ± SD (n = 3 biological replicates). Different lowercase letters indicate significant differences at P < 0.05 (one-way ANOVA with Waller–Duncan multiple-range test for post-hoc pairwise comparisons). Groups sharing the same letter are not significantly different. b Schematic representation of the SV in the LRR1 promoter. c PCR verification of the LRR1 promoter SVs in seven tea accessions. The experiment was independently repeated three times with similar results. d Diagram illustrating the dual-luciferase assay used to assess LRR1 promoter activity. e, f Luminescence imaging (e) and luciferase (LUC) activity quantification (f) show that compared with the expression of CsaLRR1 and CssLRR1, the transient expression of CtLRR1 in N. benthamiana leaves results in increased promoter activity. Relative LUC activity was normalized to REN activity. Empty represents the empty vector control. Data are presented as mean ± SD (n = 4 biological replicates). Different lowercase letters indicate significant differences at P < 0.05 (one-way ANOVA with Waller–Duncan multiple-range test for post-hoc pairwise comparisons). Groups sharing the same letter are not significantly different. g CK represents the tea leaves of C. sinensis cv. Longjin43 infected with Agrobacterium carrying empty pTRV1 and pTRV2 vectors, while VIGS represents the tea leaves infected with Agrobacterium carrying empty pTRV1 and pTRV2-CsLRR1. Chlorophyll fluorescence imaging shows infected leaves. Larger black areas indicate stronger pathogen infection severity. h CsLRR1 expression levels in control and silenced samples. Data are presented as mean ± SD (n = 5 biological replicates). i Lesion area measurements following C. gloeosporioides (Cg) infection in the control and VIGS-silenced groups (n = 40 per group). Box plots display median (center line), quartiles (box bounds), whiskers (1.5× IQR). Statistical significance was assessed using two-tailed Student’s t-tests. Source data are provided as a Source Data file.
To further explore the differences in LRR1 expression between CT accessions and cultivated tea plants, we investigated the genetic variation in the LRR1 sequence in different tea genomes. Sequence characterization revealed a 159 bp insertion in the promoter region of CtLRR1 specific to CT accessions (Fig. 7b). PCR amplification confirmed that this insertion was heterozygous in CT but showed a homozygous deletion in cultivated tea plants (Fig. 7c). In addition, PCR-based sequencing of the LRR1 promoter revealed a 159 bp insertion in other wild relatives, such as C. gymnogyna, C. kwangsiensis, and C. crassicolumna. C. gymnogyna and C. kwangsiensis presented homozygous insertions, whereas C. crassicolumna presented a heterozygous pattern, similar to that observed in CT (Supplementary Fig. 16). To further investigate the genetic variation in the LRR1 sequence, we examined the variations in the UTR and CDS regions of LRR1 across different tea genomes. The resulting sequence analysis showed that variations in the UTR and CDS regions had no clear pattern, and that UTR variations did not cause frameshift mutations in the CDS of LRR1, and all SNPs identified in the CDS region were synonymous mutations (Supplementary Fig. 17). Collectively, these results suggested that the presence or absence of a 159 bp insertion in the LRR1 promoter might be associated with its differential expression in CT accessions and cultivated tea plants. Furthermore, luciferase reporter assays further supported these findings, as the activity of the CtLRR1 promoter was significantly greater than that of the CsaLRR1 and CssLRR1 promoters (Fig. 7d–f). To investigate the functional role of LRR1, we employed virus-induced gene silencing (VIGS) to suppress its expression in tea plants. Compared with the pathogen-infected controls, the silenced plants exhibited significantly reduced leaf damage (Fig. 7g–i). These findings revealed a negative correlation between LRR1 expression and resistance to C. gloeosporioides, indicating that LRR1 acted as a negative regulator of the immune response pathway against this pathogen in tea plants. The 159 bp insertion in the CtLRR1 promoter increased promoter activity and elevated transcription levels; however, this heightened expression compromised the defense against C. gloeosporioides, underscoring its negative regulatory effect on resistance in CT accessions.
Discussion
Limited genomic data on wild relatives and CSA have hindered comprehensive research and the effective utilization of tea plant genetic resources. To address this issue, we assembled a wild relative genome and five haplotype genomes of cultivated tea plants (three CSAs and two CSSs). Compared with existing references, our genome assemblies exhibited superior contiguity (contig N50), completeness (BUSCO scores), and structural integrity (LAI scores). Using these high-quality resources, we constructed a comprehensive pangenome from 22 tea plant genomes, including previously published assemblies with robust quality and broad geographical representation20,30. Through extensive SV analysis, population-level SV studies, and transcriptome sequencing, we elucidated how SVs influence gene expression during tea domestication and shape morphological diversity across accessions.
The divergence time between wild and cultivated tea plants has been widely debated. Previous study has estimated a divergence of 24 MYA between DASZ and cultivated tea plants31; however, our analysis, incorporating a broader genomic dataset with higher-quality assemblies, refined this estimate to 5.71 MYA (Fig. 1a). Additionally, phylogenetic analysis of tea cultivars from South Korea (C. sinensis cv. Sangmok) and Japan (C. sinensis cv. Seimei) showed that they clustered with the Chinese cultivar Tieguanyin (C. sinensis cv. Tieguanyin) (Supplementary Fig. 18), indicating a shared genetic background and common origin among these cultivated tea plants from East Asia. However, genomic differences among tea cultivars from different countries warrant further investigation. Given the high heterozygosity and genetic diversity of C. sinensis’s, an accurate pangenome must encompass representatives with significant genetic divergence (e.g., CSA and its wild relatives). The pangenome constructed in this study was considerably comprehensive, as the number of pan genes increased steadily as the genome number increased to 20 genomes. Genomic collinearity analysis revealed that cultivated tea plants exhibited gene gains in stress-related pathways compared to wild tea plants, particularly those involved in flavonoid biosynthesis and plant-pathogen interaction pathways. This expansion contrasts with the loss of defense-related genes observed during grape domestication32, suggesting that domestication of tea favored stress resistance.
SVs are key drivers of plant phenotypic diversity and genome evolution33–36. Given the differences between haplotypes, we combined haplotype-resolved genomes and identified SVs across 26 genomes using ZC102hap2 as the reference genome, detecting a total of 1,375,161 SVs. These SVs were predominantly large PAVs (68.3%), followed by translocations (15.6%), CNVs (15.1%), and inversions (1%). Unlike those of rice and broomcorn millet, where the SV frequencies of wild species are relatively high37,38, our data revealed that compared with other genomes, DASZ genomes generally exhibited lower SV frequencies. SyRI revealed numerous large-scale inversions in DASZ, which likely disrupted the alignability and thereby reduced the number of detectable SVs (Fig. 3a and Supplementary Fig. 4). Moreover, most SVs were located in noncoding regions, with only a minor proportion (< 0.3%) in exonic regions, which was consistent with the patterns observed in kiwifruit39. Transcriptomic analysis revealed that approximately 22% (9304/42,623) of the promoters harbored PAVs. These genes with promoter PAVs exhibited significantly different expression levels among populations, impacting flavonoid, amino acid, and terpenoid metabolic pathways, potentially contributing to tea flavor diversity. Comparable SV patterns have been reported in other crops; for example, wheat pangenome analysis revealed a large insertion/deletion block on chromosome 1RS with reduced pSc200 tandem repeats in cultivars, a region that is widely utilized in breeding programs40. These parallels underscore the importance of PAVs in crop improvement.
In diploid plants, SV-induced allele loss results in hemizygous genes, a largely unexplored phenomenon in tea plants41–43. By assembling 10 haplotype genomes, we identified hemizygous genes arising from SVs. The hemizygosity of the CSA accessions was slightly greater (3.96%) than that of the CSS accessions (3.39%), with these genes predominantly involved in primary cell wall biosynthesis and ovule development. This pattern mirrors findings in grape32, where cultivated accessions display reduced hemizygous gene frequencies.
To harness the genetic diversity captured in our pangenome, we integrated SV data from 275 tea accessions into a comprehensive variation map, constructing a graph-based reference genome encompassing both wild and cultivated tea plants. Such graph-based frameworks have proven invaluable for SV-based association studies in other crops, enabling the identification of key agronomic traits44–46. However, owing to the absence of extensive phenotypic datasets, we relied on selection sweep analyses to delineate domestication-associated genomic intervals. A total of 886 shared SV sites were identified during the domestication of the CSS accessions. Combined promoter region and transcriptomic analyses successfully localized the ANS3 gene. The SV in the promoter region of this gene formed three distinct haplotypes, with different proportions of each haplotype across different populations, demonstrating a gradual differentiation pattern from CT to CSA to CSS. BY integrating genotype data, expression profiles, and functional validation, we demonstrated that a CT-specific 192 bp promoter insertion significantly increased the transcriptional activity of CtANS3, leading to increased anthocyanin accumulation. This regulatory mechanism, wherein promoter region SVs drive gene expression, has also been observed in apple14 and Brassica oleracea47.
Research on disease resistance in tea plants remains key for in the genetic improvement of domesticated cultivars. Consistent with findings in rice37 and chickpea12, our study revealed a greater abundance of RGAs in cultivated tea accessions than in their wild relatives, with a notable 1.3-fold expansion in NBS-LRR-TIR class genes. This pattern parallels R gene enrichment during domestication in apple29, suggesting that artificial selection broadly drives the expansion of crop resistance genes. Consistent with the results of the collinearity analysis, R genes expanded in cultivated tea plants, as exemplified by the gene ZC102AChr03G019210.1, which is absent in wild tea genomes. However, since R genes typically exist in tandem clusters with multiple copies, whether the loss of individual genes affects overall disease resistance capacity requires further investigation48. Recent chickpea pangenome studies12 have shown that SVs are prevalent in resistance (R) genes, with an average of three SV events per gene. Similarly, we observed frequent SVs within R gene bodies and reported that more than 44% of R gene promoter regions contained SVs. By integrating pathogen transcriptomic data from different tea accessions, we identified a heterozygous site in CT accessions caused by a 159 bp insertion in the CtLRR1 promoter. Functional validation demonstrated that while this insertion significantly upregulated CtLRR1 transcription in CT accessions, it paradoxically reduced resistance to C. gloeosporioides. In contrast, cultivated tea plants lacking this insertion exhibited enhanced resistance. Similar antagonistic regulatory effects have been reported in Arabidopsis49, underscoring the complexity of disease-resistance gene networks. These findings advance our understanding of disease resistance mechanisms in tea domestication and highlight the regulatory roles of SVs in crop evolution.
In summary, we present six chromosome-level reference genomes and construct a graph-based whole-genome SV map for tea plants. Our comprehensive SV analysis reveals differential gene expression patterns linked to key agronomic traits, including leaf coloration, flavor compound biosynthesis, and disease resistance. More importantly, by integrating pangenome and transcriptomic data, we systematically identify SVs across tea populations and demonstrate their bidirectional regulation of multiple gene expression levels. Many of these expression-modulating SVs have undergone strong selection pressure during domestication. The genomic resources and pangenome variation map provided in this study offer valuable tools for future crop improvement, paving the way for accelerated advancements in tea plant breeding.
Methods
Genome sequencing and assembly
We collected six tea plant samples, including one wild relative accession (C. taliensis, DLC) from Yunnan, China, and five cultivated tea plants representing two Camellia sinensis varieties: three C. sinensis var. assamica (MHDY, QS3, and YK37) from Xishuangbanna Prefecture, Yunnan, and two C. sinensis var. sinensis (HJY and ZC102) from Anhui, China. Furthermore, 16 tea plant genomes were obtained from two previous studies20,30. For genome sequencing, genomic DNA was extracted from fully expanded mature leaves of each sample. PacBio SMRT libraries were constructed using the SMRTbell Express Template Prep Kit 2.0 and sequenced on the PacBio Sequel II platform. The CCS mode50 was employed to generate high-quality long reads, with six SMRT cells allocated per sample. This sequencing strategy produced >16 K consensus reads per cell, resulting in an average genome coverage of 38.13×. Raw PacBio reads were processed using HiFiAdapterFilt51 with default parameters to remove adapters and low-quality sequences. Hi-C raw reads were filtered using SOAPnuke52 v2.1.0 with the parameters “-n 0.1 -l 20 -q 0.2 -i” to remove adapters and low-quality bases. Primary genome assembly was performed using hifiasm53 with default parameters to integrate the HiFi and Hi-C data, generating a draft genome assembly and two haplotype-resolved assemblies (hap1 and hap2). The draft genome was then scaffolded to the chromosome-level using the 3D-DNA54 pipeline v180922 and Juicer55 v1.6. First, BWA was used to construct an index for the draft genome. Subsequently, Juicer was employed to remove low-quality and duplicate reads, generating the merged_nodups.txt file containing deduplicated valid pairs. The 3D-DNA pipeline was then applied to the merged_nodups.txt file and draft genome for scaffolding and error correction, with the parameters “-r 2 -q 1” to enable iterative error correction. Finally, manual curation was performed using Juicerbox v1.11.08 to generate a high-quality chromosome-level genome assembly. Additionally, we evaluated HapHiC56 as an alternative scaffolding approach, which demonstrated comparable assembly quality to the 3D-DNA/Juicer pipeline while offering a more streamlined workflow. To obtain haplotype-resolved assemblies, the haplotype contigs (hap1 and hap2) were aligned to the chromosome-level genome assembly using RagTag57 v2.1.0 with the parameters “-C 0.95 -T” sr to generate two chromosome-scale haplotype-resolved genomes. The detailed assembly procedure is provided in the Supplementary Methods 1 and 2.
Whole-genome resequencing was conducted on 275 tea accessions, including 94 wild relatives, 91 C. sinensis var. assamica (CSA) accessions, and 90 C. sinensis var. sinensis (CSS) accessions. Of these, 32 accessions were from previous study22, while 243 accessions were newly sequenced. Detailed accession information is provided in the Supplementary Data 9.
Repeat sequence annotation
Repetitive sequences were annotated using both homology-based and de novo58 approaches. We utilized the RepBase59 database as well as RepeatMasker and RepeatProteinMask to identify and classify known repetitive elements for homology-based annotation. For de novo prediction, we constructed a repeat library using RepeatModeler60 and LTRharvest, followed by annotation with RepeatMasker. Tandem repeats were identified with Tandem Repeats Finder61. The detailed annotation method is provided in the Supplementary Method 3.
Gene prediction and functional annotation
To minimize the impact of inconsistent annotation pipelines on gene-level analyses, we used both homology- and transcriptome-based methods for the annotation of the 22 genomes. For homology-based annotation, protein-CDSs from homologous species were individually aligned to each target genome using GeMoMa62. The homology-based annotation results were then merged using the GeMoMaPipeline GAF module with the parameters “f = “evidence > =2”, c = 0.9, and m = 1”. For transcriptome-based annotation, RNA-seq reads were aligned to the corresponding genome using HISAT263, generating SAM files that were converted to the BAM format and sorted using SAMtools64. StringTie265 was employed to perform transcriptome assembly on the sorted BAM files, producing gtf files. TransDecoder v5.5.0 (https://github.com/TransDecoder/TransDecoder) scripts were used to extract cDNA sequences from the transcripts and convert the genomic coordinates. Incomplete coding genes were subsequently removed to generate the final transcriptome-based GFF files. The detailed gene annotation methodology is provided in the Supplementary Method 4. Annotation quality was evaluated using BUSCO66 against the eudictios_odb10 reference database.
Functional annotation was performed by aligning gene sequences against SwissProt67, TrEMBL67, NR, and KOG using diamond software, followed by biological pathway annotation using KEGG68 and gene ontology annotation using GO69.
Phylogenetic tree construction and divergence time estimation
The Actinidia chinensis cv. Hong Yang (v3) genome (https://kiwifruitgenome.org/) was used as an outgroup for phylogenetic inference. Single-copy orthologs were identified using OrthoFinder (v2.5.5)70 with the “-S blast” parameter. Protein sequences of single-copy genes were aligned with MAFFT (v7.525)71, and codon alignments were generated using PAL2NAL (v14)72 with default parameters. Conserved alignment blocks were extracted using Gblocks (v0.91b) with parameters set to “-t=c” for codon-based selection and “-b5=h” for allowed gap positions. The resulting filtered alignments were then concatenated into a supermatrix for downstream analyses. Phylogenetic reconstruction was performed using IQ-TREE (v2.3.3)73 with the parameters “-st CODON -bb 1000 -m GY + F + G4”. Divergence time estimation was conducted using the MCMCtree program implemented in PAML (v4.9j)74. Two fossil calibration points were employed: the divergence between Actinidia chinensis and Camellia sinensis (82.8–106 Ma), and the split between CT and Camellia sinensis (4.8–7.8 Ma) from TimeTree (https://timetree.org/) database.
Analysis of the pangenome
According to the clustering results, we identified 43,444 nonredundant orthogroups, which were subsequently classified based on their distribution across samples: gene families of core (present in all 22 samples), soft-core (20–21 samples), dispensable (2–19 samples), and private (unique to a single sample). Pangenome saturation curves were generated using PanGP (v1.0.1) (https://pangp.zhaopage.com/) to visualize the gene clustering results. CDS and protein sequence files were prepared for the genes, and initial sequence alignment was conducted using BLAST, followed by running the ParaAT.pl pipeline to calculate Ka/Ks ratios (synonymous and nonsynonymous substitution rates). The analysis parameters were configured as follows: -m clustalw2 (multiple sequence alignment method), -f axt (output format), -g (gap handling), and -k (Ka/Ks calculation).
Synteny-based gene gain and loss analysis
Genomic synteny analysis was performed using the JCVI75 software package. Genomic gff files were first converted to bed format using jcvi.formats.gff. Redundant transcripts were removed and the longest transcript for each gene was retained using jcvi.formats.bed. CDS sequences were then extracted from the processed bed files. Syntenic blocks were identified using jcvi.compara.catalog with the parameters “–cscore = 0.99 –no_strip_names” to analyze syntenic blocks between genomes. The linear relationships between genes were determined using jcvi.compara.synteny mcscan with the parameters “–iter = 1”. Based on the syntenic relationships, genes were classified as lost genes if they were absent in ≥80% of the cultivated tea samples or were completely absent in the wild tea samples.
Identification of disease-resistance genes
To identify and classify disease resistance genes in the tea plant genome, we focused on two major types: NBS-LRR (nucleotide-binding site with a leucine-rich repeat) and RLK-LRR (receptor-like kinase with leucine-rich repeat) genes76,77. Using HMMER with default parameters, we searched the tea plant proteome against the Hidden Markov Model (HMM) of the NB-ARC domain (PF00931). The identified NBS-encoding genes were further classified by searching against TIR (PF01582) and LRR domain HMMs from the Pfam database. Additionally, CC domains within these NBS-encoding proteins were identified using NCOILS under default settings. For RLK-LRR gene identification, we first searched the tea plant proteome using the kinase HMM profile (PF00069) from the Pfam database, after which we scanned the resulting candidates against the LRR HMM profile (PF00560).
Detection of genomic structural variants
The SV detection pipeline, including analytical methods and computational scripts, was adapted from the published tomato pan-genome study19. Briefly, whole genome alignment between the 26 tea plant genomes and the ZC102-hap2 reference was performed using minimap2 (v2.17)78 with the parameters “-ax asm5 -eqx“, followed by BAM file sorting using samtools. Structural variants (SVs) were identified using SyRI (v1.6)79, including PAVs (in terms of insertions and deletions >50 bp), inversions, translocations, and CNVs (duplications). We filtered out SVs containing “N” sequences, those with ambiguous alignment margins, and variants showing poor synteny alignment at breakpoints. SURVIVOR (v1.0.7)80 was employed to merge SVs shorter than 50 kb across all analyzed genomes. Consolidation was performed using the parameter set “1000 2 1 0 0 50”, which specified a maximum breakpoint distance of 1000 bp for merging SVs from the input VCF files. The distribution of identified SVs was visualized using Circos (v0.69-8)81. Identification of hemizygous genes was performed according to the method described in Long et al.32. Briefly, hemizygous genes were identified by detecting structural variants (SVs) between hap 1 and hap 2 using minimap2 followed by SyRI analysis. Genes flanked by heterozygous insertion/deletion variants were classified as hemizygous candidates32,82, and their quantification was performed using bedtools (v2.30.0)83. All SV loci and genotyping data are provided in Supplementary Data 12–16.
SV-affected genes and associated expression analysis
SVs overlapping with gene regions, including upstream (2 Kb), exons, introns, downstream (2 Kb), and intergenic regions were identified using bedtools intersect (“-wa -wb”). After removing redundancies, genes associated with SVs were classified as SV-affected. RNA sequencing was conducted on 21 tea accessions categorized into CT, CSA, and CSS groups. Transcriptomic data were used to determine gene expression levels, and SV-affected genes were compared across groups. Genes exhibiting > twofold expression changes and TPM > 1 were selected for heatmap visualization and enrichment analysis. Expression pattern and KEGG pathway enrichment analyses were performed using TBtools84.
Graph-based tea plant genome construction and SV genotyping
We generated a graph-based representation of the tea plant genome by integrating the linear reference sequence with SVs through the implementation of the “vg construct” pipeline85. The merged VCF files were compressed and indexed using BCFtools, followed by the implementation of “vg giraffe” through EVG86 installation. A genome graph was constructed and indexed using “vg autoindex” and “vg snarls”. Using “vg giraffe”, we mapped short paired-end reads derived from 275 tea accessions against the indexed graph genome, resulting in the generation of GAM format alignment files. The graph alignment results were processed using “vg pack” to generate coverage statistics, followed by SV genotyping of all 275 tea accessions through the implementation of “vg call” using default parameters.
Selective sweep analysis
All gvcf files were merged using the BCFtools “merge” function87. Nucleotide diversity (π) for the CT, CSA, and CSS tea populations was calculated separately using vcftools88, and pairwise Fst comparisons were performed with sliding windows (parameters: “–fst-window-size 20000 –fst-window-step 2000”). The top 5% of Fst regions were identified as putative selective sweeps.
The extraction of anthocyanins
Anthocyanin extraction and quantification were performed according to the method of Xie et al.89. In brief, lyophilized tea leaves (50 mg) were ground into powder and suspended in 1 mL of 1% HCl/methanol (v/v). After sonication at 90% power for 30 min, the mixture was centrifuged at 2348 × g for 5 min, and the supernatant was collected. The residue was extracted twice with 500 μL of 1% HCl/methanol, and all three supernatants were combined. A 1:1:1 (v/v/v) mixture of the combined extract, ultrapure water, and chloroform was vortexed for 2 min, centrifuged at 2348 × g for 5 min, and the upper phase was collected. This phase was mixed with ultrapure water and ethyl acetate (1:1:1, v/v/v), vortexed for 2 min, centrifuged at 2348 × g for 5 min, and the lower phase containing anthocyanins was collected. To obtain anthocyanin monomers, 2 mL of the extract was transferred to a 15 mL centrifuge tube, combined with 200 μL of concentrated HCl, loosely capped, and incubated in a 100 °C water bath for 1 h. The extract was cooled on ice for 10 min, filtered through a 0.22 μm membrane, and analyzed by ultra-performance liquid chromatography (UPLC).
UPLC analysis
The anthocyanins in the tea samples were measured according to the method of Mei et al.90. Briefly, UPLC was performed using a Waters ACQUITY UPLC system (Waters, USA) equipped with a Phenomenex Kinetex C18 column (2.6 μm, 100 mm × 4.6 mm i.d.). The mobile phase consisted of 0.5% formic acid (v/v) in water (A) and acetonitrile (B), with a gradient elution program as follows: 8–20% B over 4 min, held at 20% B for 2.5 min, 20–40% B over 5.5 min, held at 40% B for 1 min, 40–90% B over 2 min, and re-equilibrated to 10% B over 5 min, totaling a 20-min run time. The flow rate was maintained at 0.8 mL/min. Detection was performed at 530 nm, with the column and autosampler temperatures set at 25 °C and 4 °C, respectively. Anthocyanin monomers were identified and quantified using standard curves generated from cyanidin, delphinidin. All measurements were performed with three biological replicates.
DNA extraction and PCR verification
Genomic DNA was extracted from 100 mg of tea leaves ground in liquid nitrogen. The lysate was prepared using 500 μL of extraction buffer and 10 μL β-mercaptoethanol, incubated at 65 °C for 30 min, followed by the addition of 600 μL chloroform/isoamyl alcohol (24:1 v/v) and centrifugation at 13,523 × g for 10 min. The supernatant was mixed with an equal volume of isopropanol, centrifuged at 13,523 × g for 5 min, and the resulting pellet was washed twice with 70% ethanol before resuspension in 100 μL of sterile water. Primers were designed based on syntenic regions across 22 genomes, and gene fragments were validated via PCR and agarose gel electrophoresis. All primer and gene sequences are listed in Supplementary Data 17, and sequencing data are provided in the Supplementary Data 18.
Quantitative real-time PCR
Total RNA was extracted from young tea leaves using a Plant RNA Kit (RC411-01, Vazyme, China), and first-strand cDNA was synthesized with a HiScript III RT SuperMix for qPCR Kit (R323-01, Vazyme, China). qRT-PCR was conducted using Taq Pro Universal SYBR qPCR Master Mix (Q712-02, Vazyme, China) on a Bio-Rad CFX96 Real-Time PCR Detection System (Bio-Rad, USA). The thermal cycling conditions were 95 °C for 3 min, followed by 38 cycles of 95 °C for 10 s and 57 °C for 30 s, then 65 °C for 5 s and 95 °C for 5 s. CsGAPDH was used as an internal reference, and relative gene expression was calculated using the 2−ΔCt method. Same method was used for gene expression analysis of samples treated with C. gloeosporioides (with CK controls).
Dual-luciferase reporter assay
The promoter sequences of LRR1 and ANS3 were subsequently cloned from different tea accessions, and a 192 bp insertion in the CtANS3 promoter was removed using fusion PCR. In total, seven promoters were obtained: proCssLRR1, proCsaLRR1, proCtLRR1, proCssANS3, proCsaANS3, proCtANS3, and proCtANS3-192. These promoter sequences were inserted into the pGreenⅡ-0800-LUC vector via homologous recombination (restriction sites: Kpn Ⅰ - Hind Ⅲ), with the empty pGreenⅡ-0800-LUC vector serving as a control.
The recombinant constructs were subsequently transformed into Agrobacterium tumefaciens strain EHA105 (pSoup) (Weidi, China), and positive transformants were selected on LB medium containing 50 mg/L kanamycin. The verified Agrobacterium cultures were grown, collected by centrifugation at 2348 × g for 10 min, and resuspended in infiltration buffer (10 mM MES, 10 mM MgCl₂, and 200 μM acetosyringone) to an OD600 of 0.4. The bacterial suspensions (100 μL per infiltration site) were introduced into the leaves of six-week-old Nicotiana benthamiana plants using a needleless syringe.
Following infiltration, the plants were kept in darkness for 48 h. Afterward, 100 mM potassium D-luciferin (Biolai, China) was applied to the abaxial leaf surface, and luminescence signals were captured using the NightShade LB 9851 in vivo imaging system (Berthold, Germany). Relative luminescence activity was quantified using the Promega E1500 Luciferase Assay System (Promega, USA), calculated as the Luc/Ren ratio.
Pathogen infection transcriptome analysis
Tea plant accessions with uniform growth were collected from the germplasm repository at the Yunnan Academy of Agricultural Sciences. The samples included eight CT accessions (CT1-8), five CSA cultivars (YK47, YK11, YK50, YK10, YC1), and five CSS cultivars (BHZ, ZYQ, FDDB, RZWL, JX). Leaves from these cultivars were inoculated with C. gloeosporioides (Cg) mycelial plugs, whereas water-treated leaves served as controls (CK). Each treatment was performed in biological duplicate per leaf. Inoculated samples were sealed with preservative film and incubated in a growth chamber at 25 °C under a 16-h light/8-h dark cycle with relative humidity exceeding 80%. Disease progression was monitored daily, and lesions became distinct by Day 6 post-inoculation. Leaves were then photographed, and lesion areas were quantified using ImageJ software.
For transcriptome sequencing, seven accessions (CT1, CT4, CT7, YK11, YK47, FDDB, ZYQ) were selected. Gene expression levels were quantified using the HISAT2 and StringTie pipeline with default parameters. Differentially expressed genes (DEGs) were identified using DESeq2, with genes showing fold changes > 2 and P < 0.05 retained for downstream analysis.
Virus-induced gene silencing
Virus-induced gene silencing (VIGS) was performed following established protocols with modifications optimized for tea plants91–93. One-year-old tea seedlings were infiltrated with Agrobacterium suspensions carrying either pTRV1/pTRV2 or pTRV1/pTRV2-CsLRR1 via the leaf injection. Infiltrated plants were incubated at 25 °C for 14 days, after which CsLRR1 transcript levels were assessed by qRT-PCR. Following gene silencing confirmation, the treated leaves were inoculated with C. gloeosporioides mycelial plugs. After 6 days, lesion areas were quantified using ImageJ, and disease symptoms were examined via chlorophyll fluorescence imaging.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Description of Additional Supplementary Files
Source data
Acknowledgements
This work was supported by the National Natural Science Foundation of China (grant Nos. U20A2045, 32260790, 32202542 and 32472791), the Project of Science and Technology of Yunnan Province (grant no. 202102AE090038), the Base of Introducing Talents for Tea Plant Biology and Quality Chemistry (D20026), the Project of Tea Plant Germplasm Resource Garden in Anhui, the Independent Research Project of State Key Laboratory for Tea Plant Germplasm Innovation and Resource Utilization (grant No.SKLTEA-ZZ202502).
Author contributions
C.W., Y.Z., E.X. and X.W. conceived and designed the research. B.L., L.C., Y.H., Y.Z., Q.Z., S.H., H.C., Y.L., C.C., J.W. and L.T. participated in material collection. L.T., J.H., and Q.X. performed the genome assembly, annotation and evaluation. L.T. conducted structural variation, transcriptome and population analyses. L.T. and J.W. performed the molecular experiments. L.T. wrote the manuscript. J.Z., S.L., K.D., C.W., Y.Z. and E.X. revised the manuscript. All the authors have read, edited and approved the final manuscript.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
Data availability
All the data generated in this study have been deposited in the National Genomics Data Center (NGDC) with the following BioProject accessions: PRJCA036376 (raw genomic and transcriptome data), PRJCA036909 (resequencing data), PRJCA050339 (genome assembly data), and PRJCA048375 (Hi-C raw data). The assembly and annotation data of the genome and haplotype genome are publicly available at Zenodo [https://zenodo.org/uploads/17174024]. The source data and image files underlying the figures in this study are publicly available through Figshare [10.6084/m9.figshare.30186544]. Source data are provided with this paper.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Lingling Tao, Junyan Zhu, Jianbing Hu, Qi Xu.
Contributor Information
Xiaochun Wan, Email: xcwan@ahau.edu.cn.
Enhua Xia, Email: xiaenhua@gmail.com.
Yongfeng Zhou, Email: zhouyongfeng@caas.cn.
Chaoling Wei, Email: weichl@ahau.edu.cn.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-025-67060-5.
References
- 1.Xia, E. et al. Tea plant genomics: achievements, challenges and perspectives. Hortic. Res.7, 7 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wei, C. et al. Draft genome sequence of Camellia sinensis var. sinensis provides insights into the evolution of the tea genome and tea quality. Proc. Natl. Acad. Sci. USA115, E4151–E4158 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zhang, W. et al. Genome assembly of wild tea tree DASZ reveals pedigree and selection history of tea varieties. Nat. Commun.11, 3719 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sun, W. et al. Genetic diversity analysis and core collection construction of tea plant from the Yunnan Province of China using ddRAD sequencing. BMC Plant Biol.24, 1163 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Zhang, X. et al. Haplotype-resolved genome assembly provides insights into evolutionary history of the tea plant Camellia sinensis. Nat. Genet.53, 1250–1259 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Yu, X. et al. Metabolite signatures of diverse Camellia sinensis tea populations. Nat. Commun.11, 5586 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wang, X. et al. Population sequencing enhances understanding of tea plant evolution. Nat. Commun.11, 4447 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Kong, W. et al. Pan-transcriptome assembly combined with multiple association analysis provides new insights into the regulatory network of specialized metabolites in the tea plant Camellia sinensis. Hortic. Res.9, uhac100 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Song, J.-M. et al. Eight high-quality genomes reveal pan-genome architecture and ecotype differentiation of Brassica napus. Nat. Plants6, 34–45 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Li, H. et al. Graph-based pan-genome reveals structural and sequence variations related to agronomic traits and domestication in cucumber. Nat. Commun.13, 682 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Qi, J. et al. A genomic variation map provides insights into the genetic basis of cucumber domestication and diversity. Nat. Genet.45, 1510–1515 (2013). [DOI] [PubMed] [Google Scholar]
- 12.Khan, A. W. et al. Cicer super-pangenome provides insights into species evolution and agronomic trait loci for crop improvement in chickpea. Nat. Genet.56, 1225–1234 (2024). [DOI] [PubMed] [Google Scholar]
- 13.Shi, T. et al. The super-pangenome of Populus unveils genomic facets for its adaptation and diversification in widespread forest trees. Mol. Plant.17, 725–746 (2024). [DOI] [PubMed] [Google Scholar]
- 14.Wang, T. et al. Pan-genome analysis of 13 Malus accessions reveals structural and sequence variations associated with fruit traits. Nat. Commun.14, 7377 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Liu, Y. et al. Pan-genome of wild and cultivated soybeans. Cell182, 162–176.e13 (2020). [DOI] [PubMed] [Google Scholar]
- 16.Tong, W. et al. Genomic variation of 363 diverse tea accessions unveils the genetic diversity, domestication, and structural variations associated with tea adaptation. JIPB66, 2175–2190 (2024). [DOI] [PubMed] [Google Scholar]
- 17.Jayakodi, M. et al. Structural variation in the pangenome of wild and domesticated barley. Nature636, 654–662 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Wang, B. et al. De novo genome assembly and analyses of 12 founder inbred lines provide insights into maize heterosis. Nat. Genet.55, 312–323 (2023). [DOI] [PubMed] [Google Scholar]
- 19.Li, N. et al. Super-pangenome analyses highlight genomic diversity and structural variation across wild and cultivated tomato species. Nat. Genet.55, 852–860 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Chen, S. et al. Gene mining and genomics-assisted breeding empowered by the pangenome of tea plant Camellia sinensis. Nat. Plants9, 1986–1999 (2023). [DOI] [PubMed] [Google Scholar]
- 21.Wang, P. et al. Genetic basis of high aroma and stress tolerance in the oolong tea cultivar genome. Hortic. Res.8, 107 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Xia, E. et al. The reference genome of tea plant and resequencing of 81 diverse accessions provide insights into its genome evolution and adaptation. Mol. Plant.13, 1013–1026 (2020). [DOI] [PubMed] [Google Scholar]
- 23.Wu, H. et al. A high-quality Actinidia chinensis (kiwifruit) genome. Hortic. Res.6, 117 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Tong, X.-Y. et al. ddRAD sequencing of 1076 Camellia accessions reveals the genetic diversity and population introgression of the tea plant in China. Plant Divers. 10.1016/j.pld.2025.09.002 (2025).
- 25.Jiang, C., Moon, D.-G., Ma, J. & Chen, L. Characteristics of non-volatile metabolites in fresh shoots from tea plant (Camellia sinensis) and its closely related species and varieties. BPR2, 1–8 (2022). [Google Scholar]
- 26.Zhang, Y. et al. Structural variation reshapes population gene expression and trait variation in 2,105 Brassica napus accessions. Nat. Genet.56, 2538–2550 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Jun, J. H., Xiao, X., Rao, X. & Dixon, R. A. Proanthocyanidin subunit composition determined by functionally diverged dioxygenases. Nat. Plants4, 1034–1043 (2018). [DOI] [PubMed] [Google Scholar]
- 28.Tao, Y. et al. A positive regulator CsPR10-9 confers resistance to anthracnose (Colletotrichum gloeosporioides) is negatively regulated by CsMYB72 in tea plants. Plant Cell Environ.48, 6965–6981 (2025). [DOI] [PubMed] [Google Scholar]
- 29.Su, Y. et al. Phased telomere-to-telomere reference genome and pangenome reveal an expansion of resistance genes during apple domestication. Plant Physiol.195, 2799–2814 (2024). [DOI] [PubMed] [Google Scholar]
- 30.Tariq, A. et al. In-depth exploration of the genomic diversity in tea varieties based on a newly constructed pangenome of Camellia sinensis. Plant J.119, 2096–2115 (2024). [DOI] [PubMed] [Google Scholar]
- 31.Wang, F. et al. Chromosome-scale genome assembly of Camellia sinensis combined with multi-omics provides insights into its responses to infestation with green leafhoppers. Front. Plant Sci.13, 1004387 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Long, Q. et al. Population comparative genomics discovers gene gain and loss during grapevine domestication. Plant Physiol.195, 1401–1413 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Jin, S. et al. Structural variation (SV)-based pan-genome and GWAS reveal the impacts of SVs on the speciation and diversification of allotetraploid cottons. Mol. Plant.16, 678–693 (2023). [DOI] [PubMed] [Google Scholar]
- 34.Cai, X. et al. Impacts of allopolyploidization and structural variation on intraspecific diversification in Brassica rapa. Genome Biol.22, 166 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Jayakodi, M. et al. The barley pan-genome reveals the hidden legacy of mutation breeding. Nature588, 284–289 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Cheng, L. et al. Leveraging a phased pangenome for haplotype design of hybrid potato. Nature640, 408–417 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Long, W. et al. Genome evolution and diversity of wild and cultivated rice species. Nat. Commun.15, 9994 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Chen, J. et al. Pangenome analysis reveals genomic variations associated with domestication traits in broomcorn millet. Nat. Genet.55, 2243–2254 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Wang, Y. et al. Graph-based pangenome of Actinidia chinensis reveals structural variations mediating fruit degreening. Adv. Sci.11, 2400322 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Jiao, C. et al. Pan-genome bridges wheat structural variations with habitat and breeding. Nature637, 384–393 (2025). [DOI] [PubMed] [Google Scholar]
- 41.Peng, Y. et al. The genomic and epigenomic landscapes of hemizygous genes across crops with contrasting reproductive systems. Proc. Natl. Acad. Sci. USA122, e2422487122 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Zhou, Y. et al. The population genetics of structural variants in grapevine domestication. Nat. Plants5, 965–979 (2019). [DOI] [PubMed] [Google Scholar]
- 43.Tang, D. et al. Genome evolution and diversity of wild and cultivated potatoes. Nature606, 535–541 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Zhang, Y. et al. Telomere-to-telomere Citrullus super-pangenome provides direction for watermelon breeding. Nat. Genet.56, 1750–1761 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Zhou, Y. et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature606, 527–534 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Gao, L. et al. The tomato pan-genome uncovers new genes and a rare allele regulating fruit flavor. Nat. Genet.51, 1044–1051 (2019). [DOI] [PubMed] [Google Scholar]
- 47.Li, X. et al. Large-scale gene expression alterations introduced by structural variation drive morphotype diversification in Brassica oleracea. Nat. Genet.56, 517–529 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Wang, L. et al. Large-scale identification and functional analysis of NLR genes in blast resistance in the Tetep rice genome sequence. Proc. Natl. Acad. Sci. USA116, 18479–18487 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Zhang, Y. et al. An enhancer–promoter-transcription factor module orchestrates plant immune homeostasis by constraining camalexin biosynthesis. Mol. Plant.18, 95–113 (2025). [DOI] [PubMed] [Google Scholar]
- 50.Tang, L. Circular consensus sequencing with long reads. Nat. Methods16, 958–958 (2019). [DOI] [PubMed] [Google Scholar]
- 51.Sim, S. B., Corpuz, R. L., Simmonds, T. J. & Geib, S. M. HiFiAdapterFilt, a memory efficient read processing pipeline, prevents occurrence of adapter sequence in PacBio HiFi reads and their negative impacts on genome assembly. BMC Genomics23, 157 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Chen, Y. et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. GigaScience7, 1–6 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods18, 170–175 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science356, 92–95 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst.3, 95–98 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Zeng, X. et al. Chromosome-level scaffolding of haplotype-resolved assemblies using Hi-C data without reference genomes. Nat. Plants10, 1184–1200 (2024). [DOI] [PubMed] [Google Scholar]
- 57.Alonge, M. et al. Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome Biol.23, 258 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics21, i351–i358 (2005). [DOI] [PubMed] [Google Scholar]
- 59.Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res110, 462–467 (2005). [DOI] [PubMed] [Google Scholar]
- 60.Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. USA117, 9451–9457 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res.27, 573–580 (1999). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Keilwagen, J., Hartung, F. & Grau, J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. In Gene Prediction, Vol. 1962 (ed. Kollmar, M.) 161–177 (Springer New York, 2019). [DOI] [PubMed]
- 63.Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods12, 357–360 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics25, 2078–2079 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol.20, 278 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics31, 3210–3212 (2015). [DOI] [PubMed] [Google Scholar]
- 67.Boeckmann, B. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res.31, 365–370 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Ogata, H. et al. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res.27, 29–34 (1999). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nat. Genet.25, 25–29 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol.20, 238 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol.30, 772–780 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Suyama, M., Torrents, D. & Bork, P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res.34, W609–W612 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Nguyen, L.-T., Schmidt, H. A., Von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol.32, 268–274 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol.24, 1586–1591 (2007). [DOI] [PubMed] [Google Scholar]
- 75.Tang, H. et al. JCVI: A versatile toolkit for comparative genomics analysis. iMeta3, e211 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Zhang, Q.-J. et al. Rapid diversification of five Oryza AA genomes associated with rice adaptation. Proc. Natl. Acad. Sci. USA111, E4954–E4962 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Xia, E.-H. et al. The tea tree genome provides insights into tea flavor and independent evolution of caffeine biosynthesis. Mol. Plant.10, 866–877 (2017). [DOI] [PubMed] [Google Scholar]
- 78.Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics34, 3094–3100 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Goel, M., Sun, H., Jiao, W.-B. & Schneeberger, K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol.20, 277 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun.8, 14061 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res19, 1639–1645 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Shi, X. et al. The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding. Hortic. Res.10, uhad061 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics26, 841–842 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Chen, C. et al. TBtools: an integrative toolkit developed for interactive analyses of big biological data. Mol. Plant.13, 1194–1202 (2020). [DOI] [PubMed] [Google Scholar]
- 85.Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol.36, 875–879 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Du, Z.-Z., He, J.-B. & Jiao, W.-B. A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline. Genome Biol.25, 91 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience10, giab008 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Danecek, P. et al. The variant call format and VCFtools. Bioinformatics27, 2156–2158 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Xie, H. et al. An enhancer-transposable element from purple leaf tea varieties underlies the transition from evergreen to purple leaf color. Plant Commun.6, 101176 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Mei, Y. et al. Metabolites and transcriptional profiling analysis reveal the molecular mechanisms of the anthocyanin metabolism in the “Zijuan” tea plant (Camellia sinensis var. assamica). J. Agric. Food Chem.69, 414–427 (2021). [DOI] [PubMed] [Google Scholar]
- 91.Shen, J. et al. Establishment and verification of an efficient virus-induced gene silencing system in forsythia. Hortic. Plant J.7, 81–88 (2021). [Google Scholar]
- 92.Peng, K., Xue, C. & Huang, X. Enhancing virus-induced gene silencing efficiency in tea plants (Camellia sinensis L.) and the functional analysis of CsPDS. Sci. Hortic.337, 113585 (2024). [Google Scholar]
- 93.Zulfiqar, S. et al. Virus-induced gene silencing (VIGS): a powerful tool for crop improvement and its advancement towards epigenetics. IJMS24, 5608 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Description of Additional Supplementary Files
Data Availability Statement
All the data generated in this study have been deposited in the National Genomics Data Center (NGDC) with the following BioProject accessions: PRJCA036376 (raw genomic and transcriptome data), PRJCA036909 (resequencing data), PRJCA050339 (genome assembly data), and PRJCA048375 (Hi-C raw data). The assembly and annotation data of the genome and haplotype genome are publicly available at Zenodo [https://zenodo.org/uploads/17174024]. The source data and image files underlying the figures in this study are publicly available through Figshare [10.6084/m9.figshare.30186544]. Source data are provided with this paper.







