Skip to main content
Scientific Data logoLink to Scientific Data
. 2023 May 9;10:259. doi: 10.1038/s41597-023-02171-6

Haplotype-resolved genome assembly of Coriaria nepalensis a non-legume nitrogen-fixing shrub

Shi-Wei Zhao 1,#, Jing-Fang Guo 1,#, Lei Kong 1,#, Shuai Nie 1, Xue-Mei Yan 1, Tian-Le Shi 1, Xue-Chan Tian 1, Hai-Yao Ma 1, Yu-Tao Bao 1, Zhi-Chao Li 1, Zhao-Yang Chen 1, Ren-Gang Zhang 2, Yong-Peng Ma 2, Yousry A El-Kassaby 3, Ilga Porth 4, Wei Zhao 5,, Jian-Feng Mao 1,6,
PMCID: PMC10167230  PMID: 37156769

Abstract

Coriaria nepalensis Wall. (Coriariaceae) is a nitrogen-fixing shrub which forms root nodules with the actinomycete Frankia. Oils and extracts of C. nepalensis have been reported to be bacteriostatic and insecticidal, and C. nepalensis bark provides a valuable tannin resource. Here, by combining PacBio HiFi sequencing and Hi-C scaffolding techniques, we generated a haplotype-resolved chromosome-scale genome assembly for C. nepalensis. This genome assembly is approximately 620 Mb in size with a contig N50 of 11 Mb, with 99.9% of the total assembled sequences anchored to 40 pseudochromosomes. We predicted 60,862 protein-coding genes of which 99.5% were annotated from databases. We further identified 939 tRNAs, 7,297 rRNAs, and 982 ncRNAs. The chromosome-scale genome of C. nepalensis is expected to be a significant resource for understanding the genetic basis of root nodulation with Frankia, toxicity, and tannin biosynthesis.

Subject terms: Plant sciences, Computational biology and bioinformatics

Background & Summary

Coriaria nepalensis Wall. (2n = 40)1, also known as Masuri Berry, is a shrub belonging to the genus Coriaria of the unigeneric Coriariaceae family, and is mainly distributed in the Himalayan region. C. nepalensis is a non-legume nitrogen-fixing plant that forms root nodules with the actinomycete Frankia2,3. The biological ability of nitrogen-fixation in this species contributes to its rehabilitation capacity of nutrient-poor degraded land4,5; in combination with its osmotic adjustment function and drought tolerance6,7, C. nepalensis improves the abiotic conditions and provides more suitable habitat for associated plant species810. Furthermore, essential oils and extracts from C. nepalensis could be used as promising drugs due to their antimicrobial11,12 and anti-convulsant activities13. Traditionally, C. nepalensis has been used in folk medicine to treat ailments such as toothaches and traumatic injuries13,14. The toxic and antibacterial properties of C. nepalensis provide an interesting opportunity for the development of a potent new and environmentally friendly pesticide for pest management15. Moreover, C. nepalensis bark offers an important source of hydrolysable tannin16,17, an ideal treatment for tanning leather16.

The phylogenetic position of Coriariaceae is still debated18. Previous analyses based on plastid rbcL gene sequences1921, and the complete chloroplast genome14 placed Coriariaceae close to families in Cucurbitales. However, the nuclear genome has not yet been sequenced in Coriariaceae, although the genome assemblies of related taxa, such as in Datiscaceae22 and Begoniaceae23, have been published.

Molecular genetic investigation of non-legume nitrogen-fixation and root nodulation from Frankia requires a high-quality genome assembly and functional annotation of the host plant. Additionally, such genomic resources may also be crucial to advance the phylogenetics of the unigeneric Coriariaceae family and the efficient exploration of C. nepalensis’ valued natural products.

Here, we report a 620 Mb haplotype-resolved chromosome-scale assembly of C. nepalensis using a combination of high-quality PacBio HiFi (High Fidelity) long reads, Illumina reads, and Hi-C sequencing. The genome was assembled with contig N50 length of 11 Mb and 40 haplotype-resolved pseudochromosomes. We predicted 60,862 protein-coding genes, of which 99.5% were functionally annotated. Furtherore, 939 tRNAs, 7,297 rRNAs, and 982 ncRNAs were annotated. The provided genomic resources will be helpful for future functional studies in C. nepalensis.

Methods

Sample collection, library construction, and genome size estimation

Leave tissue samples for both genome and RNA sequencing were harvested in 2020 from a mature C. nepalensis individual growing in Kunming Botanical Garden which was transplanted from Songming county, Kunming, Yunnan province, China. Sampled leaves were immediately flash-frozen in liquid nitrogen and stored at −80 °C until further use. High-quality genomic DNA was extracted from leaf tissue using the DNeasy Plant Mini Kit (QIAGEN, Inc.) and purified using the Mobio PowerClean Pro DNA Clean-Up Kit (MO BIO Laboratories, Inc.). DNA integrity was assessed using Agilent 4200 Bioanalyzer. Messenger RNA (mRNA), whose sequence information was later utilized in protein-coding gene structure prediction, was isolated from leaves using the NEBNext Poly(A) mRNA Magnetic Isolation Module, and RNA quality was determined with the Agilent 2100 BioAnalyzer.

We combined PacBio HiFi long reads sequencing, Illumina sequencing, and Hi-C scaffolding for C. nepalensis genome assembly. Genomic DNA fragments were prepared using g-Tubes and purified using AMPure PB beads for library construction and subsequent SMRT cell PacBio HiFi long reads sequencing. Fragment molecules were screened on BluePippin system. The library sequencing was performed on PacBio Sequel II platform, and ccs (https://github.com/PacificBiosciences/ccs) v6.2.0 was used to generate PacBio HiFi data. We obtained ~14.5 Gb (~40×) of HiFi sequencing data with an average length of 19 kb and N50 of 21 kb (Fig. 1a). As for Illumina sequencing, 150 bp paired-end PCR-free libraries were prepared and sequenced on Illumina HiSeq X Ten platform, and ~70 Gb (~200×) of Illumina raw data were obtained. We followed a standard procedure for Hi-C library preparation24. In brief, leaf tissues were fixed with formaldehyde and the cross-linked DNA was digested with MboI restriction enzyme. Digested fragments were then biotinylated at 5′ overhangs and joined to form chimeric junctions. After biotin-containing fragments were enriched and sheared, we constructed paired-end sequencing libraries. The Hi-C libraries were sequenced using the Illumina HiSeq X Ten platform and ~67 Gb of Hi-C raw data were obtained. RNA sequencing was performed on Illumina HiSeq X Ten platform after we constructed one sequencing library using the NEBNext Ultra RNA Library Prep Kit, and ~7.5 Gb (50 Mb reads) of raw data were acquired. Then, fastp25 software was used for quality control to remove adapters and low-quality and too short Illumina reads (<60 bp). All clean reads were used for further genome assembly and gene predictions.

Fig. 1.

Fig. 1

Length and quality of PacBio HiFi reads and genome size survey. (a) Reads length and mean Phred score distribution of PacBio HiFi reads. (b) 19-mers frequency distribution estimated from PacBio HiFi sequences: observed K-mer (raw K-mer) frequencies (in grey), fitted K-mer frequencies (in blue) with skew normal distribution model, and overall fitting (in red) that concatenated observed and fitted K-mer frequencies.

Genomic characteristics including genome size, repeat content, and heterozygous rate were estimated based on K-mer frequencies. Through K-mer analysis (K = 19) of PacBio HiFi data with Jellyfish26 v2.3.0, an overall C. nepalensis haplotype genome size of 313.1 Mb was estimated using findGSE v1.94.R27 (Fig. 1b).

De novo genome assembly

De novo assembly involved three steps: primary assembly, Hi-C scaffolding, and polishing (Fig. 2). With PacBio HiFi reads and Hi-C reads as inputs, we used hifiasm28 v0.16.1 to assemble the genome into contigs and obtained a haplotype-resolved assembly with two haplotypes for subsequent analysis. Further, the Hi-C reads that were mapped to the assembly using Juicer29 v1.6. 3D-DNA30 (-m haploid -i 150000 -r 0--editor-repeat-coverage 5) were then used for preliminary Hi-C assisted chromosome assembly, and Juicebox31 (version 201008) was used to manually adjust the chromosome segmentation boundary and any wrong assembly, including switch error. Afterwards, we used 3D-DNA to re-scaffold each chromosome separately and used Juicebox to manually correct any visible error. We used TGS-GapCloser32 v1.0.1 (--min_match 1000 –minmap_arg ‘ -x asm20’) to fill the gaps (24 gaps were filled) with HiFi reads and performed three rounds of polishing using NextPolish33 v1.4.0 based on Illumina reads, and removed redundant sequences identified by Redundans34 v0.13c. Finally, a haplotype-resolved chromosomal level assembly with a total length of 620 Mb was obtained (Table 1). We obtained 40 pseudochromosomes, consistent with the chromosome number reported in a previous karyotype study1. We named the chromosomes according to the descendent order of their lengths. Furthermore, as we were describing a haplotype-resolved genome assembly without parental information for subgenome phasing, we arbitrarily denoted the longer one from each pair of homologous chromosomes as haplotype genome “a” (with character “a” in the terminal of the chromosome IDs), while the other chromosome as haplotype genome “b” (with character “b”).

Fig. 2.

Fig. 2

Pipeline overview of genome assembly (yellow), quality control (pink), repeat annotation (purple), protein-coding gene annotation (blue), and ncRNA annotation. Boxes with different color shading represent the different software used in each analytical step.

Table 1.

Statistics of the haplotype-resolved genome assembly of C. nepalensis.

Features Statistics
Sequencing
 Raw bases of WGS-PacBio HiFi (Gb) ~14.5
 Raw bases of WGS-Illumina (Gb) ~70
 Raw bases of Hi-C (Gb) ~67
 Raw bases of RNA-seq (Gb) ~7.5
Assembly
 Genome size (Mb) 620.52
 Number of pseudochromosomes 40
 Chloroplast genome assembly (bp) 158,558
 Mitochondria genome assembly (bp) 480,951
 N50 of contigs (Mb) 10.97
 L50 of contig 22
 N50 of scaffolds (Mb) 12.9
 L50 of scaffolds 11
 Number of gaps 62
 GC content (%) 34.78
 Complete BUSCOs 1,338 (93.0%)
Annotation
 Number of protein-coding gene 60,862
 Complete BUSCOs 1,440 (97.2%)
 Average length of protein-coding gene (bp) 2,892.7
 Average length of CDS (bp) 1,324
 Average number of exons per transcript 6.3
 Number of tRNA 939
 Number of rRNA 7,297
 Number of unclassified ncRNA 982

Chromosomes chr01-chr03 assemblies were significantly longer than the remaining chromosomes. The assembly of these three pairs of chromosomes was also difficult, showing Hi-C chromatin contact profiles distinct from others (Fig. 3a,b). These three pairs of chromosomes have a large number of gaps (in total 60) in the current assembly, while the other chromosomes had a total of only 2 gaps. Previous karyotype analysis1 showed that C. nepalensis had three pairs of long chromosomes with extended heterochromatin regions, which is concurrent with the three long chromosomes revealed in the present study. A high number (679,177) of tandem array repeats with the consensus sequence “ATCATTTGCAAGTTATGCACAAAAGTTGTGTCTGTAGTGCAAAACTAGAATTCGTTCGACTTGCTTTGAAATAAGTTATTGACTTGAAATGACTCATTGAAATGATTTTAAGGTTAAACGAATGCACACTTTCCTTGCAATG” was identified on the three long chromosomes chr01-chr03 (Fig. 3c) using TRF35 v4.09.1. We found the “TTTAGGG” characterized telomeric sequence in most chromosomes (Fig. 3c), indicating the high quality of our genome assembly.

Fig. 3.

Fig. 3

Hi-C density heatmaps, genomic features and evolutionary history of C. nepalensis. (a) Hi-C chromatin contact density heatmap with a low threshold parameter (minimal mapping quality = 0). (b) Hi-C chromatin contact density heatmap with a high threshold parameter (minimal mapping quality = 1). (c) Distribution of genomic features of C. nepalensis. I: sequencing depth distribution of PacBio HiFi reads. IIIV: The density of Copia LTR-RTs, Gypsy LTR-TRs and Mutator TE. V, VI: Distribution of tandem array and telomere sequence. VII, VIII: Density of protein-coding gene and GC content. (d) Phylogenetic tree. (e) Ks dot plots of C. nepalensis haplotype genome “a” and C. sativus.

In addition, a 158,558 bp chloroplast (Pt) genome and a 480,951 bp mitochondrial (Mt) genome were assembled based on short- and long- reads gained from genome sequencing using GetOrganelle36 v1.7.5.0 (Table 1).

Repeat annotation

We performed de novo transposable element (TE) annotation using EDTA37 v1.9.3 (--sensitive 1 –anno 1) which integrates homology-based and structure-based approaches for TE identification (Fig. 2). A TE library was generated and used for further repeat annotation with RepeatMasker (http://www.repeatmasker.org/RepeatMasker/) (-no_is -xsmall). The output repeat soft-masked genome sequence file was used for gene prediction. A total of 428 Mb (69.0%) of the assembly was annotated as TE (Table 2), of which 61 Mb (9.9%) were long terminal repeat (LTR) retrotransposons. Mutator transposons with 280 Mb (45.2%) in total length showed the highest genome occupation, and also a distribution similar to the high occupation tandem array mentioned above (Fig. 3c). Our further analysis revealed that the sequence motif of these tandem arrays is included inside the Mutator transposons.

Table 2.

Statistics of repeat annotation of the C. nepalensis genome.

Superfamily Number Length (bp) Percent (%)
Class I 102,847 63,608,643 10.25
    LTR/Copia 46,213 29,513,519 4.76
    LTR/Gypsy 12,994 7,928,273 1.28
    LTR/unknown 39,113 24,243,803 3.91
    nonLTR/pararetrovirus 730 347,201 0.06
    nonLTR/LINE 3,797 1,575,847 0.25
Class II 675,085 314,548,320 50.69
    TIR/hAT 17,702 5,852,661 0.94
    TIR/CACTA 23,327 7,829,412 1.26
    TIR/PIF-Harbinger 17,220 4,264,302 0.69
    TIR/Mutator 563,911 280,243,005 45.2
    TIR/Tc1_Mariner 9,638 3,344,642 0.54
    Helitron 43,287 13,014,298 2.10
Other TEs 432,103 50,184,386 8.09
Total TEs 1,210,035 428,341,349 69.03

Protein-coding genes prediction and other annotations

We collected 139,950 non-redundant protein sequences of the closely related species Datisca glomerata22, Begonia fuchsioides22, Cucumis sativus38, Vitis vinifera39, Prunus persica40, and Arabidopsis thaliana41 as evidence for protein homology (Fig. 2). Three strategies were used to assemble RNA-seq reads into transcripts which were further used as transcriptional evidence for gene annotation. For transcripts assembly, (1) de novo assembly was performed using Trinity42 v2.13.2; (2) genome-guided assembly was performed using Trinity after reads were mapped to the genome assembly using HISAT243 v2.2.1; and (3) another genome-guided assembly was prepared using StringTie44 v2.2.0 with reads mapping using HISAT2. We combined all these three sets of transcripts and obtained 77,555 transcript sequences after removing the redundant sequences with CD-HIT45 v4.8.1. Gene structure was annotated using the PASA46 v2.5.0 pipeline based on transcriptional evidence. Then, full-length gene sequences were identified by evidence of protein homologies. Based on the full-length gene set, a gene model used for ab initio gene structure prediction was trained and optimized using AUGUSTUS47 3.4.0.

Furthermore, the MAKER248 pipeline was used to predict the putative protein-coding gene structure. We performed ab initio predictions of gene structures using AUGUSTUS 3.4.0. The transcript evidence and homologous protein evidence were aligned with the genome by BLAST+49 v2.11.0 and optimized by exonerate50 2.4.0. AUGUSTUS was used to integrate gene models from the above-mentioned gene prediction. To further improve the annotation accuracy, EVidenceModeler51 (EVM) v1.1.1 and PASA were used to integrate and update the gene prediction results. We annotated a final set of 60,862 protein-coding genes (Table 1), among which 30,622 genes were predicted for the haplotype subgenome with a longer set of chromosomes (haplotype genome “a”), and 30,240 genes for the haplotype subgenome “b”. We identified 26,489 putative gene families among C. nepalensis (haplotype genome “a”), Aquilegia coerulea52, Vitis vinifera39, Averrhoa carambola53, Populus trichocarpa54, Tripterygium wilfordii55, Malus domestica56, Datisca glomerata22, Begonia fuchsioides23, Benincasa hispida57, Cucumis sativus38, and Quercus acutissima58, with OrthoFinder59 v2.5.2. (Fig. 3d). Then, 1,199 orthogroups, with a minimum of 83.3% of the species having single-copy genes in any orthogroup, were used to infer the species tree with STAG60, and the phylogenetic location of C. nepalensis was confirmed. Ks (synonymous substitutions) dot plots of haplotype genome “a” vs genome “a” and genome “a” vs C. sativus were generated with WGDI61 v0.62 (Fig. 3e), and one recent unique WGD (whole genome duplication) was revealed and was distinct from that found in C. sativus.

BUSCO62 was used for evaluating the completeness of the gene set. Out of 1,440 conserved genes, 1,400 (97.2%) were annotated, among which 1,365 (96.9%) were complete and duplicated BUSCO genes.

Three strategies were used for functional annotation of protein-coding genes (Fig. 2, Table 3): (1) we mapped gene sequences against eggNOG63 5.0 database using eggNOG-mapper64 v2.1.6 (--target_taxa Viridiplantae) and annotated 98.1% of the genes, of which 55.7 and 49.4% were annotated with GO and KEGG items, respectively; (2) based on the principle of sequence similarity, we annotated 98.5% genes using DIAMOND65 v2.0.12 (--evalue 1e-5) against the following four protein databases: Swiss_Prot66 (78.2%), TrEMBL66 (98.4%), NR67 (98.3%), and Arabidopsis thaliana genes41 (94.1%); (3) we annotated 99.1% of the genes against 14 databases using InterProScan68 v5.52–86.0 (Table 3).

Table 3.

Statistics of protein-coding gene functional annotation.

Method Database Number Percent (%)
eggNOG-mapper eggNOG 59,758 98.09
GO 33,948 55.73
KEGG_KO 30,082 49.38
KEGG_Pathway 18,595 30.52
EC 12,864 21.12
eggNOG 56,164 92.19
COG 59,758 98.09
DIAMOND 59,981 98.46
Swiss_Prot 47,641 78.20
TrEMBL 59,933 98.38
NR 59,896 98.32
A.thaliana 57,323 94.10
InterProScan 60,396 99.14
Pfam 51,101 83.88
CDD 21,628 35.50
SUPERFAMILY 40,074 65.78
Interpro 53,945 88.55
PANTHER 59,108 97.03
Gene3D 42,846 70.33
PIRSF 4,336 7.12
PRINTS 8,886 14.59
Coils 10,214 16.77
TIGRFAM 6,982 11.46
MobiDBLite 26,460 43.43
TMHMM 14,313 23.49
Phobius 20,510 33.67
SMART 19,932 32.72
Total 60,637 99.54

As for non-coding RNA (ncRNA) gene prediction (Fig. 2), we identified 939 tRNAs using tRNAScan-SE69 v2.0.8, 7,297 rRNAs using Barrnap v0.9 (https://github.com/tseemann/barrnap) (--kingdom euk), and 982 other ncRNA using Rfam70,71 16.6.

We predicted the genes in the two organelle genomes using OGAP (https://github.com/zhangrengang/OGAP). A total of 131 genes (89 protein-coding genes, 8 rRNAs, and 34 tRNAs) were annotated for the chloroplast genome, and 63 (42 protein-coding genes, 3 rRNAs, and 18 tRNAs) for the mitochondria genome.

Genome comparison between haplotype assemblies

The minimap272 v2.24 was used to perform alignments between haplotype assemblies, and SyRI73 v1.6 to identify syntenic regions and structural variations (e.g., duplications, inversions, and translocations). Plotsr74 v0.5.4 was used for the visualization of the identified structural rearrangements (Fig. 4a). Chr01-chr03 pairs showed remarkable structural variation, while the syntenies of the other homologous chromosome pairs were mostly conserved in high collinearity with only few rearrangements. Syntenic regions were larger than the various types of structural variations (Fig. 4b). Sequence differences (local variation, e.g., SNPs, indels) on syntenic regions were identified (Fig. 4c). Highly diverged regions of long fragments were uneven among chromosome pairs, but the number of sequence differences were minor. Large fragments of collinearity between unpaired chromosomes were also detected (Fig. 4a).

Fig. 4.

Fig. 4

Structural variation and statistics between two haplotype genome assemblies of C. nepalensis. (a) Structural variation between haplotype genomes. Subgenome “a” (chr01a-chr20a) is used as the reference sequence and subgenome “b” (chr01b-chr20b) is the query. (b) Size distributions of different types of structural variation between two haplotype assemblies. (c) Numbers and lengths of sequence differences on the syntenic region for each chromosome pair.

Data Records

The raw data from PacBio HiFi, Illumina, and Hi-C sequencing were submitted to the SRA database (SRR2241265575, SRR2202604176, SRR2202604277, SRR2202604378). The haplotype-resolved genome assembly was deposited at Genbank with accession numbers GCA_027190085.179 and GCA_027186245.180. The genome assembly and gene annotation results of C. nepalensis were deposited in the figshare81 database.

Technical Validation

We mapped DNA and RNA sequencing reads to the final genome assembly for evaluation of the assembly quality (Fig. 2). A high read mapping rate of 99.2% was obtained when PacBio HiFi reads were mapped onto the genome using minimap2, and sequencing depth was counted and illustrated in the circos plot in Fig. 3c. We mapped the Illumina reads to the final assembly using BWA82 v0.7.17 and obtained a 98.7% reads mapping rate, and a low SNP heterozygosity level of ~0.0027% was obtained after SNPs were identified with SAMtools83 v1.13. Furthermore, a single base error rate of ~0.0011% was acquired, and a read mapping rate of 96.2% was obtained when RNA-seq reads were mapped onto the final genome assembly using HISAT2. Since genome coverage by sequencing data was relatively high, our genome assembly has high completeness and continuity.

We performed further genome assembly quality control with Merqury84 analysis (under K = 19) (Fig. 5, Table 4) based on PacBio HiFi reads. QVs (consensus quality values) for the individual haplotype genomes “a”, “b”, and shared for both “a” and “b” genomesare 46.39, 45.86, and 46.12, respectively. K-mer completeness scores for individual genomes “a”, “b”, and shared for both “a” and “b” genomes are 94.12, 93.68, and 98.87%, respectively. Again, our presented haplotype-resolved genome assembly was confirmed the good quality in completeness.

Fig. 5.

Fig. 5

Genome quality assessment with Merqury spectrum plot. (a) Copy number spectrum plot for haplotype assemblies of C. nepalensis. (b) Assembly spectrum plot for evaluating K-mer completeness.

Table 4.

Statistics of Merqury analysis for genome quality assessment.

Assembly QV (quality value) Error rate Completeness (%)
Genome “a” 46.39 2.30e-05 94.12
Genome “b” 45.86 2.60e-05 93.68
Genome both “a” and “b” 46.12 2.44e-05 98.87

We further performed BUSCO assessments for the assembly (Table 1), whereit was revealed that complete core genes (including single and multiple copies) accounted for 93.0%, while the missing gene rate accounted for only 4.9%, underscoring the good gene integrity of the assembly.

Acknowledgements

This research was supported by the National Natural Science Foundation of China (32171816) and the National Key R&D Program of China (2022YFD2200103).

Author contributions

Jian-Feng Mao and Wei Zhao conceived and designed the study; Yong-Peng Ma collected the samples; Shi-Wei Zhao, Jing-Fang Guo, Lei Kong, Shuai Nie, Xue-Mei Yan, Tian-Le Shi, Xue-Chan Tian, Hai-Yao Ma, Yu-Tao Bao, Zhi-Chao Li, Zhao-Yang Chen, Ren-Gang Zhang performed bioinformatics; Shi-Wei Zhao drafted the manuscript; Jian-Feng Mao, Yousry A. El-Kassaby and Ilga Porth revised the manuscript. Shi-Wei Zhao, Jing-Fang Guo and Lei Kong contributed equally to this work.

Funding

Open access funding provided by Umea University.

Code availability

All data processing commands and pipelines were carried out in accordance with the instructions and guidelines provided by the relevant bioinformatic software.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Shi-Wei Zhao, Jing-Fang Guo, Lei Kong

Contributor Information

Wei Zhao, Email: zhao.wei@umu.se.

Jian-Feng Mao, Email: jianfeng.mao@umu.se.

References

  • 1.Oginuma K, Nakata M, Suzuki M, Tobe H. Karyomorphology of Coriaria (Coriariaceae): Taxonomic implications. The Botanical Magazine Tokyo. 1991;104:297–308. doi: 10.1007/BF02488383. [DOI] [Google Scholar]
  • 2.Montserrat P. Root nodules of Coriaria. Nature. 1958;182:475–475. doi: 10.1038/182475a0. [DOI] [Google Scholar]
  • 3.Hu C, Zhou P, Zhou Q, Chen H, Akkermans ADL. Nodulation and molecular characterization of pure cultures isolated from root nodules of Coriaria nepalensis. Chinese Science Bulletin. 1998;43:695–698. doi: 10.1007/BF02883580. [DOI] [Google Scholar]
  • 4.Awasthi P, Bargali K, Bargali SS, Jhariya MK. Structure and functioning of Coriaria nepalensis dominated shrublands in degraded hills of Kumaun Himalaya. I. Dry matter dynamics. Land Degradation & Development. 2022;33:1474–1494. doi: 10.1002/ldr.4235. [DOI] [Google Scholar]
  • 5.Mourya NR, Bargali K, Bargali SS. Impacts of Coriaria nepalensis colonization on vegetation structure and regeneration dynamics in a mixed conifer forest of Indian Central Himalaya. Journal of Forestry Research. 2019;30:305–317. doi: 10.1007/s11676-018-0613-x. [DOI] [Google Scholar]
  • 6.Bargali K, Tewari A. Growth and water relation parameters in drought-stressed Coriaria nepalensis seedlings. Journal of Arid Environments. 2004;58:505–512. doi: 10.1016/j.jaridenv.2004.01.002. [DOI] [Google Scholar]
  • 7.Zeng XM, Xu XL, Yi RZ, Zhong FX, Zhang YH. Sap flow and plant water sources for typical vegetation in a subtropical humid karst area of southwest China. Hydrological Processes. 2021;35:e14090. doi: 10.1002/hyp.14090. [DOI] [Google Scholar]
  • 8.Tiwari M, Singh SP, Tiwari A, Sundriyal RC. Effect of symbiotic associations on growth of host Coriaria nepalensis and its facilitative impact on oak and pine seedlings in the Central Himalaya. Forest Ecology and Management. 2003;184:141–147. doi: 10.1016/S0378-1127(03)00209-3. [DOI] [Google Scholar]
  • 9.Fang SZ, Li HY, Xie BD. Decomposition and nutrient release of four potential mulching materials for poplar plantations on upland sites. Agroforestry Systems. 2008;74:27–35. doi: 10.1007/s10457-008-9155-0. [DOI] [Google Scholar]
  • 10.Yan K, et al. Current re-vegetation patterns and restoration issues in degraded geological phosphorus-rich mountain areas: A synthetic analysis of Central Yunnan, SW China. Plant Divers. 2017;39:140–148. doi: 10.1016/j.pld.2017.04.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ahmad A, Khan A, Kumar P, Bhatt RP, Manzoor N. Antifungal activity of Coriaria nepalensis essential oil by disrupting ergosterol biosynthesis and membrane integrity against. Candida. Yeast. 2011;28:611–617. doi: 10.1002/yea.1890. [DOI] [PubMed] [Google Scholar]
  • 12.Kumar P, et al. Antimicrobial activities of essential oil and methanol extract of Coriaria nepalensis. Nat Prod Res. 2011;25:1074–1081. doi: 10.1080/14786419.2010.529545. [DOI] [PubMed] [Google Scholar]
  • 13.Zhao F, et al. New sesquiterpenes from the roots of Coriaria nepalensis. Tetrahedron. 2012;68:6204–6210. doi: 10.1016/j.tet.2012.05.067. [DOI] [Google Scholar]
  • 14.Fang HL, Shang FN, Qian J, Duan BZ. Phylogenetic relationship and characterization of the complete chloroplast genome of the Coriaria nepalensis Wall. in China, a least concern folk medicine. Mitochondrial DNA Part B-Resources. 2020;5:1718–1719. doi: 10.1080/23802359.2020.1749179. [DOI] [Google Scholar]
  • 15.Li ML, et al. Semisynthesis and antifeedant activity of new acylated derivatives of tutin, a sesquiterpene lactone from Coriaria sinica. Heterocycles. 2007;71:1155–1162. doi: 10.3987/COM-07-11021. [DOI] [Google Scholar]
  • 16.Guo LX, Qiang TT, Ma YM, Wang K, Du K. Optimisation of tannin extraction from Coriaria nepalensis bark as a renewable resource for use in tanning. Industrial Crops and Products. 2020;149:112360. doi: 10.1016/j.indcrop.2020.112360. [DOI] [Google Scholar]
  • 17.Guo LX, Qiang TT, Ma YM, Ren LF, Dai TT. Purification and characterization of hydrolysable tannins extracted from Coriaria nepalensis bark using macroporous resin and their application in gallic acid production. Industrial Crops and Products. 2021;162:113302. doi: 10.1016/j.indcrop.2021.113302. [DOI] [Google Scholar]
  • 18.Yokoyama J, Suzuki M, Iwatsuki K, Hasebe M. Molecular phylogeny of Coriaria, with special emphasis on the disjunct distribution. Mol Phylogenet Evol. 2000;14:11–19. doi: 10.1006/mpev.1999.0672. [DOI] [PubMed] [Google Scholar]
  • 19.Chase MW, et al. Phylogenetics of seed plants: An analysis of nucleotide sequences from the plastid gene rbcL. Annals of the Missouri Botanical Garden. 1993;80:528–580. doi: 10.2307/2399846. [DOI] [Google Scholar]
  • 20.Swensen SM, Mullin BC, Chase MW. Phylogenetic affinities of Datiscaceae based on an analysis of nucleotide sequences from the plastid rbcL gene. Systematic Botany. 1994;19:157–168. doi: 10.2307/2419719. [DOI] [Google Scholar]
  • 21.Swensen SM. The evolution of actinorhizal symbioses: Evidence for multiple origins of the symbiotic association. American Journal of Botany. 1996;83:1503–1512. doi: 10.1002/j.1537-2197.1996.tb13943.x. [DOI] [Google Scholar]
  • 22.Griesmann M, et al. Phylogenomics reveals multiple losses of nitrogen-fixing root nodule symbiosis. Science. 2018;361:eaat1743. doi: 10.1126/science.aat1743. [DOI] [PubMed] [Google Scholar]
  • 23.Li L, et al. Genomes shed light on the evolution of Begonia, a mega-diverse genus. New Phytol. 2022;234:295–310. doi: 10.1111/nph.17949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Xie T, et al. De novo plant genome assembly based on chromatin interactions: a case study of Arabidopsis thaliana. Mol Plant. 2015;8:489–492. doi: 10.1016/j.molp.2014.12.015. [DOI] [PubMed] [Google Scholar]
  • 25.Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–i890. doi: 10.1093/bioinformatics/bty560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–770. doi: 10.1093/bioinformatics/btr011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Sun H, Ding J, Piednoël M, Schneeberger K. findGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics. 2017;34:550–557. doi: 10.1093/bioinformatics/btx637. [DOI] [PubMed] [Google Scholar]
  • 28.Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18:170–175. doi: 10.1038/s41592-020-01056-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Durand NC, et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 2016;3:95–98. doi: 10.1016/j.cels.2016.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Dudchenko O, et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science. 2017;356:92–95. doi: 10.1126/science.aal3327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Durand NC, et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 2016;3:99–101. doi: 10.1016/j.cels.2015.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Xu M, et al. TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads. Gigascience. 2020;9:giaa094. doi: 10.1093/gigascience/giaa094. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hu J, Fan J, Sun Z, Liu S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics. 2020;36:2253–2255. doi: 10.1093/bioinformatics/btz891. [DOI] [PubMed] [Google Scholar]
  • 34.Pryszcz LP, Gabaldon T. Redundans: an assembly pipeline for highly heterozygous genomes. Nucleic Acids Res. 2016;44:e113. doi: 10.1093/nar/gkw294. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27:573–580. doi: 10.1093/nar/27.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Jin JJ, et al. GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biol. 2020;21:241. doi: 10.1186/s13059-020-02154-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Ou S, et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 2019;20:275. doi: 10.1186/s13059-019-1905-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Huang S, et al. The genome of the cucumber, Cucumis sativus L. Nat Genet. 2009;41:1275–1281. doi: 10.1038/ng.475. [DOI] [PubMed] [Google Scholar]
  • 39.Jaillon O, et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. 2007;449:463–467. doi: 10.1038/nature06148. [DOI] [PubMed] [Google Scholar]
  • 40.International Peach Genome I, et al. The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nat Genet. 2013;45:487–494. doi: 10.1038/ng.2586. [DOI] [PubMed] [Google Scholar]
  • 41.Arabidopsis Genome I. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. doi: 10.1038/35048692. [DOI] [PubMed] [Google Scholar]
  • 42.Grabherr MG, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–652. doi: 10.1038/nbt.1883. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12:357–360. doi: 10.1038/nmeth.3317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Pertea M, et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33:290–295. doi: 10.1038/nbt.3122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–3152. doi: 10.1093/bioinformatics/bts565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Haas BJ, et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003;31:5654–5666. doi: 10.1093/nar/gkg770. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008;24:637–644. doi: 10.1093/bioinformatics/btn013. [DOI] [PubMed] [Google Scholar]
  • 48.Cantarel BL, et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008;18:188–196. doi: 10.1101/gr.6743907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Camacho C, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Slater GS, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005;6:31. doi: 10.1186/1471-2105-6-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Haas BJ, et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol. 2008;9:R7. doi: 10.1186/gb-2008-9-1-r7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Filiault DL, et al. The Aquilegia genome provides insight into adaptive radiation and reveals an extraordinarily polymorphic chromosome with a unique history. Elife. 2018;7:e36426. doi: 10.7554/eLife.36426. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Wu S, et al. The genome sequence of star fruit (Averrhoa carambola) Hortic Res. 2020;7:95. doi: 10.1038/s41438-020-0307-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Tuskan GA, et al. The genome of black cottonwood, Populus trichocarpa. Science. 2006;313:1596–1604. doi: 10.1126/science.1128691. [DOI] [PubMed] [Google Scholar]
  • 55.Tu L, et al. Genome of Tripterygium wilfordii and identification of cytochrome P450 involved in triptolide biosynthesis. Nat Commun. 2020;11:971. doi: 10.1038/s41467-020-14776-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Duan N, et al. Genome re-sequencing reveals the history of apple and supports a two-stage model for fruit enlargement. Nat Commun. 2017;8:249. doi: 10.1038/s41467-017-00336-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Xie D, et al. The wax gourd genomes offer insights into the genetic diversity and ancestral cucurbit karyotype. Nat Commun. 2019;10:5158. doi: 10.1038/s41467-019-13185-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Fu R, et al. Genome-wide analyses of introgression between two sympatric Asian oak species. Nat Ecol Evol. 2022;6:924–935. doi: 10.1038/s41559-022-01754-7. [DOI] [PubMed] [Google Scholar]
  • 59.Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20:238. doi: 10.1186/s13059-019-1832-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Emms, D. M. & Kelly, S. STAG: Species tree inference from all genes. bioRxiv, 267914 (2018).
  • 61.Sun P, et al. WGDI: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes. Mol Plant. 2022;15:1841–1851. doi: 10.1016/j.molp.2022.10.018. [DOI] [PubMed] [Google Scholar]
  • 62.Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–3212. doi: 10.1093/bioinformatics/btv351. [DOI] [PubMed] [Google Scholar]
  • 63.Huerta-Cepas J, et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Research. 2018;47:D309–D314. doi: 10.1093/nar/gky1085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Huerta-Cepas J, et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-Mapper. Mol Biol Evol. 2017;34:2115–2122. doi: 10.1093/molbev/msx148. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12:59–60. doi: 10.1038/nmeth.3176. [DOI] [PubMed] [Google Scholar]
  • 66.Consortium TU. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Research. 2020;49:D480–D489. doi: 10.1093/nar/gkaa1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Coordinators NR. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research. 2013;42:D7–D17. doi: 10.1093/nar/gkt1146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Jones P, et al. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014;30:1236–1240. doi: 10.1093/bioinformatics/btu031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25:955–964. doi: 10.1093/nar/25.5.955. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Kalvari I, et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Research. 2020;49:D192–D200. doi: 10.1093/nar/gkaa1047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Kalvari I, et al. Non-coding RNA analysis using the Rfam database. Curr Protoc Bioinformatics. 2018;62:e51. doi: 10.1002/cpbi.51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Goel M, Sun H, Jiao WB, Schneeberger K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 2019;20:277. doi: 10.1186/s13059-019-1911-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Goel M, Schneeberger K. plotsr: visualizing structural similarities and rearrangements between multiple genomes. Bioinformatics. 2022;38:2922–2926. doi: 10.1093/bioinformatics/btac196. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.2022. NCBI Sequence Read Archive (SRR22412655) SRR22412655
  • 76.2022. NCBI Sequence Read Archive (SRR22026041) SRR22026041
  • 77.2022. NCBI Sequence Read Archive (SRR22026042) SRR22026042
  • 78.2022. NCBI Sequence Read Archive (SRR22026043) SRR22026043
  • 79.2022. NCBI Assembly. GCA_027190085.1
  • 80.2022. NCBI Assembly. GCA_027186245.1
  • 81.Zhao SW, 2023. Haplotype-resolved genome assembly of Coriaria nepalensis, a non-legume nitrogen-fixing shrub associated with Frankia. figshare. [DOI] [PMC free article] [PubMed]
  • 82.Li, H. J. A. P. A. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997v2 (2013).
  • 83.Danecek P, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10:giab008. doi: 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Rhie A, Walenz BP, Koren S, Phillippy AM. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 2020;21:245. doi: 10.1186/s13059-020-02134-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. 2022. NCBI Sequence Read Archive (SRR22412655) SRR22412655
  2. 2022. NCBI Sequence Read Archive (SRR22026041) SRR22026041
  3. 2022. NCBI Sequence Read Archive (SRR22026042) SRR22026042
  4. 2022. NCBI Sequence Read Archive (SRR22026043) SRR22026043
  5. 2022. NCBI Assembly. GCA_027190085.1
  6. 2022. NCBI Assembly. GCA_027186245.1
  7. Zhao SW, 2023. Haplotype-resolved genome assembly of Coriaria nepalensis, a non-legume nitrogen-fixing shrub associated with Frankia. figshare. [DOI] [PMC free article] [PubMed]

Data Availability Statement

All data processing commands and pipelines were carried out in accordance with the instructions and guidelines provided by the relevant bioinformatic software.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES