Skip to main content
Scientific Data logoLink to Scientific Data
. 2025 Nov 17;12:1802. doi: 10.1038/s41597-025-06107-0

Chromosome-level genome assembly of a hemiparasitic plant, Scurrula parasitica (Loranthaceae)

Mingcheng Wang 1,2, Panyue Du 3, Jiabo Liu 2, Quanjun Hu 2, Kangshan Mao 2, Milne Richard 4, Ning Miao 2,
PMCID: PMC12624072  PMID: 41249201

Abstract

Scurrula parasitica (Loranthaceae) is a widespread aerial hemiparasitic plant in southwest China, recognized for its ecological roles and broad host range. As a representative of mistletoes in Santalales, it serves as a model for studying the genomic basis of aerial hemiparasitism. Here, we present a high-quality chromosome-level genome assembly of S. parasitica using PacBio high-fidelity and Hi-C sequencing. The assembled genome spans 547.41 Mb with a contig N50 of 8.32 Mb, and 97.54% of the sequence is anchored to nine pseudochromosomes. Repetitive sequences account for 64.53% of the genome. We predicted 21,837 protein-coding genes, of which 20,974 (96.05%) received functional annotations.Additionally, we identified 1,271 transcription factor genes and 8,407 non-coding RNAs. This chromosome-level assembly provides a foundational resource for investigating gene family evolution, parasitic adaptation, and genome architecture in S. parasitica. The genome assembly and associated datasets have been deposited in public repositories, enabling future comparative and functional genomic studies in parasitic angiosperms.

Subject terms: Plant evolution, Molecular evolution

Background & Summary

Approximately 1% of angiosperms are parasitic plants, either fully or partially dependent on their host plants for carbon, nutrients, and water through specialized structures known as haustoria1. These plants exhibit diverse morphological and physiological adaptations and have evolved multiple times independently in at least 16 angiosperm families1,2. Parasitic strategies range from hemiparasitism, in which plants retain photosynthetic ability, to holoparasitism, in which they rely entirely on their hosts2. This diversity suggests complex and lineage-specific genomic adaptations3,4.

Recent advances in genome sequencing have enabled studies on several parasitic plant species. Published genomes include hemiparasites such as Santalum album5, Malania oleifera6, Striga asiatica7, Phtheirospermum japonicum8, and Pedicularis cranolopha9, as well as holoparasites like Cuscuta campestris10, C. australis11, Orobanche cumana, Phelipanche aegyptiaca12, and Sapria himalayana13,14. Comparative genomic analyses have revealed shared features such as extensive gene loss, plastome and mitogenome reduction, and horizontal gene transfer from host plants10,11,13. However, due to the limited number of high-quality genomes, broader evolutionary patterns across parasitic lineages remain poorly understood.

Mistletoes represent a major clade of hemiparasitic plants in Santalales, where aerial parasitism evolved independently multiple times from root-parasitic ancestors15,16. Scurrula parasitica (Loranthaceae) is a widespread mistletoe species in southwest China. Seeds dispersed by birds and mammals germinate on host branches and form haustoria to establish parasitic relationships17. In contrast to root-parasitic Santalales such as Santalum album and Malania oleifera, S. parasitica parasitizes woody branches and has a broad host range, including Osmanthus, Citrus, and Camellia, making it a valuable model for investigating the genomic basis of aerial hemiparasitism.

Here, we present a high-quality, chromosome-level genome assembly of S. parasitica using PacBio high-fidelity (HiFi) and Hi-C sequencing technologies. We comprehensively annotated the genome, including repetitive sequences, protein-coding genes (PCGs), transcription factor (TF) genes, and non-coding RNAs (ncRNAs). This genome provides a foundational resource for future comparative genomic studies to explore the genetic mechanisms underlying the evolution of hemiparasitism in Santalales and to understand both convergent and divergent genomic adaptations across parasitic angiosperms.

Methods

Plant sample preparation

We collected plant material from an S. parasitica individual parasitizing Osmanthus fragrans grown at the Wangjiang Campus of Sichuan University in Chengdu, Sichuan Province, southwest China (Fig. 1a). The freshly harvested leaves were promptly washed in distilled water and immediately frozen in liquid nitrogen, and stored at −80 °C until DNA extraction. Additionally, fresh flower, stem, leaf, and fruit tissues were collected from the same individual and frozen in liquid nitrogen for RNA sequencing (RNA-seq).

Fig. 1.

Fig. 1

Genome survey of S. parasitica. (a) Photograph of an S. parasitica individual parasitizing Osmanthus fragrans. (b) K-mer frequency distribution derived from Illumina short-read sequencing data.

Genome survey

To perform genome survey analyses, we utilized an Illumina NovaSeq. 6000 platform for whole-genome sequencing (Illumina Inc., San Diego, CA, USA). Following total DNA extraction via the CTAB method18, paired-end ReSeq libraries were prepared, with an average insertion length of approximately 400 bp. A total of 39.00 Gb of Illumina reads were generated (Table 1). A 19-mer frequency distribution of these reads was generated using jellyfish v2.2.919. This analysis identified 31,259,563,762 k-mers, with a primary peak observed at a k-depth of 57 (Fig. 1b). The haploid genome size of S. parasitica was estimated to be 548.41 Mb, with a high repeat content of 64.11% and a notably low heterozygosity rate of 0.07%.

Table 1.

Summary of genome and transcriptome sequencing data for S. parasitica.

Data type Tissue Number of reads Total data (Gb) Sequence coverage (×)
Illumina Leaf 260,000,000 39.00 71.11
PacBio HiFi Leaf 1,514,722 20.44 37.27
Hi-C Young leaf 425,069,022 63.76 116.26
RNA-seq (Total) 218,291,752 32.28
RNA-seq Leaf 48,682,578 7.20
RNA-seq Stem 55,781,704 8.26
RNA-seq Fruit 66,370,214 9.79
RNA-seq Flower 47,457,256 7.03

Genome assembly

For PacBio HiFi sequencing, we isolated high-molecular-weight DNA using a modified CTAB method and prepared SMRTbell libraries following the PacBio 15-kb protocol. Subsequently, circular consensus sequencing (CCS) was performed on a PacBio Sequel II sequencing platform (Pacific Biosciences, Menlo Park, CA, USA), resulting in 20.44 Gb of HiFi reads (37.3 × coverage) with an N50 length of 13,486 bp (Table 1). The HiFi long reads were processed using the CCS workflow in SMRT Link v8.0 (PacBio) and assembled into contigs using hifiasm v0.1420 with default parameters, resulting in 878 contigs totaling 552.42 Mb. To improve assembly accuracy, Illumina sequencing reads were aligned to the contigs using BWA v0.7.1721, and contigs with anomalous GC content (>50%) or insufficient coverage (<5×) were identified and removed based on the alignments. This filtering step yielded 731 contigs spanning 547.29 Mb, which were used for downstream Hi-C scaffolding analyses.

Hi-C sequencing was then performed to generate a chromosome-level genome assembly. Hi-C libraries were prepared from more than 2 g of young leaves from the same S. parasitica plant, following standard protocols for chromatin extraction, digestion, ligation, and DNA purification. Paired-end sequencing was performed on a NovaSeq 6000 sequencing platform, resulting in 63.76 Gb of Hi-C reads (116.3 × coverage) (Table 1). The Hi-C reads were mapped to the contig-level assembly using Juicebox v1.8.822. Uniquely mapped reads were subsequently used to anchor contigs into pseudochromosomes with the 3D-DNA pipeline23. Hi-C contact maps were visualized and manually curated in Juicebox to correct misassemblies (Fig. 2), yielding a final chromosome-level assembly of 547.41 Mb (Fig. 3; Table 2). In total, 97.54% (533.93 Mb) of the genome was anchored to nine pseudochromosomes, ranging from 55.41 Mb to 64.98 Mb (Fig. 3; Table 3). The contig and scaffold N50 values were 8.32 Mb and 59.61 Mb, respectively (Table 2).

Fig. 2.

Fig. 2

Hi-C contact heatmaps for nine pseudochromosomes of the S. parasitica genome.

Fig. 3.

Fig. 3

Circos plot illustrating the genomic architecture of S. parasitica. Tracks display (a) GC content, (b) repeat density, (c) LTR/Gypsy density, (d) LTR/Copia density, (e) protein-coding gene density, and (f) syntenic regions within the genome.

Table 2.

Global statistics of S. parasitica genome assembly and annotation.

Assembly
Estimated genome size (Mb) 548.41
Total length (Mb) 547.41
Number of pseudo-chromosomes 9
Total length of pseudo-chromosomes (Mb) 533.93
Number of contigs 731
Number of scaffolds 489
Contig N50 (Mb) 8.32
Scaffold N50 (Mb) 59.61
Longest contig (Mb) 34.36
Shortest contig (bp) 12,021
LTR assembly index 15.93
BUSCO completeness (%) 93.87
Annotation
GC content (%) 39.95
Repeat content (%) 64.53
Number of protein-coding genes 21,837
Average gene length (bp) 4,561
Average coding sequence length (bp) 1,283
Average exon length (bp) 211
Average intron length (bp) 644
Functionally annotated genes 20,974
BUSCO completeness (%) 89.59

Table 3.

Summary of nine pseudochromosomes of the final S. parasitica assembly.

Pseudochromosome Length (bp) Number of contigs Number of genes
Chr01 64,981,491 14 2,921
Chr02 63,577,551 13 2,459
Chr03 61,532,724 15 2,322
Chr04 60,875,880 49 2,329
Chr05 59,613,188 4 2,329
Chr06 56,205,050 76 2,231
Chr07 55,927,557 13 2,354
Chr08 55,804,119 16 2,233
Chr09 55,407,819 16 2,272
Total 533,925,379 216 21,450

Genome annotation

Genome annotation began with the identification of repetitive sequences. A de novo repeat library was constructed using Repeat Modeler v2.0.124 based on the genome assembly. This library was subsequently merged with the green plant repeat dataset from the Repbase database v22.1125. We then used RepeatMasker v4.1.026 to identify repetitive elements based on sequence homology. In total, we identified 353.26 Mb of repetitive sequences, accounting for 64.53% of the S. parasitica genome (Table 4). Among the identified repetitive elements, long terminal repeat retrotransposons (LTR-RTs) were the most abundant, comprising 251.45 Mb (45.93%) of the genome. Within the LTR-RT class, Gypsy and Copia elements were the most prominent, totaling 250.14 Mb. Additionally, 74.40 Mb (13.59%) of sequences were classified as unclassified repeats, suggesting the presence of species-specific or novel repeat types. Furthermore, DNA transposons accounted for 4.02% (22.01 Mb) of the genome, while long interspersed nuclear elements (LINEs) comprised 4.88 Mb, short interspersed nuclear elements (SINEs) 1.16 Mb, and other repeat types totaled 0.42 Mb (Table 4).

Table 4.

Classifications of repetitive elements in the S. parasitica genome.

Type Total length (bp) % of genome
DNA 22,008,454 4.02
LINE 4,877,987 0.89
SINE 1,161,050 0.21
LTR 251,449,511 45.93
Gypsy 143,403,054 26.20
Copia 106,734,185 19.50
 Other 1,312,272 0.24
Satellite 220,708 0.04
Simple repeat 193,952 0.04
Low complexity 12,509 0.00
Unknown 74,399,362 13.59
Total 353,262,388 64.53

After masking all repetitive elements in the S. parasitica genome, we employed three complementary approaches to predict the PCGs. For transcriptome-based annotation, total RNA was extracted from all fresh tissues using the TRIzol reagent. The NEBNext Ultra II RNA Library Prep Kit was used to generate RNA-seq libraries after removing residual DNA. These libraries were then sequenced on an Illumina NovaSeq 6000 platform, generating 32.28 Gb of RNA-seq data (Table 1). The RNA-seq reads were de novo assembled into transcripts using Trinity v2.8.427. The resulting transcripts were aligned to the repeat-masked genome using PASA v2.3.328, and the alignment results were used to generate gene structure predictions. For homologous protein annotation, we aligned protein sequences from several representative species (Santalum album5, Malania oleifera6, Arabidopsis thaliana29, Populus trichocarpa30, Vitis vinifera31, and Theobroma cacao32) to the S. parasitica genome using TBLASTN v2.2.3133. Gene models were then predicted based on these alignments using GeneWise v2.4.134. For ab initio gene prediction, high-confidence transcripts from PASA exceeding 1,500 bp in length and containing more than two exons were selected solely to train species-specific parameters for AUGUSTUS v3.2.335. The trained AUGUSTUS model was then applied to predict genes across the entire genome without applying any length or exon-number filters, ensuring that all potential PCGs were considered. Finally, we used EvidenceModeler v1.1.136 to integrate gene models from the three approaches into a consensus, non-redundant gene set.

We predicted 21,837 PCGs in the S. parasitica genome, with 21,450 (98.23%) located on the nine pseudochromosomes at a density of 40.2 genes per Mb (Table 3). The average lengths of the predicted transcripts, coding sequences (CDSs), exons, and introns were 4,561 bp, 1,283 bp, 211 bp, and 644 bp, respectively (Table 2). To investigate potential whole-genome duplication (WGD) events, we conducted an all-against-all BLASTP search using protein sequences from S. parasitica and Santalum album. Syntenic blocks were identified using MCScanX v1.137 with default parameters, and non-synonymous (Ka) and synonymous (Ks) substitution rates were calculated for syntenic gene pairs using the ‘add_ka_and_ks_to_collinearity.pl’ script from MCScanX. We observed a major peak around 0.73 in the Ks distribution of orthologs between S. parasitica and Santalum album, which was younger than the Ks peak of paralogs within S. parasitica (0.78), indicating that no independent WGD event occurred in the S. parasitica genome after its split from Santalum album (Fig. 4). The inter-chromosomal synteny shown in Fig. 3 therefore likely reflects ancient WGD events and more recent segmental duplications. Functional annotation, performed by aligning the protein sequences against Swiss-Prot, TrEMBL38, InterPro39, and KEGG40 databases, successfully annotated 96.05% of the genes (Table 5). We identified 1,271 TF genes (5.82% of PCGs) using PlantTFDB v5.041 (Fig. 5). Additionally, 8,407 ncRNAs with a total size of 0.98 Mb were identified by using tRNAscan-SE v2.042 for tRNAs, Infernal v1.1.243 for miRNAs and snRNAs, and BLASTN v2.2.31 against Rfam database44 for rRNAs, comprising 3,821 tRNAs, 3,076 snRNAs, 1,447 rRNAs, and 63 miRNAs (Table 6).

Fig. 4.

Fig. 4

Synonymous substitution rate distributions of paralogous and orthologous gene pairs in S. parasitica and Santalum album.

Table 5.

Functional annotation of protein-coding genes in the S. parasitica genome.

Number of genes Percentage (%)
Total 21,837
Annotated 20,974 96.05
 InterPro 20,824 95.36
 KEGG 7,288 33.37
 SwissProt 13,692 62.70
 TrEMBL 18,799 86.09
 GO 15,384 70.45
Unannotated 863 3.95

Fig. 5.

Fig. 5

Distribution of the top 30 transcription factor families identified in the S. parasitica genome.

Table 6.

Summary of non-coding RNAs in the S. parasitica genome.

Type Number Average length (bp) Total length (bp)
miRNA 63 133.02 8,380
tRNA 3,821 101.50 387,827
rRNA 1,447 178.18 257,833
28S 152 124.01 18,850
18S 105 953.20 100,086
5.8S 49 148.20 7,262
5S 1,141 115.37 131,635
snRNA 3,076 107.52 330,720
CD-box 2,961 106.19 314,418
HACA-box 46 131.72 6,059
Splicing 69 148.45 10,243

Data Records

The genome assembly of S. parasitica and the associated raw sequence data were made publicly available through the NCBI database under BioProject PRJNA126687745. The genome assembly is available in GenBank under accession number JBPAPV00000000046. Raw sequencing data, including Illumina, PacBio HiFi, and Hi-C reads, are available in the Sequence Read Archive (SRA) under accession numbers SRR3375597247, SRR3377632748 and SRR3367619549, respectively. RNA-seq reads were deposited under the SRA accession numbers SRR33745685–SRR337456885053. Genome assembly and annotations of repetitive elements, gene structures, and functional features have also been archived in Figshare54.

Technical Validation

We employed a variety of approaches and metrics to determine the integrity and accuracy of the final S. parasitica genome assembly. First, using BUSCO v3.0.2 software55, we evaluated the presence of 1614 conserved genes from the Embryophyta odb10 dataset. The results showed that 93.87% of complete BUSCO genes were identified at the assembly level, while 89.59% were detected at the protein level (Table 2). Second, we assessed assembly continuity by calculating the long terminal repeat (LTR) Assembly Index (LAI) using LTR_retriever v2.856. The assembly achieved an overall LAI score of 15.93, indicating reference-level genome quality (Table 2). Third, Illumina short-read data were mapped to the final assembly using BWA software. The Illumina reads covered 99.96% of the genome, with a mapping rate of 99.69% and a minimum 20-fold coverage of 99.58% of the assembly. Finally, we examined the presence of Arabidopsis-type telomeres (TTTAGGG)n57 at the ends of each pseudochromosome, with a minimum need of 5 replicates. Seven of the nine pseudochromosomes contained telomeric sequences at both ends (Table 7). Based on this comprehensive set of evidence, we conclude that the S. parasitica genome assembly is of high quality and utility.

Table 7.

Summary of Arabidopsis-type telomeres at both ends of S. parasitica pseudochromosomes.

Position Pseudochromosome length (bp) Start position End position Telomere length (bp) Type
Chr02 63,577,551 2 2,444 2,443 CCCTAAA
Chr02 63,577,551 63,573,956 63,577,175 3,220 TTTAGGG
Chr03 61,532,724 1,987 2,238 252 CCCTAAA
Chr03 61,532,724 61,527,764 61,527,805 42 TTTAGGG
Chr04 60,875,880 63841 63868 28 CCCTAAA
Chr04 60,875,880 59,956,883 59,956,910 28 TTTAGGG
Chr05 59,613,188 4 1,179 1,176 CCCTAAA
Chr06 56,205,050 1,895,326 1,909,997 14,672 CCCTAAA
Chr06 56,205,050 56,202,420 56,202,636 217 TTTAGGG
Chr07 55,927,557 49,113 49,189 77 CCCTAAA
Chr07 55,927,557 55,927,337 55,927,525 189 TTTAGGG
Chr08 55,804,119 145,613 145,661 49 TTTAGGG
Chr08 55,804,119 55,803,903 55,804,119 217 TTTAGGG
Chr09 55,407,819 1,091 1,160 70 CCCTAAA
Chr09 55,407,819 55,402,095 55,407,750 5,656 TTTAGGG

Acknowledgements

This research was jointly funded by China’s National Natural Science Foundation (U24A20355) and the National Key Research & Development Program of China (2016YFD0600203).

Author contributions

N.M. and M.W. designed the study. M.W, P.D, J.L. and Q.H. performed the data analyses and drafted the manuscript. Q.H., K.M., R.M., and N.M. revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Data availability

The genome assembly of S. parasitica has been deposited in GenBank under accession number JBPAPV000000000. All associated raw sequencing data, including Illumina, PacBio HiFi, Hi-C, and RNA-seq reads, are available under NCBI BioProject PRJNA1266877. Genome assembly and annotations have also been archived in Figshare (10.6084/m9.figshare.29210405.v2).

Code availability

No specific script was generated in this study. All commands and pipelines for data analyses followed the manuals and protocols of the relevant bioinformatics software.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. NCBI Sequence Read Archivehttps://identifiers.org/ncbi/insdc.sra:SRR33755972 (2025).
  2. NCBI Sequence Read Archivehttps://identifiers.org/ncbi/insdc.sra:SRR33776327 (2025).
  3. NCBI Sequence Read Archivehttps://identifiers.org/ncbi/insdc.sra:SRR33676195 (2025).
  4. NCBI Sequence Read Archivehttps://identifiers.org/ncbi/insdc.sra:SRR33745685 (2025).
  5. NCBI Sequence Read Archivehttps://identifiers.org/ncbi/insdc.sra:SRR33745686 (2025).
  6. NCBI Sequence Read Archivehttps://identifiers.org/ncbi/insdc.sra:SRR33745687 (2025).
  7. NCBI Sequence Read Archivehttps://identifiers.org/ncbi/insdc.sra:SRR33745688 (2025).
  8. Wang, M. Chromosome-level genome assembly of a hemiparasitic plant, Scurrula parasitica (Loranthaceae). Figshare10.6084/m9.figshare.29210405.v2 (2025). [DOI] [PMC free article] [PubMed]

Data Availability Statement

The genome assembly of S. parasitica has been deposited in GenBank under accession number JBPAPV000000000. All associated raw sequencing data, including Illumina, PacBio HiFi, Hi-C, and RNA-seq reads, are available under NCBI BioProject PRJNA1266877. Genome assembly and annotations have also been archived in Figshare (10.6084/m9.figshare.29210405.v2).

No specific script was generated in this study. All commands and pipelines for data analyses followed the manuals and protocols of the relevant bioinformatics software.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES