Skip to main content
Scientific Data logoLink to Scientific Data
. 2024 Nov 26;11:1287. doi: 10.1038/s41597-024-04150-x

Chromosome-level genome assembly of Cryptosporidium parvum by long-read sequencing of ten oocysts

Yuancai Chen 1, Jianying Huang 1, Huikai Qin 1, Kaihui Zhang 1, Yin Fu 1, Junqiang Li 1, Rongjun Wang 1, Kai Chen 2, Jie Xiong 2,3, Wei Miao 2,4, Guangying Wang 2,, Longxian Zhang 1,5,6,
PMCID: PMC11599830  PMID: 39592642

Abstract

Cryptosporidium parvum is a zoonotic parasite of the intestine and poses a threat to human and animal health. However, it is difficult to obtain a large number of oocysts for genome sequencing using in vitro culture. To address this challenge, we employed the strategy of whole-genome amplification of 10 oocysts followed by long-read sequencing and obtained a high-quality genome assembly of C. parvum IIdA19G1 subtype isolated from a pre-weaning calf with diarrhea. The assembled genome was 9.13 Mb long and encompassed eight chromosomes with six capped by telomeric sequences at one or both ends. In total, 3,915 protein-coding genes were predicted, exhibiting a high completeness with 98.2% single-copy BUSCO genes. To our current knowledge, this represents the first chromosome-level genome assembly of C. parvum achieved through the combined use of whole-genome amplification of 10 oocysts and long-read sequencing. This achievement not only advances our understanding of the genomic landscape of this zoonotic intestinal parasite, but also provides valuable resources for comparative genomics and evolutionary analyses within the Cryptosporidium clade.

Subject terms: Genome informatics, Parasite genomics

Background & Summary

Cryptosporidium spp. are parasitic apicomplexans that cause moderate-to-severe diarrhea in humans and animals1. The lack of widely efficacious medications and the absence of a vaccine necessitate heavy reliance on infection prevention for the management of cryptosporidiosis, thereby highlighting the urgent requirement for innovative interventions2,3. Cryptosporidium species have been detected in 155 mammalian species, including primates4,5. Currently, at least 44 species of Cryptosporidium have been identified6. Several species, including Cryptosporidium parvum, Cryptosporidium ubiquitum, and Cryptosporidium muris, exhibit wide host ranges, leading to zoonotic infections in conjunction with other Cryptosporidium spp7. Whole-genome sequencing (WGS) and comparative genomic analysis have been employed to elucidate the genetic underpinnings responsible for variations in host range among different species of Cryptosporidium, as well as the process of host adaptation within each species810. The use of WGS analysis has become more prevalent in the characterization of Cryptosporidium owing to the emergence of next-generation sequencing (NGS) technologies. A total of 15 species have been subjected to genome sequencing, encompassing C. parvum, Cryptosporidium hominis, C. ubiquitum, Cryptosporidium meleagridis, and others. The majority of the available genomic sequence data (19 sequences) pertain to the zoonotic C. parvum, yet only two of these sequences have been annotated11. The initial comprehensive genome assembly for C. parvum Iowa II was made accessible in 2004 using a random shotgun sequencing technique. This approach yielded a total of 9.1 Mb of DNA sequences distributed across all eight chromosomes12. According to previous studies, the genetic divergence between C. parvum and C. hominis was estimated to be approximately 3%-5% at the DNA level13.

One of the primary challenges encountered in genomics research on Cryptosporidium spp. is the limited availability of adequately purified oocysts in sufficient quantities for NGS analysis, primarily because of the absence of an in vitro culture system capable of propagating parasites. Previous WGS analyses of Cryptosporidium have been conducted using oocysts purified from laboratory animals that were infected12,14,15. Troell et al.16 sequenced the Cryptosporidium single-oocyst genome, followed by a comprehensive whole-genome analysis through comparison with de novo assembly of the reference population genome. This research represents a significant milestone as it establishes the feasibility of acquiring high-quality genomic data from single-celled eukaryotes, encompassing both extensive coverage and precise information16. However, previous research on Cryptosporidium only involved single-oocyst NGS of the genome without assembling it at the chromosomal level.

Here, our study aimed to address this limitation by generating a reference genome for C. parvum using long-read sequencing data from Oxford nanopore technology (ONT) and PacBio high fidelity (HiFi) sequencing platforms, along with error correction using short-read data. As a result, the assembled genome of C. parvum was 9.13 Mb in length and showed a high completion rate with 98.2% single-copy BUSCO genes. A total of 3,915 protein-coding genes were predicted, of which 3,666 genes (93.6%) were functionally annotated. This study is an attempt to complete the high-quality chromosome-level genome assembly of Cryptosporidium species using 10 oocysts amplification coupled with long-read sequencing, which might also be an effective strategy for genome sequencing projects of other difficult-to-collect or uncultivable pathogens.

Methods

Sample collection and genome sequencing

The Cryptosporidium strain was isolated from a calf with pre-weaning diarrhea in Henan, China, and identified as C. parvum using the SSU rRNA gene17. It was then subtyped by sequence analysis of the 60 kDa glycoprotein gene18 and identified as IIdA19G1 subtype. Oocysts of the identified Cryptosporidium species were purified using a three-step filtering (Fig. 1) comprising raw fecal filtration using 80-mesh iron sieve, sucrose gradient centrifugation, and cesium chloride gradient centrifugation19,20. Purified Cryptosporidium oocyst fluid (6 μL) was absorbed using a 10 μL pipette and dripped onto a glass petri dish. Under an inverted Olympus microscope at 60 × (OLYMPUS-BX53, Japan), a single oocyst of C. parvum was isolated using a three-axis hydraulic micromanipulator (World Precision Instruments Inc., USA). In this study, 10 oocysts were selected and pooled into a PCR tube containing 4 μL PBS buffer (Fig. 1).

Fig. 1.

Fig. 1

The purification and collection process of oocyst. (Yellow arrow: C. parvum oocyst).

The 10 oocysts sample was then lysed and whole-genome amplified using the REPLI-g Single Cell Kit (based on multiple displacement amplification method; QIAGEN, Germany). The resulting whole-genome amplification (WGA) products were purified using Agencourt AMPure XP beads (BECKMAN, USA) to remove dNTP, primers, primer dimers, salt ions, and other impurities from the amplified products. According to NanoDrop One (Thermo Fisher Scientific, USA), the WGA product concentration in C. parvum was 762 ng/μL. Through Qubit 3.0 (Invitrogen, USA), the quantity of the WGA product was 30 μg, and the Nc/Qc (NanoDrop/Qubit) value was 1.2.

The high-quality amplified DNA was used to construct the genomic library, and the library was size-selected using BluePippin (Sage Science, USA). The purified and size-selected library was then sequenced on the Pacific Biosciences Sequel II platform (HiFi) in continuous long-read mode (Pacific Biosciences, USA) and the PromethION 48 sequencer (ONT, UK) following the manufacturer’s instructions, respectively. A total of 3.5 Gb (386 × coverage) PacBio HiFi and 8.8 Gb (967 × coverage) ONT long sequencing reads were obtained after removing adaptors and chimeric reads (Table 1). For short-read sequencing, library preparation was performed with 50 ng of fragmented DNA using the MGIEasy Universal DNA Library Prep Kit (MGI, Shenzhen, China) and then sequenced on the MGISEQ-2000 platform (BGI, Shenzhen, China). About 1.6 Gb (173 × coverage) of 150-bp paired-end reads (clean data) were generated using MGI sequencing platform (Table 1).

Table 1.

Sequencing data used for the genome assembly of C. parvum.

Sequencing technology MGI PacBio ONT
Clean data (Gb) 1.6 3.5 8.8
Reads Mean (bp) 150 4,949 5,807
Reads N50 (bp) 150 5,105 6,535
Reads Max (bp) 150 25,327 92,140
Depth (×) 173 386 967
GC content (%) 32.2 31.1 31.9

De novo assembly

We first used SACRA v.2.021 to split chimeric long reads derived from multiple displacement amplification and fastp v.0.20.122 to trim adapter and low-quality bases in short reads. 486,818 chimera-containing reads in PacBio data and 1,394,568 in ONT data were identified and split using SACRA v.2.0, respectively. The clean long reads from ONT and PacBio platforms were independently assembled using Nextdenovo v.2.5.2 (https://github.com/Nextomics) and Canu v.2.2.223 with default parameters (Fig. 2). To improve the assembly contiguity, the outputs for each platform were merged using Quickmerge v.0.3 with default parameters (https://github.com/mahulchak/quickmerge). The merged assembly was then polished two rounds with Pilon v.1.24 (https://github.com/broadinstitute/pilon) using short clean reads24 (Fig. 2). For this, short reads were first mapped to the assembly using BWA v.0.7.1025 with default parameters. Then reads with mapping quality at least 30 were used for polishing (--minmq 30). The polished assemblies from the two sequencing platforms were further merged using Quickmerge v.0.3. Finally, we obtained a total genome length of 9.13 Mb across eight assembled contigs with six capped by telomeric repetitive sequences (TTTAGG)n at one or both ends (Table 2).

Fig. 2.

Fig. 2

Framework of genome assembly.

Table 2.

Comparison between the assembled and published C. parvum reference genomes.

Statistic C. parvum (This study) C. parvum (Iowa II68) C. parvum (IOWA-ATCC69)
Number of contigs 8 8 8
Genome size (bp) 9,128,570 9,102,324 9,122,263
Largest contig (bp) 1,336,160 1,344,712 1,332,634
Contigs with two telomeres 1 3 6
Contigs with one telomere 5 3 1
N50 (bp) 1,106,866 1,104,417 1,108,396
GC (%) 30.16 30.23 30.18
Number of predicted genes 3,915 3,886 4,424
Complete BUSCOs (%) 98.2 98.2 98.2
Complete and single-copy BUSCOs (%) 98.2 98.2 98.2
Complete and duplicated BUSCOs (%) 0.0 0.0 0.0
Fragmented BUSCOs (%) 0.4 0.4 0.6
Missing BUSCOs (%) 1.4 1.4 1.2
Total Lineage BUSCOs 502 502 502

The statistics of genome assembly, including contig length, N50 and GC content were comparable to those of the published C. parvum reference genome. Benchmarking Universal Single-Copy Orthologs (BUSCO) v.5.4.626 was used to evaluate the completeness of the C. parvum genome assembly against the Coccidia_odb10 database.

Gene prediction and annotation

Protein-coding genes were predicted through the integration of ab initio methods, homology alignment data, and transcriptomic data as described previously27. Briefly, the transcriptomic data28 for gene model training and protein data29 for homology alignment of C. parvum were downloaded from CryptoDB (https://cryptodb.org). For ab initio methods, PASA v.2.4.030 was applied to produce candidate gene structures, which could be applied to obtain a set of gene structures for training the SNAP (v.2013-11-29)31, Augustus v.3.3.332 (--genemodel=complete), GenomeThreader v.1.6.133, and GlimmerHMM v.3.0.434 using default parameters. Subsequently, Augustus v.3.3.332 and GlimmerHMM v.3.0.434 were used to predict gene structure using trained gene models. Gene models derived from ab initio and homologous alignment approaches was finally integrated into a non-repetitive gene set using EvidenceModeler v.1.1.135 and 3,915 protein-coding genes were predicted (Table 2).

The predicted protein sequences were functionally annotated through searching against 18 databases using InterProScan v.5.4536, including CDD37, Coils38, Gene Ontology39, Gene3D40, Hamap41, MobiDBLite42, PANTHER43, Pfam44, Phobius45, PIR46, PRINTS47, ProSite48, SFLD49, SignalP50, SMART51, SUPERFAMILY52, TIGRFAM53, TMHMM54 (Table 3). Finally, 3,666 genes (93.6% of the total) were successfully annotated.

Table 3.

Gene function annotation statistics of the assembled C. parvum genome.

Database Gene number Percentage (%)
CDD 1,027 26.2
Coils 1,076 27.5
Gene Ontology 1,963 50.1
Gene3D 2,161 55.2
Hamap 125 3.2
MobiDBLite 1,449 37.0
PANTHER 2,286 58.4
Pfam 2,299 58.7
Phobius 1,376 35.2
PIR 519 13.3
PRINTS 350 8.9
ProSite 1,687 43.1
SFLD 19 0.5
SignalP 577 14.7
SMART 1,050 26.8
SUPERFAMILY 2,039 52.1
TIGRFAM 216 5.5
TMHMM 854 21.8
All Annotated 3,666 93.6

Noncoding RNAs annotation

Non-coding RNAs are usually divided into several groups, including rRNA, tRNA, miRNA, and snRNA. Identification of the rRNA genes was conducted by Barrnap v.0.955 using default parameters. The tRNAscan-SE v.2.0.1256 was used to predict tRNA with eukaryote parameters. The miRNA genes were identified by searching miRBase v.21 databases57 using default parameters. The snRNA genes were predicted using INFERNAL v.1.158 based on Rfam v.12.0 database59 using default parameters. Finally, a total of 14 rRNAs, 45 tRNAs, 0 miRNA and 8 snRNAs were predicted (Table 4).

Table 4.

Noncoding RNA of the assembled genome.

RNA classification Number
rRNA 14
tRNA 45
miRNA 0
snRNA 8

Data Records

The raw sequencing data, including MGI short reads (accession CRA01331560), PacBio HiFi (accession CRA01331661) and ONT long reads (accession CRA01332062), and the whole-genome assembly (accession GWHEQBI0000000063) of the C. parvum IIdA19G1 strain can be access through National Genomics Data Center, China National Centre for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences (PRJCA02054064). The genome assembly65 have also been submitted to NCBI database under the BioProject accession number PRJNA1045063. Moreover, the genomic annotation results have been deposited in the Figshare database66.

Technical Validation

We evaluated the assembly using two criteria: the mapping of short and long sequencing reads and BUSCO assessment. The reads from the short-insert library were re-mapped onto the assembly using BWA v.0.7.1025, while PacBio HiFi and ONT long reads were aligned using minimap2 v.2.2467 using default parameters. The assembly completeness was evaluated using BUSCO v.5.4.626 using the Coccidia dataset and genome mode (-l coccidia_odb10 -m geno).The mapping rate for short reads was 99.4%, while the mapping rates for HiFi and ONT long reads were 99.6% and 97.7%, respectively (Table 5). Moreover, 98.2% of the complete single-copy BUSCO genes were included in the assembled genome (Table 2). Overall, these assessments independently confirmed the accuracy and completeness of the genome assembly.

Table 5.

Results of long and short sequencing reads mapped to the assembled C. parvum genome.

Sequencing platform MGI PacBio ONT
Total reads (bp) 1,576,020,900 3,519,749,056 8,825,522,949
Mapped reads (bp) 1,566,605,400 3,505,499,876 8,622,067,393
Mapping rate (%) 99.4 99.6 97.7

Acknowledgements

This research was funded by the National Key Research and Development Plan Project (2022YFD1800200), NSFC-Henan Joint Fund Key Project (U1904203), and Leading Talents of the Central Plains Thousand Talents Program (19CZ0122). We thank the members of Protist 10,000 Genomes Project (P10K) consortium for their helpful suggestions. The bioinformatics analysis was supported by the Wuhan Branch, Supercomputing Center, Chinese Academy of Sciences, China. We also thank the LetPub Editor for editing the language of this manuscript.

Author contributions

Conceived and Designed: L.X.Z. and G.Y.W. Manuscript: Y.C.C. and L.X.Z. Analysis: Y.C.C., J.Y.H., G.Y.W., K.C. and J.X. Reagents/materials: H.K.Q., K.H.Z., Y.F., J.Q.L. and R.J.W. Supervision: J.Y.H., W.M., G.Y.W., and L.X.Z. All of the authors have read and approved the final manuscript.

Code availability

No custom code was used in this study. The data analyses used standard bioinformatic tools specified in the methods.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Guangying Wang, Email: wangguangying@ihb.ac.cn.

Longxian Zhang, Email: zhanglx8999@henau.edu.cn.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Chen, Y. C. et al. Genome annotation data for the Cryptosporidium parvum IIdA19G1 subtype, figshare. Dataset, 10.6084/m9.figshare.26088349.v3 (2024).

Data Availability Statement

No custom code was used in this study. The data analyses used standard bioinformatic tools specified in the methods.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES