Abstract
Haemonchus contortus is a globally distributed and economically important gastrointestinal pathogen of small ruminants and has become a key nematode model for studying anthelmintic resistance and other parasite-specific traits among a wider group of parasites including major human pathogens. Here, we report using PacBio long-read and OpGen and 10X Genomics long-molecule methods to generate a highly contiguous 283.4 Mbp chromosome-scale genome assembly including a resolved sex chromosome for the MHco3(ISE).N1 isolate. We show a remarkable pattern of conservation of chromosome content with Caenorhabditis elegans, but almost no conservation of gene order. Short and long-read transcriptome sequencing allowed us to define coordinated transcriptional regulation throughout the parasite’s life cycle and refine our understanding of cis- and trans-splicing. Finally, we provide a comprehensive picture of chromosome-wide genetic diversity both within a single isolate and globally. These data provide a high-quality comparison for understanding the evolution and genomics of Caenorhabditis and other nematodes and extend the experimental tractability of this model parasitic nematode in understanding helminth biology, drug discovery and vaccine development, as well as important adaptive traits such as drug resistance.
Subject terms: Non-model organisms, Genome informatics, Parasitic infection, Comparative genomics, Population genetics
Stephen Doyle et al. report the chromosome-scale genome assembly and transcriptome sequence data of Haemonchus contortus, a key parasitic nematode model. These data show nearly complete conservation of chromosome content with C. elegans and brings insight into transcriptional regulation and chromosome-wide genetic diversity in this important pathogen.
Introduction
The recent introduction of long-read DNA sequencing from Pacific Biosciences (PacBio) and Oxford Nanopore has completely changed the quality and contiguity of assemblies of both large and small genomes. The ability to generate highly contiguous and closed prokaryote genomes is becoming routine, even from metagenomic samples containing multiple strains or species1,2, and high-quality chromosome-scale genomes for many eukaryotic species are rapidly becoming available3–5. These technologies generally require microgram quantities of high molecular weight DNA, and the success of a genome assembly is highly dependent on the degree of polymorphism present in the DNA that is sequenced. Ideally, the DNA sequenced will be obtained from a single individual—so that the only polymorphism is heterozygosity between homologous chromosomes in the case of a diploid or polyploid individual—or from a clonal population, where multiple but genetically identical individuals can be pooled. However, for many small organisms, this is not possible; the only option is to sequence DNA from a pool of genetically distinct organisms and derive a consensus assembly. As modern assembly algorithms are designed and trained on haploid or diploid genomes from single individuals6–8, the excessive diversity presents a significant challenge to the assembly process, and consequently, typically results in fragmented assemblies that contain misassembled and excessive haplotypic sequences9.
Parasitic helminths are a diverse group of organisms for which significant efforts have been made to develop genomic resources10. For some species11–16, high-quality near-complete assemblies have been achieved using multiple complementary sequencing and mapping technologies, and usually with extensive manual improvement. Similar quality assemblies for a few other species are available from WormBase ParaSite17, but await formal description in the peer-reviewed literature. However, for many others, access to sufficient high-quality DNA, small organismal size, and high genetic diversity are significant challenges that need to be overcome to achieve highly contiguous assemblies. The nematode Haemonchus contortus is one such species. H. contortus is a major pathogen of sheep and goats worldwide and is recognised as a model parasite due to its experimental tractability for drug discovery18,19, vaccine development20,21 and anthelmintic resistance research22. Draft genome assemblies were published in 2013 for two anthelmintic-sensitive isolates, MHco3(ISE).N123 and McMaster24; both assemblies were produced using short-read, high-throughput sequencing technologies (including both short- and long-insert libraries), resulting in assembly sizes of 370 Mbp and 320 Mbp, respectively. However, direct comparison of these assemblies shows discordance, revealing clear differences in gene family composition25 and variation in assembly quality and gene content26. Although some of these differences reflect biological variation in this highly polymorphic species27–29, many differences are technical artefacts, as revealed by comparing fragmented genome assemblies from sequencing DNA derived from pools of a genetically diverse organism. These fragmented genome assemblies are less than ideal for many downstream uses of reference genomes, for example, in genetic experiments in which patterns of genome-wide variation are defined and interpreted30.
Here we present the chromosome-scale assembly of the MHco3(ISE).N1 isolate of H. contortus, representing the first high-quality assembly for any strongylid parasite, an important clade of parasitic nematodes that includes parasitic species of veterinary and medical importance. Using a hybrid assembly approach incorporating short- and long-read sequencing followed by manual finishing, we demonstrate how a highly contiguous assembly overcomes some of the critical limitations imposed by draft assemblies in interpreting large-scale genetic and functional genomic datasets. We provide insight into patterns of transcriptional change and putative co-regulation using a significantly improved genome annotation, derived de novo from both short- and long-read cDNA sequencing, and describe within- and between-population genetic diversity that shapes the genome. This chromosome-scale assembly now offers insight into genome evolution among a broad group of important parasite species and a robust scaffold for genome-wide analyses of important parasite traits such as anthelmintic resistance.
Results
Chromosome structure of Haemonchus contortus
We have built upon our previous assembly (version 1 (V1)23) using a hybrid approach, iteratively incorporating Illumina short-insert and 3 kbp libraries, PacBio long-read, OpGen optical mapping, and 10X Genomics linked-read data (Supplementary Fig. 1; Supplementary Data 1) to generate a largely complete, chromosomal-scale genome assembly. Consistent with the karyotype for H. contortus31,32, the assembly consists of five autosomal scaffolds and one sex-linked scaffold (Supplementary Fig. 2), each containing terminal telomeric sequences (sequence motif: TTAGGC), and a single mitochondrial contig. We assigned chromosome names based on broad-scale synteny with C. elegans chromosomes (Fig. 1a); over 80% of 7,361 one-to-one orthologous genes are shared on syntenic chromosomes between the two species demonstrating the high conservation of genes per chromosome (Fig. 1b; top), however, vast rearrangements are evident and very little conservation of gene order remains (Fig. 1b; bottom). We determined the extent of microsynteny by comparing conserved, reorientated, and rearranged gene pairs between the two species (Supplementary Fig. 3a); the distance between ortholog pairs in H. contortus and C. elegans is correlated up to ~100 kbp (rho = 0.469, P = 2.2E−16) in which pair order is conserved (Supplementary Fig. 3b), likely representing selective constraint to maintain evolutionarily conserved operons, but is largely lost above 100 kbp (rho = 0.017, P = 0.660) where a greater frequency of rearrangement between pairs is evident (Supplementary Fig. 3b, c). Beyond pairs of genes, synteny breaks down rapidly; although almost 50% of shared ortholog pairs are adjacent to each other, only a single group of 10 orthologs are colinear between the genomes of the two species (Supplementary Fig. 3d, e; Supplementary Data 2).
The V4 assembly is 283.4 Mbp in length (scaffold N50 = 47.3 Mbp; contig N50 = 3.8 Mbp), representing only 75.6% of the 369.8 Mbp V1 draft assembly length (Table 1); the reduction in assembly size is due to the identification and separation of redundant haplotypic sequences present in the polymorphic draft V1 and preliminary PacBio assemblies generated. The V4 assembly is highly resolved with only 185 gaps, a substantial reduction from the 41,663 gaps present in the V1 assembly (Table 1). We identified 242 of 248 (97.98%) universally conserved orthologs measured by the Core Eukaryotic Genes Mapping Approach (CEGMA; Supplementary Table 1), and 859 of 982 (87.4%) metazoan Benchmarking Universal Single-Copy Orthologs (BUSCOs). The remaining missing orthologs show phylogenetic structure among clade V nematodes (Supplementary Fig. 4), suggesting that many are truly missing from the genomes of these species rather than due to assembly artefacts. The V4 assembly contained 141 more full-length single-copy orthologs and 159 fewer duplicated BUSCOs than identified in the V1 genome (Supplementary Table 1); considering that we identified a similar average number of CEGMA orthologs per core gene in V4 and in the C. elegans genome (1.1 in V4, compared with 1.09 in C. elegans), erroneously-duplicated sequences have been reduced in V4 compared with the draft H. contortus assemblies (V1: 85.89% complete and 1.59 average orthologs per CEGMA gene; McMaster: 70.56% complete and 1.43 average orthologs CEGMA gene; Supplementary Table 1).
Table 1.
H. contortus V4 Chromosomes | H. contortus V4 Haplotypes | H. contortus V1a | H. contortus McMasterb | H. contortus New Zealandc | C. elegans WBd | |
---|---|---|---|---|---|---|
Genome size (bp) | 283,439,308 | 248,771,548 | 369,846,877 | 319,640,208 | 465,720,674 | 100,286,401 |
Scaffolds | 7 | 3,907 | 23,860 | 14,419 | 7 | 7 |
Scaffold N50 (bp) | 47,382,676 | 167,270 | 83,287 | 56,328 | 83,970,805 | 17,493,829 |
Scaffolds ≥N50 | 3 | 174 | 1,151 | 1,684 | 3 | 3 |
Contigs | 192 | 4,226 | 65,523 | 55,322 | 172,899 | 7 |
Contig N50 (bp) | 3,801,457 | 117,552 | 20,808 | 12,900 | 3,087 | 17,493,829 |
Contigs ≥N50 | 20 | 439 | 4,943 | 6,646 | 43,632 | 3 |
Scaffold N content (bp) | 5,699,711 | 14,755,378 | 23,804,39 | 20,495,720 | 17,288,295 | 0 |
Scaffold gaps | 185 | 323 | 41,663 | 40,903 | 172,892 | 0 |
An assembly of a New Zealand (NZ) H. contortus strain was recently made available33, which used our V4 chromosomal assembly to scaffold a highly fragmented draft assembly (contigs = 172,899) into seven chromosome-scale scaffolds. The expansion of the NZ assembly size (465 Mbp vs 283 Mbp in the V4 assembly; Table 1) in the context of elevated duplicated BUSCOs (7.3% vs 1.6% in the V4 assembly) and the increased proportion of orthologs detected, on average, per partial and complete CEGMA gene (1.42 and 1.71, respectively, vs 1.1 and 1.22 in the V4 assembly; Supplementary Table 1) suggest that contaminating haplotypic sequences have erroneously been incorporated into the NZ assembly. Similarly, we suspect that the lower CEGMA statistics together with the reported low average exon number (4 exons per gene vs 9 exons in the V4 assembly) and transcript lengths (666 bp vs 1,237 bp in the V4 assembly; see below) suggest that frameshift errors due to insertion and/or deletions from uncorrected PacBio reads and/or assembly gaps (n = 172,892) are affecting coding sequences broadly throughout the NZ assembly. While multiple chromosome-scale assemblies from the same species provide a unique opportunity for comparative analysis, the striking differences between the NZ and the V4 assemblies are unlikely to represent genuine biological variation between strains.
Resolving haplotypic diversity and repeat distribution within the chromosomes
We identified a dominant group of haplotypes in our PacBio assemblies which, together with Illumina read coverage derived from a single worm, was used to partially phase the chromosome assembly sequence by manually curation. Mapping of the additional haplotypes to the reference-assigned haplotype (Fig. 2a) highlighted both the diversity between the haplotype and chromosomally-placed sequences (Fig. 2b, c second circle) and coverage of the haplotypes, spanning approximately 66.8 ± 24.1% on average of the chromosome sequences (Fig. 2c third circle; range 85.1 ± 12.3% [chromosome I] to 51.8 ± 16.6% [X chromosome]). Whole-genome short-read sequencing of 11 single worms (10 male and 1 female, identified by full- or half-coverage across the X chromosome for XX females and XO males, respectively) followed by competitive read mapping revealed clear evidence of haplotype switching between the chromosomal reference and alternate haplotype assembly based on differences in the sequencing coverage throughout the genome and between samples (Fig. 2d). Haplotypic blocks are evident, including large regions on chromosome 2 and 5 that are devoid of haplotype diversity (full coverage of the reference; blue), as well as regions in which individuals lack any read coverage due to the absence of the reference haplotype in their genome (e.g. red region on chromosome 1, samples 09 & 10). These data suggest that, despite being semi-inbred through a limited number of single-pair matings34, complex haplotypes are segregating in this population, and that the frequency and diversity of these haplotypes are non-randomly distributed throughout the genome.
The use of multiple long-molecule technologies allowed a greater representation of repetitive sequences in the V4 assembly, accounting for approximately 36.43% (compared with 30.34% in V1) of the genome (Supplementary Table 2). Particular repeat classes, including LINEs, LTRs and DNA elements were enriched towards the middle of the chromosomes and the chromosome ends (Fig. 2c inner circle; Supplementary Fig. 5). This distribution pattern is negatively correlated with observed recombination rate domains in both H. contortus35 and C. elegans36, suggesting that transposons accumulate in regions of low recombination rate throughout the genome. On the X chromosome, this enrichment is striking in the sub-telomeric regions and, in particular, from approximately 8 to 35 Mbp along the chromosome (Fig. 2c inner circle; Supplementary Fig. 5). Finally, the pattern of repetitive elements throughout the chromosomes and not the coding sequences (shown below) correlates with and perhaps is driving the nucleotide composition profile throughout the genome (Fig. 2c outer circle).
Generation of a high-quality transcriptome annotation incorporating short and long reads
Considering the broad-scale changes from V1 to V4 assemblies, we undertook a de novo gene finding strategy to annotate the V4 genome. To do so, we used our Illumina short-read RNA-seq collection sampled from seven stages of the parasite life cycle (described in the analysis of the V1 genome23) and supplemented this with PacBio Iso-Seq long-read sequencing (Supplementary Table 3) of full-length cDNA from adult male, female, and juvenile L3 parasites. The overall annotation strategy is presented in Supplementary Fig. 6 and consisted of four key stages: (i) RNA-seq evidence was used to train ab initio gene finding using Braker37; (ii) full-length high-quality isoforms generated using Iso-Seq were used to identify candidate gene models using the annotation tool Program to Assemble Spliced Alignments (PASA38); (iii) evidence from steps (i) and (ii), together with 110 manually curated genes and orthologs from the C. elegans and H. contortus V1 genomes, were incorporated using EvidenceModeller; and finally (iv) UTRs and multi-evidence gene models were updated using the full-length isoforms once again using PASA. Together, this strategy achieved a high level of sensitivity and precision, and represented substantial improvement over the V1 annotation (Supplementary Table 4).
The V4 annotation comprises 19,489 nuclear genes encoding 20,987 transcripts with approximately 56.4% (23,687 of 41,974) of UTRs annotated. The increase in genome contiguity, together with the incorporation of full-length cDNA sequencing (Fig. 3a), has resulted in longer gene (40.5% increase), mRNA (41.9% increase), and exon (30.6% increase) models on average than the previous V1 genome annotation (Fig. 3b; Supplementary Table 5); our approach allowed better representation of the longest genes (447,146 bp compared with 91,953 bp in V1), and intriguingly, identified 43 of the longest 50 (86%) and 73 of the longest 100 gene models on the X chromosome alone. These longest genes were predominantly located toward the centre of the X chromosome, suggesting that gene length expansion is driven by the accumulation of intronic repetitive elements enriched in this region as described above (Fig. 2c inner circle; Supplementary Fig. 5). Concordantly, both the gene number and gene coverage of 100 kbp windows were significantly lower on the X chromosome relative to the autosomes (pairwise t-test between X and autosomes; P = 2.2E−16). The reduction in haplotypic sequences together with greatly improved gene models resolved 61% more one-to-one orthologs with C. elegans (n = 7,361) than between C. elegans and V1 (n = 4,529) (Supplementary Table 6). Surprisingly, a greater number of one-to-one orthologs are identified between V4 and Haemonchus placei (n = 9,970; https://parasite.wormbase.org/Haemonchus_placei_prjeb509/Info/Index/) than V4 and V1 (n = 9,595), likely reflecting the higher duplication rate and haplotypic nature of the V1 relative to the H. placei assembly, although this also highlights the close relationship between H. contortus and H. placei39, a gastrointestinal pathogen commonly associated with cattle. The increased orthology between close (H. placei) and more distantly (C. elegans) related species provides support for a more complete and biologically representative gene set in the V4 annotation than the previously described V1 and McMaster annotations.
Transcriptional dynamics throughout development and between sexes
Both 2013 publications describing draft assemblies of H. contortus provide extensive descriptions of the key developmental transitions between sequential pairs of life stages throughout the life cycle23,24. We revisited this using the current annotation and, for visual comparison, present the top 1,000 most variable transcripts across all life stage samples in the MHco3(ISE).N1 datasets (Fig. 4a; Supplementary Data 3). The transcriptional transition between the juvenile free-living life stages and the more mature parasitic stages is striking, reflecting not only the development toward reproductive maturity but also drastic changes in the parasite’s environment. To explore these developmental transitions more explicitly, we determined the co-expression profiles for all genes throughout the lifecycle, revealing 19 distinct patterns of co-expression containing 8,412 genes (41% of all genes), with an average cluster size of 442.7 genes (Supplementary Fig. 7; Supplementary Data 4). Consistent with the pattern of highly variable genes, two dominant clusters of genes were identified: (i) genes with higher expression in free-living stages that subsequently decreased in expression in parasitic stages (cluster 1; n = 1,550 genes; Fig. 4b) and (ii) genes with lower expression during free-living stages that subsequently increased in expression in parasitic stages (cluster 8; n = 1,542 genes). Functional characterisation of the cluster 1 genes revealed 45 enriched gene ontology (GO) terms (https://biit.cs.ut.ee/gplink/l/8jECkGwQQS) that predominantly described ion transport and channel activity, as well as transmembrane signalling receptor activity. For cluster 8, 22 GO terms were enriched (https://biit.cs.ut.ee/gplink/l/oJoq6Zh0Re) and included terms describing peptidase activity, as well as protein and organonitrogen metabolism. To provide further resolution, we tested C. elegans orthologs of this gene set, which revealed greater resolution on metabolic processes, including terms associated with glycolysis and gluconeogenesis, pyruvate and nitrogen metabolism (https://biit.cs.ut.ee/gplink/l/b3Bx8rhJR5). Finally, we explored whether co-expressed genes contained shared sequence motifs in their 5’ UTR that may be indicative of true transcriptional co-regulation; while broadly poorly represented, we did identify 70 enriched motif sequences of which 17 were associated with gene sets with enriched GO terms (Supplementary Data 5), for example, cysteine-type peptidase activity in cluster 8 (GATAAGR; motif E-value = 2.90E−02; GO P = 4.181E−13), DNA replication in cluster 15 (AAAAATVA; motif E-value = 3.50E−02; GO P = 6.269E−4), and molting cycle, collagen and cuticulin-based cuticle in cluster 16 (GTATAAGM; motif E-value = 2.30E−02; GO P = 1.926E-6), suggestive of either direct or indirect co-regulation of expression.
The greatest differences in gene expression between life stages are between L4 (mixed male and females) and either adult males or adult females, or between adult males and adult females, with a strong sex-bias towards increased gene expression in adult males (Supplementary Data 3). Consistent with C. elegans, sex-biased expression is not randomly distributed throughout the genome40,41; only 5% (146/2,923) of H. contortus male-biased genes are found on the X chromosome (relative to 14.3% of all genes found on the X chromosome; χ2 = 175.83, df = 1, P = 3.84E−40), whereas more female-biased genes are X-linked than expected (24.6% [110/476]; χ2 = 33.59, df = 1, P = 6.78E−9). Comparison of sex-biased genes on the X chromosome revealed that, while almost completely absent in males, X-linked female-biased expression identified C. elegans orthologs previously associated with development, including developmentally upregulated genes (e.g., smad4, transcription factor AP2, vab-3), genes involved in egg development (e.g., egg-5, vit-1, cpg-2, egg-1), as well as genes involved in developmental timing (e.g., lin-14, kin-2042) that are expressed downstream of the sex-determination regulator tra-1. Sex determination in C. elegans is mediated by the ratio of X chromosomes to autosomes, whereby organisms with a single X chromosome will develop as a male (X-to-autosome ratio = 0.5:1, i.e., XO) and those with two X chromosomes develop as a hermaphrodite (XX; X-to-autosome ratio = 1:1); similarly, H. contortus males are XO and females are XX31,32. The detection of the X-to-autosome copy number initiates a regulatory pathway that first activates a dosage compensation response that down-regulates gene expression in hermaphrodites by one-half, followed by sex determination mediated by the conserved transcription factor tra-1; in XX individuals tra-1 is activated, whereas tra-1 is repressed in XO individuals. Although the similarities in the XX/XO chromosome structures between C. elegans and H. contortus suggest that the mechanisms controlling sex determination may be similar, this has not been previously shown. Comparison of H. contortus RNA-seq data between males and females showed no significant difference in the distribution of expression between autosomes and X-linked genes; only 9.24% of genes (110 + 146 of 2,770 X-linked genes) are differentially expressed on the X chromosome between sexes suggests that X chromosome dosage compensation is active in XX females. To explore this further, we mapped H. contortus orthologs of C. elegans genes involved in the X chromosome sensing and dosage compensation response and subsequent sex determination pathway (Fig. 4c; Supplementary Data 5; Adapted from WormBook43). In C. elegans, chromosome dosage is determined by measuring the expression of specific genes on the X and autosomes; in H. contortus, only two of the C. elegans X-linked orthologs - fox-1 and sex-1 - are present, however, these are located on the autosomes, whereas the autosomal dosage associated genes sea-1, −2, and −3, are missing. Five of the six C. elegans orthologs associated with the dosage compensation complex (DCC) were found (dpy-26 is missing), however, the genes xol-1 and sdc-1, −2, and −3 that act to initiate the recruitment of the DCC are absent. Thus, while the DCC may still mediate dosage compensation in H. contortus, the mechanism used to determine the X-to-autosome ratio is likely different from C. elegans. Identification of many orthologs downstream of her-1, which receives X chromosome dosage signals to initiate sex determination, suggests greater conservation of this mechanism between species. Notable is the absence of H. contortus orthologs of tra-2 and fem-3, as they are key members of the pathway that receive regulatory signals promoting male or female developmental pathways, respectively, in C. elegans.
Transcriptome complexity is generated by extensive cis- and trans-splicing
Spliced leader (SL) trans-splicing has been extensively described in C. elegans44 and has been shown previously to operate in H. contortus45–47 (Fig. 5a). To explore this further in the V4 annotation (Supplementary Fig. 8a, b), we identified 7,293 (37.4%) and 1,139 (5.8%) genes with evidence of SL1 and SL2 trans-splicing, respectively (Supplementary Data 7). However, additional evidence of trans-splicing was observed; 7,895 and 1,222 annotated transcripts contained SL1/SL2 sequences, and a further 6,883 internal SL cut sites indicated additional non-annotated isoforms were present in the transcriptomes. While we likely underestimate the frequency of trans-splicing due to the modest RNA sequencing depth available, this approach reveals a higher frequency of trans-splicing than previously described and additional evidence for the prediction and annotation of novel gene isoforms. Further, we identify degeneracy in the nucleotide sequences of both SL1 and SL2 sequences (Supplementary Fig. 8c), and conservation in the genome sequence immediately upstream of the splice site, particularly for SL2 (Supplementary Fig. 8d). In C. elegans, SL1 sequences are typically associated with non-operon genes or the first coding sequence of an operon, whereas SL2 sequences are used for second and subsequent coding sequences in a polycistronic operon. The intercistronic region between SL2-spliced genes and their upstream genes in C. elegans are typically short, usually approximately 50 to 200 bp on average, with the majority less than 2 kbp44. In H. contortus, the intergenic distance upstream of genes with SL1 sequences is largely consistent with that of genes without SL1 sequences (Fig. 5b left plot; median = 8,557 bp [SL1-spliced], 8,603.5 bp [no SL]); however, a small proportion of SL1-spliced genes are found less than 100 bp downstream of the nearest gene. Conversely, while a larger proportion of genes with SL2 sequences are immediately downstream of the nearest gene as expected (median distance = 566.5 bp), a substantial proportion (432 [35.4%]) are found with a large intergenic distance (>2 kbp) to the nearest upstream gene (Fig. 5b right plot). These observations suggest that while broadly consistent with C. elegans, (i) H. contortus SL1-sequences can be found in downstream genes within a polycistron, and may reflect evidence of hybrid operons that contain a second, internal promoter to control the expression of a subset of genes within the operon48, (ii) SL2-spliced genes are not necessarily in a polycistron but may function independently, and/or (iii) recognition of SL2 splicing within an operon remains effective over large physical distances, for example, the 2,064 bp region between deg-3 (HCON_00039370) and des-2 (HCON_00039380) operon of H. contortus49. Although these variations have been described in C. elegans50, they seem to be more the norm rather than the exception in H. contortus.
We extended the splice variation analysis to characterise cis-splicing of transcripts and differential isoform usage between life stages. We identify 1,055 genes with differentially spliced transcripts impacting at least one intron, representing on average 283 genes (FDR < 0.05; range: 43–418) that switch between different isoforms in each pairwise comparison of life stages throughout the lifecycle (Fig. 5c; Supplementary Table 7 [summary], Supplementary Data 8 [gene lists]). For example, 231 of 1,068 genes are differentially spliced between the adult male and female samples. The most differentially spliced intron is located in HCON_00073480, a one-to-one ortholog of C. elegans daz-1 (Fig. 5d). DAZ-1 is an RNA-binding protein involved in oogenesis but not spermatogenesis in C. elegans51; in H. contortus, HCON_00073480 is predominantly expressed in females (average TPM = 636.22; highest expression in hermaphrodites in C. elegans), and is expressed in males, albeit approximately 23-fold less (average TPM = 26.71)(Fig. 5e). Only a single transcript is described in C. elegans, however, five isoforms are annotated in the H. contortus genome, with sex-specific expression of exon 9 that is present in males and absent in females. Further, differential splicing was found in the splice donor site of exon 8, and while both alternate splice sites are present in both sexes, the sites are used equally in males but the longer transcript is used in approximately 60% of transcripts in female worms (Fig. 5f).
Distribution of global genetic diversity throughout the chromosomes
H. contortus is a globally distributed parasite, characterised by large population sizes and in turn, high genetic diversity27–29,52. We have characterised the chromosomal distribution of genetic diversity by exploiting a recent large scale analysis of global genetic diversity27 together with data from 74 additional worms from previously unsampled regions including Pakistan, Switzerland, USA, and the United Kingdom (Fig. 6a; https://microreact.org/project/hcontortus_global_diversity). Principal component analysis (PCA) of mtDNA diversity (Fig. 6b; Supplementary Fig. 9) showed that relationships between isolates are congruent with those previously described27; further, we show that Pakistani worms were genetically positioned between South African and Indonesian individuals, whereas the US and the UK samples were found among globally distributed samples that likely shared a common ancestral population that was distributed globally via modern human movement27. The US and UK samples consisted of both laboratory-maintained strains and field isolates; notably, apart from MHco3(ISE).N1, the laboratory strains retained similar levels of genetic diversity to the field samples. MHco3(ISE), from which the reference strain MHco3(ISE).N1 was derived, is believed to have originated in East Africa; although we have no samples from this region, the MHco3(ISE).N1 population was genetically similar to one of the three South African populations (Supplementary Fig. 10; Kokstad population), perhaps reflecting its African origins.
Genome-wide nucleotide diversity (π) is broadly consistent chromosome-wide between the autosomes (Fig. 6c–g). The high frequency of SNPs private to only a single population (relative to variants shared between two or more populations) represents a considerable proportion of the overall variation (23.55% of total variation), consistent with previous reports describing high regional genetic diversity27–29. On the X chromosome, nucleotide diversity is much lower than neutral expectation (i.e., πX/πautosome ≃ 0.7553) at a πX/πautosome median ratio of 0.363 (Fig. 6h), which suggests a sex-ratio bias in favour of males. This observation, together with the distinct repeat patterns that were negatively correlated with recombination rate, suggests that recombination may be less frequent on the X chromosome than expected. Interestingly, variation in X chromosome coverage (Supplementary Fig. 11a) increases in populations that are more genetically distinct from the reference strain (Supplementary Fig. 11b).
Finally, we reveal multiple discrete regions in each chromosome with high FST among populations (Fig. 6c–h; bottom plots). Consistent with previous reports, strong genetic selection for anthelmintic resistance has impacted genome-wide variation; the most differentiated region is found on chromosome 1 surrounding the beta-tubulin isotype 1 locus (HCON_00005260; 7,027,492 to 7,031,447 bp), a target of benzimidazole-class anthelmintics for which known resistant mutations have been identified in these populations27, as well as a broad region of elevated FST from approximately 35 to 45 Mbp on chromosome 5 previously shown to be associated with ivermectin resistance in controlled genetic crosses54. The top 1% of FST outlier regions per chromosome (280 10 kbp regions in total; Supplementary Data 9) contains 269 genes enriched for 10 GO terms (7 MF & 3 BP; https://biit.cs.ut.ee/gplink/l/1EEZzm6USa), largely describing proteolytic and cysteine-type peptidase activity.
Discussion
The H. contortus MHco3(ISE).N1 V4 assembly presented represents the largest de novo chromosome-scale nematode genome to date and provides a critical genomic resource for a broad group of human and veterinary pathogens present among the Clade V nematodes. This assembly is complemented with a high-quality de novo genome annotation to provide a precise transcriptomic resource for within and between species comparison, improved UTR and isoform classification for functional genomics, and should help to improve the annotation of other nematode species. Finally, we provide a comprehensive description of genetic and transcriptomic diversity within individuals and across the species, which informs not only our understanding of key developmental trajectories but also the parasite’s capacity to adapt when faced with challenges such as drug exposure.
A key challenge to achieving this chromosomal assembly was the presence of extensive genetic diversity in the sequencing data. Although high-quality genome assemblies are becoming more prevalent through the use of long-read sequencing technologies3,4,55, these assemblies are typically derived from a single, few, or clonal individual(s) in which the genetic diversity is low or absent; most of the genetic diversity present in a single diploid individual can be phased and separated into distinct haplotypes6,7, and assembled separately. However, even under these scenarios, heterozygosity remains a key technical limitation towards achieving contiguous genome assemblies. Considering the small physical size of H. contortus, it was necessary to pool thousands of individuals to generate sufficient material for long-read sequencing and while we did so from a semi-inbred population, there was still high genetic diversity present in the population of worms sequenced. The generation of such diversity from a single mating pair is likely influenced by reproductive traits including polygamous mating and high fecundity of H. contortus female worms. How such haplotypic diversity is maintained in the population or is tolerated within single individuals remains unclear, however, it is likely that the holocentric chromosomes of H. contortus tolerate high genetic diversity during homologous pairing of chromosomes at meiosis56. Despite this diversity, we curated a major haplotype in the ISE.N1 population to represent the chromosomes and demonstrated how this haplotype and alternate haplotypes segregated in this semi-inbred population. The absence in the literature to date of such chromosome-scale assemblies for physically small and genetically diverse eukaryotic species likely reflects the necessary investment required to improve a draft assembly to chromosomes. However, as DNA input requirements for long-read sequencing decrease57, the ability and ease with which highly contiguous genomes can be generated for small and genetically diverse species will be greatly enhanced.
Completion of the X chromosome represents one of the key advancements in the current genome assembly. In doing so, we revealed striking differences between the autosomes and X chromosome in (i) the repeat distribution and content, (ii) an enrichment for the longest genes of the genome, (iii) lower gene density and gene coverage, (iv) the lack of apparent genetic diversity, and (v) differences in the gene content that regulates X chromosome dosage and sex determination relative to C. elegans. Although we currently have little empirical data on recombination rates on the X chromosome, less recombination is expected on the X chromosome relative to the autosomes due to recombination on X only occurring during female gametogenesis. Similarly, the distinct repeat domain structure does suggest that recombination patterns may be different from the autosomes and warrants further investigation. The difference in nucleotide diversity between the X chromosome and autosomes further emphasises the impact of putative sex biases on the evolution of the X chromosome. H. contortus is polyandrous31,35 and highly fecund, so that reproductive fitness is likely over-dispersed between males and females, potentially resulting in different effective population sizes for male and female parasites. These differences may be exacerbated if males and females are differentially affected by anthelmintic exposure as has been demonstrated for some compounds58. The dichotomy between the low nucleotide diversity and high variance between samples in the global cohort perhaps reflects a key limitation on the ability of a single linear reference genome to capture genome diversity in a species with high genetic diversity. Long-read sequencing data from Haemonchus populations and bioinformatics approaches designed to capture more complex variation should help reveal the full extent of genetic variation in this species and shed light on the role of structural variation in shaping this genetic variation.
Our high-quality annotation allowed us to extend previous pairwise transcriptomic comparisons and focus on the trajectories of genes with correlated expression throughout the life cycle. These data identified putatively co-regulated gene sets, providing greater insight into coordinated developmental programmes than has been described previously. The broad-scale coordinated gene expression associated with ion transport, channel activity, and transmembrane signalling receptor activity throughout the lifecycle was of interest, particularly as ion channels and receptors are the primary targets of several anthelmintic drugs used to control H. contortus and other helminth species of human and veterinary importance59. The downregulation of sensory and signalling pathway-related genes from pairwise comparisons of juvenile to adult stages has been described previously in both parasitic24,60–63 and free-living64,65 Clade V nematodes, which suggests this is a broadly conserved phenomenon throughout nematode development. However, a better understanding of these changes may explain why some parasites at particular life stages are less sensitive to these drugs66,67, and may provide the rationale for selective treatment based on the known expression of drug targets. We further explored transcriptome variation by defining and quantifying differential isoform usage throughout the life cycle, a feature which is increasingly implicated in the sensitivity of drug targets in nematodes such as H. contortus; for example, sex-dependent sensitivity of emodepside is mediated by differential-splicing of the potassium channel slo-158, and similarly, expression of truncated acetylcholine receptor subunits acr-8 and unc-63 isoforms are associated with levamisole resistance68,69. Our results almost certainly under-estimate the extent of cis-splice variation within and between life stages; stringent filtering resulted in only a subset of genes tested (range: 759–1,156 genes per pairwise comparison of life stages), and yet, in each pairwise comparison, approximately 27.32% of genes on average contained evidence of splice variation (Supplementary Table 14). Further, each pairwise comparison identified cryptic splice sites, indicating that additional novel isoforms undergoing differential splicing were missing from the genome annotation. Ongoing manual curation to the genome annotation to classify isoform variants, together with more fine-grained quantification of transcript usage, will provide additional insight into the novelty that differential isoform usage provides, particularly for those genes for which isoform switching, but not differential expression, occurs between key life-stage transitions. Beyond the life cycle, the impact of high genetic diversity on transcriptomic variation within or between wild isolates is not yet clear, although both technical and biological variation between three divergent strains has recently been described70. A greater understanding of this genetic-to-transcriptomic variation is required, especially if phenotypic traits associated with transcriptomic differences, e.g. increased gene expression of a drug transporter correlated with resistance, are to be correctly attributed to genetic selection.
Finally, the V4 assembly allowed the extent of genome rearrangements between C. elegans and H. contortus to be characterised, revealing numerous gene and repeat family expansions that were missed in the previous assembly. As further genomes are completed from the Strongylid clade, determining the rates of structural evolution within and between chromosomes in this important group of parasitic organisms should become possible. Investigating such evolutionary processes will be important for understanding genome-wide variation associated with trait variation, including adaptive traits such as response to climate change and anthelmintic resistance. H. contortus is an established model for understanding parasite response to anthelmintics and subsequent evolution of resistance to key drugs used to control parasitic nematodes22. Our chromosomal assembly has been instrumental in mapping genetic variation associated with ivermectin54 and monepantel71 resistance. In the case of ivermectin, the identification of the major genomic region on chromosome 5 associated with resistance would not have been possible without a chromosome-scale assembly and highlights the limitations of interpreting such variation using less-contiguous genome assemblies as a reference30.
Conclusions
Our chromosome-scale reference genome and comprehensive annotation provides a step forward toward resolving the complete genome composition and transcriptional complexity of H. contortus. The increasing availability of long-read sequencing together with advances in the sequencing of small diverse organisms holds great promise for the rapid generation of high-quality contiguous genomes for organisms such as H. contortus; future sequencing, including the generation of new genomes and transcriptomes de novo from individuals sampled throughout its geographic range, will provide great insight into how genetic variation is maintained and changes in response to persistent challenges, including anthelmintic exposure and climatic change, in this hyper-diverse globally important pathogen.
Materials and methods
Parasite material
All parasite material used in the genome assembly was isolated from the semi-inbred strain of H. contortus, MHco3(ISE).N1. This strain was derived from a single round of single-worm mating using the MHco3(ISE) strain (protocol and procedure described in Sargison et al.34), and maintained as previously described23.
Iterative improvement of the genome assembly
Our approach to improving the V1 genome to the final V4 chromosomal assembly is outlined in Supplementary Fig. 1. The sequence data generated for this study are described in Supplementary Data 1.
Initial manual improvement on the V1 genome focused on iterative scaffolding with SSPACE72 and gap-filling with IMAGE73 using Illumina 500 bp and 3 kbp libraries, with additional low-coverage data from 3, 8 and 20 kbp libraries generated using Roche 454 sequencing. These improvements were performed alongside breaking of discordant joins using Reapr74, and visual inspection using Gap575. Substantial genetic variation was present in the sequence data due to the sequencing of DNA derived from a pool of individuals, resulting in a high frequency of haplotypes that assembled separately and therefore present as multiple copies of unique regions in the assembly. We surmised that much of the assembly fragmentation was due to the inability of scaffolding tools to deal with the changing rates of haplotypic variation, so we attempted to solve this manually in Gap5. As we had insufficient information to correctly phase these haplotypes, we chose, instead, the longest scaffold paths available accepting that our scaffolds would not represent single haplotypes but would rather be an amalgamation of haplotypes representing single chromosomal regions. This approach was initially difficult and time-consuming and was further confounded by a large number of repetitive sequences present in each haplotype.
Significant improvements in scaffold length were gained by the integration of OpGen (http://www.opgen.com/) optical mapping data. Optical mapping was performed following methods described previously11 with the following exceptions: high molecular weight DNA was prepared by proteinase K lysis of pools of ~500 H. contortus L3 embedded in agarose plugs, after which one of three restriction enzymes (KpnI, AflII and KpnI) were used to generate three separate restriction map datasets. Initial attempts to generate a de novo assembly using optical molecules alone was unsuccessful, and therefore, optical contigs were generated using DNA sequence seeds from the genome assembly using GenomeBuilder (OpGen) and visualised and manually curated using AssemblyViewer (OpGen). Although this approach was successful, it was limited by the quality and integrity of the gap-dense scaffolds and arbitrary nature of the haplotype scaffolding.
Subsequent integration of PacBio long-read data alongside the optical mapping data resulted in major increases in contiguity. PacBio sequencing libraries were prepared and sequenced using the PacBio RSII system. A total of 32.3 Gbp raw subreads (n = 4,085,541, N50 = 10,299 bp) were generated from 33 flow cells, representing approximately 133.8× coverage (based on the estimated genome size of 283 Mbp). A de novo PacBio assembly was generated using Sprai (v0.9.9.18; http://zombie.cb.k.u-tokyo.ac.jp/sprai/index.html), which was mapped to the assembly and used to manually resolve many gaps and resolve some of the phasing between haplotypes. Using the Sprai PacBio de novo assembly, we were also able to incorporate contigs that were previously missing from the working assembly. The increase in contiguity of the PacBio assemblies, further improved using canu v1.376, revealed two major but diverse haplotype groups segregating at approximately 65 and 30% frequency in the pooled individuals sequenced; the presence of such diverse haplotypes resulted in a significantly expanded assembly over 500 Mbp. The major haplotype was more contiguous and therefore was chosen as the primary sequence to incorporate and improve the assembly. This approach was supported by competitive mapping of a single worm short read sequencing library (ENA: ERS086777), which was found to preferentially map to different scaffolds and/or different contigs within scaffolds that we inferred were different haplotypes present in this single worm. Regions in the chromosome with no ERS086777 coverage, but for which an alternate haplotype was identified with coverage from the PacBio assembly, were manually removed from the main chromosomes and replaced. Further, this selective mapping correlated well with the optical contigs, and once these sequences were removed, much better optical alignments were produced further improving the assembly. Alternate haplotypes from the PacBio assembly and those removed from the main assembly were analysed as a separate haplotype assembly.
The increase in contiguity and resolution of shorter repetitive regions using PacBio began to reveal chromosome-specific repetitive units. Although these repeats were highly collapsed in the assembly and were typically not spanned by optical molecules, we were able to iteratively identify and join contigs/scaffolds flanking large tandem repeats that had clear read-pair evidence that they only occurred in a single location in the genome (i.e., read pairs were all mapped locally once the join was made). These were further supported by local assemblies of subsets of PacBio reads that contained copies of the repeat regions followed by de novo assembly using canu76 to reconstruct the flanking unique regions surrounding a repeat. This iterative process resulted in the production of chromosome-scale scaffolds, each terminating with a 6 bp repeat consistent with being a telomeric sequence (sequence motif: TTAGGC).
The V3 assembly, used in the H. contortus genetic map35, consisted of five autosomal chromosomes with the X chromosome in two major scaffolds. The X chromosome-associated scaffolds were identified by synteny to the C. elegans X chromosome using promer77 and by the expected ~0.75× coverage relative to the autosomes in pooled sequencing (due to a mixture of male and females present) and ~0.5× in single male sequencing libraries. We resolved the X chromosome using linked read long-range sequencing using the 10X Genomics Chromium platform (www.10xgenomics.com). DNA prepared from pooled worms was used to generate 10X sequencing libraries, which were subsequently sequenced on a HiSeq 2500 using 250 bp PE reads. Reads were mapped using bwa mem (v0.7.17-r1188), followed by scaffolding using ARCS78 and LINKS79. A single scaffold was generated using this approach, merging the two major X-linked scaffolds as well as two short sequences putatively identified as being X-linked but previously unplaced (Supplementary Fig. 2). The completion of the X chromosome, together with the polishing of the genome first with arrow (https://github.com/PacificBiosciences/GenomicConsensus) followed by Pilon v1.2280, finalised the genome to produce the V4 version presented here.
Genome completeness
Assembly completeness was determined throughout the improvement process using Core Eukaryotic Genes Mapping Approach (CEGMA; v2.5)81 and Benchmarking Universal Single-Copy Orthologs (BUSCO; v3.0; –mode genome –lineage metazoa_odb9 –long)82 gene completeness metrics. For comparison, these metrics were also determined for the complete set of nematode genomes from the WormBase ParaSite version 12 release.
Haplotype analyses
The identification and discrimination of the major and minor haplotypes in our assembly offered an opportunity to characterise the distribution of haplotypes throughout the genome, and the relative diversity between them. To visualise the broad-scale variation between the chromosome and haplotypes, we generated an approximate genome graph using minigraph (https://github.com/lh3/minigraph; parameters: -xggs). The output GFA file was visualised using Bandage83; a representative set of haplotypes were chosen for further comparison, and was compared using nucmer and visualised using Genome Ribbon84. To quantify the distribution and diversity of haplotypes, we used minimap to align haplotypes per chromosome. Haplotype density was calculated in 1 Mbp windows throughout the genome using bedtools coverage85 and visualised using the R package circlize86. We determined the relative usage of detected haplotypes in the MHco3(ISE).N1 population by using Illumina sequencing of DNA from 11 individual worms and competitively mapping the sequencing data to both the reference and haplotype assemblies using bwa mem87. Relative sequence coverage was determined using bedtools coverage85, which was normalised per sample to enable equivalent comparison between samples.
Transcriptome library preparation and sequencing
We made use of the RNA-seq data generated for each of the life stages as previously described23 together with additional full-length cDNA (PacBio Iso-Seq) sequencing performed with the specific aim of generating evidence for the annotation. Three life stages were used in the Iso-Seq analyses: adult female, adult male, and L3. RNA was extracted by homogenising the samples in Trizol in Fast prep (MP bio) followed by chloroform extraction and ethanol precipitation of the RNA, after which 1.2 μg of total RNA was converted to cDNA using SmartSeq2 protocol88. Size fractionation of equimolar pooled cDNA of the different stages was performed using a SageELF electrophoresis system (Sage Sciences), from which cDNA size fraction pools spanning 400 to 6,500 bp were collected. To improve the coverage of transcripts that differed in abundance and also in length (and to help minimise library preparation and sequencing biases towards shorter and highly abundant transcripts), we used a cDNA normalisation method followed by size fractionation of the pooled normalised cDNA to recover discrete size fractions of transcripts. The cDNA normalisation, which exploits a duplex-specific nuclease (DSN), was performed using the Trimmer-2 cDNA normalisation kit (Evrogen) following the manufacturer’s instructions. Size fractionated cDNA pools were sequenced on five individual flow cells on the RSII instrument. To increase the yield of the sequencing data, a subsequent sequencing run was performed after pooling the size fractions using a single flow cell on the Sequel instrument. A comparison of the sequences generated by the RSII and Sequel systems is presented in Supplementary Table 5.
Generation of a high-quality non-redundant Iso-Seq dataset
Full-length Iso-Seq subreads were mapped to the V4 genome using minimap2 (version 2.6-r639-dirty89) with splice-aware mapping enabled (parameter: -ax splice). The generation of high-quality Iso-Seq reads was performed following the Iso-Seq3 pipeline (https://github.com/PacificBiosciences/IsoSeq3), which consisted of four steps: (i) the generation of circular consensus sequencing (CCS) reads from raw subreads (ccs subreads.bam subreads.ccs.bam –noPolish –minPasses 1), (ii) the removal of library primers, and reorientation of reads (lima subreads.ccs.bam primers.fa subreads.demux.bam –isoseq –no-pbi), (iii) clustering of transcripts, removal of concatemers and trimming of polyA tails (isoseq3 cluster subreads.demux.primer_5p–primer_3p.bam subreads.isoseq3_unpolished.bam), and finally, (iv) polishing of the transcripts to generate a high quality set of isoforms (isoseq3 polish merged_subreads.isoseq3_unpolished.bam merged_subreads.bam merged_subreads.isoseq3_polished.bam).
Gene model prediction and genome annotation
A schematic of the strategy used for genome annotation is presented in Supplementary Fig. 6. Existing RNA-seq data were individually mapped to the V4 genome using STAR v2.5.290 before combined into a single bam using samtools merge. The merged bam was used as input to Braker (version 2.0)37,91, which can exploit mapped RNA-seq information as hints to train the gene prediction tools GeneMark-EX and Augustus, substantially improving genome annotation. The use of exon, intron, and other hints derived from RNA-seq or proteins mapped to the genome can be used to train Augustus directly; initial attempts to incorporate mapped Iso-Seq transcripts as an input to Braker, or by training Augustus using exon and intron hints, were successful and produced superior gene models compared with Braker using RNA-seq alone. However, as Iso-Seq data encode full-length transcripts that include UTRs in a single sequence, and that in this data there was little discrimination between the depth of coverage between UTR and exon, many UTRs were incorrectly annotated as 5’ or 3’ coding exons. Accordingly, only short-read RNA-seq data were used as input to Braker, resulting in a primary annotation (GFF1; Supplementary Fig. 6).
To incorporate Iso-Seq data into the genome annotation, we used the PASA (v2.2.0) genome annotation tool92 to map the high-quality full-length transcripts to the genome using both blat and gmap, akin to the use of a de novo transcriptome assembly produced from, for example, trinity, to generate a transcript database (https://github.com/PASApipeline/PASApipeline/wiki/PASA_comprehensive_db). The Braker GFF1 was subsequently loaded into PASA and integrated with the mapped Iso-Seq transcripts to produce the first round merged annotation (GFF2; Supplementary Fig. 6).
To incorporate further coding evidence into the annotation, we used EvidenceModeller v1.1.138 to apply a differential weighting schema to multiple evidence sources before merging them into a single genome annotation. Evidence used included the GFF1 and GFF2 annotations from Braker and PASA as ab initio and transcript evidence, respectively, and protein sequences from C. elegans and H. contortus V1 genome assemblies (obtained from Wormbase ParaSite release 9 caenorhabditis_elegans.PRJNA13758.WBPS9.protein.fa.gz and haemonchus_contortus.PRJEB506.WBPS9.protein.fa.gz, respectively), as well as 110 curated genes; each protein dataset was mapped to the genome using exonerate v2.2.0 (parameters: –model protein2genome –percent 50 –showtargetgff93) followed by GFF extraction using Exonerate_to_evm_gff3.pl from EvidenceModeller. The weighting was applied as follows: PROTEIN (hc_v1 * curated models) = 2; PROTEIN (Ce) = 1; ABINITIO_PREDICTION (GFF1 from Braker) = 2; TRANSCRIPT (GFF2 from PASA) = 5. The merged output of EvidenceModeller (GFF3) was imported back into PASA to update gene models with the full-length transcript information, including updates to UTRs and evidence for alternative splicing. Consistent with the author’s recommendations, two rounds of iterative update and comparison were performed to maximise the use of the full-length transcript, i.e., Iso-Seq, data (GFF4; Supplementary Fig. 6). Gene IDs were subsequently renamed with the prefix HCON and numbered with an 8 digit identifier that incremented by a value of 10 to allow identifiers to be subsequently added for new features if required (GFF5; Supplementary Fig. 6).
The genome annotation, as well as available evidence used to generate the annotation (including per life stage RNA-seq, Iso-Seq [full-length high quality & CCS reads], coverage and repeat tracks), was uploaded to the web portal Apollo94 for further visualisation and manual curation. This platform has been used to provide a means to continually improve the H. contortus genome annotation via manual curation as new information becomes available. These annotation improvements will be subsequently incorporated into WormBase ParaSite.
Annotation statistics were determined using Genome Annotation Generator (GAG)95. Annotation comparison was only made between the previously published V1 draft assembly23, the final V4 chromosomal assembly presented here, the McMaster24 and NZ isolate33 draft assemblies, and the C. elegans reference; no comparable annotation was generated for the intermediate assemblies generated before V4 was finalised, including assembly versions V2 and V3. Annotation comparisons to a curated gene set to determine the sensitivity and specificity of the approaches used were performed using gffcompare v0.10.1 (http://ccb.jhu.edu/software/stringtie/gffcompare.shtml), and were further analysed for completeness using BUSCO (–mode proteins –lineage metazoa_odb9).
The repetitive content of the genome was analysed using RepeatModeller (v1.0.11) followed by RepeatMasker (v4.0.7). Further analysis of LTRs was performed using LTRharvest and LTRdigest96 from the GenomeTools v1.5.997 package.
Transcriptome analyses
Quantitative transcriptome analyses were primarily performed using kallisto v0.43.198, which quantifies transcripts by pseudo-aligning raw sequencing reads to a transcriptome reference. The transcriptome reference was derived from the reference FASTA and annotation GFF3 data using gffread (https://github.com/gpertea/gffread). Raw sequencing reads from each of the life-stages were pseudo-aligned to the reference using kallisto quant (parameters: –bias –bootstrap-samples 100 –fusion), before the differential expression between pairwise combinations of life-stages was determined using sleuth99. For each transcript, differential expression between life-stages was tested using a Wald test, for which a q-value less than 0.05 was deemed significant.
To explore coordinated gene expression throughout the life-stages, we used clust v1.8.4100 to automatically define and group transcripts by expression level profiles. Transcripts per million (TPM) counts per life-stage were generated from the kallisto analysis described above and used as input to clust. To explore putative regulatory sequences that may be shared by co-expressed genes, we searched in each group of clustered transcripts for conserved motifs in the 100 bp region immediately upstream of the start codon for each transcript using DREME101, after which motif-containing sequences were identified using FIMO102 (MEME v5.0.4). To perform gene set analyses, subsets of H. contortus genes together with associated C. elegans orthologs were obtained using WormBase ParaSite BioMart17. Functional enrichment analyses for both species were performed using g:Profiler (version e97_eg44_p13_d22abce), with a g:SCS multiple testing correction method applying a significance threshold of 0.05103. OrthoFinder v2.2.7104 was used to determine orthologous relationships between protein sequences inferred from the H. contortus V1 and V4, the McMaster H. contortus, as well as H. placei and C. elegans protein-coding gene annotations.
Trans-splicing: characterisation of splice leader usage and distribution
We characterised the number and distribution of genes for which mapped RNA-seq transcripts contained the canonical H. contortus SL1 (GGTTTAATTACCCAAGTTTGAG) and SL2 (GGTTTTAACCCAGTATCTCAAG) sequences. For each splice leader sequence, cutadapt v1.13105 was used to detect (allowing up to 10% divergence in sequence match [–error-rate = 0.1] and minimum match length of 10 bp [–overlap=10]) and subsequently trim the leader sequences from raw RNA-seq reads from each of the life stages, after which all trimmed reads were collated and mapped to the genome using hisat2 v2.1.0106. The relative coverage of the sequence between the trimmed splice leader site and (i) the start codon (using a window 200 bp upstream and 50 bp downstream of the start codon) and (ii) internal exons (excluding the 1st exon), was determined using bedtools coverage. Weblogo107 was used to visualise sequence conservation in the genomic sequence surrounding the splice leader site, and in the splice leader sequences themselves. The spliced leader sequence splice sites for all life stages were combined into a single dataset for plotting and quantitative analyses.
Cis-splicing: differential isoform usage using Leafcutter
To analyse the extent of differential spliced isoform usage between life-stages, we used leafcutter108(see http://davidaknowles.github.io/leafcutter/ for installation and scripts described below) to quantify the proportion of reads that span introns and then subsequently quantify differential intron usage across samples. First, raw RNA-seq reads were mapped to the reference genome using STAR90 (parameters: –twopassMode basic –outSAMstrandField intronMotif –outSAMtype BAM Unsorted) to generate BAM files per life-stage. Intron junctions (junc files) were generated from the BAM files, which were subsequently clustered using leafcutter_cluster.py (parameters: -m 30 -l 500000). Differential introns between pairwise life-stages using leafcutter_ds.R (parameters: –min_samples_per_intron=3 –min_samples_per_group=3 –min_coverage=10), which required the clustered introns and an exon file (derived from the annotation GFF3) as input. Preparation of the data for visualisation of differentially spliced introns was performed using prepare_results.R; an annotation GTF was required as input, which was generated from the annotation GFF3 using gffread. The visualisation was performed using run_leafviz.R.
Population genetic analyses
To understand how genetic variation was distributed throughout the chromosomes, we collated sequencing data from our recent global analysis27 together with in-house archival sequencing data to form the most complete sample cohort of H. contortus whole-genome data to date. Reanalysis of the global cohort data was necessary, as the previous analysis was performed using the V3 assembly. In total, 338 samples from 46 populations in 17 countries were analysed; the samples together with geolocation, mitochondrial phylogeny describing genetic connectivity between samples, and ENA references to individual sequencing datasets are available to explore using Microreact109 and can be visualised here: https://microreact.org/project/hcontortus_global_diversity.
Raw sequencing data were mapped to the V4 reference using bwa mem87 (parameters: -Y and -M options were used to use soft clipping for supplementary alignments, and to mark shorter hits as secondary, respectively), followed by identification of duplicate reads using Picard (v2.5.0; https://github.com/broadinstitute/picard). Samples for which multiple sequencing lanes were generated were mapped independently and subsequently combined into a single dataset using samtools merge. Mapping statistics were generated using samtools (v1.3) flagstat and bamtools (v 2.3.0) stats and compared using MultiQC v1.3110.
Genetic variation within and between samples was determined using GATK (v 4.0.3.0). GVCFs were first created using HaplotypeCaller; to improve efficiency, this process was performed in parallel for each chromosome for each sample separately, and subsequently combined by chromosome using CombineGVCFs. Variants were determined per chromosome using GenotypeGVCFs, after which the final raw variants were combined using vcftools concat. A hard filter was applied using GATK VariantFiltration to mark variants using the following criteria: QD < 2.0, FS > 60.0, MQ < 40.0, MQRankSum < −12.5, and ReadPosRankSum < −8.0. Nucleotide diversity and F-stats within and between populations was determined using vcftools v0.1.14111 --window-pi and --weir-fst-pop, respectively, using a window size of 100 kbp.
To analyse mitochondrial DNA diversity, bcftools view was used to select single nucleotide variants with a read depth > 10, homozygous variant, and with minor allele frequency >0.01. To generate the phylogeny, variants were used to generate a consensus mitochondrial genome per sample, which were aligned using mafft v7.205112, and from which maximum likelihood phylogenies were generated using iqtree v1.6.5113 (parameters: -alrt 1000 -bb 1000 -nt 1) using automated ModelFinding114. Principal component analysis (PCA) of the mitochondrial DNA variation was performed using the poppr115 package in R.
Sequencing coverage variation between the autosomes and X chromosome for each BAM file was performed using samtools116 bedcov, using a BED file of 100 kbp windows generated using bedtools85 makewindows.
Statistics and reproducibility
All statistical analyses, data exploration, and visualisation were performed using R (v3.5.0; https://www.R-project.org/) unless otherwise stated.
Ethics approval
The use of experimental animals to maintain parasite populations for the purposes described in this manuscript was approved by the Moredun Research Institute Experiments and Ethics Committee and were conducted under approved British Home Office licenses in accordance with the Animals (Scientific Procedures) Act of 1986. The Home Office licence numbers were PPL 60/03899 and 60/4421, and the experimental identifiers for these studies were E06/58, E06/75, E09/36 and E14/30.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Supplementary information
Acknowledgements
We would like to acknowledge members of the BUG Consortium and Parasite Genomics group at the Wellcome Sanger Institute for insightful discussions over the course of this project. We also thank Pathogen Informatics and DNA Pipelines (WSI) for their support and expertise, and the Biosciences Division at the Moredun Research Institute for expert care and assistance with animals. Work described here was supported by Wellcome’s core funding of the Wellcome Sanger Institute (grant WT206194), a Wellcome Trust Project Grant to JSG (grant 067811), and the Biotechnology and Biological Sciences Research Council (BBSRC) grants (BB/M003949/1, BB/P024610/1 and BB/K020048/1). JDN was supported by a Graduate Mobility Award from McGill University.
Author contributions
J.A.C., J.S.G., M.B., E.D. and S.R.D designed the study. S.R.D. led the analysis and interpretation of the data, A.T., A.M., H.B., K.B., R.B. and J.D.N., analysed data. W.B., K.L.H, M.P. and F.H.R., provided novel software and computational resources. R.L., D.B., N.S., U.C., M.Z.S, J.W., E.D., C.B. and G.Sal., maintained parasites, collected and/or provided samples. G.San., M.A.Q., E.R. and K.M., performed laboratory work. N.H. coordinated samples and sequencing. S.R.D. and J.A.C. wrote the manuscript, with contribution and review of the final manuscript by all authors.
Code availability
Code and workflows used to analyse data and reproduce figures are available at https://github.com/stephenrdoyle/hcontortus_genome and Zenodo 10.5281/zenodo.4069270.
Data availability
The raw sequence data is available under the ENA accession PRJEB506, with reference to specific sequencing libraries described throughout the text, and/or in Table S3 of Laing et al. 2013. RNA-seq data is available from the ENA study ID PRJEB1360. The genome assembly has been made available at ENA (assembly accession: GCA_000469685.2) and WormBase ParaSite (https://parasite.wormbase.org/Haemonchus_contortus_prjeb506/Info/Index). A static version of the genome annotation used in this paper is available at ftp://ftp.sanger.ac.uk/pub/pathogens/sd21/HCON_V4_GENOME/ (signoff date: 25th Jan 2019), however, the most up-to-date version of the annotation can be accessed at WormBase ParaSite (https://parasite.wormbase.org/Haemonchus_contortus_prjeb506/Info/Index/17).
Genome variation data can be visualised using MicroReact (https://microreact.org/project/hcontortus_global_diversity), from which links to ENA accessions for individual sample sequencing data from the global collection can be obtained. All relevant data are available from the corresponding authors upon request.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Stephen R. Doyle, Email: stephen.doyle@sanger.ac.uk
James A. Cotton, Email: jc17@sanger.ac.uk
Supplementary information
Supplementary information is available for this paper at 10.1038/s42003-020-01377-3.
References
- 1.Nicholls SM, Quick JC, Tang S, Loman NJ. Ultra-deep, long-read nanopore sequencing of mock microbial community standards. Gigascience. 2019;8:giz043. doi: 10.1093/gigascience/giz043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Stewart RD, et al. Compendium of 4,941 rumen metagenome-assembled genomes for rumen microbiome biology and enzyme discovery. Nat. Biotechnol. 2019;37:953–961. doi: 10.1038/s41587-019-0202-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Yoshimura J, et al. Recompleting the Caenorhabditis elegans genome. Genome Res. 2019;29:1009–1022. doi: 10.1101/gr.244830.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Low WY, et al. Chromosome-level assembly of the water buffalo genome surpasses human and goat genomes in sequence contiguity. Nat. Commun. 2019;10:260. doi: 10.1038/s41467-018-08260-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Belser C, et al. Chromosome-scale assemblies of plant genomes using nanopore long reads and optical maps. Nat. Plants. 2018;4:879–887. doi: 10.1038/s41477-018-0289-4. [DOI] [PubMed] [Google Scholar]
- 6.Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB. Direct determination of diploid genome sequences. Genome Res. 2017;27:757–767. doi: 10.1101/gr.214874.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Chin C-S, et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods. 2016;13:1050–1054. doi: 10.1038/nmeth.4035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Koren S, et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. 2018;36:1174–1182. doi: 10.1038/nbt.4277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Barrière A, et al. Detecting heterozygosity in shotgun genome assemblies: Lessons from obligately outcrossing nematodes. Genome Res. 2009;19:470–480. doi: 10.1101/gr.081851.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.International Helminth Genomes Consortium. Comparative genomics of the major parasitic worms. Nat. Genet. 2018;51:163–174. doi: 10.1038/s41588-018-0262-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Tsai IJ, et al. The genomes of four tapeworm species reveal adaptations to parasitism. Nature. 2013;496:57–63. doi: 10.1038/nature12031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Cotton JA, et al. The genome of Onchocerca volvulus, agent of river blindness. Nat. Microbiol. 2016;2:16216. doi: 10.1038/nmicrobiol.2016.216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Hunt VL, et al. The genomic basis of parasitism in the Strongyloides clade of nematodes. Nat. Genet. 2016;48:299–307. doi: 10.1038/ng.3495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Li W, et al. The genome of tapeworm Taenia multiceps sheds light on understanding parasitic mechanism and control of coenurosis disease. DNA Res. 2018;25:499–510. doi: 10.1093/dnares/dsy020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Foster, J. M. et al. Sex chromosome evolution in parasitic nematodes of humans. Nat. Commun.11, 1964 (2020). [DOI] [PMC free article] [PubMed]
- 16.Olson, P. D. et al. Complete representation of a tapeworm genome reveals chromosomes capped by centromeres, necessitating a dual role in segregation and protection. Preprint at https://www.biorxiv.org/content/10.1101/2020.04.08.031872v1 (2020). [DOI] [PMC free article] [PubMed]
- 17.Howe KL, Bolt BJ, Shafie M, Kersey P, Berriman M. WormBase ParaSite − a comprehensive resource for helminth genomics. Mol. Biochem. Parasitol. 2017;215:2–10. doi: 10.1016/j.molbiopara.2016.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kaminsky R, et al. A new class of anthelmintics effective against drug-resistant nematodes. Nature. 2008;452:176–180. doi: 10.1038/nature06722. [DOI] [PubMed] [Google Scholar]
- 19.Preston S, et al. Low cost whole-organism screening of compounds for anthelmintic activity. Int. J. Parasitol. 2015;45:333–343. doi: 10.1016/j.ijpara.2015.01.007. [DOI] [PubMed] [Google Scholar]
- 20.Bassetto CC, Amarante AFT. Vaccination of sheep and cattle against haemonchosis. J. Helminthol. 2015;89:517–525. doi: 10.1017/S0022149X15000279. [DOI] [PubMed] [Google Scholar]
- 21.Knox DP, et al. The nature and prospects for gut membrane proteins as vaccine candidates for Haemonchus contortus and other ruminant trichostrongyloids. Int. J. Parasitol. 2003;33:1129–1137. doi: 10.1016/S0020-7519(03)00167-X. [DOI] [PubMed] [Google Scholar]
- 22.Gilleard JS. Haemonchus contortus as a paradigm and model to study anthelmintic drug resistance. Parasitology. 2013;140:1506–1522. doi: 10.1017/S0031182013001145. [DOI] [PubMed] [Google Scholar]
- 23.Laing R, et al. The genome and transcriptome of Haemonchus contortus, a key model parasite for drug and vaccine discovery. Genome Biol. 2013;14:R88. doi: 10.1186/gb-2013-14-8-r88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Schwarz EM, et al. The genome and developmental transcriptome of the strongylid nematode Haemonchus contortus. Genome Biol. 2013;14:R89. doi: 10.1186/gb-2013-14-8-r89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Laing R, et al. Haemonchus contortus: genome structure, organization and comparative genomics. Adv. Parasitol. 2016;93:569–598. doi: 10.1016/bs.apar.2016.02.016. [DOI] [PubMed] [Google Scholar]
- 26.Wintersinger, J., Mariene, G. M. & Wasmuth, J. One species, two genomes: a critical assessment of inter isolate variation and identification of assembly incongruence in Haemonchus contortus. Preprint at https://www.biorxiv.org/content/10.1101/384008v1.abstract (2018).
- 27.Sallé G, et al. The global diversity of Haemonchus contortus is shaped by human intervention and climate. Nat. Commun. 2019;10:4811. doi: 10.1038/s41467-019-12695-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Troell K, Engström A, Morrison DA, Mattsson JG, Höglund J. Global patterns reveal strong population structure in Haemonchus contortus, a nematode parasite of domesticated ruminants. Int. J. Parasitol. 2006;36:1305–1316. doi: 10.1016/j.ijpara.2006.06.015. [DOI] [PubMed] [Google Scholar]
- 29.Blouin MS, Yowell CA, Courtney CH, Dame JB. Host movement and the genetic structure of populations of parasitic nematodes. Genetics. 1995;141:1007–1014. doi: 10.1093/genetics/141.3.1007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Doyle SR, Cotton JA. Genome-wide approaches to investigate anthelmintic resistance. Trends Parasitol. 2019;35:289–301. doi: 10.1016/j.pt.2019.01.004. [DOI] [PubMed] [Google Scholar]
- 31.Redman E, et al. Genetics of mating and sex determination in the parasitic nematode Haemonchus contortus. Genetics. 2008;180:1877–1887. doi: 10.1534/genetics.108.094623. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Bremner KC. Cytological polymorphism in the nematode Haemonchus contortus (Rudolphi 1803) Cobb 1898. Nature. 1954;174:704–705. doi: 10.1038/174704b0. [DOI] [PubMed] [Google Scholar]
- 33.Palevich N, Maclean PH, Baten A, Scott RW, Leathwick DM. The genome sequence of the anthelmintic-susceptible New Zealand haemonchus contortus. Genome Biol. Evol. 2019;11:1965–1970. doi: 10.1093/gbe/evz141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Sargison ND, et al. A method for single pair mating in an obligate parasitic nematode. Int. J. Parasitol. 2018;48:159–165. doi: 10.1016/j.ijpara.2017.08.010. [DOI] [PubMed] [Google Scholar]
- 35.Doyle SR, et al. A genome resequencing-based genetic map reveals the recombination landscape of an outbred parasitic nematode in the presence of polyploidy and polyandry. Genome Biol. Evol. 2018;10:396–409. doi: 10.1093/gbe/evx269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Rockman MV, Kruglyak L. Recombinational landscape and population genomics of Caenorhabditis elegans. PLoS Genet. 2009;5:e1000419. doi: 10.1371/journal.pgen.1000419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M. BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics. 2016;32:767–769. doi: 10.1093/bioinformatics/btv661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Haas BJ, et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 2008;9:R7. doi: 10.1186/gb-2008-9-1-r7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Chaudhry U, et al. Genetic evidence for hybridisation between Haemonchus contortus and Haemonchus placei in natural field populations and its implications for interspecies transmission of anthelmintic resistance. Int. J. Parasitol. 2015;45:149–159. doi: 10.1016/j.ijpara.2014.09.002. [DOI] [PubMed] [Google Scholar]
- 40.Albritton SE, et al. Sex-biased gene expression and evolution of the x chromosome in nematodes. Genetics. 2014;197:865–883. doi: 10.1534/genetics.114.163311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Reinke V, Gil IS, Ward S, Kazmer K. Genome-wide germline-enriched and sex-biased expression profiles in Caenorhabditis elegans. Development. 2004;131:311–323. doi: 10.1242/dev.00914. [DOI] [PubMed] [Google Scholar]
- 42.Berkseth M, Ikegami K, Arur S, Lieb JD, Zarkower D. TRA-1 ChIP-seq reveals regulators of sexual differentiation and multilevel feedback in nematode sex determination. Proc. Natl Acad. Sci. USA. 2013;110:16033–16038. doi: 10.1073/pnas.1312087110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Ellis R. Sex determination in the germ line. WormBook. 2006 doi: 10.1895/wormbook.1.82.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Allen MA, Hillier LW, Waterston RH, Blumenthal T. A global analysis of C. elegans trans-splicing. Genome Res. 2011;21:255–264. doi: 10.1101/gr.113811.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Bektesh S, Van Doren K, Hirsh D. Presence of the Caenorhabditis elegans spliced leader on different mRNAs and in different genera of nematodes. Genes Dev. 1988;2:1277–1283. doi: 10.1101/gad.2.10.1277. [DOI] [PubMed] [Google Scholar]
- 46.Redmond DL, Knox DP. Haemonchus contortus SL2 trans-spliced RNA leader sequence. Mol. Biochem. Parasitol. 2001;117:107–110. doi: 10.1016/S0166-6851(01)00331-0. [DOI] [PubMed] [Google Scholar]
- 47.Laing R, et al. Annotation of two large contiguous regions from the haemonchus contortus genome using RNA-seq and comparative analysis with Caenorhabditis elegans. PLoS ONE. 2011;6:e23216. doi: 10.1371/journal.pone.0023216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Huang P, et al. Identification and analysis of internal promoters in Caenorhabditis elegans operons. Genome Res. 2007;17:1478–1485. doi: 10.1101/gr.6824707. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Rufener L, Mäser P, Roditi I, Kaminsky R. Haemonchus contortus acetylcholine receptors of the DEG-3 subfamily and their role in sensitivity to monepantel. PLoS Pathog. 2009;5:e1000380. doi: 10.1371/journal.ppat.1000380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Blumenthal, T. Operon and non-operon gene clusters in the C. elegans genome. WormBook 1–20, 10.1895/wormbook.1.175.1 (2014). [DOI] [PMC free article] [PubMed]
- 51.Maruyama R, Endo S, Sugimoto A, Yamamoto M. Caenorhabditis elegans DAZ-1 is expressed in proliferating germ cells and directs proper nuclear organization and cytoplasmic core formation during oogenesis. Dev. Biol. 2005;277:142–154. doi: 10.1016/j.ydbio.2004.08.053. [DOI] [PubMed] [Google Scholar]
- 52.Gilleard, J. S. & Redman, E. Genetic Diversity and Population Structure of Haemonchus contortus. Haemonchus contortus and Haemonchosis – Past, Present and Future Trends 31–68, 10.1016/bs.apar.2016.02.009 (2016). [DOI] [PubMed]
- 53.Wilson Sayres MA. Genetic diversity on the sex chromosomes. Genome Biol. Evol. 2018;10:1064–1078. doi: 10.1093/gbe/evy039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Doyle SR, et al. Population genomic and evolutionary modelling analyses reveal a single major QTL for ivermectin drug resistance in the pathogenic nematode, Haemonchus contortus. BMC Genomics. 2019;20:218. doi: 10.1186/s12864-019-5592-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Jain M, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 2018;36:338–345. doi: 10.1038/nbt.4060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Mandrioli M, Manicardi GC. Unlocking holocentric chromosomes: new perspectives from comparative and functional genomics? Curr. Genomics. 2012;13:343–349. doi: 10.2174/138920212801619250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Kingan SB, et al. A high-quality de novo genome assembly from a single mosquito using PacBio sequencing. Genes. 2019;10:62. doi: 10.3390/genes10010062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Kashyap SS, et al. Emodepside has sex-dependent immobilizing effects on adult Brugia malayi due to a differentially spliced binding pocket in the RCK1 region of the SLO-1 K channel. PLoS Pathog. 2019;15:e1008041. doi: 10.1371/journal.ppat.1008041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Wolstenholme AJ. Ion channels and receptor as targets for the control of parasitic nematodes. Int. J. Parasitol. 2011;1:2–13. doi: 10.1016/j.ijpddr.2011.09.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Schwarz EM, et al. The genome and transcriptome of the zoonotic hookworm Ancylostoma ceylanicum identify infection-specific gene families. Nat. Genet. 2015;47:416–422. doi: 10.1038/ng.3237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Tang YT, et al. Genome of the human hookworm Necator americanus. Nat. Genet. 2014;46:261–269. doi: 10.1038/ng.2875. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Wang Z, et al. Characterizing Ancylostoma caninum transcriptome and exploring nematode parasitic adaptation. BMC Genomics. 2010;11:307. doi: 10.1186/1471-2164-11-307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Choi Y-J, et al. A deep sequencing approach to comparatively analyze the transcriptome of lifecycle stages of the filarial worm, Brugia malayi. PLoS Negl. Trop. Dis. 2011;5:e1409. doi: 10.1371/journal.pntd.0001409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Gerstein MB, et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science. 2010;330:1775–1787. doi: 10.1126/science.1196914. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Kim DH, Grün D, van Oudenaarden A. Dampening of expression oscillations by synchronous regulation of a microRNA and its target. Nat. Genet. 2013;45:1337–1344. doi: 10.1038/ng.2763. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Basáñez M-G, et al. Effect of single-dose ivermectin on Onchocerca volvulus: a systematic review and meta-analysis. Lancet Infect. Dis. 2008;8:310–322. doi: 10.1016/S1473-3099(08)70099-9. [DOI] [PubMed] [Google Scholar]
- 67.Sabah AA, Fletcher C, Webbe G, Doenhoff MJ. Schistosoma mansoni: chemotherapy of infections of different ages. Exp. Parasitol. 1986;61:294–303. doi: 10.1016/0014-4894(86)90184-0. [DOI] [PubMed] [Google Scholar]
- 68.Neveu C, et al. Genetic diversity of levamisole receptor subunits in parasitic nematode species and abbreviated transcripts associated with resistance. Pharmacogenet. Genomics. 2010;20:414–425. doi: 10.1097/FPC.0b013e328338ac8c. [DOI] [PubMed] [Google Scholar]
- 69.Fauvin A, et al. cDNA-AFLP analysis in levamisole-resistant Haemonchus contortus reveals alternative splicing in a nicotinic acetylcholine receptor subunit. Mol. Biochem. Parasitol. 2010;170:105–107. doi: 10.1016/j.molbiopara.2009.11.007. [DOI] [PubMed] [Google Scholar]
- 70.Rezansoff AM, et al. The confounding effects of high genetic diversity on the determination and interpretation of differential gene expression analysis in the parasitic nematode Haemonchus contortus. Int. J. Parasitol. 2019;49:847–858. doi: 10.1016/j.ijpara.2019.05.012. [DOI] [PubMed] [Google Scholar]
- 71.Niciura SCM, et al. Extreme-QTL mapping of monepantel resistance in Haemonchus contortus. Parasit. Vectors. 2019;12:403. doi: 10.1186/s13071-019-3663-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011;27:578–579. doi: 10.1093/bioinformatics/btq683. [DOI] [PubMed] [Google Scholar]
- 73.Tsai IJ, Otto TD, Berriman M. Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol. 2010;11:R41. doi: 10.1186/gb-2010-11-4-r41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Hunt M, et al. REAPR: a universal tool for genome assembly evaluation. Genome Biol. 2013;14:R47. doi: 10.1186/gb-2013-14-5-r47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Bonfield JK, Whitwham A. Gap5—editing the billion fragment sequence assembly. Bioinformatics. 2010;26:1699–1703. doi: 10.1093/bioinformatics/btq268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Koren S, et al. Canu: scalable and accurate long-read assembly via adaptive -mer weighting and repeat separation. Genome Res. 2017;27:722–736. doi: 10.1101/gr.215087.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Kurtz S, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004;5:R12. doi: 10.1186/gb-2004-5-2-r12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Yeo S, Coombe L, Warren RL, Chu J, Birol I. ARCS: scaffolding genome drafts with linked reads. Bioinformatics. 2018;34:725–731. doi: 10.1093/bioinformatics/btx675. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Warren RL, et al. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. Gigascience. 2015;4:35. doi: 10.1186/s13742-015-0076-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Walker BJ, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE. 2014;9:e112963. doi: 10.1371/journal.pone.0112963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Parra G, Bradnam K, Korf I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics. 2007;23:1061–1067. doi: 10.1093/bioinformatics/btm071. [DOI] [PubMed] [Google Scholar]
- 82.Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–3212. doi: 10.1093/bioinformatics/btv351. [DOI] [PubMed] [Google Scholar]
- 83.Wick RR, Schultz MB, Zobel J, Holt KE. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics. 2015;31:3350–3352. doi: 10.1093/bioinformatics/btv383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Nattestad, M., Aboukhalil R., Chin C.-S. & Schatz M. C., Ribbon: intuitive visualization for complex genomic variation, Bioinformatics, btaa680, 10.1093/bioinformatics/btaa680 (2020). [DOI] [PMC free article] [PubMed]
- 85.Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Gu Z, Gu L, Eils R, Schlesner M, Brors B. circlize implements and enhances circular visualization in R. Bioinformatics. 2014;30:2811–2812. doi: 10.1093/bioinformatics/btu393. [DOI] [PubMed] [Google Scholar]
- 87.Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
- 88.Picelli S, et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods. 2013;10:1096–1098. doi: 10.1038/nmeth.2639. [DOI] [PubMed] [Google Scholar]
- 89.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Dobin A, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Lomsadze A, Burns PD, Borodovsky M. Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res. 2014;42:e119. doi: 10.1093/nar/gku557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Haas BJ, et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003;31:5654–5666. doi: 10.1093/nar/gkg770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Slater GSC, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005;6:31. doi: 10.1186/1471-2105-6-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Dunn NA, et al. Apollo: democratizing genome annotation. PLoS Comput. Biol. 2019;15:e1006790. doi: 10.1371/journal.pcbi.1006790. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Geib SM, et al. Genome annotation generator: a simple tool for generating and correcting WGS annotation tables for NCBI submission. Gigascience. 2018;7:1–5. doi: 10.1093/gigascience/giy018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Ellinghaus D, Kurtz S, Willhoeft U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 2008;9:18. doi: 10.1186/1471-2105-9-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Gremme G, Steinbiss S, Kurtz S. GenomeTools: a comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM Trans. Comput. Biol. Bioinform. 2013;10:645–656. doi: 10.1109/TCBB.2013.68. [DOI] [PubMed] [Google Scholar]
- 98.Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 2016;34:525–527. doi: 10.1038/nbt.3519. [DOI] [PubMed] [Google Scholar]
- 99.Pimentel H, Bray NL, Puente S, Melsted P, Pachter L. Differential analysis of RNA-seq incorporating quantification uncertainty. Nat. Methods. 2017;14:687–690. doi: 10.1038/nmeth.4324. [DOI] [PubMed] [Google Scholar]
- 100.Abu-Jamous B, Kelly S. Clust: automatic extraction of optimal co-expressed gene clusters from gene expression data. Genome Biol. 2018;19:172. doi: 10.1186/s13059-018-1536-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Bailey TL. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics. 2011;27:1653–1659. doi: 10.1093/bioinformatics/btr261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011;27:1017–1018. doi: 10.1093/bioinformatics/btr064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Raudvere U, et al. g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update) Nucleic Acids Res. 2019;47:W191–W198. doi: 10.1093/nar/gkz369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20:238. doi: 10.1186/s13059-019-1832-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 2011;17:10. doi: 10.14806/ej.17.1.200. [DOI] [Google Scholar]
- 106.Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods. 2015;12:357–360. doi: 10.1038/nmeth.3317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107.Crooks GE, Hon G, Chandonia J-M, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004;14:1188–1190. doi: 10.1101/gr.849004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108.Li YI, et al. Annotation-free quantification of RNA splicing using LeafCutter. Nat. Genet. 2017;50:151–158. doi: 10.1038/s41588-017-0004-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Argimón S, et al. Microreact: visualizing and sharing data for genomic epidemiology and phylogeography. Micro. Genom. 2016;2:e000093. doi: 10.1099/mgen.0.000093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 110.Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32:3047–3048. doi: 10.1093/bioinformatics/btw354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 111.Danecek P, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 112.Nakamura T, Yamada KD, Tomii K, Katoh K. Parallelization of MAFFT for large-scale multiple sequence alignments. Bioinformatics. 2018;34:2490–2492. doi: 10.1093/bioinformatics/bty121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113.Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 2015;32:268–274. doi: 10.1093/molbev/msu300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114.Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods. 2017;14:587–589. doi: 10.1038/nmeth.4285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 115.Kamvar ZN, Tabima JF, Grünwald NJ. Poppr: an R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction. PeerJ. 2014;2:e281. doi: 10.7717/peerj.281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 116.Li H, et al. The sequence alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Code and workflows used to analyse data and reproduce figures are available at https://github.com/stephenrdoyle/hcontortus_genome and Zenodo 10.5281/zenodo.4069270.
The raw sequence data is available under the ENA accession PRJEB506, with reference to specific sequencing libraries described throughout the text, and/or in Table S3 of Laing et al. 2013. RNA-seq data is available from the ENA study ID PRJEB1360. The genome assembly has been made available at ENA (assembly accession: GCA_000469685.2) and WormBase ParaSite (https://parasite.wormbase.org/Haemonchus_contortus_prjeb506/Info/Index). A static version of the genome annotation used in this paper is available at ftp://ftp.sanger.ac.uk/pub/pathogens/sd21/HCON_V4_GENOME/ (signoff date: 25th Jan 2019), however, the most up-to-date version of the annotation can be accessed at WormBase ParaSite (https://parasite.wormbase.org/Haemonchus_contortus_prjeb506/Info/Index/17).
Genome variation data can be visualised using MicroReact (https://microreact.org/project/hcontortus_global_diversity), from which links to ENA accessions for individual sample sequencing data from the global collection can be obtained. All relevant data are available from the corresponding authors upon request.