Abstract
The Pacific banana slug, Ariolimax columbianus, is endemic to the forests of the Pacific Northern West. Found throughout the coastal foothills and mountains of California, the hermaphroditic molluscs Ariolimax spp. are niche-constrained, hyper-localized, and phenotypically diverse. The evolutionary history, recent population history and environmental conditions leading to their phenotypic and genetic variation are not understood. To facilitate such research, we present the first high-quality de novo genome assembly of A. columbianus as part of the California Conservation Genomics Project. Pacific Biosciences HiFi long reads and Omni-C chromatin-proximity sequencing technologies were used to produce a de novo genome assembly, consistent with the standard California Conservation Genomics Project genome assembly protocol. This assembly comprises 401 scaffolds spanning 2.29 Gb, represented by a scaffold N50 of 94.9 Mb, a contig N50 of 3.7 Mb, and a benchmarking universal single-copy ortholog completeness score of 93.9%. Future work will use the A. columbianus genome to study the population structure of Ariolimax spp. across California to understand patterns of population structure, genetic diversity, and the broader ecological connections with their habitat. This data will contribute to the California Conservation Genomics Project, expanding the knowledge about the partitioning of genomic variation across the different ecoregions of California.
Keywords: Ariolimax californicus, Ariolimax dolichophallus, California Conservation Genomics Project, CCGP, hyper-localized, hermaphroditic
1. Introduction
Banana slugs, Ariolimax, are found across the coastal foothills and mountains of Western North America from Southern California to Alaska. These hermaphroditic molluscs are synonymous with coastal redwood forests, but range broadly across wet forest habitats. They are the largest terrestrial slug in North America, with lengths exceeding 26 cm, and are notable for their striking yellow coloration (Harper 1988; von Proschwitz et al. 2017). Within California, the genus includes up to six species: Ariolimax brachyphallus, Ariolimax buttoni, Ariolimax californicus, Ariolimax columbianus, Ariolimax dolichophallus, and Ariolimax stramineus. The genus Ariolimax was established in 1859 with A. columbianus as the type species (Mörch 1859). The genus was initially characterized in California, with species boundaries primarily delineated by phenotypic differences in body size and genital morphology (Mead 1943). However, the Pacific banana slug, A. columbianus, is the only species found outside of California, with a range extending from the North Coast of California to Alaska. This species is notably phenotypically diverse; Pacific banana slugs display a range of geographically constrained colors and dappled patterns, including shades of yellow, green, brown, and black (Fig. 1) (Pearson et al. 2006). Consequently, the widely distributed A. columbianus may represent a polyphyletic group, a common occurrence in land snails (Nantarat et al. 2014; Neiber et al. 2015; Fontanilla et al. 2017; Chueca et al. 2018). The evolutionary and recent population histories, along with environmental conditions that contribute to the phenotypic and genetic variations within the genus, remain poorly understood.
Fig. 1.
Close up images of Ariolimax columbianus and their habitat. (A) The reference specimen (1044C). (B) This is an example of representative habitat for the Pacific banana slug. (C) A spotted specimen from the California North Coast. (D) A specimen from the foothills of the California Sierra Nevada.
Attempts to evaluate population structure, connectivity, and genetic diversity among A. columbianus have relied on mitochondrial markers (Leonard et al. 2011; von Proschwitz et al. 2017; Brune 2022), which greatly restrict genomic inference of population structure. A recently published genome for A. columbianus (NCBI:GCA_032357075.1), using Illumina short reads, comprised 535,985 scaffolds with a contig L50 of 114,321, indicating a discontinuous assembly (Edsinger et al. 2024). Here, we report the first high-quality, near-chromosome-level assembly for A. columbianus as part of the California Conservation Genomics Project (CCGP). The primary goal of the CCGP is to publish and analyze whole-genome sequencing data from a wide range of species, 230 in total, covering different ecoregions and habitats within California. This initiative seeks to address conservation challenges exacerbated by climate change, habitat loss, and other environmental threats to inform policy decisions on land use and resource management (Beninde et al. 2022; Fiedler et al. 2022; Shaffer et al. 2022; Toffelmier et al. 2022). In the context of the CCGP, this genome assembly will serve as an essential resource for the high-resolution characterization of genomic variation in Ariolimax, developing a better understanding of the frequency of self-fertilization, population connectivity, specific barriers to gene flow, and the genetic factors influencing their color polymorphism. More broadly, the addition of Ariolimax to the CCGP provides a resource for the study of a self-fertilizing and hyper-localized genus, contributing a unique asset to the broader study of factors influencing the genomic diversity landscape of species across California.
2. Methods
2.1 Biological materials
A Pacific banana slug was collected north of Crescent City, CA (41.821 N, 124.141 W) in August 2020. The individual was yellow with no spots and measured approximately 2 inches long (Fig. 1A). A cross-section of the body was used for DNA extraction and generation of the genome assembly.
2.2 High molecular weight genomic DNA isolation
High molecular weight genomic (HMW) DNA (gDNA) was extracted from a 30 mg flash-frozen posterior body cross-section. The tissue was homogenized in 1 ml of homogenization buffer (10 mM Tris-HCL-pH 8.0, 25 mM EDTA, 0.2% 2-Mercaptoethanol) using TissueRuptor II (Qiagen, Germany; Cat # 9002755). 1 ml of lysis buffer (10 mM Tris, 25 mM EDTA, 200 mM NaCl, 1% SDS, 0.2% 2-Mercaptoethanol) and proteinase K (100 µg/ml) was added to the homogenate and it was incubated overnight at room temperature. Lysate was treated with RNase A (20 µg/ml) at 370 °C for 30 minutes and was cleaned with equal volumes of phenol/chloroform using phase-lock gels (Quantabio, Beverly, MA; Cat # 2302830). DNA was precipitated by adding 0.4X volume of 5M ammonium acetate and 3X volume of ice-cold ethanol. DNA pellet was washed twice with 70% ethanol and resuspended in an elution buffer (10 mM Tris, pH 8.0). DNA was additionally cleaned three times with 1X KAPA pure SPRI beads (KAPA Biosystems, Wilmington, MA; Cat # KK8002). The purity of gDNA was accessed using the NanoDrop ND-1000 spectrophotometer, where a 260/280 ratio of 1.80 and 260/230 ratio of 2.09 was observed. DNA yield was 14.6 µg as quantified by Qubit 2.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA). The integrity of the HMW gDNA was verified on a Femto pulse system (Agilent Technologies, Santa Clara, CA), where 73% of DNA was observed in fragments above 20 kb.
2.3 HiFi library preparation and sequencing
The HiFi SMRTbell library was constructed using the SMRTbell prep kit 3.0 (Pacific Biosciences [PacBio], Menlo Park, CA; Cat. #102-182-700) according to the manufacturer’s instructions. HMW gDNA was sheared to a mode of 15 to 18 kb with fragments ranging from 1-50kb using Diagenode’s Megaruptor 3 system (Diagenode, Belgium; cat. B06010003). The sheared gDNA was concentrated using 1X of SMRTbell cleanup beads provided in the SMRTbell prep kit 3.0 for the repair and a-tailing incubation at 37 °C for 30 minutes and 65 °C for 5 minutes, followed by ligation of overhang adapters at 20 °C for 30 minutes, cleanup using 1X SMRTbell cleanup beads, and nuclease treatment at 37 °C for 15 minutes. The SMRTbell library was size selected using 3.1X of 35% v/v diluted AMPure PB beads (PacBio, Cat. #100-265-900) to progressively remove SMRTbell templates <5 kb. The 15 to 18 kb average HiFi SMRTbell library was sequenced at UC Davis DNA Technologies Core (Davis, CA) using three 8M SMRT cells (PacBio, Cat #101-389-001), Sequel II sequencing chemistry 2.0, and 30-hour movies each on a PacBio Sequel IIe sequencer.
2.4 Omni-C library preparation and sequencing
The Omni-C library was prepared using the Dovetail Omni-C Kit (Dovetail Genomics, Scotts Valley, CA) according to the manufacturer’s protocol with slight modifications. First, specimen tissue (posterior body cross section) was thoroughly ground with a mortar and pestle while cooled with liquid nitrogen. Subsequently, chromatin was fixed in place in the nucleus. The suspended chromatin solution was then passed through 100 μm and 40 μm cell strainers to remove large debris. Fixed chromatin was digested under various conditions of DNase I until a suitable fragment length distribution of DNA molecules was obtained. Chromatin ends were repaired and ligated to a biotinylated bridge adapter followed by proximity ligation of adapter containing ends. After proximity ligation, crosslinks were reversed and the DNA was purified from proteins. Purified DNA was treated to remove biotin that was not internal to ligated fragments. An NGS library was generated using an NEB Ultra II DNA Library Prep kit (New England Biolabs, Ipswich, MA) with an Illumina compatible y-adaptor. Biotin-containing fragments were then captured using streptavidin beads. The post-capture product was split into two replicates prior to PCR enrichment to preserve library complexity with each replicate receiving unique dual indices. The library was sequenced at Vincent J. Coates Genomics Sequencing Lab (Berkeley, CA) on an Illumina NovaSeq 6000 platform (Illumina, CA) to generate approximately 100 million 2 × 150 bp read pairs per GB genome size.
2.5 Nuclear genome assembly
We assembled the genome of an A. columbianus individual following the CCGP assembly pipeline version 5.0, as outlined in Table 1 listing the tools and non-default parameters used in the assembly process. We removed the remnants adapter sequences from the PacBio HiFi dataset using HiFiAdapterFilt (Sim et al. 2022) and generated an initial phased diploid assembly using HiFiasm (Cheng et al. 2022) in HiC mode with the filtered PacBio HiFi reads and the Omni-C short reads, a process that generates two assemblies, one per haplotype. We then aligned the Omni-C data to both assemblies following the Arima Genomics Mapping Pipeline (https://github.com/ArimaGenomics/mapping_pipeline) and then scaffolded both assemblies with SALSA (Ghurye et al. 2017, 2019).
Table 1.
Assembly pipeline and software used. Software citations are listed in the text.
| Assembly | Software and any non-default options | Version |
|---|---|---|
| Filtering PacBio HiFi adapters | HiFiAdapterFilt | Commit 64d1c7b |
| K-mer counting | Meryl (k = 21) | 1 |
| Estimation of genome size and heterozygosity | GenomeScope | 2 |
| De novo assembly (contiging) | HiFiasm (HiC Mode, –primary, output p_ctg.hap1, p_ctg.hap2) | 0.16.1-r375 |
| Scaffolding | ||
| Omni-C data alignment | Arima Genomics Mapping Pipeline | Commit 2e74ea4 |
| Omni-C Scaffolding | SALSA (-DNASE, -i 20, -p yes) | 2 |
| Gap closing | YAGCloser (-mins 2 -f 20 -mcc 2 -prt 0.25 -eft 0.2 -pld 0.2) | Commit 0e34c3b |
| Omni-C Contact map generation | ||
| Short-read alignment | BWA-MEM (-5SP) | 0.7.17-r1188 |
| SAM/BAM processing | samtools | 1.11 |
| SAM/BAM filtering | pairtools | 0.3.0 |
| Pairs indexing | pairix | 0.3.7 |
| Matrix generation | cooler | 0.8.10 |
| Matrix balancing | hicExplorer (hicCorrectmatrix correct --filterThreshold -2 4) | 3.6 |
| Contact map visualization | HiGlass | 2.1.11 |
| PretextMap | 0.1.4 | |
| PretextView | 0.1.5 | |
| PretextSnapshot | 0.0.3 | |
| Manual curation tools | Rapid curation pipeline (Wellcome Trust Sanger Institute, Genome Reference Informatics Team) | Commit 4ddca450 |
| Genome quality assessment | ||
| Basic assembly metrics | QUAST (--est-ref-size) | 5.0.2 |
| Assembly completeness | BUSCO (-m geno, -l metazoa) | 5.0.0 |
| Merqury | 2020-01-29 | |
| Contamination screening | ||
| Local alignment tool | BLAST+ (-db nt, -outfmt “6 qseqid staxids bitscore std,” -max_target_seqs 1, -max_hsps 1, -evalue 1e-25) | 2.15 |
| General contamination screening | BlobToolKit (HiFi coverage, BUSCO=metazoa, NCBI Taxa ID=2067768) | 2.3.3 |
| Mitochondrial assembly | ||
| Mitochondrial genome assembly | MitoHiFi (-r, -p 90, -o 1, -a animal) Reference: Arion flagellus (NCBI:NC_073101.1) | 2.2 |
| Repeat analysis | ||
| Identification of repeat elements | RepeatModeler (Dfam database (version 3.8)) | 2.0.6 |
| Annotation of repeat elements | RepeatMasker | 4.1.7-p1 |
| Annotation of tandem repeats | TRF | 4.09.1 |
The assemblies were manually curated by iteratively generating and analyzing their corresponding Omni-C contact maps. In general, to generate the contact maps we aligned the Omni-C data with BWA-MEM (Li 2013), identified ligation junctions, and generated Omni-C pairs (Lee et al. 2022) using pairtools (Open2C et al. 2023). Then, we generated multi-resolution Omni-C matrices with cooler (Abdennur and Mirny 2020) and balanced them with hicExplorer (Ramírez et al. 2018). We used HiGlass (Kerpedjiev et al. 2018) and the PretextSuite (https://github.com/wtsi-hpag/PretextView; https://github.com/wtsi-hpag/PretextMap; https://github.com/wtsi-hpag/PretextSnapshot) to visualize the contact maps where we identified misassemblies and misjoins, and finally modified the assemblies using the Rapid Curation pipeline from the Wellcome Trust Sanger Institute, Genome Reference Informatics Team (https://gitlab.com/wtsi-grit/rapid-curation). Some of the remaining gaps (joins generated during scaffolding and/or curation) were closed using the PacBio HiFi reads and YAGCloser (https://github.com/merlyescalona/yagcloser). Finally, we checked for contamination using the BlobToolKit Framework (Challis et al. 2019).
2.6 Genome quality assessment
We generated k-mer counts from the PacBio HiFi reads using meryl (https://github.com/marbl/meryl). The k-mer counts were then used in GenomeScope2.0 (Ranallo-Benavidez et al. 2020) to estimate genome features including genome size, heterozygosity, and repeat content. To obtain general contiguity metrics, we ran QUAST (Gurevich et al. 2013). To evaluate genome quality and functional completeness we used BUSCO (Manni et al. 2021) with the Metazoa ortholog database (metazoa_odb10), which contains 954 genes and the Mollusca ortholog database (mollusca_odb10), which contains 5,295 genes. Assessment of base level accuracy (QV) and k-mer completeness was performed using the previously generated meryl database and merqury (Rhie et al. 2020). We further estimated genome assembly accuracy via BUSCO gene set (metazoa_odb10) frameshift analysis using the pipeline described in Korlach et al. (2017). Measurements of the size of the phased blocks is based on the size of the contigs generated by HiFiasm on HiC mode. We follow the quality metric nomenclature established by Rhie et al. (2021), with the genome quality code x.y.P.Q.C, where, x = log10[contig NG50]; y = log10[scaffold NG50]; P = log10 [phased block NG50]; Q = Phred base accuracy QV (quality value); C = % genome represented by the first ‘n’ scaffolds, following a karyotype of 2n = 52 estimated as the mode of the number of chromosome from the closely related species Arion vulgaris (NCBI:GCA_020796225.1; Chen et al. 2022). Quality metrics for the notation were calculated on haplotype 1.
2.7 Mitochondrial genome
We assembled the mitochondrial genome of A. columbianus from the PacBio HiFi reads using the reference-guided pipeline MitoHiFi (Allio et al. 2020; Uliano-Silva et al. 2023). The mitochondrial sequence of the closely related Arion flagellus (NCBI:NC_073101.1) was used as the starting sequence. After completion of the nuclear genome, we searched for matches of the resulting mitochondrial assembly sequence in the nuclear genome assembly using BLAST+ (Camacho et al. 2009) and filtered out contigs and scaffolds from the nuclear genome with a percentage of sequence identity >99% and size smaller than the mitochondrial assembly sequence.
2.8 Repeat annotation
We used a combination of de novo and known element annotation to identify repetitive regions on both assemblies of A. columbianus and other Stylommatophora. We performed a de novo identification of repetitive elements using the program RepeatModeler2 (Flynn et al. 2020). The resulting repeat libraries were each merged with the mollusca-specific subset of the Dfam database and used to annotate repetitive regions using RepeatMasker (Smit et al. 2013-2015; Storer et al. 2021). Tandem repeats were identified using TRF and the intersection of the two analyses was taken as the total repeat content (Benson 1999).
3. Results
3.1 Sequencing results
The Omni-C and PacBio HiFi sequencing libraries generated 152.5 million read pairs and 6.46 million reads, respectively. The latter yielded 32-fold coverage (N50 read length 12,941 bp; minimum read length 89 bp; mean read length 11,390 bp; maximum read length of 65,721 bp) based on the Genomescope 2.0 genome size estimation of 2.29 Gb (Supplementary Fig. S1). Based on PacBio HiFi reads, we estimated 0.112% sequencing error rate and 0.544% nucleotide heterozygosity rate (Fig. 2A). The k-mer spectrum shows a bimodal distribution with two major peaks at 15- and 32-fold coverage (Fig. 2A).
Fig. 2.
Visual overview of genome assembly metrics. (A) K-mer spectrum output generated from PacBio HiFi data without adapters using GenomeScope2.0. (B) BlobToolKit Snail plot showing a graphical representation of the quality metrics presented in Table 2 for the Ariolimax columbianus primary assembly (xgAriColu1.0.hap1). The plot circle represents the full size of the assembly. From the inside-out, the central plot covers length-related metrics. The line at 180M represents the size of the longest scaffold; all other scaffolds are arranged in size-order moving clockwise around the plot and drawn in gray starting from the outside of the central plot. Dark and light orange arcs show the scaffold N50 and scaffold N90 values. The central light gray spiral shows the cumulative scaffold count with a white line at each order of magnitude. White regions in this area reflect the proportion of Ns in the assembly. The dark versus light blue area around it shows mean, maximum and minimum GC versus AT content at 0.1% intervals (Challis et al. 2019). Omni-C Contact maps for the primary (C) and alternate (D) genome assembly generated with PretextSnapshot. Omni-C contact maps translate proximity of genomic regions in 3D space to contiguous linear organization. Each cell in the contact map corresponds to sequencing data supporting the linkage (or join) between two of such regions.
3.2 Nuclear genome assembly
The final assembly (xgAriColu1) consists of two phased haplotypes that are similar in size to the estimated value from GenomeScope2.0 (Fig. 2A). Haplotype one consists of 401 scaffolds spanning 2.29 Gb with contig N50 of 3.67 Mb, scaffold N50 of 94.89Mb, largest contig of 19.35 Mb and largest scaffold of 180.02 Mb. Haplotype two consists of 581 scaffolds, spanning 2.28 Gb with contig N50 of 3.74 Mb, scaffold N50 of 93.6 Mb, largest contig 22.21 Mb and largest scaffold of 176.61 Mb.
During manual curation, we generated a total of 607 joins and 265 breaks, where haplotype one had 284 joins and 117 breaks; and haplotype two had 323 joins and 148 breaks. In the gap closing step, we were able to close a total of 168 gaps, 83 on haplotype one and 85 on haplotype two. Finally, we filtered out 597 contigs (354 on haplotype one and 242 on haplotype two) not corresponding to the Phylum Gastropoda (Supplementary Table S1) and 5 contigs (3 on haplotype one and 2 on haplotype 2) corresponding to mitochondrial contamination. No further contigs were removed.
Haplotype one has a BUSCO completeness score of 93.9% using the Metazoa gene set, a per base quality (QV) of 60.61 a k-mer completeness of 90.19, and a frameshift indel QV of 51.44. Haplotype two has a BUSCO completeness score of 92.4% using the same gene set, a per base quality (QV) of 60.62, a k-mer completeness of 90.07, and a frameshift indel QV of 50.79.
Assembly statistics are reported in Table 2, and a graphical representation of the assembly of haplotype one is in Fig. 2B. The Omni-C contact maps show that both assemblies are highly contiguous, and suggest that the A. columbianus genome is organized in 25 chromosomes based on the number of major bins along the diagonal axis of the plot (Fig. 2C and D). We have deposited scaffolds corresponding to both haplotypes to GenBank (see Table 2 and Data availability for details).
Table 2.
Sequencing and assembly statistics, and accession numbers.
| Bio Projects & Vouchers | CCGP NCBI BioProject | PRJNA720569 | |||||
|---|---|---|---|---|---|---|---|
| Genera NCBI BioProject | PRJNA766268 | ||||||
| Species NCBI BioProject | PRJNA766268 | ||||||
| NCBI BioSample | SAMN36908962 | ||||||
| Specimen identification | 1044C | ||||||
| NCBI Genome accessions | Haplotype 1 | Haplotype 2 | |||||
| Assembly accession | JAVGWX000000000 | JAVGWY000000000 | |||||
| Genome sequences | GCA_036924085.1 | GCA_036924075.1 | |||||
| Genome sequence | PacBio HiFi reads | Run | 1 PACBIO_SMRT (Sequel IIe) run: 6.5M spots, 73.7G bases, 41.6G bytes |
||||
| Accession | SRX23901772 | ||||||
| Omni-C Illumina reads | Run | 2 ILLUMINA (Illumina NovaSeq 6000) runs: 152.5M spots, 46.1G bases, 15G bytes |
|||||
| Accession | SRX23901773, SRX23901774 | ||||||
| Genome assembly quality metrics | Assembly identifier (Quality codea) | xgAriColu1(6.7.P6.Q.C97) | |||||
| HiFi Read coverageb | 32.11X | ||||||
| Haplotype 1 | Haplotype 2 | ||||||
| Number of contigs | 1,382 | 1,530 | |||||
| Contig N50 (bp) | 3,670,449 | 3,749,547 | |||||
| Contig NG50b | 3,670,449 | 3,700,100 | |||||
| Longest contigs | 19,357,573 | 22,212,580 | |||||
| Number of scaffolds | 401 | 581 | |||||
| Scaffold N50 | 94,891,770 | 93,600,053 | |||||
| Scaffold NG50b | 94,891,770 | 93,600,053 | |||||
| Largest scaffold | 180,022,102 | 176,612,698 | |||||
| Size of final assembly | 2,294,334,993 | 2,282,270,284 | |||||
| Phased block NG50b | 3,863,613 | 3,797,846 | |||||
| Gaps per Gbp (# Gaps) | 428(981) | 416(949) | |||||
| Indel QV (Frame shift) | 51.44 | 50.79 | |||||
| Base pair QV | 60.61 | 60.62 | |||||
| Full assembly = 60.61 | |||||||
| k-mer completeness | 90.19 | 90.07 | |||||
| Full assembly = 97.19 | |||||||
| BUSCO completeness (metazoa) n = 954 | Cc | Sc | Dc | Fc | Mc | ||
| H1d | 93.90% | 86.40% | 7.50% | 2.10% | 4.00% | ||
| H2d | 92.40% | 84.20% | 8.20% | 2.10% | 5.50% | ||
| BUSCO completeness (mollusca) n = 5295 |
Cc | Sc | Dc | Fc | Mc | ||
| H1d | 87.80% | 70.10% | 17.70% | 2.40% | 9.80% | ||
| H2d | 87.50% | 70.00% | 17.50% | 2.10% | 10.40% | ||
| Organelles | Mitochondrial sequence | CM072560.1 | |||||
aAssembly quality code x.y.P.Q.C derived notation, from (Rhie et al. 2021). x = log10[contig NG50]; y = log10[scaffold NG50]; P = log10 [phased block NG50]; Q = Phred base accuracy QV (Quality value); C = % genome represented by the first “n” scaffolds, following a known karyotype of 2n = 52 estimated as the mode of the number of chromosome from the closely related species Arion vulgaris (NCBI:GCA_020796225.1; Chen et al. 2022). Quality code for all the assembly denoted by primary assembly (xgAriColu1.0.hap1).
bRead coverage and NGx statistics have been calculated based on the estimated genome size of 2.29 Gb.
cBUSCO Scores. Complete BUSCOs (C). Complete and single-copy BUSCOs (S). Complete and duplicated BUSCOs (D). Fragmented BUSCOs (F). Missing BUSCOs (M).
dH1: Haplotype 1 and (H2) Haplotype 2 assembly values.
3.3 Mitochondrial genome assembly
We assembled a mitochondrial genome with MitoHiFi. The final mitochondrial genome is circular and spans 22,705 bp. The base composition of the final assembly version is A = 35.64%, C = 10.94%, G = 11.63%, T = 41.78%, and consists of 22 unique transfer RNAs, 10 protein-coding genes, and 1 rRNA. We have deposited the mitochondrial genome to GenBank (see Table 2 and Data availability for details).
3.4 Repeat annotation
The repeat libraries generated using RepeatModeler2 for both assemblies of A. columbianus contain 2044 and 2093 repetitive elements for haplotype 1 and haplotype 2, respectively (Flynn et al. 2020). The repeat content estimate using RepeatMasker and TRF is 77.70% and 77.14% for haplotype 1 and haplotype 2, respectively (Supplementary Table S3).
4. Discussion
To be consistent with the CCGP data analysis pipeline and to ensure that our dataset is comparable with other species from CCGP, we generated a new genome assembly using a hybrid assembly that combines PacBio long-read data with Omni-C chromatin conformation data used for scaffolding. Compared with the previous assembly of A. columbianus (NCBI:GCA_032357075.1), the reference genome is significantly larger (2.29 Gb vs 1.54 Gb) and more contiguous, being composed of a considerably smaller number of scaffolds (401 vs 535,985) and smaller L50 (9 vs 114,321) highlighting the necessity of a dedicated effort to generate a high-quality reference genome. The slight size difference between haplotype 1 and haplotype 2 may be attributed to unresolved repetitive sequences in haplotype 2, which is further supported by its higher unknown repeat content and lower overall repeat content (Supplementary Table S3).
The primary assembly, haplotype 1, with BUSCO complete score of 93.9% suggests a high-quality assembly comparable to other CCGP assemblies and is more complete than a majority of other related species with at least scaffold-level assemblies (Supplementary Table S2). The genome size of A. columbianus is considerably larger than another species within the Arionidae family, Arion vulgaris (1.54 Gb, Supplementary Table S2), but within the range seen within the Stylommatophora order (1.36 to 5.40 Gb; Supplementary Table S2). The heterozygosity and repeat content of A. columbianus are within the range observed across related species (0.36% to 1.55% and 65.75% to 85.75%, respectively; Supplementary Tables S2 and S3), suggesting a high-quality genomic resource comparable to other Stylommatophora genomes.
Within the Arionidae family, A. columbianus has a lower heterozygosity and a higher content of transposable elements than Arion vulgaris (Supplementary Tables S2 and S3), which are suggestive of a smaller historical population size (De Kort et al. 2022). Additionally, there is a notable difference in GC content between A. columbianus (41.5%) and Arion vulgaris (38.5%). The mitochondrial genome of A. columbianus is significantly larger (22.71 kb) than the reference used to scaffold its assembly, Arion flagellus (14.29 kb, NCBI:NC_073101.1). With only the requisite 22 tRNAs, 10 protein-coding genes, and 1 rRNA, the increased size of A. columbianus is explained by expansion of non-coding regions which are frequent in molluscan mitogenomes (Ghiselli et al. 2021; Malkócs et al. 2022; Davison et al. 2024) and known to exhibit exceptional variation in size, structure, and content. Together the differences in heterozygosity, transposable element content, GC content, and mitochondrial genome size suggest A. columbianus has undergone a distinct evolutionary and demographic process from Arion.
This genome assembly will serve as an important resource for ongoing investigations using whole-genome resequencing efforts for populations spanning California as part of the CCGP goals (Beninde et al. 2022; Fiedler et al. 2022; Shaffer et al. 2022; Toffelmier et al. 2022). This will not only contribute to our understanding of the connectivity of the Ariolimax genus, but with their association with the broader landscape genomics project, providing insights into how genetic diversity and gene flow are influenced by environmental factors and human activities across California’s diverse ecosystems.
Supplementary material
Supplementary material can be found at http://www.jhered.oxfordjournals.org/.
Acknowledgments
We thank the staff at the UC Santa Cruz Paleogenomics Laboratory and the UC Davis DNA Technologies and Expression Analysis Cores for the generation of the high-quality sequence data used for this assembly. We would also like to thank Brad Schaffer, Victoria Sork, and Erin Toffelmier of the CCGP leadership team. PacBio Sequel II/IIe library prep and sequencing was carried out at the DNA Technologies and Expression Analysis Core at the UC Davis Genome Center, supported by NIH Shared Instrumentation Grant 1S10OD010786-01. Deep sequencing of Omni-C libraries used the Novaseq S4 sequencing platforms at the Vincent J. Coates Genomics Sequencing Laboratory at UC Berkeley, supported by NIH S10 OD018174 Instrumentation Grant.
Contributor Information
Maximilian Genetti, Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, United States; Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, United States.
Merly Escalona, Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, United States.
Cade Mirchandani, Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, United States; Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, United States.
Jonas Oppenheimer, Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, United States.
Eric Beraut, Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA, United States.
Samuel Sacco, Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA, United States.
William Seligmann, Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA, United States.
Colin W Fairbairn, Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA, United States.
Ruta Sahasrabudhe, DNA Technologies and Expression Analysis Core Laboratory, Genome Center, University of California-Davis, Davis, CA 95616, United States.
Mohan P A Marimuthu, DNA Technologies and Expression Analysis Core Laboratory, Genome Center, University of California-Davis, Davis, CA 95616, United States.
Oanh Nguyen, DNA Technologies and Expression Analysis Core Laboratory, Genome Center, University of California-Davis, Davis, CA 95616, United States.
Noravit Chumchim, DNA Technologies and Expression Analysis Core Laboratory, Genome Center, University of California-Davis, Davis, CA 95616, United States.
Russell Corbett-Detig, Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, United States; Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, United States.
Author contributions
Maximilian Genetti (Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Visualization, Writing - original draft), Merly Escalona (Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing - original draft), Cade Mirchandani (Writing - original draft), Jonas Oppenheimer (Writing - original draft), Eric Beraut (Investigation), Samuel Sacco (Investigation), William Seligmann (Investigation), Colin W. Fairbairn (Investigation), Ruta Sahasrabudhe (Investigation), Mohan Marimuthu (Investigation), Oanh Nguyen (Investigation), Noravit Chumchim (Investigation), and Russell Corbett-Detig (Conceptualization, Funding acquisition, Project administration, Resources, Supervision, Writing - original draft)
Funding
This work was supported by the California Conservation Genomics Project, with funding provided to the University of California by the State of California, State Budget Act of 2019 (UC Award ID RSI-19-690224).
Conflict of interest statement. None declared.
Data availability
Data generated for this study are available under NCBI BioProject PRJNA986204. Raw sequencing data for sample 1044C (NCBI BioSample SAMN36908962) are deposited in the NCBI Short Read Archive under SRX23901772 for PacBio HiFi sequencing data, and SRX23901773-74 for the Omni-C Illumina sequencing data. GenBank accessions for both primary and alternate assemblies are GCA_036924085.1 and GCA_036924075.1; and for genome sequences JAVGWX000000000 and JAVGWY000000000. The GenBank organelle genome assembly for the mitochondrial genome is CM072560.1. Assembly scripts and other data for the analyses presented can be found at the following GitHub repository: www.github.com/ccgproject/ccgp_assembly
References
- Abdennur N, Mirny LA.. Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics. 2020:36:311–316. https://doi.org/ 10.1093/bioinformatics/btz540 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Allio R, Schomaker‐Bastos A, Romiguier J, Prosdocimi F, Nabholz B, Delsuc F.. MitoFinder: efficient automated large‐scale extraction of mitogenomic data in target enrichment phylogenomics. Mol Ecol Resour. 2020:20:892–905. https://doi.org/ 10.1111/1755-0998.13160 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beninde J, Toffelmier E, Shaffer HB.. A brief history of population genetic research in California and an evaluation of its utility for conservation decision-making. J Hered. 2022:113:604–614. https://doi.org/ 10.1093/jhered/esac049 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999:27:573–580. https://doi.org/ 10.1093/nar/27.2.573 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brune MA. An analysis on the population genetic structure of Ariolimax columbianus around Corvallis, Oregon. [Honors Baccalaureate of Science in Biochemistry and Molecular Biology, Oregon State University]. ScholarsArchive@OSU. 2022. https://ir.library.oregonstate.edu/concern/honors_college_theses/9p290j038 [Google Scholar]
- Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL.. BLAST+: Architecture and applications. BMC Bioinf. 2009:10:421. https://doi.org/ 10.1186/1471-2105-10-421 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Challis R, Richards E, Rajan J, Cochrane G, Blaxter M.. 2019. BlobToolKit – interactive quality assessment of genome assemblies. https://doi.org/ 10.1101/844852 [DOI] [PMC free article] [PubMed]
- Chen Z, Doğan O, Guiglielmoni N, Guichard A, Schrödl M.. Pulmonate slug evolution is reflected in the de novo genome of Arion vulgaris Moquin-Tandon, 1855. Sci Rep. 2022:12:14226. https://doi.org/ 10.1038/s41598-022-18099-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng H, Jarvis ED, Fedrigo O, Koepfli K-P, Urban L, Gemmell NJ, Li H.. Haplotype-resolved assembly of diploid genomes without parental data. Nat Biotechnol. 2022:40:1332–1335. https://doi.org/ 10.1038/s41587-022-01261-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chueca LJ, Gómez-Moliner BJ, Madeira MJ, Pfenninger M.. Molecular phylogeny of Candidula (Geomitridae) land snails inferred from mitochondrial and nuclear markers reveals the polyphyly of the genus. Mol Phylogenet Evol. 2018:118:357–368. https://doi.org/ 10.1016/j.ympev.2017.10.022 [DOI] [PubMed] [Google Scholar]
- Davison A, Chowdhury M, Johansen M, Uliano-Silva M, Blaxter M; Wellcome Sanger Institute Tree of Life programme. High heteroplasmy is associated with low mitochondrial copy number and selection against non-synonymous mutations in the snail Cepaea nemoralis. BMC Genomics. 2024:25:596. https://doi.org/ 10.1186/s12864-024-10505-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Kort H, Legrand S, Honnay O, Buckley J.. Transposable elements maintain genome-wide heterozygosity in inbred populations. Nat Commun. 2022:13:7022. https://doi.org/ 10.1038/s41467-022-34795-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edsinger E, Kieras M, Pirro S.. The genome sequences of 118 taxonomically diverse eukaryotes of the Salish sea. Biodiv Genom. 2024:2024:1–4. https://doi.org/ 10.56179/001c.118307 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fiedler PL, Erickson B, Esgro M, Gold M, Hull JM, Norris JM, Shapiro B, Westphal M, Toffelmier E, Shaffer HB.. Seizing the moment: the opportunity and relevance of the California conservation genomics project to state and federal conservation policy. J Hered. 2022:113:589–596. https://doi.org/ 10.1093/jhered/esac046 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF.. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci USA. 2020:117:9451–9457. https://doi.org/ 10.1073/pnas.1921046117 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fontanilla IK, Naggs F, Wade CM.. Molecular phylogeny of the Achatinoidea (Mollusca: Gastropoda). Mol Phylogenet Evol. 2017:114:382–385. https://doi.org/ 10.1016/j.ympev.2017.06.014 [DOI] [PubMed] [Google Scholar]
- Open2C, Abdennur N, Fudenberg G, Flyamer IM, Galitsyna AA, Goloborodko A, Imakaev M, Venev SV. Pairtools: From sequencing data to chromosome contacts. 2023. https://doi.org/ 10.1101/2023.02.13.528389 [DOI] [PMC free article] [PubMed]
- Ghiselli F, Gomes-dos-Santos A, Adema CM, Lopes-Lima M, Sharbrough J, Boore JL.. Molluscan mitochondrial genomes break the rules. Philos Trans R Soc London Ser B. 2021:376:20200159. https://doi.org/ 10.1098/rstb.2020.0159 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ghurye J, Pop M, Koren S, Bickhart D, Chin C-S.. Scaffolding of long read assemblies using long range contact information. BMC Genomics. 2017:18:527. https://doi.org/ 10.1186/s12864-017-3879-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ghurye J, Rhie A, Walenz BP, Schmitt A, Selvaraj S, Pop M, Phillippy AM, Koren S.. Integrating Hi-C links with assembly graphs for chromosome-scale assembly. PLoS Comput Biol. 2019:15:e1007273. https://doi.org/ 10.1371/journal.pcbi.1007273 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gurevich A, Saveliev V, Vyahhi N, Tesler G.. QUAST: Quality assessment tool for genome assemblies. Bioinformatics. 2013:29:1072–1075. https://doi.org/ 10.1093/bioinformatics/btt086 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harper AB. The banana slug: a close look at a giant forest slug of western North America. Bay Leaves Press; 1988. [Google Scholar]
- Kerpedjiev P, Abdennur N, Lekschas F, McCallum C, Dinkla K, Strobelt H, Luber JM, Ouellette SB, Azhir A, Kumar N, et al. HiGlass: Web-based visual exploration and analysis of genome interaction maps. Genome Biol. 2018:19:125. https://doi.org/ 10.1186/s13059-018-1486-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Korlach J, Gedman G, Kingan SB, Chin C-S, Howard JT, Audet J-N, Cantin L, Jarvis ED.. De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads. GigaScience. 2017:6:1–16. https://doi.org/ 10.1093/gigascience/gix085 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee S, Bakker CR, Vitzthum C, Alver BH, Park PJ.. Pairs and Pairix: a file format and a tool for efficient storage and retrieval for Hi-C read pairs. Bioinformatics. 2022:38:1729–1731. https://doi.org/ 10.1093/bioinformatics/btab870 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leonard JL, Pearse JSP, Elejalda A, Helsen S, Natalie Van Houtte B, Karin, Jordaens K, Backeljau T.. Phylogeography and Rapid Evolution in Banana Slugs (Arionidae: Ariolimax spp.) (Annual Report for 2011, pp. 218–220). The Western Society of Malacologists; 2013. [Google Scholar]
- Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (arXiv:1303.3997). arXiv; 2013. http://arxiv.org/abs/1303.3997 [Google Scholar]
- Malkócs T, Viricel A, Becquet V, Evin L, Dubillot E, Pante E.. Complex mitogenomic rearrangements within the Pectinidae (Mollusca: Bivalvia). BMC Ecol Evol. 2022:22:29. https://doi.org/ 10.1186/s12862-022-01976-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manni M, Berkeley MR, Seppey M, Simao FA, Zdobnov EM.. BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes (arXiv:2106.11799). arXiv; 2021. http://arxiv.org/abs/2106.11799 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mead AR. Revision of the giant west coast land slugs of the genus Ariolimax moerch (Pulmonata: Arionidae). Am Midl Nat. 1943:30:675. https://doi.org/ 10.2307/2421208 [DOI] [Google Scholar]
- Mörch OAL. Beiträge zur Molluskenfauna Central-Amerika’s. Malakozoologische Blätter 1859;6:102–126. [Google Scholar]
- Nantarat N, Tongkerd P, Sutcharit C, Wade CM, Naggs F, Panha S.. Phylogenetic relationships of the operculate land snail genus Cyclophorus Montfort, 1810 in Thailand. Mol Phylogenet Evol. 2014:70:99–111. https://doi.org/ 10.1016/j.ympev.2013.09.013 [DOI] [PubMed] [Google Scholar]
- Neiber MT, Hausdorf B.. Molecular phylogeny reveals the polyphyly of the snail genus Cepaea (Gastropoda: Helicidae). Mol Phylogenet Evol. 2015:93:143–149. https://doi.org/ 10.1016/j.ympev.2015.07.022 [DOI] [PubMed] [Google Scholar]
- Pearson AK, Pearson OP, Ralph PL.. Growth and activity patterns in a backyard population of the banana slug, Ariolimax columbianus. The Veliger 2006:48:143–150. [Google Scholar]
- Ramírez F, Bhardwaj V, Arrigoni L, Lam KC, Grüning BA, Villaveces J, Habermann B, Akhtar A, Manke T.. High-resolution TADs reveal DNA sequences underlying genome organization in flies. Nat Commun. 2018:9:189. https://doi.org/ 10.1038/s41467-017-02525-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ranallo-Benavidez TR, Jaron KS, Schatz MC.. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat Commun. 2020:11:1432. https://doi.org/ 10.1038/s41467-020-14998-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rhie A, McCarthy SA, Fedrigo O, Damas J, Formenti G, Koren S, Uliano-Silva M, Chow W, Fungtammasan A, Kim J, et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature. 2021:592:737–746. https://doi.org/ 10.1038/s41586-021-03451-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rhie A, Walenz BP, Koren S, Phillippy AM.. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 2020:21:245. https://doi.org/ 10.1186/s13059-020-02134-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shaffer HB, Toffelmier E, Corbett-Detig RB, Escalona M, Erickson B, Fiedler P, Gold M, Harrigan RJ, Hodges S, Luckau TK, et al. Landscape genomics to enable conservation actions: the California conservation genomics project. J Hered. 2022:113:577–588. https://doi.org/ 10.1093/jhered/esac020 [DOI] [PubMed] [Google Scholar]
- Sim SB, Corpuz RL, Simmonds TJ, Geib SM.. HiFiAdapterFilt, a memory efficient read processing pipeline, prevents occurrence of adapter sequence in PacBio HiFi reads and their negative impacts on genome assembly. BMC Genomics. 2022:23:157. https://doi.org/ 10.1186/s12864-022-08375-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smit AFA, Hubley R, Green P. RepeatMasker Open-4.0. 2013–2015. [accessed 30 Oct 2024]. http://www.repeatmasker.org. [Google Scholar]
- Storer J, Hubley R, Rosen J, Wheeler TJ, Smit AF.. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mobile DNA 2021:12:2. https://doi.org/ 10.1186/s13100-020-00230-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Toffelmier E, Beninde J, Shaffer HB.. The phylogeny of California, and how it informs setting multispecies conservation priorities. J Hered. 2022:113:597–603. https://doi.org/ 10.1093/jhered/esac045 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Uliano-Silva M, Ferreira JGRN, Krasheninnikova K, Blaxter M, Mieszkowska N, Hall N, Holland P, Durbin R, Richards T, Kersey P, et al. ; Darwin Tree of Life Consortium. MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads. BMC Bioinf. 2023:24:288. https://doi.org/ 10.1186/s12859-023-05385-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Von Proschwitz T, Reise H, Schlitt B, Breugelmans K.. Records of the slugs Ariolimax columbianus (Ariolimacidae) and Prophysaon foliolatum (Arionidae) imported into Sweden. Folia Malacologica 2017:25:267–271. https://doi.org/ 10.12657/folmal.025.023 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data generated for this study are available under NCBI BioProject PRJNA986204. Raw sequencing data for sample 1044C (NCBI BioSample SAMN36908962) are deposited in the NCBI Short Read Archive under SRX23901772 for PacBio HiFi sequencing data, and SRX23901773-74 for the Omni-C Illumina sequencing data. GenBank accessions for both primary and alternate assemblies are GCA_036924085.1 and GCA_036924075.1; and for genome sequences JAVGWX000000000 and JAVGWY000000000. The GenBank organelle genome assembly for the mitochondrial genome is CM072560.1. Assembly scripts and other data for the analyses presented can be found at the following GitHub repository: www.github.com/ccgproject/ccgp_assembly


