Skip to main content
Journal of Heredity logoLink to Journal of Heredity
. 2025 May 5;116(6):826–834. doi: 10.1093/jhered/esaf026

A haplotype-resolved genome assembly of the bocaccio rockfish, Sebastes paucispinis

Rishi De-Kayne 1,a,, Stacy Li 2,a, Merly Escalona 3,a, Runyang Nicolas Lou 4, Juan Manuel Vazquez 5, Gregory L Owens 6, Sree Rohit Raj Kolora 7, Conner Jainese 8, Katelin Seeto 9, Merit McCrea 10, Oanh Nguyen 11, Noravit Chumchim 12, Ruta Sahasrabudhe 13, Colin W Fairbairn 14, Richard E Green 15, William E Seligmann 16, Milton Love 17, Peter H Sudmant 18,
Editor: Elizabeth Alter
PMCID: PMC12584591  PMID: 40323688

Abstract

Rockfishes (genus Sebastes) are one of the most diverse clades amongst teleosts (ray-finned fishes). The genus includes more than 110 species which are distributed broadly across the North Pacific Ocean, North and South Atlantic Ocean, and Southeastern Pacific Ocean. Rockfishes exhibit particularly high diversity along the western coast of the United States, where their abundance plays a critical role in local marine ecosystems and fisheries. Sebastes paucispinis (“bocaccio”) is a rockfish species most commonly found off the coast of California. In 2005, Bocaccio were federally declared overfished following massive depletion by commercial and recreational fisheries from the 1980s to early 2000s. Implementation of significant restrictions has bolstered recovery of critical rockfish populations along the California and Oregon coasts, but the impact of anthropogenic stressors on bocaccio, and other Sebastes species, has yet to be fully evaluated. Here, we present the first de novo reference-quality genome assembly of Sebastes paucispinis, as part of the California Conservation Genomics Project.

Keywords: bocaccio, California Conservation Genomics Project, CCGP, rockfish, Sebastes

Introduction

Sebastes paucispinis, also known as the bocaccio rockfish, is a critically endangered (IUCN v.2.3) member of the diverse genus of rockfishes, Sebastes of which there are more than 110 species. Bocaccio are named for both their near absence of head spines (paucispinis meaning “few spines” in Latin) and large mouth (bocaccio meaning “big mouth” or “ugly mouth” in Italian; Love et al. 2002). Bocaccio have been described as being a variety of colors, ranging from olive-brown or silvery grey, to red, pink, and even orange (Orr et al. 1998; Love et al. 2002; Fig. 1). In addition to this variable main body coloration, bocaccio may have patches of white as well as dark (cancerous) melanistic blotches that typically present in older, larger, individuals (Orr et al. 1998; Love et al. 2002). Bocaccio are a long-lived species, with an estimated maximum lifespan of 70 years (Department of Fisheries and Oceans Canada, pers. comm. to M.L.), and are geographically distributed along the northeastern Pacific Ocean, spanning the western reaches of North America from Alaska to central Baja California (Love et al. 2021). Juveniles are found near the surface and in inshore waters, adults are found at depths between 20 and 475 m. Bocaccio are found primarily over such complex structures as rocky reefs and oil platforms (where they form substantial aggregations; Love et al. 2002, 2005, 2009). They are most abundant off British Columbia and from northern California to at least northern Baja California (Love and Passarelli 2020), where they play a critical ecological role, representing a key species in marine food webs due to their broadly piscivorous diet.

Fig. 1.

Fig. 1.

A) images of Sebastes paucispinis (image credit SWFSC ROV Team) illustrating bocaccio color variation, and B) the read length histogram of HiFi reads used for the assembly, where the red line represents the mean read length of 13,283 bp.

Bocaccio are widely enjoyed for human consumption and as such represent an economically important species for both recreational and commercial fisheries (He and Field 2017). However, overfishing has driven a significant modern decline in historically robust populations along the Pacific coast, especially in the Puget Sound/Georgia Basin regions where bocaccio are now protected under the Endangered Species Act (Williams et al. 2010; https://www.fisheries.noaa.gov/species/bocaccio-protected). Populations were estimated to have declined by ~96% to 98% in 2005 compared to historic levels and were federally declared overfished, though the population has increased in recent years and the fishery reopened in 2018 (He and Field 2017). Several life history traits, namely, late-onset maturity and relative longevity, have made bocaccio particularly vulnerable to overfishing (Love et al. 2002). Although a number of genetic studies have aimed to assess the impacts, consequences, and future outlook for bocaccio populations, the lack of species-specific high-quality genetic resources has hindered the use of state-of-the-art conservation genomics approaches (Matala et al. 2004; Drake et al. 2010; Buonaccorsi et al. 2012). As a result, the genetic consequences of overfishing in the years preceding and following federal restrictions on fishing activities have yet to be comprehensively quantified.

As part of the California Conservation Genomics Project (CCGP) consortium (Shaffer et al. 2022), we present a complete Sebastes paucispinis reference-quality genome to aid in bocaccio conservation efforts, supporting the CCGP’s broader efforts in creating high-quality genomic resources for studying and monitoring species diversity.

Methods

Biological materials

A single Sebastes paucispinis specimen was collected in May 2019 from the Footprint State Marine Reserve, situated between the Anacapa and Santa Cruz islands west of the southern California coast (in accordance with NOAA permit LOA-3-2019). The specimen was euthanized, immediately dissected on ice, and the liver tissue was removed and flash-frozen.

High molecular weight gDNA isolation

High molecular weight (HMW) genomic DNA (gDNA) was extracted from 40 mg of frozen liver tissue (Voucher number—SEB-726) using the Nanobind Tissue Big DNA kit (Pacific BioSciences—PacBio, Menlo Park, CA) following manufacturer’s instructions. Purity of gDNA was accessed using NanoDrop ND-1000 spectrophotometer where a 260/280 ratio of 1.83 and 260/230 ratio of 2.11 were observed. DNA yield was 16.6 µg as quantified by Qbit 2.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA). Integrity of the HMW gDNA was verified on a Femto pulse system (Agilent Technologies, Santa Clara, CA) where 85% of the DNA was observed in fragments above 100 kb.

HiFi library preparation and sequencing

We constructed a HiFi SMRTbell library using the SMRTbell Express Template Prep Kit v2.0 (PacBio, Cat. #100-938-900) following the manufacturer’s instructions. HMW gDNA was sheared to a target DNA size distribution between 15 kb and 18 kb using Diagenode’s Megaruptor 3 system (Diagenode, Belgium; cat. B06010003). The sheared gDNA was concentrated using 0.45× of AMPure PB beads (PacBio, Cat. #100-265-900) for the removal of single-strand overhangs at 37 °C for 15 min, followed by further enzymatic steps of DNA damage repair at 37 °C for 30 min, end repair and A-tailing at 20 °C for 10 min and 65 °C for 30 min, and ligation of overhang adapters v3 at 20 °C for 60 min. The SMRTbell library was purified and concentrated with 1X AMPure PB beads for nuclease treatment at 37 °C for 30 min and size selection using the PippinHT system (Sage Science, Beverly, MA; Cat #HPE7510) to collect fragments greater than 7kb to 9 kb. The 15 kb to 20 kb average HiFi SMRTbell library was sequenced at UC Davis DNA Technologies Core (Davis, CA) using two 8M SMRT cells on the Sequel IIe with sequencing chemistry 2.0 and 30-h movies.

Omni-C library preparation and sequencing

We prepared an Omni-C library using a Dovetail Omni-C Kit (Dovetail Genomics; Scotts Valley, CA) according to the manufacturer’s protocol with slight modifications. First, a second sample of S. paucispinis frozen liver tissue from the same individual was thoroughly ground with a mortar and pestle under liquid nitrogen. Subsequently, chromatin was fixed in place in the nucleus. The suspended chromatin solution was then passed through 100 and 40 μm cell strainers to remove large debris. Fixed chromatin was digested under various conditions of DNase I until a suitable fragment length distribution of DNA molecules was obtained. Chromatin ends were repaired and ligated to a biotinylated bridge adapter followed by proximity ligation of adapter-containing ends. After proximity ligation, crosslinks were reversed and the DNA was purified from proteins. Purified DNA was treated to remove biotin that was not internal to ligated fragments. An NGS library was generated using an NEB Ultra II DNA Library Prep kit (New England Biolabs [NEB], Ipswich, MA) with an Illumina-compatible y-adaptor. Biotin-containing fragments were then captured using streptavidin beads. The post-capture product was split into two replicates prior to PCR enrichment to preserve library complexity with each replicate receiving unique dual indices. The library was sequenced at Vincent J. Coates Genomics Sequencing Lab (Berkeley, CA) on an Illumina NovaSeq 6000 platform (Illumina, San Diego, CA) to generate approximately 100 million 2 × 150 bp paired-end reads per giga base of genome size.

Nuclear genome assembly

We assembled the genome of a Sebastes paucispinis individual following the CCGP assembly pipeline version 5.0 (www.github.com/ccgproject/ccgp_assembly), as outlined in Table 1 listing the tools and non-default parameters used in the assembly process. We removed the remnants adapter sequences from the PacBio HiFi dataset using HiFiAdapterFilt (Sim et al. 2022) and generated an initial diploid phased assembly using HiFiasm (Cheng et al. 2021) in HiC mode with the filtered PacBio HiFi reads and the Omni-C short-reads, a process that generates two assemblies, one per haplotype. We then aligned the Omni-C data to both assemblies following the Arima Genomics Mapping Pipeline (https://github.com/ArimaGenomics/mapping_pipeline) and then scaffolded both assemblies with SALSA (Ghurye et al. 2017, 2019).

Table 1.

Assembly pipeline and software used. All software is cited in the text.

Assembly step Software and any non-default options Version Reference
Initial assembly
Filtering PacBio HiFi adapters HiFiAdapterFilt Commit 64d1c7b Sim et al. 2022
k-mer counting Meryl (k = 21) 1 https://github.com/marbl/meryl
Estimation of genome size and heterozygosity GenomeScope (-l 50) 2 Ranallo-Benavidez et al. 2020
De novo assembly (contiging) HiFiasm (Hi-C Mode, –primary, output hic.hap1.p_ctg, hic.hap2.p_ctg) 0.19.4-r575 Cheng et al. 2022
Scaffolding
Omni-C data alignment Arima Genomics Mapping Pipeline Commit 2e74ea4 https://github.com/ArimaGenomics/mapping_pipeline
Arima Genomics Mapping Pipeline (AGMP) BWA-MEM 0.7.17-r1188 Li 2013
samtools 1.11 Danecek et al. 2021
filter_five_end.pl (AGMP) Commit 2e74ea4 https://github.com/ArimaGenomics/mapping_pipeline
two_read_bam_combiner.pl ((AGMP)) Commit 2e74ea4 https://github.com/ArimaGenomics/mapping_pipeline
picard 2.27.5 https://broadinstitute.github.io/picard/
Omni-C Scaffolding SALSA (-DNASE, -i 20, -p yes) 2 Ghurye et al. 2017, Ghurye et al. 2019
Omni-C contact map generation
Short-read alignment BWA-MEM (-5SP) 0.7.17-r1188 Li 2013
SAM/BAM processing samtools 1.11 Danecek et al. 2021
SAM/BAM filtering pairtools 0.3.0 Open2C et al. 2024
Pairs indexing pairix 0.3.7 Lee et al. 2022
Matrix generation cooler 0.8.10 Abdennur and Mirny 2020
Matrix balancing hicExplorer (hicCorrectmatrix correct --filterThreshold -2 4) 3.6 Ramírez et al. 2018
Contact map visualization HiGlass 2.1.11 Kerpedjiev et al. 2018
PretextMap 0.1.4 https://github.com/wtsi-hpag/PretextView
PretextView 0.1.5 https://github.com/wtsi-hpag/PretextMap
PretextSnapshot 0.0.3 https://github.com/wtsi-hpag/PretextSnapshot
Manual curation tools Rapid curation pipeline (Wellcome Trust Sanger Institute, Genome Reference Informatics Team) Commit 7acf220c https://gitlab.com/wtsi-grit/rapid-curation
Genome quality assessment
Basic assembly metrics QUAST (--est-ref-size) 5.0.2 Gurevich et al. 2013
Assembly completeness BUSCO (-m geno, -l actinopterygii_odb10) 5.8.2 Manni et al. 2021
Merqury 2020-01-29 Rhie et al. 2020
Repeat content and transposable element diversity RepeatMasker 4.1.2-p1 Smit et al. 2015
Dfam 3.8 Storer et al. 2021
CD-HIT 4.8.1 Fu et al. 2012
Contamination screening
Local alignment tool BLAST + (-db nt, -outfmt “6 qseqid staxids bitscore std,” -max_target_seqs 1, -max_hsps 1, -evalue 1e-25) 2.15 Camacho et al. 2009
General contamination screening BlobToolKit (HiFi coverage, BUSCO = actinopterygii, NCBI Taxa ID = 72093) 2.3.3 Challis et al. 2020

The assemblies for both haplotypes were manually curated by iteratively generating and analyzing their corresponding Omni-C contact maps. Briefly, to generate the contact maps we aligned the Omni-C data with BWA-MEM (Li 2013), identified ligation junctions, and generated Omni-C pairs (Lee et al. 2022) using pairtools (Open2C et al. 2024). Then, we generated multi-resolution Omni-C matrices with cooler (Abdennur and Mirny 2020) and balanced them with hicExplorer (Ramírez et al. 2018). We used HiGlass (Kerpedjiev et al. 2018) and the PretextSuite (https://github.com/wtsi-hpag/PretextView; https://github.com/wtsi-hpag/PretextMap; https://github.com/wtsi-hpag/PretextSnapshot) to visualize the contact maps. We identified misassemblies and misjoins in these contact maps, and modified the assemblies using the Rapid Curation pipeline from the Wellcome Trust Sanger Institute, Genome Reference Informatics Team (https://gitlab.com/wtsi-grit/rapid-curation). Some of the remaining gaps (joins generated during scaffolding and/or curation) were closed using the PacBio HiFi reads and YAGCloser (https://github.com/merlyescalona/yagcloser). We checked for contamination using the BlobToolKit Framework (Challis et al. 2020).

Genome quality assessment

We generated k-mer counts from the PacBio HiFi reads using meryl (https://github.com/marbl/meryl). The k-mer counts were then used in GenomeScope2.0 (Ranallo-Benavidez et al. 2020) to estimate genome features including genome size, heterozygosity, and repeat content. To obtain general contiguity metrics, we ran QUAST (Gurevich et al. 2013). To evaluate genome quality and functional completeness we used Benchmarking Universal Single-Copy Orthologs (BUSCO) (Manni et al. 2021) with the Actinopterygii ortholog database (actinopterygii_odb10) which contains 3,640 genes. Assessment of base level accuracy (QV) and k-mer completeness was performed using the previously generated meryl database and Merqury (Rhie et al. 2020). We further estimated genome assembly accuracy via BUSCO gene set frameshift analysis using the pipeline described in (Korlach et al. 2017). Measurements of the size of the phased blocks are based on the size of the contigs generated by HiFiasm on HiC mode. We follow the quality metric nomenclature established by Rhie et al. (2021), with the genome quality code x.y.P.Q.C, where, x = log10[contig NG50]; y = log10[scaffold NG50]; P = log10 [phased block NG50]; Q = Phred base accuracy QV (quality value); C = % genome represented by the first ‘n’ scaffolds, following a karyotype of 2n = 48 for this species, estimated as a mode from ancestral species number of chromosomes (Genome on a Tree—GoaT; tax_name(Sebastes paucispinis); Challis et al. 2023). Quality metrics for the notation were calculated on the haplotype one assembly.

Assembly validation and comparison

To confirm the quality of the assembly, we compared its BUSCO genome completeness assessment (Manni et al. 2021) with other Sebastes rockfish assemblies. Specifically, we compared it against the widow rockfish Sebastes entomelas (NCBI: GCA_045837885.1 and NCBI: GCA_045838235.1), the Acadian redfish Sebastes fasciatus (NCBI: GCA_043250625.1 and GCA_043250585.1), and the honeycomb rockfish Sebastes umbrosus (NCBI: GCF_015220745.1 and GCA_015220095.1).

Repeat content and transposable element diversity

To determine the diversity of repeat elements along the S. paucispinis genome, we produced a species-specific repeat library using repeatmodeler (Smit and Hubley 2008). We combined this species-specific library with ancestral repeats for S. paucispinis from the Dfam database (Storer et al. 2021) and reduced redundancy with cd-hit-est (Fu et al. 2012). We then proceeded to mask repeats in each corresponding species assembly using RepeatMasker (Smit et al. 2015), providing us with a summary of transposable elements (TEs) and other repetitive elements along the genome (Smit et al. 2015).

Results

Sequencing data

The PacBio HiFi library generated 6.01 million reads and the Omni-C library generated 216.72 million read pairs. The PacBio HiFi sequences yielded ~79× genome coverage and had an N50 read length of 13,699 bp; a minimum read length of 96 bp; a mean read length of 13,283 bp; and a maximum read length of 58,255 bp (Fig. 1B). Based on the PacBio HiFi data, Genomescope 2.0 estimated a genome size of 767.83 Mb, a 0.153% sequencing error rate, and 0.211% heterozygosity. The k-mer spectrum shows a bimodal distribution with a major coverage peak at ~100-fold coverage and a minor coverage peak at ~50-fold coverage (Fig. 2A).

Fig. 2.

Fig. 2.

A) A k-mer-based analysis of genome size and duplication level run using GenomeScope, B) a BlobToolKit Snail plot of the S. paucispinis assembly fSebPau1.0.hap1 illustrating assembly summary statistics. The full outer circle represents the entirety of the assembly. From the center of the snail plot, the red line indicates the length of the longest scaffold in the assembly, scaffolds are represented in grey and arranged in ascending length order clockwise around the plot. Light and dark orange sections of the plot represent the N90 and N50 values of the assembly, respectively, and light and dark blue regions on the outer edge of the plot represent AT and GC content, respectively. Dovetail Omni-C contact maps for C) haplotype 1 and D) haplotype 2.

Nuclear genome assembly

The final genome assembly (fSebPau1) consists of two phased haplotypes, both assemblies are similar in size, and similar but not equal to the estimated genome size from GenomeScope2.0. The haplotype one assembly (fSebPau1.0.hap1) consists of 211 scaffolds spanning 806.86 Mb with a contig N50 of 14.26 Mb, a scaffold N50 of 34.62 Mb, the largest contig size of 31.4 Mb, and the largest scaffold size of 43.67 Mb. The haplotype two assembly (fSebPau1.0.hap2) consists of 115 scaffolds spanning 802.85 Mb with a contig N50 of 18.44 Mb, a scaffold N50 of 43.52 Mb, the largest contig size of 35.2 Mb, and the largest scaffold size of 43.52 Mb (Table 2 and Fig. 2B).

Table 2.

Summary of assembly statistics for the Sebastes paucispinis genome assembly.

Bio projects and vouchers CCGP NCBI BioProject PRJNA720569
Genera NCBI BioProject PRJNA765858
Species NCBI BioProject PRJNA777218
NCBI BioSample SAMN36697770
Specimen identification SEB-726
NCBI Genome accessions Haplotype 1 Haplotype 2
Assembly accession JAUPFX000000000 JAUPFY000000000
Genome sequences GCA_036937225.1 GCA_036937175.1
Genome sequence PacBio HiFi reads Run 1 PACBIO_SMRT (Sequel IIe) run: 6M spots, 80G bases, 48.9Gb
Accession SRX25151838
Omni-C Illumina reads Run 2 ILLUMINA (Illumina NovaSeq 6000) runs: 216.7M spots, 65.4G bases, 21.5Gb
Accession SRX25151838-9
Genome assembly quality metrics Assembly identifier (Quality code)* fSebPau1(7.7.P7.Q64.C99)
HiFi Read coverage § 79.95X
Haplotype 1 Haplotype 2
Number of contigs 378 278
Contig N50 (bp) 14,264,469 18,449,034
Contig NG50 § 23,814,533 27,712,622
Longest Contigs 31,405,881 35,207,844
Number of scaffolds 211 115
Scaffold N50 34,620,701 34,476,039
Scaffold NG50 § 38,903,856 38,870,289
Largest scaffold 43,679,056 43,528,066
Size of final assembly 806,866,550 802,857,081
Phased block NG50 § 23,644,358 27,712,622
Gaps per Gbp (# Gaps) 207(167) 203(163)
Indel QV (Frame shift) 48.27606757 48.51125506
Base pair QV 64.5533 64.5533
Full assembly = 64.2817
k-mer completeness 97.3282 97.3208
Full assembly = 99.7538
BUSCO completeness
(actinopterygii_odb10) n = 3,640
C** S D F M
H1 99.3% 98.8% 0.5% 0.6% 0.01%
H2 99.3% 98.9% 0.4% 0.6% 0.01%

*Assembly quality code x.y.P.Q.C derived notation, from (Rhie et al. 2021). x = log10[contig NG50]; y = log10[scaffold NG50]; P = log10 [phased block NG50]; Q = Phred base accuracy QV (Quality value); C = % genome represented by the first “n” scaffolds, following a karyotype of 2n = 48 for this species, estimated as a mode from ancestral species number of chromosomes (Genome on a Tree—GoaT; tax_name(Sebastes paucispinis); Challis et al. 2023). Quality metrics for the notation were calculated on the haplotype one assembly.

§Read coverage and NGx statistics have been calculated based on the estimated genome size of 767.83 Mb.

(H1) Haplotype 1 and (H2) Haplotype 2 assembly values.

**BUSCO Scores. Complete BUSCOs (C). Complete and single-copy BUSCOs (S). Complete and duplicated BUSCOs (D). Fragmented BUSCOs (F). Missing BUSCOs (M).

During manual curation, we made a total of 91 joins (47 on haplotype one and 44 on haplotype two), 6 breaks (4 on haplotype one and 2 on haplotype two) based on the Omni-C contact map signal and were able to close a total of 11 gaps (5 on haplotype one and 6 on haplotype two). No further contigs were modified or removed. The Omni-C contact maps show highly contiguous assemblies, with chromosome-length scaffolds (Fig. 2C and D). Assembly statistics are reported in Table 2 and represented graphically in (Fig. 2B). We have deposited the genome assembly on NCBI GenBank (see Table 2 and “Data availability” for details).

Assembly validation and comparison

The haplotype one assembly has a BUSCO completeness score for the Actinopterygii gene set of 99.3%, a base pair quality value (QV) of 64.02, a k-mer completeness of 97.32%, and a frameshift indel QV of 48.27. The haplotype two assembly has a BUSCO completeness score for the Actinopterygii gene set of 99.3%, a base pair QV of 64.55, a k-mer completeness of 97.32%, and a frameshift indel QV of 48.51. This was in keeping with other high-quality Sebastes rockfish assemblies which each had between 98.9% and 99.3% of BUSCOs complete (Table 3). The completeness of the alternative, or haplotype 2, S. paucispinius exceeded the completeness of other rockfish assemblies (including those with comparable haplotype 1 completeness) with a BUSCO missingness of only 0.1%, the same as haplotype 1, compared to a range of 0.2% to 3.3% missing across other assemblies.

Table 3.

Summary of BUSCOs across recent haplotype-resolved rockfish assemblies.

Common name Latin name Genome assembly initiative Haplotype—NCBI accession BUSCO results C = Complete, S = Single copy, D = Duplicated, F = Fragmented, M = Missing, n = number of BUSCOs
Bocaccio rockfish Sebastes paucispinis CCGP hap 1 - GCA_036937225.1 C:99.3%[S:98.8%,D:0.5%],F:0.6%,M:0.1%,n:3,640
hap 2 - GCA_036937175.1 C:99.3%[S:98.9%,D:0.4%],F:0.6%,M:0.1%,n:3,640
Widow rockfish Sebastes entomelas CCGP hap 1 - GCA_045837885.1 C:99.3%[S:98.7%,D:0.7%],F:0.5%,M:0.1%,n:3,640
hap 2 - GCA_045838235.1 C:99.2%[S:98.5%,D:0.6%],F:0.5%,M:0.3%,n:3,640
Acadian redfish Sebastes fasciatus VGP hap 1 - GCA_043250625.1 C:99.3%[S:98.7%,D:0.6%],F:0.5%,M:0.2%,n:3,640
hap 2 - GCA_043250585.1 C:92.1%[S:88.9%,D:3.2%],F:3.7%,M:4.1%,n:3,640
Honeycomb rockfish Sebastes umbrosus VGP hap 1 - GCF_015220745.1 C:98.9%[S:98.4%,D:0.5%],F:0.5%,M:0.6%,n:3,640
hap 2 - GCA_015220095.1 C:95.3%[S:94.3%,D:1.0%],F:1.4%,M:3.3%,n:3,640

Repeat content and transposable element diversity.

TEs and repeats spanned 45.66% of the genome in total (Fig. 3A). The most abundant category was Class II TEs (or retrotransposons), which span 23.26% of the genome, followed by unclassified repeats (11.97%), Class I TEs (or DNA transposons, 6.49%) and then simple repeats (2.23%; Fig. 3A). Our analysis of repeat element lengths shows that the median in repeat lengths is around 100 bp, while the longest detected repeat spans over 500 kb (Fig. 3B). The length distribution varies slightly across different repeat classes, with simple and low-complexity repeats being the shortest classes and satellites the longest.

Fig. 3.

Fig. 3.

A) Proportion of the Sebastes paucispinis genome spanned by different repeat and transposable element categories. B) A histogram of repeat elements binned by repeat element length in log scale.

Discussion

Here we present a haplotype-resolved chromosome-scale assembly of the bocaccio rockfish Sebastes paucispinis. The assembly comprises 211 scaffolds and spans 806.9 Mb, with a scaffold N50 of 34.6 Mb. This assembly is similar in quality to other recently assembled rockfish genomes, with 99.3% complete BUSCOs (and only 0.6% fragmented, and 0.1% missing), and represents a substantial step forward compared to rockfish assemblies produced with older sequencing and assembly technologies. The assembly highlights the high repeat content present in rockfish genomes, with over 45% of the genome being spanned by TEs and repeats.

Rockfish have become a focal clade for studying aging due to their broad variation in lifespan (Kolora et al. 2021). Previous comparative genomics investigations into the evolution of lifespan variation and longevity have been carried out using genomes produced with older, lower fidelity, sequencing technologies (Kolora et al. 2021), resulting in less contiguous/accurate reference assemblies. Recent advances in sequencing technologies and genome assembly tools such as HiFiasm (Cheng et al. 2021), used to produce this bocaccio genome assembly, represent the state of the art and pave the way for more comprehensive investigations into aging. In addition, haplotype-resolved PacBio HiFi genomes will bolster ongoing work into a host of other questions in the rockfish clade including studies into the basis of ecological speciation between rockfish species, instances of local adaptation, as well as more fundamental investigations into aspects of genomic variation, both in terms of nucleotide variation and structural variation (which relies heavily on having high-quality reference genomes).

Following their dramatic decline from 1960 to 1990 it is clear that bocaccio populations require careful management and monitoring to facilitate population/stock recovery (Love et al. 2002; He and Field 2017). As with other fish taxa, the designation and monitoring of new marine protected areas (MPAs), and associated genetic/genomic monitoring and assessment, to assist in stock recovery, can be a powerful tool to aid in the recovery of wild populations. Genomic resources can help in a multitude of ways allowing the production of genetic panels for genetic diversity assessments, and to identify subpopulation structure across species ranges. In the case of bocaccio, little is known about population structure across their vast range, and while pre-genomic microsatellite and simple sequence repeat markers revealed little evidence of significant population structure (Matala et al. 2004; Drake et al. 2010; Buonaccorsi et al. 2012), these approaches use only a tiny fraction of the genome (10s of markers). High-quality chromosome-scale genomes allow the identification of millions of markers, dramatically increasing the genomic resolution with which we can identify fine-scale population variation. These insights can then be translated into improved knowledge of population structure and ultimately guide recovery and population management plans (Bernatchez et al. 2017; Bernos et al. 2020). The interesting spatial distribution of bocaccio, particularly their abundance and preference for schooling near oil platforms, also provides a unique opportunity to study human-wildlife interactions and the impacts of anthropogenic structures on wild populations (Love et al. 2005).

This chromosome-scale bocaccio assembly was produced as part of the CCGP and fulfills the mission of producing genomic resources that will assist in the protection of California wildlife in the face of ever-increasing anthropogenic stressors. This high-quality bocaccio assembly will aid in bocaccio conservation and act as a resource that the broader scientific community can use to address fundamental biology questions using rockfish as a focal clade, including ongoing investigations into the evolution of lifespan variation.

Acknowledgments

PacBio Sequel II library prep and sequencing were carried out at the DNA Technologies and Expression Analysis Cores at the UC Davis Genome Center, supported by NIH Shared Instrumentation Grant 1S10OD010786-01. Deep sequencing of Omni-C libraries used the Novaseq S4 sequencing platforms at the Vincent J. Coates Genomics Sequencing Laboratory at UC Berkeley, supported by NIH S10 OD018174 Instrumentation Grant. We thank the staff at the UC Davis DNA Technologies and Expression Analysis Cores and the UC Santa Cruz Paleogenomics Laboratory for their diligence and dedication to generating high-quality sequence data.

Contributor Information

Rishi De-Kayne, Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA.

Stacy Li, Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA.

Merly Escalona, Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA.

Runyang Nicolas Lou, Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA.

Juan Manuel Vazquez, Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA.

Gregory L Owens, Department of Biology, University of Victoria, Victoria, BC, Canada.

Sree Rohit Raj Kolora, Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA.

Conner Jainese, Marine Science Institute, University of California, Santa Barbara, Santa Barbara, CA, USA.

Katelin Seeto, Marine Science Institute, University of California, Santa Barbara, Santa Barbara, CA, USA.

Merit McCrea, Marine Science Institute, University of California, Santa Barbara, Santa Barbara, CA, USA.

Oanh Nguyen, DNA Technologies and Expression Analysis Core Laboratory, Genome Center, University of California, Davis, Davis, CA, USA.

Noravit Chumchim, DNA Technologies and Expression Analysis Core Laboratory, Genome Center, University of California, Davis, Davis, CA, USA.

Ruta Sahasrabudhe, DNA Technologies and Expression Analysis Core Laboratory, Genome Center, University of California, Davis, Davis, CA, USA.

Colin W Fairbairn, Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA 95064, USA.

Richard E Green, Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA.

William E Seligmann, Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA 95064, USA.

Milton Love, Marine Science Institute, University of California, Santa Barbara, Santa Barbara, CA, USA.

Peter H Sudmant, Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA.

Author contributions

Rishi De-Kayne (Data curation, Formal analysis, Investigation, Resources, Software, Validation, Visualization, Writing—original draft, Writing—review & editing), Stacy Li (Conceptualization, Data curation, Formal analysis, Project administration, Resources, Software, Writing—original draft, Writing—review & editing), Merly Escalona (Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing—original draft, Writing—review & editing), Runyang Lou (Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing—original draft, Writing—review & editing), Gregory Owens (Validation, Visualization, Writing—review & editing), Sree Rohit Raj Kolora (Conceptualization, Data curation, Investigation, Methodology, Resources, Writing—review & editing), Conner Jainese (Investigation, Methodology, Resources), Katelin Seeto (Data curation, Investigation, Methodology), Merit McCrea (Data curation, Methodology, Resources), Oanh Nguyen (Data curation, Investigation, Methodology, Resources), Noravit Chumchim (Data curation, Investigation, Methodology, Resources), Ruta Sahasrabudhe (Data curation, Methodology, Resources), Colin W. Fairbairn (Investigation, Methodology, Resources), Richard Green (Investigation, Methodology, Project administration), William Seligmann (Investigation, Methodology, Resources), Milton Love (Data curation, Funding acquisition, Methodology, Project administration, Supervision, Writing—review & editing), and Peter Sudmant (Conceptualization, Data curation, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing—original draft, Writing—review & editing)

Funding

This work was supported by the California Conservation Genomics Project, with funding provided to the University of California by the State of California, State Budget Act of 2019 [UC Award ID RSI-19-690224] to PHS, NIH NIGMS award R35GM13798 to PHS, and a North Pacific Research Board Award (NPRB P2112).

Data availability

Data generated for this study are available under NCBI BioProject PRJNA720569. Raw sequencing data for sample SEB-726 (NCBI BioSample SAMN36697770) are deposited in the NCBI Short Read Archive (SRA) under SRX25151838 (PacBio HiFi), and SRX25151839 + SRX25151840 (Dovetail Omni-C). Assembly scripts and other data for the analyses presented can be found at the following GitHub repository: www.github.com/ccgproject/ccgp_assembly

References

  1. Abdennur  N, Fudenberg  G, Flyamer  IM, Galitsyna  AA, Goloborodko  A, Imakaev  M, Venev  SV; Open2C. Pairtools: from sequencing data to chromosome contacts. PLoS Comput Biol.  2024:20:e1012164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Abdennur  N, Mirny  LA.  Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics.  2020:36:311–316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bernatchez  L, Wellenreuther  M, Araneda  C, Ashton  DT, Barth  JMI, Beacham  TD, Maes  GE, Martinsohn  JT, Miller  KM, Naish  KA, et al.  Harnessing the power of genomics to secure the future of seafood. Trends Ecol Evol  2017:32:665–680. [DOI] [PubMed] [Google Scholar]
  4. Bernos  TA, Jeffries  KM, Mandrak  NE.  Linking genomics and fish conservation decision making: a review. Rev Fish Biol Fish.  2020:30:587–604. [Google Scholar]
  5. Buonaccorsi  VP, Kimbrell  CA, Lynn  EA, Hyde  JR.  Comparative population genetic analysis of bocaccio rockfish Sebastes paucispinis using anonymous and gene-associated simple sequence repeat loci. J Hered.  2012:103:391–399. [DOI] [PubMed] [Google Scholar]
  6. Challis  R, Kumar  S, Sotero-Caio  C, Brown  M, Blaxter  M.  Genomes on a Tree (GoaT): a versatile, scalable search engine for genomic and sequencing project metadata across the eukaryotic tree of life. Wellcome Open Res.  2023:8:24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Challis  R, Richards  E, Rajan  J, Cochrane  G, Blaxter  M.  BlobToolKit—interactive quality assessment of genome assemblies. G3 (Bethesda).  2020:10:1361–1374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Cheng  H, Concepcion  GT, Feng  X, Zhang  H, Li  H.  Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods.  2021:18:170–175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Drake  JS, Berntson  EA, Gustafson  RG, Holmes  EE, Levin  PS, Tolimieri  N, et al. (2010). Status review of five rockfish species in Puget Sound, Washington: Bocaccio (Sebastes paucispinis), canary rockfish (S. pinniger), yelloweye rockfish (S. ruberrimus), greenstriped rockfish (S. elongatus), and redstripe rockfish (S. proriger). U.S. Dept Commer., NOAA Tech. Memo. NMFS-NWFSC-108, 234 p.
  10. Fu  L, Niu  B, Zhu  Z, Wu  S, Li  W.  CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics.  2012:28:3150–3152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Ghurye  J, Pop  M, Koren  S, Bickhart  D, Chin  C-S.  Scaffolding of long read assemblies using long range contact information. BMC Genomics.  2017:18:527. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Ghurye  J, Rhie  A, Walenz  BP, Schmitt  A, Selvaraj  S, Pop  M, Phillippy  AM, Koren  S.  Integrating Hi-C links with assembly graphs for chromosome-scale assembly. PLoS Comput Biol.  2019:15:e1007273. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Gurevich  A, Saveliev  V, Vyahhi  N, Tesler  G.  QUAST: quality assessment tool for genome assemblies. Bioinformatics.  2013:29:1072–1075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. He  X, Field  JC.  Stock assessment update: status of bocaccio, Sebastes paucispinis, in the conception, Monterey and Eureka INPFC areas for 2017. Pacific Fishery Management Council, Portland, Oregon; 2017. [Google Scholar]
  15. Kerpedjiev  P, Abdennur  N, Lekschas  F, McCallum  C, Dinkla  K, Strobelt  H, Luber  JM, Ouellette  SB, Azhir  A, Kumar  N, et al.  HiGlass: web-based visual exploration and analysis of genome interaction maps. Genome Biol.  2018:19:125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kolora  SRR, Owens  GL, Vazquez  JM, Stubbs  A, Chatla  K, Jainese  C, Seeto  K, McCrea  M, Sandel  MW, Vianna  JA, et al.  Origins and evolution of extreme life span in Pacific Ocean rockfishes. Science.  2021:374:842–847. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Korlach  J, Gedman  G, Kingan  SB, Chin  C-S, Howard  JT, Audet  J-N, Cantin  L, Jarvis  ED.  De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads. GigaScience.  2017:6:1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Lee  S, Bakker  CR, Vitzthum  C, Alver  BH, Park  PJ.  Pairs and Pairix: a file format and a tool for efficient storage and retrieval for Hi-C read pairs. Bioinformatics.  2022:38:1729–1731. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Li  H.  Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv [q-bioGN]. 2013.
  20. Love  M, Schroeder  D, Lenarz  W.  Distribution of bocaccio (Sebastes paucispinis) and cowcod (Sebastes levis) around oil platforms and natural outcrops off California with implications for larval production. Bull Mar Sci.  2005:77:397–408. [Google Scholar]
  21. Love  MS, Bizzarro  JJ, Cornthwaite  AM, Frable  BW, Maslenikov  KP.  Checklist of marine and estuarine fishes from the AlaskaYukon Border, Beaufort Sea, to Cabo San Lucas, Mexico. Zootaxa.  2021:5053:1–285. [DOI] [PubMed] [Google Scholar]
  22. Love  MS, Passarelli  JK.  Miller and lea’s guide to the coastal marine fishes of California, 2nd ed. Davis, California, USA: UCANR Publications; 2020. [Google Scholar]
  23. Love  MS, Yoklavich  M, Schroeder  DM.  Demersal fish assemblages in the Southern California Bight based on visual surveys in deep water. Environ Biol Fishes.  2009:84:55–68. [Google Scholar]
  24. Love  MS, Yoklavich  M, Thorsteinson  LK.  The Rockfishes of the Northeast Pacific. Berkeley and Los Angeles, California, USA: University of California Press; 2002. [Google Scholar]
  25. Manni  M, Berkeley  MR, Seppey  M, Simão  FA, Zdobnov  EM.  BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol Biol Evol.  2021:38:4647–4654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Matala  AP, Gray  AK, Gharrett  AJ, Love  MS.  Microsatellite variation indicates population genetic structure of bocaccio. N Am J Fish Manag.  2004:24:1189–1202. [Google Scholar]
  27. Orr  JW, Brown  MA, Baker  DC, Alaska Fisheries Science Center (U.S.).  Guide to rockfishes (Scorpaenidae) of the genera Sebastes, Sebastolobus, and Adelosebastes of the Northeast Pacific Ocean. NOAA technical memorandum NMFS-AFSC; 1998. https://repository.library.noaa.gov/view/noaa/26628 [Google Scholar]
  28. Ramírez  F, Bhardwaj  V, Arrigoni  L, Lam  KC, Grüning  BA, Villaveces  J, Habermann  B, Akhtar  A, Manke  T.  High-resolution TADs reveal DNA sequences underlying genome organization in flies. Nat Commun.  2018:9:189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Ranallo-Benavidez  TR, Jaron  KS, Schatz  MC.  GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat Commun.  2020:11:1432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Rhie  A, McCarthy  SA, Fedrigo  O, Damas  J, Formenti  G, Koren  S, Uliano-Silva  M, Chow  W, Fungtammasan  A, Kim  J, et al.  Towards complete and error-free genome assemblies of all vertebrate species. Nature.  2021:592:737–746. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Rhie  A, Walenz  BP, Koren  S, Phillippy  AM.  Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol.  2020:21:245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Shaffer  HB, Toffelmier  E, Corbett-Detig  RB, Escalona  M, Erickson  B, Fiedler  P, Gold  M, Harrigan  RJ, Hodges  S, Luckau  TK, et al.  Landscape genomics to enable conservation actions: The California Conservation Genomics Project. J Hered.  2022:113:577–588. [DOI] [PubMed] [Google Scholar]
  33. Sim  SB, Corpuz  RL, Simmonds  TJ, Geib  SM.  HiFiAdapterFilt, a memory efficient read processing pipeline, prevents occurrence of adapter sequence in PacBio HiFi reads and their negative impacts on genome assembly. BMC Genomics.  2022:23:157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Smit  AFA, Hubley  R. RepeatModeler Open-1.0, 2008. Available fom http://www repeatmasker org [Google Scholar]
  35. Smit  AFA, Hubley  R, Green  P. RepeatMasker Open-4.0. 2013–2015, 2015.
  36. Storer  J, Hubley  R, Rosen  J, Wheeler  TJ, Smit  AF.  The Dfam community resource of transposable element families, sequence models, and genome annotations. Mob DNA.  2021:12:2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Williams  GD, Levin  PS, Palsson  WA.  Rockfish in puget sound: an ecological history of exploitation. Mar Policy  2010:34:1010–1020. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data generated for this study are available under NCBI BioProject PRJNA720569. Raw sequencing data for sample SEB-726 (NCBI BioSample SAMN36697770) are deposited in the NCBI Short Read Archive (SRA) under SRX25151838 (PacBio HiFi), and SRX25151839 + SRX25151840 (Dovetail Omni-C). Assembly scripts and other data for the analyses presented can be found at the following GitHub repository: www.github.com/ccgproject/ccgp_assembly


Articles from Journal of Heredity are provided here courtesy of Oxford University Press

RESOURCES