Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2020 Oct 29;15(10):e0240935. doi: 10.1371/journal.pone.0240935

The sockeye salmon genome, transcriptome, and analyses identifying population defining regions of the genome

Kris A Christensen 1,2,*, Eric B Rondeau 1,2,3, David R Minkley 1,2, Dionne Sakhrani 1, Carlo A Biagi 1, Anne-Marie Flores 2, Ruth E Withler 3, Scott A Pavey 4, Terry D Beacham 3, Theresa Godin 5, Eric B Taylor 6, Michael A Russello 7, Robert H Devlin 1, Ben F Koop 2,*
Editor: Zuogang Peng8
PMCID: PMC7595290  PMID: 33119641

Abstract

Sockeye salmon (Oncorhynchus nerka) is a commercially and culturally important species to the people that live along the northern Pacific Ocean coast. There are two main sockeye salmon ecotypes—the ocean-going (anadromous) ecotype and the fresh-water ecotype known as kokanee. The goal of this study was to better understand the population structure of sockeye salmon and identify possible genomic differences among populations and between the two ecotypes. In pursuit of this goal, we generated the first reference sockeye salmon genome assembly and an RNA-seq transcriptome data set to better annotate features of the assembly. Resequenced whole-genomes of 140 sockeye salmon and kokanee were analyzed to understand population structure and identify genomic differences between ecotypes. Three distinct geographic and genetic groups were identified from analyses of the resequencing data. Nucleotide variants in an immunoglobulin heavy chain variable gene cluster on chromosome 26 were found to differentiate the northwestern group from the southern and upper Columbia River groups. Several candidate genes were found to be associated with the kokanee ecotype. Many of these genes were related to ammonia tolerance or vision. Finally, the sex chromosomes of this species were better characterized, and an alternative sex-determination mechanism was identified in a subset of upper Columbia River kokanee.

Introduction

Sockeye salmon (Oncorhynchus nerka) are one of eight species of Pacific salmon and trout native to the North Pacific Ocean where they are of tremendous economic and cultural significance. The return of sockeye salmon from the Pacific Ocean to rivers and lakes is part of an ancient series of migrations that began with the emergence of the species several million years ago (reviewed in [15]). Spawning sockeye salmon, with their bright red bodies, pulse at various times during summer and fall through streams from the Columbia River to the Mackenzie River (Northwest Territories, Canada) in North America, and from Hokkaido, Japan to the Chukotka Peninsula in Asia [6]. The largest concentrations of sockeye salmon, and where most commercial catches are taken, centre around Bristol Bay (Alaska, USA), the Fraser River (BC, Canada), and the Kamchatka Peninsula (Kamchatka, Russia) [7].

Sockeye salmon can be anadromous (ocean-going) or remain as freshwater populations known as kokanee [6]. Populations of sockeye salmon and kokanee can be broadly divided into northwestern and southern phylogenetic groups based on allozyme, minisatellite, and mtDNA loci (a third glacial refugium has also been suggested) [8,9]. This split between northwestern and southern phylogenetic groups are consistent with other Pacific salmon species and suggests two common North American glacial refugia during the Last Glacial Maximum [814]). Modern populations of Pacific salmon are thought to be derived from the colonization of fish from these refugia. Kokanee from both phylogenetic groups have diverged from the ocean-going ecotype multiple times (i.e. the kokanee ecotypes are polyphyletic) since the Last Glacial Maximum except in some locations (e.g. the Fraser and Columbia Rivers) [2,8,15], where multiple populations of kokanee are more closely related to each other than sympatric sockeye salmon. At one time the Fraser and Columbia Rivers may have been connected, which could explain how kokanee from these rivers could be monophyletic [15,16].

Several studies have previously identified genomic regions underlying the various sockeye ecotypes (including spawning habitat ecotypes not discussed earlier) [1723]. As noted in Pritchard et al. (2018), one gene that was identified in some of these studies comparing sockeye salmon and kokanee or other ecotypes (e.g. shore-spawning vs. stream-spawning) was the leucine-rich repeat-containing 9 gene [17,20,21,24]. This gene is proximal to the six homeobox 6 gene that is a candidate gene under strong selection in differing Atlantic salmon (Salmo salar) populations (associated with upstream catchment) [24]. Larson et al. (2014 and 2019) also discovered that the MHC class II peptide-binding region in sockeye salmon was under directional selection based on spawning habitat ecotype [18,25].

Sockeye salmon have an XY sex-determination system (with sdY as the sex-determining gene on the Y-chromosome [2629]). Interestingly, the sockeye salmon Y-chromosome has fused with an autosome making it an X1X2Y system [27,30,31]. The X1 and X2 chromosomes correspond to linkage groups 9b and 9a, respectively (from Limborg et al. (2015) [32]). Not all populations of sockeye salmon appear to have a strong association of sdY to sex [27] suggesting that an alternative sex-determination mechanism(s) may exist in certain populations. SdY-positive females in Atlantic salmon have been identified previously and explained as possible mosaicism, but sdY-negative males are less common and require another explanation [33].

Salmonid sex chromosomes are generally not conserved between genera or species even though linkage groups are often conserved between species otherwise [29,3436]. In Atlantic salmon, sdY has translocated between chromosomes at least twice since Atlantic salmon speciation and suggests that sex determination in salmon can be “evolutionarily fluid” [37,38]. The sdY gene is surrounded by repetitive sequence and is small, which may allow or possibly facilitate these translocation events and generation of novel sex chromosomes in salmonids [29,34,39].

In this study, we generated the first sockeye salmon reference genome assembly. With the large RNA-seq data sets we produced, the National Center for Biotechnology Information (NCBI) generated a standardized gene annotation of this genome assembly. We also resequenced the genomes of 140 sockeye salmon and kokanee samples from along the northern Pacific Ocean to better understand population structure and genomic loci underlying divergence between populations and ecotypes. Finally, we were better able to characterize the sex-chromosomes of this species.

Materials and methods

Samples

See S1 Methods for sampling strategy.

DNA extractions, RNA extractions, libraries, and sequencing

DNA was extracted from tissue samples preserved in ethanol (from Eric Taylor’s, Scott Pavey’s, and Michael Russello’s labs) with a DNeasy Blood & Tissue Kit (QIAGEN) following the manufacturer’s protocol, or the DNA was already extracted. DNA was extracted from tissue samples (from Fisheries and Oceans Canada and the Freshwater Fisheries Society of British Columbia) preserved in RNAlater (ThermoFisher) following the manufacturer’s protocol [40]. RNA was extracted from the Pitt Lake sockeye salmon tissue samples (On170719-1) preserved in RNAlater using a RNeasy Mini Kit (QIAGEN).

Overlapping paired-end library preparation and sequencing (used for genome assembly) was performed at McGill University and Génome Québec Innovation Centre. These libraries were generated following the NEBNext Ultra II DNA Library Prep Kit for Illumina (New England BioLabs). The Pippin Prep (SAGE Science) was used for size selection (peak of distribution: ~488 bp based on 2100 BioAnalyzer (Agilent Technologies)) and the library was sequenced on an Illumina HiSeq 2500 in Rapid mode (PE250).

Mate-pair libraries were prepared at McGill University and Génome Québec Innovation Centre following the Nextera Mate Pair Sample Prep Kit (Illumina). Size selection was performed on the libraries using a 0.5% agarose gel for the following sizes: 3, 5, 10 kbp. These libraries were sequenced on an Illumina HiSeq 2500 (PE125).

For whole-genome resequencing, libraries were produced at McGill University and Génome Québec Innovation Centre using a NxSeq AmpFREE Low DNA Library Kit and NxSeq Adaptors (Lucigen) after DNA passed QC (Quant-iT PicoGreen dsDNA Assay Kit (Life Technologies) and gel electrophoresis). They were then sequenced on an Illumina HiSeq X (PE150) after libraries passed QC (Quant-iT PicoGreen dsDNA Assay Kit, Kapa Illumina GA with Revised Primers-SYBR Fast Universal Kit (Kapa Biosystems), and LabChip Gx (PerkinElmer)).

RNA-seq libraries were generated at McGill University and Génome Québec Innovation Centre after the RNAs passed QC (NanoDrop Spectrophotometer ND-1000 (NanoDrop Technologies, Inc., and 2100 Bioanalyzer (Agilent Technologies)). Total RNA (250 ng) was enriched for mRNA using the NEBNext Poly(A) Magnetic Isolation Module (New England BioLabs), and then cDNA synthesis followed using the NEBNext RNA First-Strand Synthesis and NEBNext Ultra Directional RNA Second Strand Synthesis Modules (New England BioLabs). NEBNext Ultra II DNA Library Prep Kit for Illumina (New England Biolabs) was used to finish the library preparation. Libraries were sequenced after passing QC (same as above) on an Illumina HiSeq 4000 (PE 100).

PacBio DNA libraries were prepared with the Pacific Biosciences 20 kb Template Preparation Using BluePippin Size-Selection System instructions at McGill University and Génome Québec Innovation Centre. Briefly, Covaris g-TUBES (Covaris) were used to shear high molecular weight DNA at 4000 RPM for 60 s (each direction). DNA damage repair, end repair, and SMRT bell ligation followed manufacturer’s protocol using the SMRTbell Template Prep Kit 1.0 reagents (Pacific Biosciences). Size selection (9 kb-50 kb) was then performed on a BluePippin system (Sage Science). Sequencing was performed on a Sequel (Sequel Sequencing Plate 2.1, SMRT cells 1M v2).

The 10X Chromium shotgun libraries (for genome assembly) were prepared and sequenced at McGill University and Génome Québec Innovation Centre using the Chromium Genome Reagent kits v2 User Guide RevB protocol after BluePippin size selection for 40 kb–80 kb DNA fragments. The Chromium library was sequenced on an Illumina HiSeq X Ten.

Genome assembly

The overlapping paired-end 250 bp reads (PE250) and all the reads from the mate-pair libraries were checked for quality using FastQC (default settings) [41]. The PE250 reads were then trimmed using trimmomatic version 0.36 [42] with the following parameters: ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 (i.e. only adaptors were removed based on review of the output from FastQC). All the mate-pair libraries were also trimmed with trimmomatic (ILLUMINACLIP:NexteraPE-PE.fa:2:30:10 LEADING:28 TRAILING:28 SLIDINGWINDOW:4:15 MINLEN:50). PacBio long reads were error corrected with the paired-end data using LoRDEC [43] with the following parameters: -k 21 -s 3 -T 50.

Initial contigs and scaffolds were produced using ALLPATHS-LG [44] with the following parameters: Coverage of 64x from the PE250 reads (32x from two lanes), 14x coverage from each of the three mate-pair libraries, genome size set to 2.6 Gbp, ploidy set to one. To increase the contig lengths, a custom pipeline was used to fill gaps in the scaffolds. First, the corrected PacBio reads were aligned to the assembly using BLAST [45] (-task megablast, -evalue 1E-16, -max_hsps 25, -word_size 42, -perc_identity 85, -max_target_seq 4, -outfmt 6). The alignments were then filtered with custom software (S1 File) that filters based on linear alignments (maximum distance between BLAST high-scoring segment pairs (hsps) was 15 kbp, minimum length of all hsps was set to 1000, the minimum average percent identity of all the hsps was 87 with a minimum of 85 for an individual hsps). Each PacBio read was only allowed to have one best location. The script gap_finder.pl from LR_Gapcloser [46] was used to identify the locations of all of the gaps in the assembly. The corrected PacBio reads were then used to fill these gaps if the aligned reads spanned the gap as a single hsps or as two flanking hsps. In either case, the distance of the hsps (one spanning or two flanking) from the gap was only allowed to be 100 bp, the size of the sequence filling the gap was only allowed to be 100x larger than the predicted gap (the prediction was made in the ALLPATHS-LG assembly from mate-pair data), and the minimum gap size to fill was 9 bp.

The assembly was then polished with Pilon [47] using all of the trimmed paired-end data (bwa mem aligned with -M parameter, Samtools [48] sorted, and default Pilon parameters), and then scaffolded using the 10x data with the Arcs/Tigmint/Links pipeline [4951]. The following parameters were used for the Arcs/Tigmint/Links pipeline: -l 5, -a 0.5, -c 3, -e 30000. After scaffolding with the 10x data, another custom gap filling was performed (same as before except, the maximum distance between hsps was 100 kbp).

The assembly was then error corrected with Arrow [52,53] using the ArrowGrid pipeline [54,55] using all of the PacBio reads before correction (default settings) and then with Pilon again (same as above). Scaffolds smaller than 500 bp were then removed using the seqtk [56] function seq. Finally, the sequences were all made uppercase using Unix commands.

Scaffolds were ordered and oriented onto pseudomolecules/chromosomes using the methodology described in Christensen et al. (2018) [57]. Briefly, scaffolds were aligned to the Atlantic salmon (Salmo salar, GCF_000233375.1 [58]), coho salmon (Oncorhynchus kisutch, GCF_002021735.1), Chinook salmon (Oncorhynchus tshawytscha, GCF_002872995.1 [57]), rainbow trout (Oncorhynchus mykiss, GCF_002163495.1 [59]), Arctic charr (Salvelinus alpinus, GCF_002910315.2 [60]), and Northern pike (Esox lucius, GCF_000721915.3 [61]) genome assemblies using BLAST (-outfmt 6, -word_size 48, perc_identity 94, -max_hsps 100, -max_target_seqs 10 -evalue 1E-16 for Northern pike: -outfmt 6, -max_hsps 400, -max_target_seqs 10, -evalue 1E-11). These alignments were then filtered (e.g. off-target or repetitive elements) using the scripts from Christensen et al. (2018) (-max1/2 0.5, -maxActual1/2 100K, -minl1 0.25, -minl2 0, -minaln 1K, -avgminper 94, -minper 94 -pidVar 4 for Atlantic salmon and Arctic charr: -avgminper 93 and -minper 93 for Northern pike: -minl1 0.2, -avgminper 86, -minper 85). The best placement for each scaffold was then found (-filtMinLen 1K -minWinSize 10K -minSizeLarger 10K). Marey maps (graphs with genetic map positions on one axis and genomic positions on the other axis) were generated for each of the previously published sockeye salmon genetic maps [32,62,63] using the methodology from Christensen et al. (2018). The syntenic information was combined with the Marey maps using custom scripts (S1 File). The scaffolds were then manually ordered in Libreoffice calc using the synteny and Marey map information. The order of the scaffolds was visualized against other genomes using custom scripts (S1 File) and ggplot2 [64] in R [65]. A BUSCO [66] analysis was performed for quality assurance. The actinopterygii_odb9 BUSCO dataset (parameters: -m geno, -c 10 -sp zebrafish) was used to find the fraction of genes that could be identified in the genome assembly to assess the quality of the assembly.

Circos plot

Duplicated genomic regions were identified with default settings of SyMap [67] from a modified copy of the genome fasta file. The lower-case sequences of the genome (sequences masked by WindowMasker [68] by the NCBI) were first replaced with “N’s” using a Unix command (sed -e '/^>/! s/[[:lower:]]/N/g'). Also, only sequences assigned to chromosomes were retained with the Samtools faidx command. The orientation of the blocks generated from SyMap was found using the script Analyze_Symap_Block_Orientation.py from Christensen et al. (2018) [57]. The percent identity between duplicated genomic regions was found using the output from SyMap and the Analyze_Symap_Linear_Alignments.py script [57]. The percent of repetitive elements was identified in genomic blocks from the modified genomic fasta file and the Percent_Repeat_Genome_Fasta.py script [57]. Circos software was then used to generate the Circos plot [69].

Gene annotation

RNA-seq libraries (see above) of red muscle (SRA accession: SRX5621463), hind gut (SRX5621462), stomach (SRX5621461), ovaries (SRX5621460), gill (SRX5621459), spleen (SRX5621458), pituitary (SRX5621457), white muscle (SRX5621456), pyloric caeca (SRX5621455), adipose (SRX5621454), heart (SRX5621453), liver (SRX5621452), brain (SRX5621451), mid gut (SRX5621450), left eye (SRX5621449), upper jaw (SRX5621448), head kidney (SRX5621447), and lower jaw (SRX5621446) were submitted to GenBank to improve the NCBI genome annotation. The number of paired-end reads per tissue ranged from 98.7 to 147.5 million.

Variant calling

The GATK version 3.8 best practices pipeline [7072] was used as a framework to call variants on the whole genome resequencing data. First the raw paired-end reads were aligned to the genome with bwa mem (-M parameter) version 0.7.17 [73], and Samtools sort version 1.9 was used to sort the resulting alignments. The Picard (version 2.18.9) command AddOrReplaceReadGroups was used to add information about the experiments to the alignment file (validation stringency parameter was set to lenient). Samtools was then used to index the alignment files, and MarkDuplicates from Picard added information about possible PCR duplicates (lenient validation stringency). Samples that were split between sequencing lanes were merged with the MarkDuplicates command and read group information was changed using the Picard command ReplaceSamHeader.

GATK’s HaplotypeCaller generated GVCF files (—genotyping_mode DISCOVERY,—emitRefConfidence GVCF) for each individual. These GVCF files were then genotyped using GATK’s GenotypeGVCFs command for 10 Mbp intervals (generated by a custom script S1 File). The resulting VCF files were then merged using the GATK command CatVariants. The merged VCF file was then sorted using VCFtools [74] command vcf-sort. The sorted VCF file was then compressed and indexed with the Bgzip and Tabix programs [75].

Variants in the VCF file were filtered using the GATK command VariantFiltration (—filterExpression “QD < 2.0 || FS > 60.0 || SOR < 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0”). This filtered VCF file was then compressed and indexed with Bgzip and Tabix and was used as a training set. SNPs from Veale and Russello (2017), Nichols et al. (2016), and Larson et al. (2016) were used as truth sets (as described in the GATK documentation a truth data set is a set of variants assumed to be real) [17,19,20]. To generate the truth sets, sequences and SNP positions were manually extracted from supplemental files and converted to 1-based positions (instead of 0-based), and mapped to the genome with bwa mem (these SNPs previously did not have genomic positions, but positions relative to a sequenced read). The Samtools sorted sam file was then processed by snp-placer [76] to locate the genomic position of these SNPs (this included a filter step and soft-clipped alignments were removed with command-line tools). The VCF file produced by snp-placer was manually filtered for missing locations and alignment quality scores below 10, and then used to identify the same positions in the unfiltered VCF file produced by GATK using the vcf-isec command from VCFtools (parameters used: -n = 2 -f). The resultant VCF files (for each study compared) contained the intersection of the truth SNPs and the unfiltered variants.

Variant recalibration was performed using the GATK function VariantRecalibrator with the training and truth datasets (parameters used: -resource: training,prior = 12.0, -resource:training,truth,prior = 15.0, -mode SNP, -an QD, -an MQ, -an MQRankSum, -an ReadPosRankSum, -an FS, -an SOR, -an DP -an InbreedingCoeff). After variant recalibration the unfiltered variants were filtered using the ApplyRecalibration function of GATK (—ts_filter_level 99.5 -mode SNP).

Three additional filters were applied depending on the type of analysis used to evaluate the variants/populations. The first filter with VCFtools (parameters:—maf 0.05,—max-alleles 2,—min-alleles 2,—max-missing 0.9,—remove-filtered-all—remove-indels) was used in all analyses. This first filter removed variants that were not biallelic, indels, had missing data in more than 10% of the individuals, that were marked as failed by GATK, or if the minor allele frequency was below 0.05. The second filter additionally removed variants with allelic imbalance (with ratios of the lower count allele to higher count allele less than 0.2, referred to as allele balance filter later) in any of the individuals using custom scripts (S1 File). The final filter additionally removed variants that were in high linkage disequilibrium (LD), which might bias phylogenetic analyses. BCFtools [77] was used to filter variants within a 20 kbp window for high LD, and only allowed two variants within that window to remain with high LD (parameters: +prune, -w 20kb, -l 0.4, -n 2). The number of missing genotypes and average depth was calculated for each individual from the variants after the third filter using a custom script (S1 File).

Clustering individuals (population stratification)

Three methods were employed to cluster the sockeye and kokanee samples: 1) discriminant analysis of principal components (DAPC) [78,79], 2) Admixture (model-based estimation) [80], and 3) a maximum likelihood analysis (phylogenetic tree) [81]. To reduce the effects of high LD, variants that had been filtered for LD were used in the three clustering methods. DAPC was completed in R with the adegenet [79] and vcfR [82] libraries. Clusters were identified using the find.clusters function (the cluster with the lowest Bayesian information criterion was chosen, 3 in this case), and the optimum number of principal components was found using the optim.a.score function (6 were chosen). Both eigenvalues were retained for the discriminant analysis.

For the admixture analysis, chromosome names were changed in the vcf file using a custom script (S1 File), and the vcf file was converted to an appropriate format using PLINK v1.9 [83,84] (parameters:—double-id,—chr-set 29 no-xy). A cross-validation analysis using the admixture software pointed to a cluster of three having the lowest error, and so a K of three was chosen using default settings. Admixture plots were created in R using the ggplot2 and reshape2 [85] libraries.

The maximum likelihood tree was generated with snpPhylo with a bootstrap value of 1000 and filtering turned off as the data had already been filtered for LD (parameters: -B 1000 -r -m 0.0 -M 0.0). The tree was visualized using the interactive tree of life software [86]. The colour of the groups was chosen to match the DAPC analysis.

Chromosomal variation underlying population structure

To identify the regions of the genome underlying population structure, eigenGWA (eigen genome wide association) [87] was performed using the egwas command from the GEAR software [88]. EigenGWA identifies which regions of the genome are associated with the given eigenvalues and corrects for genetic drift (via the genomic inflation factor) to identify ancestry informative variants/markers. In this case, the LD1 (similar to PC1 from a PCA) values from the DAPC analysis were used in the eigenGWA. Genomic associations were identified in a pairwise fashion between the three clusters found in the DAPC analysis using variants that were not filtered for LD (second filter). A Bonferroni correction was used to limit false positives (ɑ = 0.01 before correction) and peaks were examined only if there were multiple significant variants found in the peak to limit false-positive associations and uncover only the most robust associations (a minimum of five significant variants within 100 kbp of each other). Spurious alignments might cause a single or even a few variants to appear highly significant, but they may not show LD to other proximal variants, which would be expected at short distances.

The admixture ancestry values from Fraser River drainage sockeye salmon (n = 14) and kokanee (n = 12) were also analyzed using eigenGWA. This was done because there was an obvious genetic divergence between the two groups seen from the admixture analysis. The same criteria were used for significance as other eigenGWAs. An eigenGWA was also used to identify loci responsible for a latitudinal cline seen from admixture ancestry values. Again, the same significance criteria were used as above.

Linkage disequilibrium (LD)

LD (r2) was identified between every variant of a chromosome using VCFtools with the allele balance filtered variants with r2 minimum values of 0.5 (parameters:—geno-r2,—min-r2 0.5). They were visualized in R using the scales [89], ggplot2, and plotly [90] libraries. For regions of interest found from an eigenGWA analysis, the variant with the lowest p-value was used to identify all the markers in that region in LD (minimum r2 of 0.3) using VCFtools. This was done to simply identify all variants that were in LD with significant GWA and to be able to visualize the genomic distance of this LD block. These variants were extracted from the vcf file using the—positions option of VCFtools. These variants were then visualized with IGV [91].

Genome-wide association (GWA)

Association tests (logistic regression) were performed using the PLINK v1.9 software with sex and sex-determining gene presence/absence as the traits under investigation. For both phenotypes, the variant set filtered for allele balance was employed, and the DAPC groupings and eigenvalues were used as covariates to account for population structure (parameters—allow-extra-chr,—logistic,—allow-no-sex,—covar). The sex-determining gene, sdY [26], was scored manually (present/absent) from alignments produced for variant calling in IGV for all the individuals. A Bonferroni correction was used to control false positives (ɑ = 0.01).

An association test (logistic regression) was also used to identify regions of the genome associated with ecotype (sockeye salmon vs. kokanee). DAPC values were again used to control for population structure. SNPs with the least stringent filtering were used in this analysis. A permutation test with 1000 permutations was used to identify significance (ɑ = 0.01, with 5 significant variants within 100 kbp of each other).

Finally, an association test was used to identify an alternative sex determination gene(s) for upper Columbia River kokanee that were sdY-negative using the variants with the least stringent filtering. As only 31 individuals were used in this analysis and over 4 million variants were interrogated, it was expected that no variants would pass multiple testing correction. Only the peak with the lowest p-values (with more than five proximal variants with low p-values) is shown for hypothesis generation and for a future candidate gene approach to follow. A permutation test with 1000 permutations was used to assess significance (ɑ = 0.01).

Individual genomic diversity

Runs of homozygosity were identified from the variants that had been filtered for allele balance using PLINK v1.9 (parameters:—homozyg). The total lengths of the runs of homozygosity were plotted in LibreOffice calc, and tested for significance in R using the aov and TukeyHD functions. The number of heterozygous genotypes and alternative homozygous genotypes per individual were counted from the variants with minimal filtering using a custom script (S1 File). Heterozygotes per kbp was calculated as the number of heterozygous genotypes divided by the total nucleotides in the genome (1,927,125,257) multiplied by 1 kbp. The heterozygosity ratio was calculated as the number of heterozygous genotypes divided by the number of alternative homozygous genotypes [92,93].

Orthology between species

Orthologous genes were identified between sockeye salmon and two other salmon species, coho and Chinook salmon using the methodology of Christensen et al. (2018) [60]. Briefly, the sockeye salmon genome assembly was individually aligned to the coho and Chinook salmon genome assemblies with BLAST (-task megablast, -evalue 0.000001, -max_target_seqs 3, -max_hsps 20000, -outfmt 6, -word_size 40, -perc_identity 96, -lcase_masking, -softmasking false). The resulting alignment files were filtered with the Compare_Genome_2_Other_Genome_blastfmt6_ver1.0.py (-minl 0.01 -minal 30000) and Filter_Linear_Alignment.v1.0.py (default) Python scripts [60]. NCBI annotated proteins were downloaded for each genome assembly and the sockeye salmon proteins were aligned against the other two protein data sets with BLASTP (-max_target_seqs 3, -max_hsps 20, -evalue 0.01, -outfmt 6). The protein alignment files were then filtered using the Filter_Alignments_Blast_Fmt6_Protein_ver1.0.py script (-min_per 80, -min_aln_per 80) and orthologs were identified between species using the Orthology_Between_Genomes.v1.1.py Python script [60]. If an orthologous gene was not identified between sockeye salmon and one of the other two species, it was considered missing, but this could occur for several other reasons (e.g. an annotation error in a region of the genome, paralogs obscuring clear orthologous assignment, and poor genome assembly), besides an actual gene loss or gain between species. Missing orthologs were identified with a script (S1 File) and plotted by their genomic positions in R using the ggplot2 library. This was done to identify if there were any regions with an increase in missing orthologs that might indicate a problem with the corresponding region of the sockeye salmon genome assembly.

Results

Genome assembly

Before trimming, genome coverage with mate-pair and paired-end Illumina data was ~159x assuming a genome size of 2.4 Gbp. After trimming adaptors and low-quality reads, the coverage dropped to ~87x. PacBio data coverage was ~22x with an average read length of ~7276 bp. The genome assembly was submitted to the NCBI (GenBank assembly accession: GCA_006149115.1, BioProject: PRJNA530256). The metrics reported on the NCBI website were: total sequence length ~1.9 Gbp, number of scaffolds: 38,027, scaffold N50 ~1 Mbp, contig N50 ~330 kbp. The BUSCO analysis identified 88.8% complete, 2.9% fragmented, and 8.3% missing BUSCOs or genes. Like other salmonid genomes [57,58,60], the sockeye salmon genome has extensive homology between duplicated chromosomes (i.e. homeologous regions) generated from the salmonid-specific whole-genome duplication [94] (Fig 1). Some regions retain high nucleotide sequence similarity (> 90%) between homeologous regions after the roughly 90 million years since the genome duplication in an ancestral species [94,95] (Fig 1). Qualitatively, these high sequence similarity homeologous regions appear reduced in length when compared to other salmonids [57,58], but this reduction likely reflects issues with the current assembly quality rather than differences between species (discussed below). It is often difficult to distinguish very high similarity sequences during assembly and these regions tend to remain in small and unplaced contigs.

Fig 1. Sockeye salmon circos plot.

Fig 1

Interior links were generated by SyMap between homeologous regions (only blocks larger than 2 Mbp are shown). Circle A) Larson et al. (2016) [62] female genetic map markers plotted against the corresponding chromosomal positions. Centromeres from Larson et al. (2016) are shown in blue below the genetic map. Circle B) The percent identity between homeologous regions in 1 Mbp intervals and weighted by alignment length (scale: 75–100%). Percent identities above 90% are highlighted with orange. Circle C) The fraction of repetitive sequences in 1 Mbp windows ranging from zero to one (fractions above 0.65 are shown in orange). Circle D) Log-transformed counts of variants with LD (r2 ≥ 0.5) to other markers ≥ 100 kbp away in 1 Mbp windows. Window counts between 100–999 are green, while those with counts greater than 999 are orange.

Regions of high LD were most commonly found around centromeres (Fig 1). Other regions with high LD were found on the sex chromosomes 9a and 9b. There are also large regions with high LD on LG13, LG22, and LG27 which do not appear to be related to centromeres (Fig 1).

Gene annotation

Over 2 billion reads from 18 tissues were submitted to GenBank for gene annotation of the genome assembly. Standardized NCBI annotation using our submitted RNA-seq data, the reference genome, and other publicly available RNA sequence data was used to identify 38,468 protein-coding genes, 5,185 non-coding genes, and 64,416 fully supported mRNAs (from a total of 9.4 billion reads) [96]. This is less than the 42,483 protein-coding genes identified in Chinook salmon (O. tshawytscha) (8 billion reads) [97], 42,884 protein-coding genes in rainbow trout (O. mykiss) (7.4 billion reads) [98], and 41,269 protein coding genes in coho salmon version 2 (O. kisutch) (7.4 billion reads) [99]. The exact reason for these differences is not known, however, it is likely related to a quality difference between the genome assemblies or differences in RNA-seq data sets. A similar difference between the number of protein-coding gene counts was observed between version 1 and 2 of the coho salmon genome assembly (version 1: 36,425 vs. version 2: 41,269) and likely reflects quality differences rather than any biological reason. This first atlas of annotated genes is still a valuable resource as it provides an important data set for linking genetic variants, phenotypes, and genes.

Variant calling

A total of 25,728,393 variants in 140 individuals were filtered to remove indels, variants with more than two alleles, maf < 0.05, and were genotyped in more than 90% of samples to leave 4,533,143 variants. After the second filter (for allele balance), 564,684 variants remained, and after the third filter (LD filter) there were 124,663 variants. The number of uncalled variants and average depth of the variants was calculated from the variants after the third filter. There was an average of 1,866 missing variants per individual with a standard deviation of 2,021. The average depth per variant was 11.10x with a standard deviation of 6.79x.

Clustering individuals (population stratification)

DAPC and admixture optimal clustering supported three groups, and the phylogenetic tree appeared to have three main clusters (Figs 2 and 3). DAPC group 1 was comprised of kokanee from the upper Columbia River drainage (Kootenay Lake, Arrow Lake, Whatshan Reservoir, and Koocanusa Reservoir). Please note that the samples from the Clearwater Trout Hatchery, also in DAPC group 1, were from Columbia River ancestry. This group was well supported in every clustering technique. DAPC group 2 was composed of fish from multiple drainages on the BC coast and interior and appears to exhibit clinal variation in admixture values from DAPC groups 1 and 3 (Fig 3). This group also displays variation between sockeye and kokanee in proximal locations from the Fraser River drainage based on admixture ancestry values (Fig 3). DAPC group 3 included all populations either north or west of the Babine River in central British Columbia. The largest uncertainty between the phylogenetic tree and the DAPC analysis was where to differentiate between groups 2 and 3 (Fig 2).

Fig 2. Clustering sockeye and kokanee individuals (population stratification).

Fig 2

This figure shows the clustering of individuals based on: A) DAPC and B) maximum likelihood analysis (phylogenetic tree). Both analyses began with the same 124,663 variants already filtered for common factors (e.g. maf 0.05) and was specifically filtered for linkage disequilibrium (LD), to reduce the effects of a single genomic location in high LD overwhelming all other signals. A) DAPC analysis clustering with the optimal group number [3] and optimal number of PCAs chosen [6]. The axes represent the first two linear discriminants (LD1 and LD2). The gray, coral (red/orange), and teal (blue) colours correspond to the different clusters. B) An unrooted maximum-likelihood phylogenetic tree, with bootstrap values (based on 1,000 bootstraps) shown as green dots with the larger dots representing greater bootstrap values (min: 0.1 max: 100). Only 7,357 variants remained after default SNPhylo filters. Colours are consistent with the DAPC analysis (DAPC group names shown for comparison). Please note in the Klukshu and Hansen groups, one individual from the other group was found in the grouping and likely represents a switched sample (see S1 Table). Also, there is one Takla kokanee in the Hansen grouping. Sockeye are represented by (S) and kokanee are represented by (K).

Fig 3. Population stratification relative to location.

Fig 3

This figure shows the map of the sample sites and the admixture and DAPC clustering analyses. A) The sample site locations with the DAPC assignments overlaid. Specific locations from group 2 are shown with lines. Red lines represent kokanee samples. Only group 2 and 3 body of water names (with drainage in parentheses) are displayed for clarity. The insert shows greater detail of Skeena, Bella Coola, Fraser, and Columbia River bodies of water and is linked by lines to Fig 3B. B) An admixture analysis with k = 3. The colours are consistent with the DAPC analysis and DAPC groups are shown. From the DAPC group 2, there appears to be differentiation between proximal sockeye and kokanee samples based on the admixture ancestry values (in the Fraser River drainage). There also appears to be a latitudinal cline of the gray admixture ancestry values for some sites of DAPC groups 2 and 3.

Chromosomal variation underlying population structure

To identify regions of the genome differentiating groups 2 and 3, an eigenGWA was performed using the LD1 values from the DAPC analysis. The largest and statistically significant signal came from a region of chromosome/linkage group 26 (NC_042560.1 at 26,695,436 bp) that contains a cluster of immunoglobulin heavy chain genes (Fig 4) (referred to as immunoglobulin heavy chain variable gene cluster). This region shows evidence for large haplotypes, with many individuals being homozygous for a particular haplotype (Fig 4C). However, some individuals are heterozygous for these entire blocks.

Fig 4. Genomic regions associated with eigenvalues from DAPC groups 2 and 3.

Fig 4

A) A Manhattan plot of an eigenGWA using the LD1 values from the DAPC analysis after accounting for the genomic inflation factor. Only the individuals from groups 2 and 3 were included in this analysis to specifically find genomic regions underlying clustering differences in DAPC 2 and 3. The horizontal blue line represents the Bonferroni correction at the 0.05 alpha level (variants interrogated = 450,868) and the red line at the 0.01 alpha level (0.01 was chosen as the level of significance for this study). Only peaks with at least 5 significant variants within 100 kbp of each other were considered. B) A screenshot from IGV showing genetic markers in LD (r2 > = 0.3) with the variant with the lowest p-value from the eigenGWA (distance ~200 kb). The dark blue colour represents homozygous reference variants (HomRef), the green colour represents homozygous alternative variants (HomVar), and the light blue colour represents heterozygous variants (Het). Near the bottom of Fig 4B shows the location of the immunoglobulin heavy chain genes and the variant with the lowest p-value are shown. Below that are examples of the haplotypes seen in the data (see Fig 4B legend: dark blue, homozygous for the reference allele; green, homozygous with the alternative allele; light blue, heterozygous). These example data are shown to illustrate that this region is often homozygous for all the alleles in this block.

The consistent haplotypes and complete heterozygous haplotypes are suggestive of an inversion similar to what has been seen in other salmonid species [100,101]. While there was not an obvious signal of an inversion from aligned reads in this region (e.g. paired-end reads align in the same orientation with a large insert size), there was an excess of paired-end reads with the same orientation in this region relative to the surrounding sequence, possibly indicating gene rearrangements rather than an inversion.

Another statistically significant peak (with at least five variants in peak above ɑ = 0.01) was found on chromosome 16 (NC_042550.1 at 15,084,422 bp) (Fig 4). Calcium channel, voltage-dependent, T type, alpha 1G subunit was identified as a candidate gene for this association.

Eight peaks were significant at ɑ = 0.01 (after Bonferroni correction and with at least five significant variants in the peak) from the eigenGWA analysis between groups 1 and 2 (Table 1, S2 Fig), and ten were found between DAPC groups 1 and 3 (Table 1, S3 Fig). Candidate genes found in these comparisons include: talin 2, “calcium channel, voltage-dependent, T type, alpha 1G subunit”, regulator of G-protein signaling 6, dipeptidyl-peptidase 6a, Mtr4 exosome RNA helicase, “aldehyde dehydrogenase 9 family, member A1a, tandem duplicate 1”, GREB1-like protein, and lin-28 homolog B (Table 1).

Table 1. Summary of significant eigenGWAS underlying population structure.

DAPC 1 vs. 2 DAPC 1 vs. 3 DAPC 2 vs. 3 Chromosome/Scaffold Position Candidate Gene Symbol Accession
+ + 9a 20731646 TLN2 LOC115134248
+ 10 61506478 dcps dcps
+ + 12 4475796 lncRNA LOC115138883
+ + 16 15084422 CACNA1G cacna1g
+ 18 31984093 RGS6 LOC115146242
+ 22 5021144 DPP6A dpp6a
+ + 22 72378343 MTREX LOC115105969
+ + 24 60038155 ALDH9A1A.1 aldh9a1a.1
+ 25 29879399 BANP LOC115109541
+ + 26 26674671 IgHC* NA
+ 27 20438447 EMID1 LOC115111655
+ 9b 29720123 GREB1L LOC115114584
+ NW_021791234.1 120629 uncharacterized** LOC115118739
+ NW_021786671.1 58477 LIN28B LOC115116852

*Immunoglobulin heavy chain variable gene cluster.

**Similar to immunoglobulin gene.

Closest candidate gene to lowest p-value variant, but not well supported.

Some of these associations may be related to inversions found between populations because large haploblocks have been identified in some of these regions. For example, the variant with the lowest p-value in the chromosome 24 peak was found in a haploblock larger than 1 Mbp (S4 Fig). No obvious inversions were seen from aligned reads to this region of the genome. Other mechanisms can generate large haploblocks (e.g. selection, reduced recombination, and inbreeding) and further investigation will be needed to differentiate these mechanisms.

There were nine peaks identified from the eigenGWA comparing the Fraser River sockeye salmon and kokanee (Table 2, S5 Fig). Four candidate genes identified from this analysis were: complement C3-like (LOC115103919), carboxypeptidase A6 (cpa6), cone cGMP-specific 3',5'-cyclic phosphodiesterase subunit alpha'-like (LOC115106380), and SWI/SNF-related matrix-associated actin-dependent regulator of chromatin subfamily A-like protein 1 (LOC115126495). No significant peaks were identified from the latitudinal cline eigenGWA.

Table 2. Significant eigenGWAS underlying Fraser River kokanee and sockeye ecotypes.

Chromosome/Scaffold Position Candidate Gene Symbol Accession
9a 27555132 SEPT7 LOC115134365
20 20896927 PLXNA2* LOC115102444
21 10837054 C3* LOC115103919
22 5976979 CPA6* cpa6
22 51934852 P2RX5 LOC115105525
23 6251055 PDE6C LOC115106380
25 30482158 sspn sspn
NW_021803831.1 67980 unknown NA
NW_021814461.1 17077 SMARCAL1 LOC115126495

Closest candidate gene to lowest p-value variant, but not well supported.

* Genes related to ammonia tolerance [102].

Genomic associations with kokanee ecotype

Ten loci were associated with the kokanee ecotype (i.e. comparing sockeye salmon and kokanee) (Table 3). With five of these loci it was difficult to identify candidate genes for the following reasons: 1) the association did not overlap with any annotated features, 2) there were no genes on the associated scaffold, 3) there were multiple possible candidate genes, or 4) the genotype information suggested that the association was an artifact of misalignment (personal observation). Five of the associations had a clear candidate gene: neuregulin 3, FKBP prolyl isomerase 6, delta-sarcoglycan-like, and two uncharacterized genes—one that had sequence similarity to an immunoglobulin and the other a non-coding RNA.

Table 3. Genomic locations of ecotype associations.

Chromosome/Scaffold Position Candidate Gene Symbol Accession
3 52034002 VTCN1 vtcn1
4 52069691 PHACTR4 LOC115128539
7 37872810 ncRNA LOC115132798
10 5469301 NRG3 nrg3
12 41953339 JAG2 LOC115138579
22 47522975 FKBP6 fkbp6
22 50912490 SGCD LOC115105509
NW_021813758.1 2362 uncharacterized (ncRNA) LOC115125940
NW_021814090.1 26917 uncharacterized (diverse immunoglobulin domain) LOC115126197
NW_021817479.1 7319 unknown NA

Closest candidate gene to lowest p-value variant, but not well supported.

Sockeye salmon and kokanee sex chromosomes

GWAs were employed to better characterize regions of the genome responsible for sex-determination (Fig 5). Two peaks were observed for phenotypic sex (chromosomes 9a and 9b) and the sex-determining gene presence/absence GWAs (Fig 5). The associations of the sdY presence/absence analysis but not phenotypic sex reached significance after Bonferroni correction (ɑ = 0.01). These regions on the female sex-chromosomes (as this is a female genome assembly, only the female sex-chromosomes are shown) have extensive LD blocks (Fig 5D).

Fig 5. Sockeye salmon and kokanee sex chromosomes.

Fig 5

A) A GWA between all known male and female sockeye salmon and kokanee salmon with the DAPC LD1 values as covariates. The two peaks (Chr. 9a and 9b) did not reach significance. B) A GWA between all individuals scored for the presence or absence of the sex-determining gene (red line indicates significance threshold). C) A depiction of the proposed X1 (9b), X2 (9a), and Y chromosomes. D) IGV was used to visualize variants on chromosome 9a with only minimal filtering for known females (no upper Columbia River kokanee), males (no upper Columbia River kokanee), Freshwater Fisheries Society of British Columbia female kokanee, and Freshwater Fisheries Society of British Columbia male kokanee. E) On the left, LD plot of chromosome 9a from 4.6 Mbp to the end of the chromosome (~2 Mbp) with a minimum LD threshold of r2 = 0.5 for both plots. On the right, LD plot of chromosome 9b from 3.5 Mbp to 11.5 Mbp.

Another GWA was implemented to identify genomic locations for sex-determination in samples of kokanee from the upper Columbia River drainage (Kootenay Lake, Arrow Lake, Whatshan Reservoir, and Koocanusa Reservoir). These samples were missing the sex-determining gene in male samples. No significant peaks were detected from the GWA. The peak with the lowest p-value from this analysis (multiple variants around krüppel-like factor 5 on LG1 6,248,507 bp) is shown for hypothesis generation and follow-up testing (S6 Fig).

Individual genomic diversity

The total length of runs of homozygosity ranged from ~6.4 Mbp to ~1.4 Gbp (S7 Fig) with a median of ~35.5 Mbp. The individual with the greatest extent of runs of homozygosity was the artificially produced doubled haploid female. The only population that appeared to have major elevated levels of total lengths of runs of homozygosity was from Cultus Lake (S7 Fig). Total length of runs of homozygosity from Cultus Lake samples were not significantly different with a one-way ANOVA test unless the doubled haploid individual was removed from the analysis (p < 0.001). Samples from all other bodies of water were significantly different from Cultus Lake samples (p < 0.001) using a Tukey’s test post-hoc (other comparisons were also significant, but Cultus Lake was the largest difference with all other bodies of water).

The average number of heterozygous genotypes per kbp was 0.67 and varied little between individuals (standard deviation = 0.08, S7 Fig). The heterozygotes/kbp statistic can vary depending on coverage and other factors that were not controlled for. The heterozygous ratio was similar to the heterozygotes/kbp statistic except with the dam of the doubled haploid used to sequence the reference genome (S7 Fig). This individual had much fewer alternative homozygous alleles that inflated the heterozygous ratio. The dam of the individual used in the construction of the reference genome assembly is expected to have lower levels of alternate homozygous alleles as half of its genome was used to generate the reference genome assembly. Excluding this individual, the correlation coefficient between the heterozygous ratio and the heterozygotes/kbp statistic was 0.91.

Orthology between species

We were able to identify 18,625 orthologs between sockeye and coho salmon and 18,143 orthologs between sockeye and Chinook salmon (S2 File). There were 17,800 coho salmon genes that we were unable to confidently identify an ortholog for and these genes mapped disproportionately to the telomeric ends of coho salmon chromosomes (S8 Fig). The distal segments of salmon chromosomes are often more difficult to assemble due to an ancestral autopolyploidy genome duplication in salmon [94,103]. For this reason, the quality of a genome assembly will likely suffer the most in these regions and may explain why there is a discrepancy of the number of annotated genes between assemblies. Higher-quality assemblies would be able to assembly these regions better and recover gene annotations missed in lower quality assemblies.

Discussion

Genome assembly and variant calling

The present sockeye salmon reference genome assembly adds to the growing number of available Pacific salmon genome assemblies. This version of the sockeye salmon genome assembly has the lowest contiguity metrics and BUSCO scores of the previous Pacific salmon reference genome assemblies. It also has fewer annotated genes and reduced identified orthologs near the ends of chromosomes. Despite these differences, this genome assembly still provides an excellent resource. With this assembly and transcriptome, the NCBI was able to annotate 38,468 protein-coding genes, we were able to identify millions of nucleotide variants, and we were able to identify regions of the genome underlying population structure and the kokanee ecotype.

Clustering and chromosomal variation underlying population structure

There are three well supported sample clusters, one of which was composed of kokanee samples from the upper Columbia River (Kootenay Lake, Arrow Lake, Whatshan Reservoir, and Koocanusa Reservoir). These results support a closer genetic relationship of kokanee in this region than to sympatric sockeye. This cluster was not found in some previous surveys [8,9,13], but was found in other [15,104]. In one report, the Okanagan population (composed of sockeye salmon and kokanee ecotypes) formed one cluster and all the other bodies of water in the upper Columbia formed a second [104]. This is consistent with results from the current work. In another study, kokanee from central British Columbia (Okanagan Lake, Kootenay Lake, South Thompson River, and middle Fraser River) clustered together while sockeye from this region clustered with sockeye and kokanee from other regions [15]. Taken together, these results suggest the existence of an upper Columbia kokanee population distinct from surrounding populations of kokanee and sockeye (discussed below). Beacham and Withler (2017) suggested that this monophyletic kokanee grouping from central British Columbia was an effect of deglaciation—through the formation of large lakes that connected the Columbia and Fraser Rivers [15,16,105]. We suggest that in addition, there was likely isolation of these kokanee from other sockeye salmon and kokanee that survived in the Columbia River glacial refugia, which is reflected in DAPC grouping, admixture analysis, sex determination mechanism, and a number of genomic regions associated specifically with these kokanee. Isolation of upper Columbia River kokanee (specifically Kootenay River tributaries) has lasted for at least 80 years [106], but isolation might have started potentially up to 10,000 years ago [107,108].

The two remaining clusters are congruent with previous studies, which also found a northwestern and southern grouping [8,9]. Where one grouping ends and the other begins was inconsistent among studies and likely reflects the latitudinal cline seen in the admixture analysis of the current study as well as variation in which specific locations were included in each study. Wood et al. (1994) found a third group in their analysis by splitting the central British Columbia coast from the southern coast [9], which again likely reflects the latitudinal cline seen in the current study.

A major difference between the northwestern group and the southern group was the ancestry informative immunoglobulin heavy chain variable gene cluster found through an eigenGWA. There are two immunoglobulin heavy chain loci in Atlantic salmon and these duplicated loci likely reflect the salmonid-specific whole-genome duplication [94,109]. In sockeye salmon, the homeologous immunoglobulin heavy chain locus was found on chromosome 21 (~28.3 Mbp—28.6 Mbp) using alignments of the genes on chromosome 26. There were no obvious peaks found on chromosome 21 (Fig 4). As suggested in the Results section, this region might be an inversion between the northwestern and southern groups. Further investigation will be required to better understand this region of the genome and if it is an inversion. Recombination between heterozygous haploblocks is expected to be reduced at an inversion (reviewed in [110]). If the different haplotypes conferred local adaptation (northern vs. southern glacial refugia) that was fixed during glacial isolation, underdominance and lack of recombination may continue to help maintain population structure between the northwestern and southern groups [110]. Multiple pathogens are influenced by temperature and several outbreaks in fish farms have been related to temperature (reviewed in [111]). It may be that fish from the northwestern group were exposed to a different pathogenic community than those in the southern group due to temperature differences. This region of the genome may reflect that difference in pathogen communities and may confer a selective advantage based on location.

Another significant difference between the northern and southern groups was seen on chromosome 16, near the candidate gene calcium voltage-gated channel subunit alpha1 G. This gene belongs to the T-type calcium channel ɑ1 subunit of the voltage-gated calcium channels that activates in response to low voltage to mediate calcium influx into various excitable cell types [112]. In Atlantic salmon, the T-type and L-type voltage-gated calcium channels were shown to be involved in sperm motility [113,114]. Sperm related genes have been previously found to be under selection between human populations and may represent episodic diversifying selection driven by sperm competition [115].

There were many eigenGWAS peaks between the northwestern and southern groups, and the group from the upper Columbia River drainage. This divergence between the two other groups is likely driven by isolation of this group from the other two for possibly up to 10,000 years. One of the main differences of this group and the two others was an apparent change in sex-determination (discussed below). Other candidate genes identified as ancestry informative between the upper Columbia River group and the other two groups included: talin 2 (TLN2), “calcium channel, voltage-dependent, T type, alpha 1G subunit” (CACNA1G), regulator of G-protein signaling 6 (RGS6), dipeptidyl-peptidase 6a (DPP6A), Mtr4 exosome RNA helicase (MTREX), “aldehyde dehydrogenase 9 family, member A1a, tandem duplicate 1” (ALDH9A1A.1), GREB1-like protein (GREB1L), and lin-28 homolog B (LIN28B).

Talin 2 is a large gene with multiple transcript isoforms with tissue specific expression, including a spermatid specific isoform [116]. This implicates two possible sperm related genes (talin 2 and calcium channel, voltage-dependent, T type, alpha 1G subunit) in divergence between the three sockeye salmon clusters, possibly associated with diversifying selection driven by sperm competition.

The regulator of G-protein signaling 6 gene is part of a family of regulators that modulate G protein-coupled receptor signaling pathways [117]. It is involved in many biological pathways from cardiovascular development to alcohol dependence in mammals [118,119]. It is unclear why this gene would be ancestry informative in sockeye salmon and kokanee.

The dipeptidyl-peptidase 6 gene helps to regulate cerebellar granuale neuronal cell resting and firing patterns [120,121]. Dipeptidyl-peptidase 6 knockout mice had reduced memory and exhibited slower learning along with brain morphology differences [122]. Interestingly, dipeptidyl-peptidase 6 has previously been found as an outlier locus between a northern and southern snail population in which the author suggested that divergence could be a result of thermally-driven selection [123]. More research would need to be conducted to test this hypothesis in sockeye salmon populations, but we note that sockeye salmon in different geographic regions are subjected to significant variation in thermal habitats.

Mtr4 exosome RNA helicase is a member of the nuclear exome-targeting complex, which functions as a cofactor of the RNA exosome and monitors for aberrant noncoding RNAs [124,125]. The nuclear exome-targeting complex has been found to respond to stress [126], and viruses can co-opt its machinery to initiate viral transcription and increase infectivity [127]. Variants of the Mtr4 exosome RNA helicase might influence infectivity of viruses, and the divergence at this locus may reflect different pathogenic environments and local adaptation of the upper Columbia River kokanee.

The aldehyde dehydrogenase 9 family, member A1a, tandem duplicate 1 candidate gene on chromosome 24 was found in a large haploblock and we suggest that this may be an example of an inversion or multiple inversions between populations. To our knowledge, this is the first putative large inversion found between sockeye salmon populations. Further research will be needed to confirm if this is an inversion or if other mechanisms have driven the relatively large haplotypes in this region of the genome.

GREB1-like protein is a well-known gene that influences the timing of some species of salmonids returning from the ocean to spawn (i.e. run-timing) [128,129]. GREB1-like protein is involved in kidney and genital tract development in mice and zebrafish [130,131], which may help explain the gene’s association with run-timing and consequently with maturation in salmon. In sockeye salmon, run-timing and maturation are linked, but maturation can occur in the ocean prior to migration or after migration to freshwater [132]. It may be that the upper Columbia River group is predominately of one run-time and why there is a strong association with this gene. Run-timing data was not collected for this study but should be examined in future investigations now that this association has been identified in sockeye salmon.

Lin-28 homolog B is a key regulator of stem cell self-renewal in many organisms, and the paralog of Lin-28 homolog B plays a further role in primordial germline stem cell development [133]. With its link to spermatogonia, it is possible that, like talin-2 and calcium channel, voltage-dependent, T type, alpha 1G subunit, Lin-28 homolog B could represent diversifying selection driven by sperm competition.

Genomic associations with kokanee ecotype

In the present study, 19 loci appeared to have an association with ecotype (sockeye vs. kokanee). Four of these associations have previously been identified in other studies [17,19,20]. As discussed in the Introduction, one genomic region that was identified in some association studies comparing sockeye salmon and kokanee was the LG12 region around the leucine-rich repeat-containing 9 gene (start position of gene: NC_042546.1 41,184,975 bp) [17,19,20,24]. This was previously found to be associated with shore vs. stream spawning and sockeye vs. kokanee ecotypes [17,19]. This gene is proximal to the six homeobox 6 gene (start position of gene: NC_042546.1 41,338,065 bp) that is a candidate gene under strong selection in differing Atlantic salmon (Salmo salar) populations (24). From the current analysis, the closest ecotype association to this region was at LG12 (NC_042546.1) 41,938,693 bp, which is around 600 kbp away (135 kbp between the lowest p-value variant in the current study and the 22357 RAD tag from Veale and Russello (2017) [19]). Strong linkage disequilibrium, lack of phenotype information (i.e. spawning habitat information that was used in previous research), and much fewer genetic markers in previous studies might be responsible for the distance between peaks. The other three variants previously identified were located on LG20 (kokanee vs. sockeye, RAD tag = 24539 [17] distance from lowest p-value = 153 kbp), LG21 (kokanee vs. sockeye, RAD tag = 91349 [19], distance from lowest p-value = 93 kbp), and LG25 (kokanee vs. sockeye, RAD tag = 58166 [17], distance from lowest p-value = 6 kbp). These correspond to two candidate genes we are not confident about and complement C3-like (on LG21, discussed below). The association identified on LG20 was closest to a possible candidate gene, plexin-A2-like (LOC115102444), discussed below. The variant with the lowest p-value on LG25 was within the sarcospan gene (sspn, discussed below).

One of the conserved associations (found with the GWA between all sockeye and kokanee) in the current study was with the candidate gene neuregulin 3. Neuregulin 3 is a member of the epidermal growth factor-like signaling molecule family of genes and plays a role in the central nervous system development (reviewed in [134]). It has been associated with various behaviours and psychiatric disorders in humans and mice [134]. It is unclear at this time how this gene might influence ecotype or why it is associated with ecotype.

Delta-sarcoglycan, another conserved candidate gene (i.e. an association identified in the analysis between all sockeye and kokanee), is a component of the sarcoglycan subcomplex that stabilizes skeletal muscle fiber sheaths among other functions (reviewed in [135]). Variants of this gene are associated with muscular dystrophy in humans [136], and this gene plays a role in the longevity of the retina in mice [137]. The sacrospan gene, another candidate gene, associates with the sarcoglycan subcomplex [138]. Proteins from these genes are components of the sarcoglycan-sarcospan complex and are expressed in the retina, likely in Müller and ganglion cells [139]. How the sarcoglycan-sarcospan complex might relate to sockeye salmon ecotype requires a brief explanation of smoltification.

Smoltification is critical to ocean-going sockeye salmon as it prepares the parr (developmental stage before smoltification) for the challenges of a marine environment. Unlike most sockeye salmon, kokanee remain in fresh-water environments. Smoltification alters metabolism, osmoregulation, growth, colour, behaviour, and other traits of the young parr to prepare for marine environments (reviewed in [140]). In landlocked Atlantic salmon (analogous to kokanee), some populations still have many of the elements of smoltification while others have lost key components, including osmoregulatory ability, brain structure development, and metabolism [140]. Kokanee can go through smoltification, but like landlocked Atlantic salmon, it appears that the process has been altered from ocean-going sockeye salmon in at least one population [107]. Smoltification is energetically costly and is possibly maladaptive in landlocked salmon [140].

One of the environmental cues for smoltification comes from changes in day length detected by the retina and the light-brain-pituitary axis—changes to this system such as continuous daylight may interrupt smoltification [140]. Activation of the light-brain-pituitary in some landlocked Atlantic salmon smolts by day length appears to be disrupted [140]. One possible explanation for this disruption is that it reduces the chance of smoltification in these landlocked populations and offers an evolutionary advantage because energetic resources are not used in a process no longer needed. If the sarcoglycan-sarcospan complex is involved in maintaining the retina in sockeye salmon, then it may play an indirect role in smoltification and in a similar disruption of the light-brain-pituitary axis.

Another gene associated with sockeye ecotype for a subset of the samples was the cone cGMP-specific 3',5'-cyclic phosphodiesterase subunit alpha’ gene (PDE6C). This gene is a vital component of the phototransduction pathway (reviewed in [141]) and is differentially expressed in the brains of resident and migratory rainbow trout along with many other phototransduction genes [142]. It has also been previously identified as under local selection in Atlantic salmon [24] or associated with domestication (as indicated in Pritchard et al. 2018 [24,143]). Again, this gene may play a role in smoltification and the disruption of the light-brain-pituitary axis, which may be favourable for kokanee and landlocked populations.

Three candidate genes: complement C3, carboxypeptidase A6, and plexin A2 were previously identified as candidate genes for ammonia tolerance in orange-spotted grouper (Epinephelus coioides) [102]. All three of these candidate genes were established in previous studies identifying outlier loci between kokanee and sockeye populations (plexin A2-like and complement C3 outlier loci were within 200 kbp of the current associations and the carboxypeptidase A6 outlier locus was within 1 Mbp, RAD tag: 14532 [17]). These candidate genes and their link with ammonia tolerance, suggests that a driving force in divergence between Fraser River sockeye salmon and kokanee might be environmental ammonia. Sockeye salmon experience varied levels of environmental ammonia, which is expected to be highest in estuaries where run-off from agriculture and other human activities accumulates [144]. Sockeye salmon smolts, out of all the Pacific salmon, spend the least amount of time in estuaries—only around 5 days [145]. If estuary ammonia levels are responsible for the divergence between sockeye salmon and kokanee at these loci, it suggests that these are recent and human-induced adaptations. Further research will be needed to test this hypothesis.

Similar to differences between the DAPC groups, a sperm-related gene differentiates sockeye and kokanee ecotypes. The FKBP6 prolyl isomerase protein functions as a testis-specific component of the synaptonemal complex and is essential for sperm production in mice [146]. This region has previously been identified as an outlier locus between Okanagan Lake kokanee and Okanagan River sockeye salmon and also between Redfish Lake kokanee and sockeye (within 1 Mbp—RAD tag: 54636 [19], RAD tag: 47864, [17]). Again, this result suggests diversifying selection driven by sperm competition.

Sockeye and kokanee sex chromosomes

The Y-chromosome in sockeye salmon is a metacentric chromosome formed from the fusion of the acrocentric Y-chromosome with another acrocentric autosome [31]. The sex phenotype maps to the putative centromere of this fusion [27]. In females, the X1 and X2 chromosomes correspond to chromosomes 9b and 9a respectively [32]. Both female chromosomes show association with the sex phenotype, sdY (sex-determining gene in salmon [26] and sockeye salmon [27]), and have large haploblocks based on sex in this study. This is consistent with a Robertsonian translocation and reduced recombination common to sex-determining regions.

Male kokanee from the upper Columbia River drainage were sdY-negative (personal communication DS and RHD), which is consistent with the resequencing data aligned to the sdY sequence. It is also congruous with a previous study finding a high percentage of sdY-negative sockeye salmon males (~30%) in an upper Columbia River hatchery [27], and another study with similar findings in samples collected from Asian populations [147]. Atlantic salmon females have been identified with sdY, but likely have autosomal pseudocopies rather than a bonafide function sex-determining copy (bioRχiv [148]) [33]. This has been noted in other salmonid species as well [147]. There were no significant associations from a GWA of the sex phenotype with any markers, but there was a ~100 kbp peak at the krüppel-like factor 5 gene (LG1 6,248,507–6,256,452 bp). This peak encompassed the markers with the lowest p-values with the sex phenotype. Future studies of mid or upper Columbia River sockeye salmon and kokanee will be needed to better our understanding of alternative sex determination in salmonids. From this and a previous study, there is a clear alternative sex-determination mechanism to the canonical sdY pathway in a potentially large percentage of upper Columbia River sockeye salmon and kokanee.

The sdY gene arose from a gene duplication of an immune-related gene that diverged to be able to interact with the Forkhead box domain of the female-determining transcription factor and eventually disrupt female differentiation [149]. Thus, sdY was able to “hijack” a conserved sex differentiation cascade by interacting with one of the members of this cascade [149]. Krüppel-like factors (Krüppel-like factor 4 directly) have been shown to interact with this differentiation cascade [150,151] and they could have likewise been co-opted to serve as sex-determining genes in kokanee.

Individual genomic diversity

Overall, genomic diversity was similar among all samples except in a doubled haploid individual, the dam of the individual used in the construction of the reference genome assembly, and samples from Cultus Lake. The Cultus Lake population is considered endangered and has declined in abundance precipitously since the 1970s [152]. Previously, Cultus Lake samples had the lowest mean heterozygosity scores (0.57) compared to other Fraser River drainage samples, which otherwise had uniform heterozygosity scores in one study examining six microsatellite markers [153]. This is consistent with the high total length of runs of homozygosity found from the Cultus Lake samples in this study that were not seen in any other bodies of water. These baseline metrics of genomic diversity may play an important role in discussions of conservation of threatened populations of sockeye salmon.

Conclusions

In this study, we generated a reference genome assembly for sockeye salmon, a useful RNA-seq data set for annotation of this and future sockeye salmon genome assemblies, and identified regions of the genome underlying population structure and sockeye salmon ecotype. We found that an immunoglobulin heavy chain locus was a major ancestry important region of the genome differentiating two of the three key genetic groups of sockeye salmon and kokanee from this study. We were able to identify regions of the genome that appear to differentiate sockeye salmon from kokanee, and these regions implicate ammonia tolerance and vision as possible indicators of ecotype. Finally, we were able to improve understanding of the sex chromosomes in sockeye salmon, and to confirm a novel sex determination mechanism in kokanee.

Supporting information

S1 Fig. Sample site locations of sockeye salmon and kokanee.

Map generated with the maps library in R [154].

(TIF)

S2 Fig. EigenGWA between the DAPC groups 1 and 2.

A Manhattan plot where eigenvalues from the DAPC analysis were used to identify regions of the genome with ancestry informative genes (e.g. under selection) between groups 1 and 2. The red horizontal line is the threshold of significance for ɑ = 0.01 after Bonferroni correction.The blue line is for ɑ = 0.05.

(TIF)

S3 Fig. EigenGWA between the DAPC groups 1 and 3.

A Manhattan plot where eigenvalues from the DAPC analysis were used to identify regions of the genome with ancestry informative genes (e.g. under selection) between groups 1 and 3. The red horizontal line is the threshold of significance for ɑ = 0.01 after Bonferroni correction. The blue line is for ɑ = 0.05.

(TIF)

S4 Fig. Chromosome 24 (NC_042558.1) genotypes and putative inversion(s).

A) On the left of this figure is the admixture ancestry plot with the DAPC group assignments. On the right, is a screenshot of chromosome 24 from IGV from 58 Mbp—62 Mbp (only variants with r2 values > = 0.3 with the variant with the lowest p-value from the eigenGWA in this peak are shown). This region of the genome was found from an eigenGWA to be associated with inferred population structure between DAPC groups 1 and 2. The dark blue genotypes are homozygous for the reference allele (HomRef), the green genotypes are homozygous for an alternative allele (HomVar), and the light blue are heterozygous (Het). B) A scatterplot of variants with r2 values > = 0.5 on the top shows areas with high LD. Below is a smaller version of the genotypes with the putative inversions highlighted.

(TIF)

S5 Fig. Sockeye salmon vs. kokanee eigenGWA.

A) The eigenGWA is shown between Fraser River sockeye salmon (n = 14) and kokanee (n = 12) with putative genes highlighted at the peaks (with at least 5 variants with LD). The red line represents a Bonferroni correction at ɑ = 0.01 and after correction for the genomic inflation factor. The blue line represents a Bonferroni correction at ɑ = 0.05 and was chosen as the minimum value of significance. B) An IGV plot of all the variants used in the eigenGWA for the region around the peak on chromosome 23. The genotypes are: dark blue—homozygous reference, green—homozygous alternative, and light blue—heterozygous. The top IGV plot is the kokanee used in this analysis and the sockeye are below. Below the IGV plot, thick lines represent NCBI annotated genes in this region. The putative ancestry informative gene is highlighted in green and named. The variants with the lowest p-values from the eigenGWA are shown as dotted-lines (1st represents the variant with the lowest p-value, 2nd represents the variant with the second lowest p-value, etc.). The p-values, in combination with the genotypes, were used to identify the most likely ancestry informative gene in this region.

(TIF)

S6 Fig. Visualization of the variants with the greatest association to the sex phenotype in kokanee lacking the sdY gene.

Variants on chromosome 1 (NC_042535.1) shown in IGV with the female variants on top and the male variants on the bottom. The variant with the greatest association was found in the 3’ UTR of the krüppel-like factor 5 gene.

(TIF)

S7 Fig. Individual genomic diversity.

A) A map of the sampling sites. B) Three measures of individual genomic diversity: 1) total length of runs of homozygosity, 2) heterozygous genotypes per kbp, and 3) heterozygous ratio.

(TIF)

S8 Fig. Density plot of genes that we were unable to identify an ortholog for on the coho salmon genome.

The x-axis is positions along the chromosome and the points represent the start position of a “missing” gene. The y-axis is the density of missing genes along the chromosome.

(PDF)

S1 Table. Sample information.

(XLSX)

S1 File. Compressed archive file with various custom Python scripts used in this study and readme files.

(XZ)

S2 File. List of orthologous genes between sockeye salmon and other salmon species.

(XLSX)

S1 Methods. Sampling strategy.

(DOCX)

Acknowledgments

We would like to acknowledge and thank McGill University and Génome Québec Innovation Centre for their extensive sample preparation and sequencing services. We would also like to thank and acknowledge the generous support and resources from Compute Canada (www.computecanada.ca). We would also like to thank Fisheries and Oceans Canada and the University of Victoria for the facilities and personnel needed for this study. Finally, the authors also appreciate the many Fisheries and Oceans Canada research and watershed management staff who collected samples for analysis in this study.

Data Availability

Raw data has been deposited to the National Center for Biotechnology Information (NCBI) under BioProject PRJNA530256 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA530256/). Custom scripts and sample information can be found in supplemental files.

Funding Statement

Funding for this study was provided by Fisheries and Oceans Canada (https://www.dfo-mpo.gc.ca/index-eng.htm) under the Canadian Regulatory System for Biotechnology to RHD. BFK’s, MAR’s, and EBT’s research is supported by the Natural Sciences and Engineering Research Council of Canada (https://www.nserc-crsng.gc.ca/index_eng.asp) (EBT - Discovery and Equipment grants programs). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Crête-Lafrenière A, Weir LK, Bernatchez L. Framing the Salmonidae Family Phylogenetic Portrait: A More Complete Picture from Increased Taxon Sampling. PLOS ONE. 2012. October 5;7(10):e46662 10.1371/journal.pone.0046662 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wood CC, Bickham JW, John Nelson R, Foote CJ, Patton JC. Recurrent evolution of life history ecotypes in sockeye salmon: implications for conservation and future evolution. Evol Appl. 2008. May;1(2):207–21. 10.1111/j.1752-4571.2008.00028.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Devlin RH. Sequence of Sockeye Salmon Type 1 and 2 Growth Hormone Genes and the Relationship of Rainbow Trout with Atlantic and Pacific Salmon. Can J Fish Aquat Sci. 1993. August 1;50(8):1738–48. [Google Scholar]
  • 4.Koop BF, von Schalburg KR, Leong J, Walker N, Lieph R, Cooper GA, et al. A salmonid EST genomic study: genes, duplications, phylogeny and microarrays. BMC Genomics. 2008. November 17;9:545 10.1186/1471-2164-9-545 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.McKay SJ, Devlin RH, Smith MJ. Phylogeny of Pacific salmon and trout based on growth hormone type-2 and mitochondrial NADH dehydrogenase subunit 3 DNA sequences. Can J Fish Aquat Sci. 1996. May 1;53(5):1165–76. [Google Scholar]
  • 6.Beamish RJ. The Ocean Ecology of Pacific Salmon and Trout. American Fisheries Society; 2018. 1147 p. [Google Scholar]
  • 7.Behnke R, McGuane T. Trout and Salmon of North America. 1 edition New York: Free Press; 2002. 384 p. [Google Scholar]
  • 8.Taylor EB, Foote CJ, Wood CC. Molecular Genetic Evidence for Parallel Life-History Evolution within a Pacific Salmon (Sockeye Salmon and Kokanee, Oncorhynchus nerka). Evolution. 1996;50(1):401–16. 10.1111/j.1558-5646.1996.tb04502.x [DOI] [PubMed] [Google Scholar]
  • 9.Wood CC, Riddell BE, Rutherford DT, Withler RE. Biochemical Genetic Survey of Sockeye Salmon (Oncorhynchus nerka) in Canada. Can J Fish Aquat Sci. 1994. December 19;51(S1):114–31. [Google Scholar]
  • 10.Taylor EB, Beacham TD, Kaeriyama M. Population Structure and Identification of North Pacific Ocean Chum Salmon (Oncorhynchus keta) Revealed by an Analysis of Minisateliite DNA Variation. Can J Fish Aquat Sci. 1994. June 1;51(6):1430–42. [Google Scholar]
  • 11.Varnavskaya NV, Beacham TD. Biochemical genetic variation in odd-year pink salmon (Oncorhynchus gorbuscha) from Kamchatka. Can J Zool. 1992. November 1;70(11):2115–20. [Google Scholar]
  • 12.Beacham TD, Candy JR, Le KD, Wetklo M. Population structure of chum salmon (Oncorhynchus keta) across the Pacific Rim, determined from microsatellite analysis. Fish Bull. 2009;107(2):244–60. [Google Scholar]
  • 13.Beacham TD, McIntosh B, MacConnachie C, Miller KM, Withler RE, Varnavskaya N. Pacific Rim Population Structure of Sockeye Salmon as Determined from Microsatellite Analysis. Trans Am Fish Soc. 2006;135(1):174–87. [Google Scholar]
  • 14.Beacham TD, Jonsen KL, Supernault J, Wetklo M, Deng L, Varnavskaya N. Pacific Rim Population Structure of Chinook Salmon as Determined from Microsatellite Analysis. Trans Am Fish Soc. 2006;135(6):1604–21. [Google Scholar]
  • 15.Beacham TD, Withler RE. Population structure of sea-type and lake-type sockeye salmon and kokanee in the Fraser River and Columbia River drainages. PLOS ONE. 2017. September 8;12(9):e0183713 10.1371/journal.pone.0183713 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.McPhail J, Lindsey C. Zoogeography of the freshwater fishes of Cascadia (the Columbia system and rivers north to the Stikine) In: The zoogeography of North American freshwater fishes. New York: John Wiley & Sons; 1986. p. 615–37. [Google Scholar]
  • 17.Nichols KM, Kozfkay CC, Narum SR. Genomic signatures among Oncorhynchus nerka ecotypes to inform conservation and management of endangered Sockeye Salmon. Evol Appl. 2016. December;9(10):1285–300. 10.1111/eva.12412 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Larson WA, Seeb JE, Dann TH, Schindler DE, Seeb LW. Signals of heterogeneous selection at an MHC locus in geographically proximate ecotypes of sockeye salmon. Mol Ecol. 2014. November 1;23(22):5448–61. 10.1111/mec.12949 [DOI] [PubMed] [Google Scholar]
  • 19.Veale AJ, Russello MA. Genomic Changes Associated with Reproductive and Migratory Ecotypes in Sockeye Salmon (Oncorhynchus nerka). Genome Biol Evol. 2017. October 1;9(10):2921–39. 10.1093/gbe/evx215 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Larson WA, Limborg MT, McKinney GJ, Schindler DE, Seeb JE, Seeb LW. Genomic islands of divergence linked to ecotypic variation in sockeye salmon. Mol Ecol. 2017;26(2):554–70. 10.1111/mec.13933 [DOI] [PubMed] [Google Scholar]
  • 21.Veale AJ, Russello MA. An ancient selective sweep linked to reproductive life history evolution in sockeye salmon. Sci Rep. 2017. May 11;7(1):1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Lemay MA, Russello MA. Genetic evidence for ecological divergence in kokanee salmon. Mol Ecol. 2015;24(4):798–811. 10.1111/mec.13066 [DOI] [PubMed] [Google Scholar]
  • 23.Lemay MA, Russello MA. Neutral Loci Reveal Population Structure by Geography, not Ecotype, in Kootenay Lake Kokanee. North Am J Fish Manag. 2012. April 1;32(2):282–91. [Google Scholar]
  • 24.Pritchard VL, Mäkinen H, Vähä J-P, Erkinaro J, Orell P, Primmer CR. Genomic signatures of fine-scale local selection in Atlantic salmon suggest involvement of sexual maturation, energy homeostasis and immune defence-related genes. Mol Ecol. 2018;27(11):2560–75. 10.1111/mec.14705 [DOI] [PubMed] [Google Scholar]
  • 25.Larson WA, Dann TH, Limborg MT, McKinney GJ, Seeb JE, Seeb LW. Parallel signatures of selection at genomic islands of divergence and the major histocompatibility complex in ecotypes of sockeye salmon across Alaska. Mol Ecol. 2019;28(9):2254–71. 10.1111/mec.15082 [DOI] [PubMed] [Google Scholar]
  • 26.Yano A, Guyomard R, Nicol B, Jouanno E, Quillet E, Klopp C, et al. An Immune-Related Gene Evolved into the Master Sex-Determining Gene in Rainbow Trout, Oncorhynchus mykiss. Curr Biol. 2012. August 7;22(15):1423–8. 10.1016/j.cub.2012.05.045 [DOI] [PubMed] [Google Scholar]
  • 27.Larson WA, McKinney GJ, Seeb JE, Seeb LW. Identification and Characterization of Sex-Associated Loci in Sockeye Salmon Using Genotyping-by-Sequencing and Comparison with a Sex-Determining Assay Based on the sdY Gene. J Hered. 2016. November 1;107(6):559–66. 10.1093/jhered/esw043 [DOI] [PubMed] [Google Scholar]
  • 28.Thorgaard GH. Heteromorphic sex chromosomes in male rainbow trout. Science. 1977. May 20;196(4292):900–2. 10.1126/science.860122 [DOI] [PubMed] [Google Scholar]
  • 29.Phillips RB. Evolution of the Sex Chromosomes in Salmonid Fishes. Cytogenet Genome Res. 2013;141(2–3):177–85. 10.1159/000355149 [DOI] [PubMed] [Google Scholar]
  • 30.Faber-Hammond J, Phillips RB, Park LK. The sockeye salmon neo-Y chromosome is a fusion between linkage groups orthologous to the coho Y chromosome and the long arm of rainbow trout chromosome 2. Cytogenet Genome Res. 2012;136(1):69–74. 10.1159/000334583 [DOI] [PubMed] [Google Scholar]
  • 31.Thorgaard GH. Sex chromosomes in the sockeye salmon: a Y-autosome fusion. Can J Genet Cytol J Can Genet Cytol. 1978. September;20(3):349–54. 10.1139/g78-039 [DOI] [PubMed] [Google Scholar]
  • 32.Limborg MT, Waples RK, Allendorf FW, Seeb JE. Linkage Mapping Reveals Strong Chiasma Interference in Sockeye Salmon: Implications for Interpreting Genomic Data. G3 Genes Genomes Genet. 2015. November 1;5(11):2463–73. 10.1534/g3.115.020222 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Brown MS, Evans BS, Afonso LOB. Discordance for genotypic sex in phenotypic female Atlantic salmon (Salmo salar) is related to a reduced sdY copy number. Sci Rep. 2020. June 15;10(1):9651 10.1038/s41598-020-66406-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Faber-Hammond JJ, Phillips RB, Brown KH. Comparative Analysis of the Shared Sex-Determination Region (SDR) among Salmonid Fishes. Genome Biol Evol. 2015. July 1;7(7):1972–87. 10.1093/gbe/evv123 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Woram RA, Gharbi K, Sakamoto T, Hoyheim B, Holm L-E, Naish K, et al. Comparative Genome Analysis of the Primary Sex-Determining Locus in Salmonid Fishes. Genome Res. 2003. January 2;13(2):272–80. 10.1101/gr.578503 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Phillips RB, Konkol NR, Reed KM, Stein JD. Chromosome painting supports lack of homology among sex chromosomes in Oncorhynchus, Salmo, and Salvelinus (Salmonidae). Genetica. 2001. January 1;111(1):119–23. [DOI] [PubMed] [Google Scholar]
  • 37.Kijas J, McWilliam S, Naval Sanchez M, Kube P, King H, Evans B, et al. Evolution of Sex Determination Loci in Atlantic Salmon. Sci Rep. 2018. April 4;8(1):5664 10.1038/s41598-018-23984-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Eisbrenner WD, Botwright N, Cook M, Davidson EA, Dominik S, Elliott NG, et al. Evidence for multiple sex-determining loci in Tasmanian Atlantic salmon (Salmo salar). Heredity. 2014. July;113(1):86–92. 10.1038/hdy.2013.55 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Lubieniecki KP, Lin S, Cabana EI, Li J, Lai YYY, Davidson WS. Genomic Instability of the Sex-Determining Locus in Atlantic Salmon (Salmo salar). G3 Bethesda Md. 2015. November;5(11):2513–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Genomic DNA Preparation from RNAlaterTM Preserved Tissues—CA [Internet]. [cited 2019 Dec 19]. https://www.thermofisher.com/ca/en/home/references/protocols/nucleic-acid-purification-and-analysis/rna-protocol/genomic-dna-preparation-from-rnalater-preserved-tissues.html.
  • 41.Andrews S. FastQC [Internet]. Babraham Bioinformatics—FastQC A Quality Control tool for High Throughput Sequence Data. 2016 [cited 2017 Dec 19]. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
  • 42.Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014. August 1;30(15):2114–20. 10.1093/bioinformatics/btu170 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Salmela L, Rivals E. LoRDEC: accurate and efficient long read error correction. Bioinformatics. 2014. December 15;30(24):3506–14. 10.1093/bioinformatics/btu538 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci. 2011. January 25;108(4):1513–8. 10.1073/pnas.1017351108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009. December 15;10(1):1–9. 10.1186/1471-2105-10-421 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Xu G-C, Xu T-J, Zhu R, Zhang Y, Li S-Q, Wang H-W, et al. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. GigaScience. 2019. 01;8(1). 10.1093/gigascience/giy157 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLOS ONE. 2014. November 19;9(11):e112963 10.1371/journal.pone.0112963 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinforma Oxf Engl. 2009. August 15;25(16):2078–9. 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Yeo S, Coombe L, Warren RL, Chu J, Birol I. ARCS: scaffolding genome drafts with linked reads. Bioinforma Oxf Engl. 2018. 01;34(5):725–31. 10.1093/bioinformatics/btx675 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJM, et al. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience. 2015. August 4;4:35 10.1186/s13742-015-0076-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Jackman SD, Coombe L, Chu J, Warren RL, Vandervalk BP, Yeo S, et al. Tigmint: correcting assembly errors using linked reads from large molecules. BMC Bioinformatics. 2018. October 26;19(1):393 10.1186/s12859-018-2425-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.PacificBiosciences/GenomicConsensus [Internet]. Pacific Biosciences; 2019 [cited 2019 Dec 13]. https://github.com/PacificBiosciences/GenomicConsensus.
  • 53.Chin C-S, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods. 2013. June;10(6):563–9. 10.1038/nmeth.2474 [DOI] [PubMed] [Google Scholar]
  • 54.Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017. January 5;27(5):722–36. 10.1101/gr.215087.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Koren S. skoren/ArrowGrid [Internet]. 2019 [cited 2019 Dec 13]. https://github.com/skoren/ArrowGrid.
  • 56.Li H. lh3/seqtk [Internet]. 2019 [cited 2019 Dec 13]. https://github.com/lh3/seqtk.
  • 57.Christensen KA, Leong JS, Sakhrani D, Biagi CA, Minkley DR, Withler RE, et al. Chinook salmon (Oncorhynchus tshawytscha) genome and transcriptome. PLOS ONE. 2018. April 5;13(4):e0195461 10.1371/journal.pone.0195461 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Lien S, Koop BF, Sandve SR, Miller JR, Kent MP, Nome T, et al. The Atlantic salmon genome provides insights into rediploidization. Nature. 2016. May 12;533(7602):200–5. 10.1038/nature17164 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Berthelot C, Brunet F, Chalopin D, Juanchich A, Bernard M, Noël B, et al. The rainbow trout genome provides novel insights into evolution after whole-genome duplication in vertebrates. Nat Commun [Internet]. 2014. April 22 [cited 2015 Mar 18];5 Available from: http://www.nature.com/ncomms/2014/140422/ncomms4657/full/ncomms4657.html. 10.1038/ncomms4657 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Christensen KA, Rondeau EB, Minkley DR, Leong JS, Nugent CM, Danzmann RG, et al. The Arctic charr (Salvelinus alpinus) genome and transcriptome assembly. PLOS ONE. 2018. September 13;13(9):e0204076 10.1371/journal.pone.0204076 [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
  • 61.Rondeau EB, Minkley DR, Leong JS, Messmer AM, Jantzen JR, von Schalburg KR, et al. The Genome and Linkage Map of the Northern Pike (Esox lucius): Conserved Synteny Revealed between the Salmonid Sister Group and the Neoteleostei. PLOS ONE. 2014. July 28;9(7):e102089 10.1371/journal.pone.0102089 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Larson WA, McKinney GJ, Limborg MT, Everett MV, Seeb LW, Seeb JE. Identification of Multiple QTL Hotspots in Sockeye Salmon (Oncorhynchus nerka) Using Genotyping-by-Sequencing and a Dense Linkage Map. J Hered. 2016. March 1;107(2):122–33. 10.1093/jhered/esv099 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Everett MV, Miller MR, Seeb JE. Meiotic maps of sockeye salmon derived from massively parallel DNA sequencing. BMC Genomics. 2012. October 3;13:521 10.1186/1471-2164-13-521 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Wickham H. ggplot2: Elegant Graphics for Data Analysis 1st ed 2009. Corr. 3rd printing 2010 edition. New York: Springer; 2010. 213 p. [Google Scholar]
  • 65.R Core Team. R: A Language and Environment for Statistical Computing [Internet]. Vienna, Austria: R Foundation for Statistical Computing; 2018. https://www.R-project.org/. [Google Scholar]
  • 66.Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015. October 1;31(19):3210–2. 10.1093/bioinformatics/btv351 [DOI] [PubMed] [Google Scholar]
  • 67.Soderlund C, Bomhoff M, Nelson WM. SyMAP v3.4: a turnkey synteny system with application to plant genomes. Nucleic Acids Res. 2011. May;39(10):e68 10.1093/nar/gkr123 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Morgulis A, Gertz EM, Schäffer AA, Agarwala R. WindowMasker: window-based masker for sequenced genomes. Bioinforma Oxf Engl. 2006. January 15;22(2):134–41. 10.1093/bioinformatics/bti774 [DOI] [PubMed] [Google Scholar]
  • 69.Krzywinski MI, Schein JE, Birol I, Connors J, Gascoyne R, Horsman D, et al. Circos: An information aesthetic for comparative genomics. Genome Res [Internet]. 2009. June 18 [cited 2015 May 21]; Available from: http://genome.cshlp.org/content/early/2009/06/15/gr.092759.109. 10.1101/gr.092759.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010. September;20(9):1297–303. 10.1101/gr.107524.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011. May;43(5):491–8. 10.1038/ng.806 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinforma. 2013;43:11.10.1–11.10.33. 10.1002/0471250953.bi1110s43 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv13033997 Q-Bio [Internet]. 2013 Mar 16 [cited 2017 Dec 19]; http://arxiv.org/abs/1303.3997.
  • 74.Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011. August 1;27(15):2156–8. 10.1093/bioinformatics/btr330 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Li H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics. 2011. March 1;27(5):718–9. 10.1093/bioinformatics/btq671 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.GitHub—CNuge/snp-placer: Take information about snps on short sequence reads and accurately place the snps in a reference genome [Internet]. [cited 2019 Dec 12]. https://github.com/CNuge/snp-placer.
  • 77.Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011. November 1;27(21):2987–93. 10.1093/bioinformatics/btr509 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Jombart T, Devillard S, Balloux F. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet. 2010. October 15;11(1):94 10.1186/1471-2156-11-94 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Jombart T. adegenet: a R package for the multivariate analysis of genetic markers. Bioinforma Oxf Engl. 2008. June 1;24(11):1403–5. [DOI] [PubMed] [Google Scholar]
  • 80.Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009. September;19(9):1655–64. 10.1101/gr.094052.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Lee T-H, Guo H, Wang X, Kim C, Paterson AH. SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics. 2014. February 26;15(1):162 10.1186/1471-2164-15-162 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Knaus BJ, Grünwald NJ. vcfr: a package to manipulate and visualize variant call format data in R. Mol Ecol Resour. 2017. January;17(1):44–53. 10.1111/1755-0998.12549 [DOI] [PubMed] [Google Scholar]
  • 83.PLINK 1.9 [Internet]. [cited 2018 Jun 1]. http://www.cog-genomics.org/plink/1.9/.
  • 84.Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience [Internet]. 2015. December 1 [cited 2020 Feb 21];4(1). Available from: https://academic.oup.com/gigascience/article/4/1/s13742-015-0047-8/2707533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Wickham H. Reshaping Data with the reshape Package. J Stat Softw. 2007. November 13;21(1):1–20. [Google Scholar]
  • 86.Letunic I, Bork P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 2019. July 2;47(W1):W256–9. 10.1093/nar/gkz239 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Chen G-B, Lee SH, Zhu Z-X, Benyamin B, Robinson MR. EigenGWAS: finding loci under selection through genome-wide association studies of eigenvectors in structured populations. Heredity. 2016. July;117(1):51–61. 10.1038/hdy.2016.25 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.gc5k/GEAR [Internet]. GitHub. [cited 2020 Feb 21]. https://github.com/gc5k/GEAR.
  • 89.Wickham H, Seidel D, RStudio. scales: Scale Functions for Visualization [Internet]. 2019 [cited 2020 Feb 21]. https://CRAN.R-project.org/package=scales.
  • 90.Sievert C. Interactive web-based data visualization with R, plotly, and shiny [Internet]. [cited 2020 Feb 21]. https://plotly-r.com/.
  • 91.Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013. January 3;14(2):178–92. 10.1093/bib/bbs017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Samuels DC, Wang J, Ye F, He J, Levinson RT, Sheng Q, et al. Heterozygosity Ratio, a Robust Global Genomic Measure of Autozygosity and Its Association with Height and Disease Risk. Genetics. 2016. November 1;204(3):893–904. 10.1534/genetics.116.189936 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Guo Y, Ye F, Sheng Q, Clark T, Samuels DC. Three-stage quality control strategies for DNA re-sequencing data. Brief Bioinform. 2014. November;15(6):879–89. 10.1093/bib/bbt069 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Allendorf FW, Thorgaard GH. Tetraploidy and the Evolution of Salmonid Fishes In: Turner BJ, editor. Evolutionary Genetics of Fishes [Internet]. Springer; US; 1984. [cited 2015 Mar 17]. p. 1–53. (Monographs in Evolutionary Biology). http://link.springer.com/chapter/10.1007/978-1-4684-4652-4_1. [Google Scholar]
  • 95.Macqueen DJ, Johnston IA. A well-constrained estimate for the timing of the salmonid whole genome duplication reveals major decoupling from species diversification. Proc R Soc B Biol Sci [Internet]. 2014. March 7 [cited 2015 Mar 17];281(1778). Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3906940/. 10.1098/rspb.2013.2881 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Oncorhynchus nerka Annotation Report [Internet]. [cited 2020 Jun 24]. https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Oncorhynchus_nerka/100/.
  • 97.Oncorhynchus tshawytscha Annotation Report [Internet]. [cited 2018 May 9]. https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Oncorhynchus_tshawytscha/100/.
  • 98.Oncorhynchus mykiss Annotation Report [Internet]. [cited 2018 May 9]. https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Oncorhynchus_mykiss/100/.
  • 99.Oncorhynchus kisutch Annotation Report [Internet]. [cited 2020 Jun 24]. https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Oncorhynchus_kisutch/101/.
  • 100.Pearse DE, Barson NJ, Nome T, Gao G, Campbell MA, Abadía-Cardoso A, et al. Sex-dependent dominance maintains migration supergene in rainbow trout. Nat Ecol Evol. 2019. December;3(12):1731–42. 10.1038/s41559-019-1044-6 [DOI] [PubMed] [Google Scholar]
  • 101.McKinney G, McPhee MV, Pascal C, Seeb JE, Seeb LW. Network Analysis of Linkage Disequilibrium Reveals Genome Architecture in Chum Salmon. G3 GenesGenomesGenetics. 2020. March 12;10(5):1553–61. 10.1534/g3.119.400972 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Xu T, Zhang X, Ruan Z, Yu H, Chen J, Jiang S, et al. Genome resequencing of the orange-spotted grouper (Epinephelus coioides) for a genome-wide association study on ammonia tolerance. Aquaculture. 2019. October 15;512:734332. [Google Scholar]
  • 103.Davidson WS, Koop BF, Jones SJM, Iturra P, Vidal R, Maass A, et al. Sequencing the genome of the Atlantic salmon (Salmo salar). Genome Biol. 2010;11(9):403 10.1186/gb-2010-11-9-403 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Russello M. Kootenay Lake Kokanee Genetics Support Tool Development. University of British Columbia, Okanagan Campus, Department of Biology; 2017. Report No.: 2016–2017.
  • 105.Winans GA, Aebersold PB, Waples RS. Allozyme Variability of Oncorhynchus nerka in the Pacific Northwest, with Special Consideration to Populations of Redfish Lake, Idaho. Trans Am Fish Soc. 1996. September 1;125(5):645–63. [Google Scholar]
  • 106.Nelson JS. Distribution and Nomenclature of North American Kokanee, Oncorhynchus nerka. J Fish Res Board Can. 1968. February 1;25(2):409–14. [Google Scholar]
  • 107.Foote CJ, Wood CC, Clarke WC, Blackburn J. Circannual Cycle of Seawater Adaptability in Oncorhynchus nerka: Genetic Differences between Sympatric Sockeye Salmon and Kokanee. Can J Fish Aquat Sci. 1992. January 1;49(1):99–109. [Google Scholar]
  • 108.Northcote T. Some impacts of man on Kootenay Lake and its salmonids. Ann Arbor, Michigan: Great Lakes Fishery Commission; 1973. Report No.: Techn. Rep. Nr. 2.
  • 109.Yasuike M, de Boer J, von Schalburg KR, Cooper GA, McKinnel L, Messmer A, et al. Evolution of duplicated IgH loci in Atlantic salmon, Salmo salar. BMC Genomics. 2010. September 2;11(1):486 10.1186/1471-2164-11-486 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110.Kirkpatrick M. How and Why Chromosome Inversions Evolve. PLoS Biol [Internet]. 2010. September 28 [cited 2020 Jul 21];8(9). Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2946949/. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 111.Guijarro JA, Cascales D, García-Torrico AI, García-Domínguez M, Méndez J. Temperature-dependent expression of virulence genes in fish-pathogenic bacteria. Front Microbiol [Internet]. 2015. July 9 [cited 2020 Jul 16];6 Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4496569/. 10.3389/fmicb.2015.00700 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112.Liao X, Li Y. Genetic associations between voltage-gated calcium channels and autism spectrum disorder: a systematic review. Mol Brain. 2020. June 22;13(1):96 10.1186/s13041-020-00634-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113.Beltrán JF, Belén LH, Lee-Estevez M, Figueroa E, Dumorné K, Farias JG. The voltage-gated T-type Ca2+ channel is key to the sperm motility of Atlantic salmon (Salmo salar). Fish Physiol Biochem [Internet]. 2020. June 6 [cited 2020 Jul 27]; Available from: 10.1007/s10695-020-00829-1. [DOI] [PubMed] [Google Scholar]
  • 114.Lissabet JFB, Belén LH, Lee-Estevez M, Farias JG. Role of voltage-gated L-type calcium channel in the spermatozoa motility of Atlantic salmon (Salmo salar). Comp Biochem Physiol A Mol Integr Physiol. 2020. March 1;241:110633 10.1016/j.cbpa.2019.110633 [DOI] [PubMed] [Google Scholar]
  • 115.Schaschl H, Wallner B. Population-specific, recent positive directional selection suggests adaptation of human male reproductive genes to different environmental conditions. BMC Evol Biol. 2020. February 13;20(1):27 10.1186/s12862-019-1575-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116.Debrand E, El Jai Y, Spence L, Bate N, Praekelt U, Pritchard CA, et al. Talin 2 is a large and complex gene encoding multiple transcripts and protein isoforms. Febs J. 2009. March;276(6):1610–28. 10.1111/j.1742-4658.2009.06893.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 117.De Vries L, Zheng B, Fischer T, Elenko E, Farquhar MG. The Regulator of G Protein Signaling Family. Annu Rev Pharmacol Toxicol. 2000;40(1):235–71. 10.1146/annurev.pharmtox.40.1.235 [DOI] [PubMed] [Google Scholar]
  • 118.Chakravarti B, Yang J, Ahlers‐Dannen KE, Luo Z, Flaherty HA, Meyerholz DK, et al. Essentiality of Regulator of G Protein Signaling 6 and Oxidized Ca2+/Calmodulin‐Dependent Protein Kinase II in Notch Signaling and Cardiovascular Development. J Am Heart Assoc Cardiovasc Cerebrovasc Dis [Internet]. 2017. October 27 [cited 2020 Jul 29];6(11). Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5721783/. 10.1161/JAHA.117.007038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 119.Stewart A, Maity B, Anderegg SP, Allamargot C, Yang J, Fisher RA. Regulator of G protein signaling 6 is a critical mediator of both reward-related behavioral and pathological responses to alcohol. Proc Natl Acad Sci. 2015. February 17;112(7):E786–95. 10.1073/pnas.1418795112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 120.Nadin BM, Pfaffinger PJ. A New TASK for Dipeptidyl Peptidase-like Protein 6. PLOS ONE. 2013. April 9;8(4):e60831 10.1371/journal.pone.0060831 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 121.Nadin BM, Pfaffinger PJ. Dipeptidyl Peptidase-Like Protein 6 Is Required for Normal Electrophysiological Properties of Cerebellar Granule Cells. J Neurosci. 2010. June 23;30(25):8551–65. 10.1523/JNEUROSCI.5489-09.2010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 122.Lin L, Murphy JG, Karlsson R-M, Petralia RS, Gutzmann JJ, Abebe D, et al. DPP6 Loss Impacts Hippocampal Synaptic Development and Induces Behavioral Impairments in Recognition, Learning and Memory. Front Cell Neurosci [Internet]. 2018. March 29 [cited 2020 Jul 29];12 Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5884885/. 10.3389/fncel.2018.00084 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 123.Gleason LU. Ecological genomics of an intertidal marine snail: Population structure and local adaptation to heat stress in Chlorostoma (formerly Tegula) funebralis [Internet]. undefined. 2015 [cited 2020 Jul 29]. /paper/Ecological-genomics-of-an-intertidal-marine-snail%3A-Gleason/562fe995ce36074bdcc90125ba70cfd5f21c34d7.
  • 124.Puno MR, Lima CD. Structural basis for MTR4–ZCCHC8 interactions that stimulate the MTR4 helicase in the nuclear exosome-targeting complex. Proc Natl Acad Sci. 2018. June 12;115(24):E5506–15. 10.1073/pnas.1803530115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 125.Lubas M, Christensen MS, Kristiansen MS, Domanski M, Falkenby LG, Lykke-Andersen S, et al. Interaction Profiling Identifies the Human Nuclear Exosome Targeting Complex. Mol Cell. 2011. August 19;43(4):624–37. 10.1016/j.molcel.2011.06.028 [DOI] [PubMed] [Google Scholar]
  • 126.Tiedje C, Lubas M, Tehrani M, Menon MB, Ronkina N, Rousseau S, et al. p38MAPK/MK2-mediated phosphorylation of RBM7 regulates the human nuclear exosome targeting complex. RNA. 2015. January 2;21(2):262–78. 10.1261/rna.048090.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 127.Rialdi A, Hultquist J, Jimenez-Morales D, Peralta Z, Campisi L, Fenouil R, et al. The RNA Exosome Syncs IAV-RNAPII Transcription to Promote Viral Ribogenesis and Infectivity. Cell. 2017. May 4;169(4):679–692.e14. 10.1016/j.cell.2017.04.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 128.Prince DJ, O’Rourke SM, Thompson TQ, Ali OA, Lyman HS, Saglam IK, et al. The evolutionary basis of premature migration in Pacific salmon highlights the utility of genomics for informing conservation. Sci Adv. 2017. August 1;3(8):e1603198 10.1126/sciadv.1603198 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 129.Micheletti SJ, Hess JE, Zendt JS, Narum SR. Selection at a genomic region of major effect is responsible for evolution of complex life histories in anadromous steelhead. BMC Evol Biol. 2018. September 15;18(1):140 10.1186/s12862-018-1255-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 130.Sanna-Cherchi S, Khan K, Westland R, Krithivasan P, Fievet L, Rasouly HM, et al. Exome-wide Association Study Identifies GREB1L Mutations in Congenital Kidney Malformations. Am J Hum Genet. 2017. November 2;101(5):789–802. 10.1016/j.ajhg.2017.09.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 131.De Tomasi L, David P, Humbert C, Silbermann F, Arrondel C, Tores F, et al. Mutations in GREB1L Cause Bilateral Kidney Agenesis in Humans and Mice. Am J Hum Genet. 2017. November 2;101(5):803–14. 10.1016/j.ajhg.2017.09.026 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 132.Quinn TP, McGinnity P, Reed TE. The paradox of “premature migration” by adult anadromous salmonid fishes: patterns and hypotheses. Can J Fish Aquat Sci. 2015. November 25;73(7):1015–30. [Google Scholar]
  • 133.Shyh-Chang N, Daley GQ. Lin28: Primal Regulator of Growth and Metabolism in Stem Cells. Cell Stem Cell. 2013. April 4;12(4):395–406. 10.1016/j.stem.2013.03.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 134.Avramopoulos D. Neuregulin 3 and its roles in schizophrenia risk and presentation. Am J Med Genet B Neuropsychiatr Genet. 2018;177(2):257–66. 10.1002/ajmg.b.32552 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 135.Tarakci H, Berger J. The sarcoglycan complex in skeletal muscle. Front Biosci Landmark Ed. 2016. January 1;21:744–56. 10.2741/4418 [DOI] [PubMed] [Google Scholar]
  • 136.Duggan DJ, Manchester D, Stears KP, Mathews DJ, Hart C, Hoffman EP. Mutations in the delta-sarcoglycan gene are a rare cause of autosomal recessive limb-girdle muscular dystrophy (LGMD2). Neurogenetics. 1997. May;1(1):49–58. 10.1007/s100480050008 [DOI] [PubMed] [Google Scholar]
  • 137.Perez-Ortiz AC, Peralta-Ildefonso MJ, Lira-Romero E, Moya-Albor E, Brieva J, Ramirez-Sanchez I, et al. Lack of Delta-Sarcoglycan (Sgcd) Results in Retinal Degeneration. Int J Mol Sci. 2019. January;20(21):5480 10.3390/ijms20215480 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 138.Crosbie RH, Lebakken CS, Holt KH, Venzke DP, Straub V, Lee JC, et al. Membrane Targeting and Stabilization of Sarcospan Is Mediated by the Sarcoglycan Subcomplex. J Cell Biol. 1999. April 5;145(1):153–65. 10.1083/jcb.145.1.153 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 139.Fort P, Estrada F-J, Bordais A, Mornet D, Sahel J-A, Picaud S, et al. The sarcoglycan-sarcospan complex localization in mouse retina is independent from dystrophins. Neurosci Res. 2005. September;53(1):25–33. 10.1016/j.neures.2005.05.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 140.Stefansson S, Björnsson B, Ebbesson L, McCormick S. Smoltification In: Finn R, Kapoor B, editors. Fish Larval Physiology. Enfield, NH, USA: Science Publishers; 2008. p. 639–81. [Google Scholar]
  • 141.Iribarne M. Zebrafish Photoreceptor Degeneration and Regeneration Research to Understand Hereditary Human Blindness. Vis Impair Blind [Internet]. 2019 Aug 20 [cited 2020 Mar 16]; https://www.intechopen.com/online-first/zebrafish-photoreceptor-degeneration-and-regeneration-research-to-understand-hereditary-human-blindn.
  • 142.McKinney GJ, Hale MC, Goetz G, Gribskov M, Thrower FP, Nichols KM. Ontogenetic changes in embryonic and brain gene expression in progeny produced from migratory and resident Oncorhynchus mykiss. Mol Ecol. 2015. April;24(8):1792–809. 10.1111/mec.13143 [DOI] [PubMed] [Google Scholar]
  • 143.Gutierrez AP, Yáñez JM, Davidson WS. Evidence of recent signatures of selection during domestication in an Atlantic salmon population. Mar Genomics. 2016. April 1;26:41–50. 10.1016/j.margen.2015.12.007 [DOI] [PubMed] [Google Scholar]
  • 144.Eddy FB. Ammonia in estuaries and effects on fish. J Fish Biol. 2005;67(6):1495–513. [Google Scholar]
  • 145.Moore JW, Gordon J, Carr-Harris C, Gottesfeld AS, Wilson SM, Russell JH. Assessing estuaries as stopover habitats for juvenile Pacific salmon. Mar Ecol Prog Ser. 2016. November 9;559:201–15. [Google Scholar]
  • 146.Crackower MA, Kolas NK, Noguchi J, Sarao R, Kikuchi K, Kaneko H, et al. Essential Role of Fkbp6 in Male Fertility and Homologous Chromosome Pairing in Meiosis. Science. 2003. May 23;300(5623):1291–5. 10.1126/science.1083022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 147.Podlesnykh AV, Brykov VA, Kukhlevsky AD. Unstable Linkage of Molecular Markers with Sex Determination Gene in Pacific Salmon (Oncorhynchus spp.). J Hered. 2017. May 1;108(3):328–33. 10.1093/jhered/esx001 [DOI] [PubMed] [Google Scholar]
  • 148.Ayllon F, Solberg MF, Besnier F, Fjelldal PG, Hansen TJ, Wargelius A, et al. Sex determining gene transposition as an evolutionary platform for chromosome turnover. bioRxiv. 2020. March 16;2020.03.14.991026. [Google Scholar]
  • 149.Bertho S, Herpin A, Branthonne A, Jouanno E, Yano A, Nicol B, et al. The unusual rainbow trout sex determination gene hijacked the canonical vertebrate gonadal differentiation pathway. Proc Natl Acad Sci. 2018. December 11;115(50):12781–6. 10.1073/pnas.1803826115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 150.Choi H, Ryu K-Y, Roh J. Krüppel-like factor 4 plays a role in the luteal transition in steroidogenesis by downregulating Cyp19A1 expression. Am J Physiol-Endocrinol Metab. 2019. April 2;316(6):E1071–80. 10.1152/ajpendo.00238.2018 [DOI] [PubMed] [Google Scholar]
  • 151.Godmann M, Katz JP, Guillou F, Simoni M, Kaestner KH, Behr R. Krüppel-like factor 4 is involved in functional differentiation of testicular Sertoli cells. Dev Biol. 2008. March 15;315(2):552–66. 10.1016/j.ydbio.2007.12.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 152.DFO. Science Response Process of July 18, 2018 on the Review of Science Information to Inform Consideration of Risks to Cultus Lake Sockeye Salmon in 2018. DFO Can. Sci. Advis. Sec. Sci. Resp.; 2018. Report No.: 2018/052.
  • 153.Withler RE, Le KD, Nelson RJ, Miller KM, Beacham TD. Intact genetic structure and high levels of genetic diversity in bottlenecked sockeye salmon (Oncorhynchus nerka) populations of the Fraser River, British Columbia, Canada. Can J Fish Aquat Sci. 2000. October 1;57(10):1985–98. [Google Scholar]
  • 154.Becker RA, Wilks AR, Brownrigg R, Minka TP, Deckmyn A. maps: Draw Geographical Maps [Internet]. 2018. https://CRAN.R-project.org/packages=maps.

Decision Letter 0

Zuogang Peng

16 Jun 2020

PONE-D-20-12784

The sockeye salmon genome, transcriptome, and analyses identifying population defining regions of the genome and sex chromosome characterization

PLOS ONE

Dear Dr. Christensen,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Your manuscript has been reviewed by three referees. Although the external referees express interest in the general subject area of the paper, they also express a series of reservations that preclude publication of the paper in PLoS ONE in its current form. However, if you feel that you can suitably address the concerns and issues raised by the referees, I would be willing to consider a revised manuscript. Also, please be advised that the revised manuscript may be subject to re-review.

Please submit your revised manuscript by Jul 31 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Zuogang Peng, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. To comply with PLOS ONE submissions requirements, please provide methods of sacrifice in the Methods section of your manuscript.

3. We note that you are reporting an analysis of a microarray, next-generation sequencing, or deep sequencing data set. PLOS requires that authors comply with field-specific standards for preparation, recording, and deposition of data in repositories appropriate to their field. Please upload these data to a stable, public repository (such as ArrayExpress, Gene Expression Omnibus (GEO), DNA Data Bank of Japan (DDBJ), NCBI GenBank, NCBI Sequence Read Archive, or EMBL Nucleotide Sequence Database (ENA)). In your revised cover letter, please provide the relevant accession numbers that may be used to access these data. For a full list of recommended repositories, see http://journals.plos.org/plosone/s/data-availability#loc-omics or http://journals.plos.org/plosone/s/data-availability#loc-sequencing.

4. In your Methods section, please provide additional location information of the sampling sites, including geographic coordinates for the data set if available.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: In this MS, the authors sequenced and assembled the first sockeye salmon reference genome assembly. The genomes of 140 sockeye salmon and kokanee from various bodies of water along the northern Pacific Ocean were resequenced to understand population structure and genomic loci underlying that population structure. Three distinct groups were identified from the individuals. An immunoglobulin heavy chain variable gene cluster on chr. 26 was identified that differentiated the samples from the northwestern region of the sampling area from those to the south. They also explore the sex chromosomes of this species. An alternative sex-determination mechanism was identified in a subset of upper Columbia River kokanee.

Generally, the manuscript is well organized and written nicely. The methods were clearly described. The data analysis and discussion were appropriately made. The figures are appropriate and clear. I have just a few issues that need to be addressed.

1.Line 374-376, please add references.

2.Line 372, the sockeye salmon, sockeye salmon,and coho salmon are closely related species. Why the protein-coding genes annotated in this species were much less than salmon and trout? Whether this indicate the poor quality of the genome assembled or some characteristics related to this species? Please discuss it in detail. In addition, what genes were not identified in this species?

3.The authors identify the Chr.9a/9b as the sex chromosomes. However, no data was provided to support this conclusion because no marker was developed based on the differentiated regions on this chromosome to distinct the genetic male and female. The sockeye salmon has XX/XY or ZZ/ZW sex determination system? Please indicated it in the introduction.

4.The authors analyze the genomic associations with kokanee ecotype. Some candidate genes, including aquaporin-3, trim45, etc, were identified and discussed in three populations. But no functional experiments were performed to demonstrate it. In addition, whether the expression of these genes showed different patterns in different populations?

5.In Figure 2B, this figure was used in Figure 3, 4, 5 and 6. Please delete it.

6.Species name in the References should be in italics.

Reviewer #2: Christensen et al. describe a reference genome for sockeye salmon and conduct genomic analysis of sockeye salmon from across their range. The resource is highly valuable, the population genetic results are compelling, and the paper was well written. I recommend that the paper be accepted with very minor revisions. I only have a few comments.

I thought the finding of the large region of divergence between lineages on chromosome 26 was very interesting. I’m wondering if this could be an inversion or if some other mechanism is contributing to the large peak. Maybe this could be similar to inversions between lineages in Atlantic salmon? (e.g. https://doi.org/10.1111/mec.15065) I recommend the authors add a few sentences discussing potential mechanisms that could explain this large peak.

Line 330: I think “covariants” should be “covariates”

Line 419: Hanzen should be Hansen

Line 688: loci to locus

Line 716: I thought the inclusion of the double haploid in the runs of homozygosity analysis was unnecessary. I recommended removing this individual from the analyses and taking the section describing these results out of the discussion.

Reviewer #3: Comments to authors

This manuscript describes the acquisition and analyses of genome sequence data from a large number of sockeye salmon from throughout the species range. In addition, RNA-seq data were generated to help with annotated the composite genome. The genome data were then analyzed for polymorphisms that were used to help answer questions important to the evolutionary ecology of sockeye salmon. The authors have generated a lot of sequence data and conducted many analyses. Constructing a genome sequence, especially in a species with residual tetrasomy, is challenging. however, I have many concerns both with regard to the writing and clarity of the manuscript as well as the analyses chosen. Below I will split up my concerns and comments into what I deem to be major revisions and those which are more minor in nature.

Major comments

1) I feel that the manuscript suffers from an identity crisis. Is the main message to present the genome/protein coding parts of the genome or is the main aim of the manuscript to describe the analyses of the genetic data to answer the genetic basis of interesting questions? These needn’t be mutually exclusive, but I think the manuscript would read better if the authors could focus the writing on one of these two big picture aims.

2) The writing is poor and in places very difficult to read. Please give the manuscript a thorough re-read before resubmitting. Below in the minor comments I will detail some specific line numbers and sentences that were especially difficult for me to follow.

3) The quality of the figures MUST be addressed. The DPI was too low to accurately read figures and this greatly diminished my enthusiasm for the manuscript. In addition, many of the figures seemed unnecessary and I would recommend moving to the supplemental data.

4) Some analyses should be completed before resubmission. Specifically, I would like to see more information on how the authors dealt with the homeologous regions of the sockeye genome. Where they removed from this assembly? A circos plot, similar to previous salmonid genomes, would be very helpful in that regard.

Minor comments/specific concerns

Abstract

Lines 28-30, arguably, this is a common feature of many different salmon and trout (e.g., rainbow trout and chinook). Maybe remove the “most complex and fascinating life histories” and instead say that “Repeatedly, a resident form known as kokanee…” and continue with the rest of lines 29-30.

Line 34: do a better job of linking these sentences, the polymorphisms within the immunoglobulin heavy chain are what’s causing a large part of the differentiation between the three groups.

Introduction

Paragraph starting on line 52 is confusingly written and much of the information seems unnecessary. The take home message is that sockeye exist as multiple different life history ecotypes and one of these is the freshwater resident kokanee. I’d recommend deleting this paragraph, taking the important message (i.e., what I say above) and adding this to the paragraph starting on line 61.

Line 65: I’d suggest replacing “the hypothesis of…” with “is believe to be due to two common North American…”

Lines 69-71: this sentence must be clearer! The way I read it the kokanee appear to be monophyletic with respect to multiple rivers from the same area. Is that correct?

Lines 75-76: selection with respect to what in Atlantic salmon? As in how is variation in this gene associated with selection?

Line 81: delete “assembly”.

Lines 83-84: delete “various bodies of water”.

Lines 84-85: replace “that population structure” with ecotype divergence.

Materials and methods

Samples

For the two fish used in genome sequencing and transcriptome sequencing please state what population these samples originated from. How old were the samples?

Line 108/ figure 1: i’d recommend removing figure 1 as it is difficult to determine how many samples from each population there are and instead displaying that information as a table with the location, latitude and longitude, sample size, number of each sex, and the number of kokanee versus anadromous sockeye. This would be a useful resource for the reader as they progress reading the manuscript.

Lines 137-142: I am guessing that the samples were barcoded to allow pooling when sequenced? I’m assuming yes, but I can’t find the specific details in the methods either here or in the variant calling section. Some mention for how samples were barcoded and how sequences were separated by sample should be made.

Line 144: delete “that were sent”.

Line 147: the NEBNext RNA first strand synthesis is a way of synthesizing cDNA not for enriching extracted RNA for mRNA.

Line 165: what quality filters were used?

Line 169-170: I don’t understand the “and using paired-end data” add on. I have a feeling this should be a separate sentence.

Line 174: “corrected” PacBio reads, should that be filtered or quality filtered?

Line 198: “found” in Christensen et al should be “described”.

Line 205: why were alignments filtered? What were the authors trying to remove?

Line 210: please add sockeye to “previously published genetic maps”.

Line 248: please replace “truth datasets” with to validate candidate SNPs.

Line 249: please replace “the truth without errors” with real.

Line 281-283: this sentence needs a rewrite. I’d recommend “Each of the methods used filtered variants to reduce the effects of high LD on subsequent analyses”.

Line 299-300: “from the clustering methodologies” (see clustering individuals section) should be deleted.

Lines 304-307: I’m confused by what was done here and why. What do the authors mean by “allele balance”? why wasn’t LD filtered?

Line 307: what was the p-value cutoff for the Bonferroni correction?

Lines 310-312: I am confused as to what data were used for this analysis and why.

LD section

Please check the superscript formatting of R2 it seems off.

I’m concerned about filtering for minimum R2 value around regions that might be in high LD. Would doing so give an inflated LD? In other words, by having these minimum cutoffs are the authors getting an accurate estimation of LD for the region of the genome under study?

Individual genomic diversity

I understand why the authors looked at runs of homozygosity, but I don’t think it adds much to the story. I’d suggest removing to supplemental information.

Results

Gene annotation

Is there a link for interested researchers to download the annotated gene information?

Variant calling

Rewrite the first sentence. Maybe something like this “A total of 25,728,393 variants in 140 individuals were filtered to remove indels, SNPs with more than two alleles, maf <0.05, and were genotyped in more than 90% of samples to leave 4,533,143. These variants were further filtered to…”

Was there no limitation based on the number of reads necessary for scoring SNPs? Was there a minimum number of reads necessary for determining heterozygotes from homozygous genotypes? What did the authors do with homeologous regions?

Figure legends. This really confused me and I’d request that the authors move the figure legends to before the figures rather than embedding them in the results section.

Line 463: the wording is clumsy here. The nearness of the lowest p-value SNP to aldh9a1 is the important point.

Line 505: why do the authors think it’s unclear if cpa6 is the most likely gene under this association peak?

Line 555: is this one SNP or several? Where is it in terms of the genome, is it near any other genes?

Figures and Tables

Figure 1: again, please change this to a table. I think it would be easier to read and interpret.

Figure 2: the resolution is extremely bad and needs improving. The x and y axes in part A are confusing. Are these principle components 1 and 2? What are the two inlay figures? It is unclear from the figure legend. Parts B and C are interesting, but again it’s nearly impossible to read the text associated with the figures.

Figure 3: again, the resolution is poor and I’m not convinced the figure is necessary. There are a lot of figures that are showing, essentially, the same thing. I’d recommend removing figure 3 to the supplemental info.

Figure 4: part A: it’s difficult to read, but it seems that the Bonferroni significant cut off is above 15. How are the authors determining this? I thought Bonferroni significance is determined by dividing alpha by the sample size. In this case 0.05 / total number of SNPs (450868) should give a negative log10 P value of 6.955. I’m also confused about part C: The authors want to show how LD breaks down around the SNP with the strongest association but I’m struggling to see how LD breaks down in this figure. My suggestion would be to do a more traditional LD heatmap.

Figure 5: again, the quality is poor although the results are interesting. I think these data might be better presented in a table that reports on the comparison being made, i.e., Group 3 versus Group 2, the p-value, the gene, and the location of the gene on the sockeye genome.

Figure 6: The possibility of an inversion that appears to be associated with the groupings is very interesting. However, it’s difficult to interpret the figure and make the link between the data and how they support the hypothesis of an inversion. Would not a simple LD heatmap show the same data in an easier way? In addition, the possible inversion on chr 24 was not discussed. Are there other datasets that might point towards an inversion on chr24, or is this completely novel? If novel this needs to be better characterized.

Figure 7: part A seems unnecessary. Please tell the reader how many samples were included in this analysis and how many were kokanee versus anadromous sockeye. Part B is interesting but again I’m unsure how the lines for significance were drawn. Part C is interesting but needs clarity, what does the 1st, 2nd, 3rd, 4th refer to? I could not deduce that from the figure legend. I also think this is not the best way to present these data. A simple bar graph that shows the proportion of kokanee that are ancestral, heterozygous, and homozygous for the alternative allele and the same with anadromous sockeye would be far simpler.

Table 1: confusing and badly formatted. Please add gene abbreviations to a new column.

Figure 8: please format the axes in part E so that the distance is presented as MB rather than bases. I’d also recommend changing the scale bar for LD to make the distinction between areas with high LD and low LD clearer.

Figure 9: does not add to the paper, I’d suggest removing or moving to the supplemental information.

Discussion

The first sentence needs a rewrite. Put the contribution of the sockeye genome in the bigger context better. Something like “adds to a growing number of completed salmonid genomes”

Lines 601-603: what does this suggest?

Lines 609-610: take out the “with a slight discrepancy between…” it doesn’t add to the sentence.

Lines 613-614: what does this suggest with respect to the number of genetic populations and the potential isolation of samples from British Columbia?

Line 618: what does this suggest with respect to separation of kokanee and sockeye?

Lines 631-636: the separation of populations on the basis of immunoglobulin heavy chain is interesting and needs more exploration. Why might there be a difference at this locus? What about the gene duplication between chr21 and chr26 are these homeologous in sockeye?

Line 644: three loci have also been found to be associated with ecotype diversity in other studies? If so, how? Between kokanee and sockeye? Or between beach and stream spawning?

The whole section on genes associated with life history ecotype development needs work. For example, the authors mention neuregulin 3, but make no effort to discuss why that gene might be different between kokanee and sockeye. Same with the other genes connected to phototransduction, skeletal development, and immunity.

Line 674: what do the authors mean by conserved ecotype associations? Conserved between studies or between populations?

Line 691: why might aquaporin-3 alleles be associated with ecotype development?

Lines 694-695: is this different from how the Y chromosome formed in other salmonids? How do the homologs compare with Y chromosomes in other salmonids?

The lack of sdY in some populations of sockeye salmon is interesting and warrants further discussion.

Line 706: which chromosome is the kruppel-like factor 5 gene on?

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Oct 29;15(10):e0240935. doi: 10.1371/journal.pone.0240935.r002

Author response to Decision Letter 0


18 Aug 2020

Editor’s Comments:

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

This was checked.

2. To comply with PLOS ONE submissions requirements, please provide methods of sacrifice in the Methods section of your manuscript.

Added euthanasia protocols for samples taken for this study.

3. We note that you are reporting an analysis of a microarray, next-generation sequencing, or deep sequencing data set. PLOS requires that authors comply with field-specific standards for preparation, recording, and deposition of data in repositories appropriate to their field. Please upload these data to a stable, public repository (such as ArrayExpress, Gene Expression Omnibus (GEO), DNA Data Bank of Japan (DDBJ), NCBI GenBank, NCBI Sequence Read Archive, or EMBL Nucleotide Sequence Database (ENA)). In your revised cover letter, please provide the relevant accession numbers that may be used to access these data. For a full list of recommended repositories, see http://journals.plos.org/plosone/s/data-availability#loc-omics or http://journals.plos.org/plosone/s/data-availability#loc-sequencing.

All raw sequences had already been submitted to NCBI’s SRA. The assembly already had an accession as well, but we added the NCBI BioProject ID so all 170 SRA accessions can be identified (e.g. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA530256/) for the project.

4. In your Methods section, please provide additional location information of the sampling sites, including geographic coordinates for the data set if available.

Added - Table with approximate geographic coordinates.

Reviewer #1 comments

1. Line 374-376, please add references.

Added - NCBI annotation reports (NCBI website)

2. Line 372, the sockeye salmon, sockeye salmon,and coho salmon are closely related species. Why the protein-coding genes annotated in this species were much less than salmon and trout? Whether this indicate the poor quality of the genome assembled or some characteristics related to this species? Please discuss it in detail. In addition, what genes were not identified in this species?

Added to the results section - gene count discrepancy between species, likely due to assembly quality differences and/or differences in RNA-seq data sets. Count discrepancies were found between version 1 and 2 of the coho salmon genome assemblies as well (version 1: 36,425 vs. version 2: 41,269) and likely represents quality differences. This is discussed further in the new results section (Orthology between species).

Identifying missing genes between species is difficult as genes are inconsistently named between species with names like LOC123456789 in one species and the ortholog named LOC234567890 in another. This requires assignment of orthologs between species first.

Added to the method section: Method for identifying orthologs between sockeye salmon and two other species. Supplemental figures show that “missing” genes are disproportionally found in the telomeric ends supporting that they are missing because of difficulties in assembling complex regions of the chromosomes (highly repetitive), which has been discussed previously (citation in text).

3. The authors identify the Chr.9a/9b as the sex chromosomes. However, no data was provided to support this conclusion because no marker was developed based on the differentiated regions on this chromosome to distinct the genetic male and female. The sockeye salmon has XX/XY or ZZ/ZW sex determination system? Please indicated it in the introduction.

Chr 9a and 9b were identified as the X-chromosomes in a previous study. This is made more clear in the Introduction and more information is given about sex-determination in sockeye salmon.

4. The authors analyze the genomic associations with kokanee ecotype. Some candidate genes, including aquaporin-3, trim45, etc, were identified and discussed in three populations. But no functional experiments were performed to demonstrate it. In addition, whether the expression of these genes showed different patterns in different populations?

While it would be great to test these candidate genes, this manuscript’s scope was to identify populations and identify candidate genes that might be under selection. Future work will be to validate or eliminate these candidate genes as this work would take much longer to perform.

5. In Figure 2B, this figure was used in Figure 3, 4, 5 and 6. Please delete it.

Figure 2B was removed and the text was updated based on this removal.

6. Species name in the References should be in italics.

Changed.

Reviewer #2 comments

I thought the finding of the large region of divergence between lineages on chromosome 26 was very interesting. I’m wondering if this could be an inversion or if some other mechanism is contributing to the large peak. Maybe this could be similar to inversions between lineages in Atlantic salmon? (e.g. https://doi.org/10.1111/mec.15065) I recommend the authors add a few sentences discussing potential mechanisms that could explain this large peak.

Added to the results section that this is possibly an inversion based on haplotypes. Added to the discussion that this may be a local adaptation that became fixed in glacial refugia and that is maintained between populations by reduced recombination and underdominance. Based on paired-end alignments, there wasn’t strong evidence that this was an inversion. This is discussed in the text and includes alternative mechanisms.

Line 330: I think “covariants” should be “covariates”

Changed to covariates (found twice in the manuscript and changed both).

Line 419: Hanzen should be Hansen

Changed to Hansen.

Line 688: loci to locus

Changed loci to locus in two locations in the manuscript.

Line 716: I thought the inclusion of the double haploid in the runs of homozygosity analysis was unnecessary. I recommended removing this individual from the analyses and taking the section describing these results out of the discussion.

We agree that the doubled haploid result is obvious and that the manuscript would read better without this individual, but we think it is important to include. It isn’t completely clear from previous research that doubled haploids only retain the DNA from only one parent (please see article: Isogenic lines in fish – a critical review). This whole genome analysis of a doubled haploid might be useful for future comparisons and isogenic line development. However, we have reduced this discussion and it is only briefly mentioned now with most of the results and discussion removed.

Reviewer #3 comments

1) I feel that the manuscript suffers from an identity crisis. Is the main message to present the genome/protein coding parts of the genome or is the main aim of the manuscript to describe the analyses of the genetic data to answer the genetic basis of interesting questions? These needn’t be mutually exclusive, but I think the manuscript would read better if the authors could focus the writing on one of these two big picture aims.

We have added a goal statement in the abstract to try to give the reader a better sense of the focus and goals. This should make the thesis of the manuscript more explicit. We have also tried to narrow the focus by moving several figures to supplemental data and removing unnecessary sentences guided by all of the reviewers’ comments.

2) The writing is poor and in places very difficult to read. Please give the manuscript a thorough re-read before resubmitting. Below in the minor comments I will detail some specific line numbers and sentences that were especially difficult for me to follow.

Edits were made to simplify sentences and increase the ease of reading. Specific comments were addressed below.

3) The quality of the figures MUST be addressed. The DPI was too low to accurately read figures and this greatly diminished my enthusiasm for the manuscript. In addition, many of the figures seemed unnecessary and I would recommend moving to the supplemental data.

The figures have gone through the PACE software for this journal and is at the recommended DPI. It may be that the figures need to be downloaded by the reviewer. Journals will often have low quality figures in auto-generated PDFs for review. They are meant to be downloaded separately. Figures 1, 5, 6, 7, and 9 were moved to supplementary.

4) Some analyses should be completed before resubmission. Specifically, I would like to see more information on how the authors dealt with the homeologous regions of the sockeye genome. Where they removed from this assembly? A circos plot, similar to previous salmonid genomes, would be very helpful in that regard.

A circos plot was generated with common metrics shown. Homeologous regions are briefly addressed in the results section. The entire genome has homeologous regions and for the most part it appears they were successfully differentiated. It is difficult to both assemble and place contigs onto chromosomes for regions where the sequence similarity is still very high between homeologous regions. In lower-quality genome assemblies, these regions are typically left as contigs because there is not enough information to put them in the correct place on chromosomes. None of these regions were removed, but may have been left as unplaced contigs. Later versions of the genome assembly will be using long-read technology to address these issues.

Lines 28-30, arguably, this is a common feature of many different salmon and trout (e.g., rainbow trout and chinook). Maybe remove the “most complex and fascinating life histories” and instead say that “Repeatedly, a resident form known as kokanee…” and continue with the rest of lines 29-30.

This line was removed.

Line 34: do a better job of linking these sentences, the polymorphisms within the immunoglobulin heavy chain are what’s causing a large part of the differentiation between the three groups.

This sentence was altered to better link the two sentences and make it clear that it is the variants rather than the gene itself.

Paragraph starting on line 52 is confusingly written and much of the information seems unnecessary. The take home message is that sockeye exist as multiple different life history ecotypes and one of these is the freshwater resident kokanee. I’d recommend deleting this paragraph, taking the important message (i.e., what I say above) and adding this to the paragraph starting on line 61.

Paragraph deleted and sentence added as requested.

Line 65: I’d suggest replacing “the hypothesis of…” with “is believe to be due to two common North American…”

Changed.

Lines 69-71: this sentence must be clearer! The way I read it the kokanee appear to be monophyletic with respect to multiple rivers from the same area. Is that correct?

That is correct. We have added a clarifying sentence.

Lines 75-76: selection with respect to what in Atlantic salmon? As in how is variation in this gene associated with selection?

Added (associated with upstream catchment).

Line 81: delete “assembly”.

While it makes the sentence sound better if we remove assembly, it makes the statement inaccurate. The assembled sequence is a genome assembly and not a genome (for example: https://uswest.ensembl.org/Help/Faq?id=216). Instead, we removed “sequenced and assembled” and added generated in their place.

Lines 83-84: delete “various bodies of water”.

Removed.

Lines 84-85: replace “that population structure” with ecotype divergence.

Changed with a slight modification to let the reader know that we looked at divergence between populations and ecotype.

For the two fish used in genome sequencing and transcriptome sequencing please state what population these samples originated from. How old were the samples?

These are Pitt Lake sockeye salmon (please see Samples section). The age was added to the Samples section.

Line 108/ figure 1: i’d recommend removing figure 1 as it is difficult to determine how many samples from each population there are and instead displaying that information as a table with the location, latitude and longitude, sample size, number of each sex, and the number of kokanee versus anadromous sockeye. This would be a useful resource for the reader as they progress reading the manuscript.

Figure 1 was moved to supplemental material, but kept as many of these location would not be known to most people and the map gives an easier depiction of location than a table can. A table was added with aggregate information.

Lines 137-142: I am guessing that the samples were barcoded to allow pooling when sequenced? I’m assuming yes, but I can’t find the specific details in the methods either here or in the variant calling section. Some mention for how samples were barcoded and how sequences were separated by sample should be made.

The NxSeq Adaptors were used to barcode samples from the NxSeq AmpFREE Low DNA Library Kit (please see text). All fastq files were recieved as single files (per individual) from McGill University and Génome Québec Innovation Centre. From my understanding, these files are generated with the standard Illumina software (please see: https://support.illumina.com/content/dam/illumina-support/documents/documentation/system_documentation/hiseqx/hiseq-x-system-guide-15050091-07.pdf). This is a standard practice if custom barcodes are not used.

Line 144: delete “that were sent”.

Removed.

Line 147: the NEBNext RNA first strand synthesis is a way of synthesizing cDNA not for enriching extracted RNA for mRNA.

This sentence has been modified to clarify.

Line 165: what quality filters were used?

This is made more clear in the text (i.e. only adaptors were removed based on review of the output from FastQC)

Line 169-170: I don’t understand the “and using paired-end data” add on. I have a feeling this should be a separate sentence.

Paired-end data (which has low error-rates) is used to correct the PacBio reads. This was made more clear in the manuscript.

Line 174: “corrected” PacBio reads, should that be filtered or quality filtered?

PacBio reads are error prone and they are corrected—not filtered. For example: https://www.pacb.com/publications/lordec-accurate-and-efficient-long-read-error-correction/

Line 198: “found” in Christensen et al should be “described”.

Changed.

Line 205: why were alignments filtered? What were the authors trying to remove?

The following text was added to describe what was being filtered: “(e.g. off-target or repetitive elements).”

Line 210: please add sockeye to “previously published genetic maps”.

Added.

Line 248: please replace “truth datasets” with to validate candidate SNPs.

Truth set is the correct nomenclature and having that nomenclature is important for reproducibility. They were not used to validate candidate SNPs, they were used as a truth set in a model to score other variants. Please see: https://gatk.broadinstitute.org/hc/en-us/articles/360035890831-Known-variants-Training-resources-Truth-sets.

Line 249: please replace “the truth without errors” with real.

Changed.

Line 281-283: this sentence needs a rewrite. I’d recommend “Each of the methods used filtered variants to reduce the effects of high LD on subsequent analyses”.

The sentence was rewritten: “To reduce the effects of high LD, variants that had been filtered for LD were used in the three clustering methods.”

Line 299-300: “from the clustering methodologies” (see clustering individuals section) should be deleted.

Removed.

Lines 304-307: I’m confused by what was done here and why. What do the authors mean by “allele balance”? why wasn’t LD filtered?

Reworded the sentence to simplify and added a mention to allele balance in the methods section for reference to what it means. Allele balance is the ratio between alleles. If for example you have the A/B genotype, you might have 100 A counts and 100 B counts, which is a 1:1 ratio (and expected based on chromosome segregation). If you have 10 A counts and 100 B counts that would suggest something is wrong with this locus.

For association studies, you typically have one phenotype that you are comparing at a time. In this case we are using LD1 values from the DAPC analysis for the phenotype. In order to tell what is specifically different between groups, only two groups could be compared at a time.

LD wasn’t filtered because that isn’t typically done and it would remove the peaks that are commonly found in association studies. For example, if we had a causal variant at 10,000 bp we would expect an association with this variant and other variants near it, but the p-values would increase as the distance increased from the causal variant due to recombination. If we filtered on LD, we would remove all but one of the variants in this region and possibly remove the causal variant or a variant that had a lower p-value than the one kept.

Line 307: what was the p-value cutoff for the Bonferroni correction?

Added that the alpha value was 0.01 before correction.

Lines 310-312: I am confused as to what data were used for this analysis and why.

In this section, we were looking for chromosomal variation underlying population structure. There is an explanation of what EigenGWA is used for earlier in the paragraph. In the previous section, it was said that there were three grouping methods, DAPC, admixture, and a phylogenetic tree. In these lines (310-312), instead of the DAPC values used to identify groups, we used the admixture values to compare sockeye and kokanee from a subset of the samples. That was done because there was a difference in admixture values between kokanee and sockeye. This means there is a clear genetic difference between kokanee and sockeye in these geographic regions. By looking at this subset of individuals we are isolating that difference and by using the admixture values we are specifically looking for the part of the genome that best underlies this difference.

LD section

Please check the superscript formatting of R2 it seems off.

This will be checked in preparation for publication. It looks good in the LibreOffice Writer text editor.

I’m concerned about filtering for minimum R2 value around regions that might be in high LD. Would doing so give an inflated LD? In other words, by having these minimum cutoffs are the authors getting an accurate estimation of LD for the region of the genome under study?

This was done for visualization and not used in any analyses. We were not trying to estimate LD in these regions only to show the markers that are associated with the population.

Individual genomic diversity

I understand why the authors looked at runs of homozygosity, but I don’t think it adds much to the story. I’d suggest removing to supplemental information.

Figure 9 was moved to supplemental data. These metrics are an important aspect of sockeye salmon biology and the first attempt at finding them on this scale for the whole genome. They are also important for conservation and regulatory purposes and should therefore be reported to as wide of an audience as possible. Moving them to supplemental data would be counterproductive to this purpose.

Results

Gene annotation

Is there a link for interested researchers to download the annotated gene information?

Added to references.

Variant calling

Rewrite the first sentence. Maybe something like this “A total of 25,728,393 variants in 140 individuals were filtered to remove indels, SNPs with more than two alleles, maf <0.05, and were genotyped in more than 90% of samples to leave 4,533,143. These variants were further filtered to…”

Changed.

Was there no limitation based on the number of reads necessary for scoring SNPs? Was there a minimum number of reads necessary for determining heterozygotes from homozygous genotypes?

GATK works by generating a model of “good” variants that it then uses to score the rest of the variants. It may use read depth information, but the user does not supply a threshold and it wouldn’t be a set threshold in the results because other metrics might increase or decrease the score besides this value.

What did the authors do with homeologous regions?

The entire genome is composed of homeologous regions, but for homeologous regions with very high sequence similarity we didn’t do anything different or special. What we did notice is that these regions tended to have much fewer variants. This could occur for a few reasons. The first reason is that reads would have a much lower mapping score because they could be mapped to multiple locations in the genome equally well. The GATK model would likely treat these regions as poorly supported and call fewer variants in them. Another reason could be that these regions have more recombination between homologous and homeologous chromosomes that could influence variant retention. After all, these regions have retained high sequence similarity > 95% (for regions that were able to be placed and this likely under represents this value) for around 90 million years. There must be a mechanism that reduces the accumulation of mutations in these regions to maintain that high sequence similarity for so long.

Figure legends. This really confused me and I’d request that the authors move the figure legends to before the figures rather than embedding them in the results section.

This is a journal requirement.

Line 463: the wording is clumsy here. The nearness of the lowest p-value SNP to aldh9a1 is the important point.

Changed to only say where the variant with the lowest p-value was found.

Line 505: why do the authors think it’s unclear if cpa6 is the most likely gene under this association peak?

Added that the variant with the second lowest p-value is located in a nearby gene.

Line 555: is this one SNP or several? Where is it in terms of the genome, is it near any other genes?

Added that this was multiple variants, and the approximate region and candidate gene.

Figures and Tables

Figure 1: again, please change this to a table. I think it would be easier to read and interpret.

Changed.

Figure 2: the resolution is extremely bad and needs improving. The x and y axes in part A are confusing. Are these principle components 1 and 2? What are the two inlay figures? It is unclear from the figure legend. Parts B and C are interesting, but again it’s nearly impossible to read the text associated with the figures.

We believe the resolution issue is from the review manuscript quality generated by the journal. Journals often ask high-quality figures to be downloaded seperately for review (the same issue for all figures). Added what the axis are (linear discriminants, which are analogous to principle components). An error was found in this figure, with the DAPC groups mixed up on the DAPC figure. This was fixed throughout the manuscript.

Figure 3: again, the resolution is poor and I’m not convinced the figure is necessary. There are a lot of figures that are showing, essentially, the same thing. I’d recommend removing figure 3 to the supplemental info.

Figure 1 was moved to supplemental data. This figure is the only one left for readers to be able to see where each group is relative to one another and is vital for understanding clustering. It would be very difficult to imagine this figure from latitude and longitude positions alone.

Figure 4: part A: it’s difficult to read, but it seems that the Bonferroni significant cut off is above 15. How are the authors determining this? I thought Bonferroni significance is determined by dividing alpha by the sample size. In this case 0.05 / total number of SNPs (450868) should give a negative log10 P value of 6.955. I’m also confused about part C: The authors want to show how LD breaks down around the SNP with the strongest association but I’m struggling to see how LD breaks down in this figure. My suggestion would be to do a more traditional LD heatmap.

This was a coding error, log was used to draw the lines instead of log10. This was addressed in all figures and in the manuscript. This mistake made it necessary to edit several sections of the manuscript and to increase our criteria for a real peak. There were many more associations when considering the lower threshold. We increased the threshold from 3 nearby variants required to be considered a real peak to 5 to only analyze the most robust results. We thank the reviewer for catching this. It changed the results and discussion quite a bit. For part C, we were trying to show haploblocks and this is better explained in the manuscript now and shown in better detail in the figure.

Figure 5: again, the quality is poor although the results are interesting. I think these data might be better presented in a table that reports on the comparison being made, i.e., Group 3 versus Group 2, the p-value, the gene, and the location of the gene on the sockeye genome.

This was changed to a table.

Figure 6: The possibility of an inversion that appears to be associated with the groupings is very interesting. However, it’s difficult to interpret the figure and make the link between the data and how they support the hypothesis of an inversion. Would not a simple LD heatmap show the same data in an easier way? In addition, the possible inversion on chr 24 was not discussed. Are there other datasets that might point towards an inversion on chr24, or is this completely novel? If novel this needs to be better characterized.

A scatterplot with r2 values was added to the now supplemental figure to better visualize the block, in addition the possible inverted region(s) were highlighted. As far as we can tell, this possible inversion has not been mentioned in the literature before. We tried to confirm the potential inversion with paired-end data, but no convincing evidence was found. We are now using software to call inversions, but it may take much longer to figure out for sure and we believe this is outside the scope of this manuscript. For now we have discussed paired-end alignments and other mechanisms that might cause large haploblocks.

Figure 7: part A seems unnecessary. Please tell the reader how many samples were included in this analysis and how many were kokanee versus anadromous sockeye. Part B is interesting but again I’m unsure how the lines for significance were drawn. Part C is interesting but needs clarity, what does the 1st, 2nd, 3rd, 4th refer to? I could not deduce that from the figure legend. I also think this is not the best way to present these data. A simple bar graph that shows the proportion of kokanee that are ancestral, heterozygous, and homozygous for the alternative allele and the same with anadromous sockeye would be far simpler.

Part A was removed, and the sample number (sockeye salmon n=14, kokanee n=12) was added to both the methods section and the figure legend (now in supplementary). The threshold lines were redrawn with the correct values. For part C, the 1-4th lines referred to the significant variants with the lowest p-values. This is clarified in the text. We agree that a bar graph would be simpler, but we prefer this format as it allows the readers to see the haplotypes in this region for themselves. We reduced the size of the screenshot to simplify viewing.

Table 1: confusing and badly formatted. Please add gene abbreviations to a new column.

This table was reformatted and gene abbreviations were added.

Figure 8: please format the axes in part E so that the distance is presented as MB rather than bases. I’d also recommend changing the scale bar for LD to make the distinction between areas with high LD and low LD clearer.

Changed.

Figure 9: does not add to the paper, I’d suggest removing or moving to the supplemental information.

Moved Figure 9 to supplementary figures. It is still discussed in the main text because these metrics are important for management, comparisons between species, and understanding the amount of genetic variation that can be commonly expected in this species.

The first sentence needs a rewrite. Put the contribution of the sockeye genome in the bigger context better. Something like “adds to a growing number of completed salmonid genomes”

Changed.

Lines 601-603: what does this suggest?

Added that is suggests that it is of lower quality.

Lines 609-610: take out the “with a slight discrepancy between…” it doesn’t add to the sentence.

Removed.

Lines 613-614: what does this suggest with respect to the number of genetic populations and the potential isolation of samples from British Columbia?

Added speculation as to why this might have occurred.

Line 618: what does this suggest with respect to separation of kokanee and sockeye?

Added speculation as to why this might have occurred.

Lines 631-636: the separation of populations on the basis of immunoglobulin heavy chain is interesting and needs more exploration. Why might there be a difference at this locus? What about the gene duplication between chr21 and chr26 are these homeologous in sockeye?

Added a possible explanation for why the immunoglobulin heavy chain might be involved. Added that chr21 and chr26 are homeologous.

Line 644: three loci have also been found to be associated with ecotype diversity in other studies? If so, how? Between kokanee and sockeye? Or between beach and stream spawning?

Added to the discussion that the markers from the previous studies were aligned to the genome to enable comparison (genomic positions and marker names were already included in the discussion). Added the ecotype under comparison and analysis.

The whole section on genes associated with life history ecotype development needs work. For example, the authors mention neuregulin 3, but make no effort to discuss why that gene might be different between kokanee and sockeye. Same with the other genes connected to phototransduction, skeletal development, and immunity.

Added discussion of why these genes might be connected to ecotype.

Line 674: what do the authors mean by conserved ecotype associations? Conserved between studies or between populations?

Added to the sentence: (i.e. an association identified in the analysis between all sockeye and kokanee)

Line 691: why might aquaporin-3 alleles be associated with ecotype development?

This section was removed because aquaporin-3 did not pass the new threshold of 5 significant variants.

Lines 694-695: is this different from how the Y chromosome formed in other salmonids? How do the homologs compare with Y chromosomes in other salmonids?

The lack of sdY in some populations of sockeye salmon is interesting and warrants further discussion.

Added a section to the Introduction discussing salmonid Y chromosomes and comparisons. The lack of sdY is expounded on in the Discussion by adding details from other studies that have found sdY negative populations. The origin of the sdY gene is also discussed and how KLF5 might also influence sex.

Line 706: which chromosome is the kruppel-like factor 5 gene on?

Added “(LG1 6,248,507 - 6256,452)” to the manuscript

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Zuogang Peng

30 Sep 2020

PONE-D-20-12784R1

The sockeye salmon genome, transcriptome, and analyses identifying population defining regions of the genome

PLOS ONE

Dear Dr. Christensen,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The manuscript has been reviewed by one of the previous referees. I agree with the referee that some minor revisions are still needed. I invite you to submit a revised version that address all the concerns arised. 

Please submit your revised manuscript by Nov 14 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Zuogang Peng, Ph.D.

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #3: Comments to the authors

The authors have addressed all of my comments and I feel the manuscript is now suitably for publication albeit with a couple of small changes. I would like to mention that I am impressed with the improvement in the quality of the writing and the authors should be commended in that regard.

Major comments

The only major comment I have concerns the length of the new version of the manuscript. I tried to find a section or two that could be moved to supplemental, but the only good candidate was the sampling details on lines 93-129 (as well as table one).

Minor comments

Lines 53-54: start the sentence with “This split between…” and then add “suggests” between “salmon species and” and “two common North American…”

Line 57: add “the phenotype is” to the text within brackets, “(i.e., the phenotypes are polyphyletic)”

I think adding some text to line 58 might help the reader understand why the Fraser and Colombia Rivers are different to the rest of the sockeye range. Perhaps something like “where multiple populations of Kokanee are more closely related to each other than sympatric sockeye salmon”.

Line 140: add how large the size selected fragment of DNA was.

Line 174: add the parameters used for filtering with FastQC.

Paragraph titled “clustering and chromosomal variation underlying population structure

This paragraph is very interesting, but the important points get a bit lost as the reader tries to keep which population is which straight. I’d suggest re-writing this section perhaps to emphasize that what’s interesting here is the discrepancy in the normal dogma that sympatric kokanee and sockeye are more closely related to each other than either is to allopatric populations of the same ecotype. Perhaps, the authors should start by saying “There are three clusters, one of which was composed of samples from the upper Columbia” and then go on to talk about the discrepancy between studies with regard to some kokanee forming a monophyletic group.

Lines 750-753: the information in brackets could be deleted as the methods used are well explained in the methods section.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Oct 29;15(10):e0240935. doi: 10.1371/journal.pone.0240935.r004

Author response to Decision Letter 1


30 Sep 2020

Comments from reviewer 3

Major comments

The only major comment I have concerns the length of the new version of the manuscript. I tried to find a section or two that could be moved to supplemental, but the only good candidate was the sampling details on lines 93-129 (as well as table one).

The sample section was moved to S1 Methods.

Minor comments

Lines 53-54: start the sentence with “This split between…” and then add “suggests” between “salmon species and” and “two common North American…”

This was changed.

Line 57: add “the phenotype is” to the text within brackets, “(i.e., the phenotypes are polyphyletic)”

I think adding some text to line 58 might help the reader understand why the Fraser and Colombia Rivers are different to the rest of the sockeye range. Perhaps something like “where multiple populations of Kokanee are more closely related to each other than sympatric sockeye salmon”.

This was changed.

Line 140: add how large the size selected fragment of DNA was.

Added that the peak was around 488 bp.

Line 174: add the parameters used for filtering with FastQC.

Added that default settings were used.

Paragraph titled “clustering and chromosomal variation underlying population structure

This paragraph is very interesting, but the important points get a bit lost as the reader tries to keep which population is which straight. I’d suggest re-writing this section perhaps to emphasize that what’s interesting here is the discrepancy in the normal dogma that sympatric kokanee and sockeye are more closely related to each other than either is to allopatric populations of the same ecotype. Perhaps, the authors should start by saying “There are three clusters, one of which was composed of samples from the upper Columbia” and then go on to talk about the discrepancy between studies with regard to some kokanee forming a monophyletic group.

Changed to make this result more clear.

Lines 750-753: the information in brackets could be deleted as the methods used are well explained in the methods section.

Removed.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 2

Zuogang Peng

6 Oct 2020

The sockeye salmon genome, transcriptome, and analyses identifying population defining regions of the genome

PONE-D-20-12784R2

Dear Dr. Christensen,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Zuogang Peng, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Zuogang Peng

9 Oct 2020

PONE-D-20-12784R2

The sockeye salmon genome, transcriptome, and analyses identifying population defining regions of the genome

Dear Dr. Christensen:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Zuogang Peng

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Sample site locations of sockeye salmon and kokanee.

    Map generated with the maps library in R [154].

    (TIF)

    S2 Fig. EigenGWA between the DAPC groups 1 and 2.

    A Manhattan plot where eigenvalues from the DAPC analysis were used to identify regions of the genome with ancestry informative genes (e.g. under selection) between groups 1 and 2. The red horizontal line is the threshold of significance for ɑ = 0.01 after Bonferroni correction.The blue line is for ɑ = 0.05.

    (TIF)

    S3 Fig. EigenGWA between the DAPC groups 1 and 3.

    A Manhattan plot where eigenvalues from the DAPC analysis were used to identify regions of the genome with ancestry informative genes (e.g. under selection) between groups 1 and 3. The red horizontal line is the threshold of significance for ɑ = 0.01 after Bonferroni correction. The blue line is for ɑ = 0.05.

    (TIF)

    S4 Fig. Chromosome 24 (NC_042558.1) genotypes and putative inversion(s).

    A) On the left of this figure is the admixture ancestry plot with the DAPC group assignments. On the right, is a screenshot of chromosome 24 from IGV from 58 Mbp—62 Mbp (only variants with r2 values > = 0.3 with the variant with the lowest p-value from the eigenGWA in this peak are shown). This region of the genome was found from an eigenGWA to be associated with inferred population structure between DAPC groups 1 and 2. The dark blue genotypes are homozygous for the reference allele (HomRef), the green genotypes are homozygous for an alternative allele (HomVar), and the light blue are heterozygous (Het). B) A scatterplot of variants with r2 values > = 0.5 on the top shows areas with high LD. Below is a smaller version of the genotypes with the putative inversions highlighted.

    (TIF)

    S5 Fig. Sockeye salmon vs. kokanee eigenGWA.

    A) The eigenGWA is shown between Fraser River sockeye salmon (n = 14) and kokanee (n = 12) with putative genes highlighted at the peaks (with at least 5 variants with LD). The red line represents a Bonferroni correction at ɑ = 0.01 and after correction for the genomic inflation factor. The blue line represents a Bonferroni correction at ɑ = 0.05 and was chosen as the minimum value of significance. B) An IGV plot of all the variants used in the eigenGWA for the region around the peak on chromosome 23. The genotypes are: dark blue—homozygous reference, green—homozygous alternative, and light blue—heterozygous. The top IGV plot is the kokanee used in this analysis and the sockeye are below. Below the IGV plot, thick lines represent NCBI annotated genes in this region. The putative ancestry informative gene is highlighted in green and named. The variants with the lowest p-values from the eigenGWA are shown as dotted-lines (1st represents the variant with the lowest p-value, 2nd represents the variant with the second lowest p-value, etc.). The p-values, in combination with the genotypes, were used to identify the most likely ancestry informative gene in this region.

    (TIF)

    S6 Fig. Visualization of the variants with the greatest association to the sex phenotype in kokanee lacking the sdY gene.

    Variants on chromosome 1 (NC_042535.1) shown in IGV with the female variants on top and the male variants on the bottom. The variant with the greatest association was found in the 3’ UTR of the krüppel-like factor 5 gene.

    (TIF)

    S7 Fig. Individual genomic diversity.

    A) A map of the sampling sites. B) Three measures of individual genomic diversity: 1) total length of runs of homozygosity, 2) heterozygous genotypes per kbp, and 3) heterozygous ratio.

    (TIF)

    S8 Fig. Density plot of genes that we were unable to identify an ortholog for on the coho salmon genome.

    The x-axis is positions along the chromosome and the points represent the start position of a “missing” gene. The y-axis is the density of missing genes along the chromosome.

    (PDF)

    S1 Table. Sample information.

    (XLSX)

    S1 File. Compressed archive file with various custom Python scripts used in this study and readme files.

    (XZ)

    S2 File. List of orthologous genes between sockeye salmon and other salmon species.

    (XLSX)

    S1 Methods. Sampling strategy.

    (DOCX)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    Raw data has been deposited to the National Center for Biotechnology Information (NCBI) under BioProject PRJNA530256 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA530256/). Custom scripts and sample information can be found in supplemental files.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES