Skip to main content
Proceedings of the Royal Society B: Biological Sciences logoLink to Proceedings of the Royal Society B: Biological Sciences
. 2016 May 11;283(1830):20153064. doi: 10.1098/rspb.2015.3064

Genetic basis of adult migration timing in anadromous steelhead discovered through multivariate association testing

Jon E Hess 1,, Joseph S Zendt 2, Amanda R Matala 1, Shawn R Narum 1
PMCID: PMC4874702  PMID: 27170720

Abstract

Migration traits are presumed to be complex and to involve interaction among multiple genes. We used both univariate analyses and a multivariate random forest (RF) machine learning algorithm to conduct association mapping of 15 239 single nucleotide polymorphisms (SNPs) for adult migration-timing phenotype in steelhead (Oncorhynchus mykiss). Our study focused on a model natural population of steelhead that exhibits two distinct migration-timing life histories with high levels of admixture in nature. Neutral divergence was limited between fish exhibiting summer- and winter-run migration owing to high levels of interbreeding, but a univariate mixed linear model found three SNPs from a major effect gene to be significantly associated with migration timing (p < 0.000005) that explained 46% of trait variation. Alignment to the annotated Salmo salar genome provided evidence that all three SNPs localize within a 46 kb region overlapping GREB1-like (an oestrogen target gene) on chromosome Ssa03. Additionally, multivariate analyses with RF identified that these three SNPs plus 15 additional SNPs explained up to 60% of trait variation. These candidate SNPs may provide the ability to predict adult migration timing of steelhead to facilitate conservation management of this species, and this study demonstrates the benefit of multivariate analyses for association studies.

Keywords: genome-wide association studies, migratory species, anadromous fishes

1. Introduction

Fitness trade-offs may serve to differentiate migrating individuals and cause them to exhibit particular strategies that vary by timing [1], speed [2], distance [3] and destination of migration [4]. The attributes of migration behaviour are probably fine-tuned within populations to confer optimal fitness for a particular habitat. Genetic mechanisms can operate and drive variation in migratory behaviour, and the study of this architecture has begun to enhance our understanding of a variety of different migratory insects, birds and fishes [5]. A group of anadromous fishes, salmonids (Family Salmonidae), represent an opportunity to expand this understanding given their life-history diversity of migratory traits [6] and the availability of resources for study of these species owing to their high cultural and economic value. Specifically, the bioinformatic resources for salmonids now include whole-genome sequences of Salmo salar [7] and Oncorhynchus mykiss [8]. These resources increase the likelihood that candidate loci identified by genome-wide association studies (GWAS) of salmonids will be annotated and thereby provide insight into homologous functioning of genetic architecture of migratory behaviour.

In this study, we test for genome-wide association of a migratory trait in steelhead (O. mykiss), which is the anadromous form of this salmonid species. Steelhead exhibit a high degree of life-history diversity throughout their geographical range, including two divergent freshwater migration-timing life histories of adults known as winter- and summer-run. Both run types spend similar periods of time rearing in freshwater and the ocean (1–4 years for each habitat). However, summer-run steelhead return to their natal freshwater streams between May and October when they are relatively sexually immature, and winter-run steelhead return to freshwater between December and May as sexually mature fish [9]. Both summer- and winter-run steelhead hold in freshwater until they spawn together each year on the ascending limb of flows during annual spring run-off [10]. The alternative life-history strategies exhibited by the summer- and winter-run steelhead of the Columbia River Basin are probably associated with fitness trade-offs that may explain the differences in their respective geographical distributions [11]. For example, summer-run steelhead displays an optimal migration timing for passage over partial barriers such as waterfalls that are relatively easier to navigate during summer flows. However, the fitness cost of this strategy may be relatively high pre-spawning mortality owing to the protracted freshwater residence associated with this early migration timing [12].

Genetic analysis of native, sympatric summer and winter runs indicate that these two migratory forms are more similar within drainages than either type is to other populations of the same migratory type [13,14]. However, the degree of gene flow among the two migratory types can vary depending on location. For the Columbia River, only summer-run steelhead are distributed upstream of the major historic barrier of Celilo Falls (near the eastern edge of the Cascade Crest; electronic supplementary material, figure S1). However, downstream of this point both runs have historically coexisted. In some coastal tributary populations where hatchery stocking has occurred, separation in their spawn timing and habitat preference has generally reinforced some degree of reproductive isolation [1517]. This isolation can cause an accumulation of divergence throughout the genome, which would present added challenge to uncovering the genetic basis of the trait of interest because any alleles that vary between subpopulations can lead to spurious associations with the trait [18]. However, our study focused on steelhead from a tributary of the Columbia River, the Klickitat River, which is located in an intermediate geographical location between coastal and inland lineages where neutral genetic divergence is very low for native anadromous steelhead [19] and both migratory life-history types occur in sympatry. Therefore, this represents a model system for association testing of migratory life histories with little confounding neutral structure.

Traits underlying migration behaviour that appear bimodal can be polygenic, involving interaction among multiple genes of minor effect [20], and so we included both univariate and multivariate models for association tests of 15 239 SNPs with adult migration timing in steelhead. This combination of approaches takes advantage of testing framework of univariate analysis to evaluate independent associations of SNPs with our trait of interest, along with benefits of multivariate analyses with random forest (RF) such as the ability to capture the complexity of epistatic interactions that are typically involved among genes underlying quantitative traits [21]. Further, we examined whether information (the degree of association each SNP has with the trait) from the univariate analyses could be useful as a prior to help narrow the search for candidate SNPs in the RF approach and increase efficiency for identifying a set of SNPs with greatest prediction power for the trait of interest.

2. Material and methods

(a). Fish collection, sample pre-screening and DNA analysis

Adult steelhead were trapped throughout each year of a consecutive 3-year period (2007–2009) in the Klickitat River at the Lyle Falls fish ladder (Rkm 2.4; electronic supplementary material, figure S1) for non-lethal collection of biological data, and then released back into the stream. Biological data included migration date, fork length, sex and a clip of tissue from the caudal fin. DNA was extracted and genotypes from 180 TaqMan SNP assays were analysed initially to filter out non-target samples (non-local or hatchery-origin steelhead) via STRUCTURE v. 2.3.4 analysis [22] (electronic supplementary material, figure S2 and supplemental methods S1). This set of 180 SNPs has been highly vetted for Hardy–Weinberg equilibrium across the steelhead range in the Columbia River Basin and is a standard marker set for characterizing population structure in this region [23]. Only fish determined to be natural-origin steelhead from the Klickitat River were retained for further inclusion in this study (electronic supplementary material, figure S3). This included a total of 320 fish over three migration years (n = 83 for 2007; n = 93 for 2008; n = 144 for 2009).

The trait we tested was migration timing in units of ordinal day as the phenotype, which we ordered to reflect the biological sequence of annual steelhead migrations (figure 1). For both summer- and winter-run life-history types, spawning coincides temporally in April of each year for all steelhead that returned to the Klickitat River during all previous seasons of the migration year, as well as those that returned during January through April of the current year. Therefore, we reordered April 30 as ordinal day 365 and May 1 as ordinal day 1 to represent the end of one spawning cycle and the beginning of the next migration and spawning cycle, respectively (figure 1). The summer-run migration typically lasts from May to October of the same calendar year. The winter-run migration typically begins around December and continues through April of the following calendar year. Thus, there are two periods where the two migratory types are expected to overlap: primarily in the transition from summer-run to winter-run, which is thought to occur in the autumn months (September through November); and secondarily at the end of the spawning season in April, where late winter-run fish and early summer-run fish are migrating.

Figure 1.

Figure 1.

Mean monthly counts of adult steelhead passing Lyle Falls during 2007–2014. Ordinal ‘day’ was re-ordered to begin with the summer-run period and end after the spawning period. The bi-modal distribution represents the peaks of the summer- and winter-run steelhead migration periods (horizontal bars) and includes a ‘transitional’ period of overlap in distributions (hatched bars). Sample sizes (total n = 237) analysed from the three time periods are indicated.

DNA from each of the 320 steelhead and a set of 10 doubled haploid O. mykiss individuals (included to identify paralogous sequence variants, PSVs) was quantified in 96-well assay plates using the Quant-iT dsDNA pico-green assay kit (Life Technologies, Grand Island, NY, USA) and a Perkin Elmer Victor 5 plate reader. RAD-seq libraries were prepared for Illumina sequencing with a protocol that has been previously published [24] and subsequently modified [25]. Prior to sequencing, RAD libraries were quantified by q-PCR and Illumina library quantification standards (Kappa Biosystems Inc, Woburn, MA, USA) on an ABI 7900HT Sequence Detection System (Life Technologies). Libraries were sequenced with single-end 100 bp reads on an Illumina HiSeq 1500 (Illumina Inc., San Diego, CA, USA).

(b). Bioinformatics

Bioinformatics for de novo SNP discovery and genotyping were completed with STACKS [26], and subsequently filtered to remove markers and individuals that failed quality thresholds (electronic supplementary material, supplemental methods S2). The final dataset included 15 239 SNPs with greater than 3% MAF that were successfully genotyped across more than 80% of 237 individuals. The decrease in sample size from the initial 320 individuals was largely due to tissue degradation. These failed tissues appeared evenly distributed across run times and were unlikely to cause sample bias. Sequence for each RAD tag was used to align to the O. mykiss and the S. salar reference genomes using BOWTIE2 [27]. The positions from the alignment to the O. mykiss genome were used to order the loci by chromosome number. Many of the tags were found to occur within an unknown chromosome and so were simply ordered by their scaffold position number after the markers on the last chromosome (i.e. after sex chromosome). Coding regions of the O. mykiss genome were identified within 5 kb of each SNP locus, and for comparison, coding regions of the annotated S. salar genome were also examined within 5 kb of the putative candidate SNPs identified in this study. To identify gene functions and annotations, coding sequences were then queried against the NCBI nucleotide sequence database using the software program Blast2GO [28]. We used program search parameters Blastx (Blast program), nr (Blast database), retrieve 10 Blast hits and Blast expectation = 10−3 default value. A Fisher's exact test in the FatiGO [29] package that is integrated within Blast2GO was used to test for significant annotation differences between two sets of loci (i.e. a group of putative candidate loci and the reference set of all loci that were annotated). We generated visual aids for annotation of regions where candidate SNPs were found to localize using the Oncorhynchus genome browser developed by Genoscope (https://www.genoscope.cns.fr/trout/cgi-bin/gbrowse/truite/) and the S. salar genome browser by NCBI (http://www.ncbi.nlm.nih.gov/genome/browse/).

(c). Univariate genome-wide association studies

We performed univariate analyses using a general linear model (GLM) and a mixed linear model (MLM) with TASSEL v. 5.1.0 [30]. The GLM is a fixed effects linear model that is used in TASSEL to identify significant associations between phenotypes and genotypes. TASSEL takes population structure into account by using membership in underlying populations as covariates in the model. The MLM is similar to GLM but includes both fixed (i.e. gender, year, population structure and genetic marker) and random effects (i.e. relationships among individuals), and can thus account for both population structure and kinship to reduce false positive associations [31]. Equations for the GLM and MLM are described in the TASSEL manual [30]. A kinship matrix using the ‘scaled IBS’ method [32] based on all 15 239 SNPs was generated in TASSEL to represent cryptic familial relationships. The MLM was implemented using default options (i.e. ‘P3D’ [33] parameter option and the ‘optimum compression’ option). The GLM effectively represents a ‘maximum compression’ option, and thus provides contrast to the MLM. Permutation tests (1000) were used to calculate p-values to determine significant associations of SNPs with traits. Owing to the potential to identify false associations when many SNPs, traits and individuals are included, a Bonferroni correction was applied to α level 0.05 (i.e. corrected α = 3.28 × 10−6) to reduce false positives and stringently control for Type I errors. The association tests using a GLM were performed with covariates of gender, year and population structure. For population structure, coefficients of ancestry for nine STRUCTURE clusters (only 9 of the K = 10 clusters from the STRUCTURE ‘filter’ analysis were used to remove linear dependency as recommended in the TASSEL manual). This dataset was also analysed using an MLM, and the kinship matrix was included as an additional covariate. Some individuals (4%) were missing gender data, which was imputed in TASSEL by using default parameter settings to average the nearest three neighbours.

(d). Random forest

We performed the RF multivariate analysis on our dataset by using two different approaches (henceforth referred to as ‘RF-rank’ and ‘MLM-rank’), and within each approach we employed ‘coarse-sweep’ and ‘fine-sweep’ analyses to identify the number of SNPs required to explain the most trait variation. For the first approach (‘RF-rank’), our ‘coarse-sweep’ involved an iterative process (electronic supplementary material, supplemental methods S3) to build a set of predictive models for the migration trait (dependent variable) based on subsets of the 15 239 SNP loci (independent variables). The RF model's predictive accuracy is estimated by using a portion (approx. 2/3) of the data that are randomly selected with replacement to build a predictive model, which is then used to predict the phenotypes of the remainder of the data [34]. The amount of trait variance explained by the predicted values relative to the total amount of variation in the observed phenotypes yields the percentage of trait variation explained, a measure of the model's predictive accuracy. Results from this ‘coarse-sweep’ analysis showed that the maximum phenotypic variance could be explained with the top 25 loci. Using this group of loci, we then ranked their importance using a backward purging analysis [34], which was automated with R scripts [35]. In this backward purging (i.e. ‘fine-sweep’ analysis), the least important loci are removed one by one, starting with a greater number than the initial optimum number of loci (here, the best 150 loci rather than the top 25; electronic supplementary material, supplemental methods S3). This ‘fine-sweep’ analysis was necessary to identify candidate SNPs with more precision than could be accomplished with the ‘coarse-sweep’ analysis, and this analysis helps to minimize the potential negative interactions across loci that may otherwise prevent identification of the best candidate SNPs [34,35].

The migration trait data in this case were the residuals of the day of passage trait from the previous MLM univariate analysis, which minimized bias from population structure and kinship. For comparison, we also used the uncorrected trait values (i.e. re-ordered ordinal day of migration timing) to estimate the percentage of total variation that each set of SNPs could explain. Genotypes were converted to 0 (aa), 0.5 (ab) and 1 (bb), and missing data were imputed using the nearest three individuals which was generated using TASSEL. The RF analyses were performed using the RANDOMFOREST package [36] in R.

For the second approach (‘MLM-rank’), we used the results of the univariate MLM analysis in order to rank all the loci based on their p-value, lowest to highest. As a ‘coarse-sweep’ analysis, we used subsets of top-ranked loci (i.e. top 3, 5, 10, 25, 50, 75, 100, 200, 300, 400, 500, 750, 1500, 3000 and 7500 SNPs) to determine the approximate number of loci necessary to explain maximal phenotypic variation. Subsequently, we conducted a ‘fine-sweep’ backward purging analysis using a starting point of the 150 top-ranked SNPs. We felt this ‘MLM-rank’ approach would make a useful comparison to the alternative ‘RF-rank’ approach, which relied solely on importance values estimated by RF analysis to rank loci and was more time-consuming due to the non-automated, iterative steps. We evaluated the two approaches for their efficiency (i.e. their ability to identify the smallest number of loci required to explain the most amount of trait variation).

(e). Ability of top candidate markers to distinguish alternative run types

The top candidate markers identified by the ‘MLM-rank’ approach were analysed in STRUCTURE to demonstrate their effectiveness for distinguishing individuals according to their migration-timing. Individuals were categorized into three groups divided by the periods May–August, September–November and December–April to capture a summer, transitional and winter-run group, respectively (figure 1). Correlation between ordinal day-of-passage and probability of ancestry to a cluster (output from STRUCTURE) was estimated using a Mantel test performed with 9999 permutations in PASSaGE v. 2 [37]. We calculated pairwise FST among the three migration-timing groups using ARLEQUIN v. 3.5.1.2 [38] with 10 000 permutations, and significance was determined at α = 0.05 level corrected for multiple comparisons using B–Y method FDR [39] (corrected α = 0.01092). For comparison, these same analyses were repeated using a set of 180 non-candidate SNPs that are part of a standard panel for genetic stock identification and population structure analyses in the Columbia River Basin [23].

3. Results

(a). Univariate genome-wide association studies

The univariate analyses performed in TASSEL indicated that MLM (figure 2) was a better fit than GLM (electronic supplementary material, figure S4) due to the inclusion of the kinship matrix as a covariate in the MLM. Three SNPs were significantly associated with migration timing in the MLM, specifically SNPs 47080_53, 52458_16 and 54772_22 (Bonferroni's adjusted α ≤ 3.28 × 10−6). These three SNPs were highly evident in the QQ plot (figure 2a) above the 1 : 1 line and outside the significance threshold, with the remainder of markers not statistically significant and largely fitting expected p-values. We used the residual trait values from TASSEL for subsequent multivariate analyses because of the relatively good fit of this model, and its ability to minimize bias from both population structure and cryptic relatedness.

Figure 2.

Figure 2.

Association test results for a mixed linear model (MLM) of the migration-timing trait. (a) QQ plot showing the expected −log10(p-value) versus −log10(p-value). (b) Manhattan plot showing −log10(p-value) versus the O. mykiss genome position of the markers. The heavy dashed line indicates the Bonferroni-corrected α level of 0.05. (Online version in colour.)

(b). Random forest multivariate analysis

The initial ‘coarse-sweep’ using the RF-rank approach (with no prior information from MLM) demonstrated that approximately 25–50 SNPs were able to explain the largest amount of residual trait variation (approx. 44%; figure 3a). The subsequent ‘fine-sweep’ backward purge analysis corroborated this approximate number of SNPs; however, the analysis more precisely identified a set of 44 SNPs to explain 46% of residual trait variation (figure 3b) or 60% uncorrected trait variation. The coarse-sweep and fine-sweep analyses were highly concordant, because 84% of the top 44 SNPs identified in the fine sweep were included in the top 50 SNPs identified in the coarse-sweep. Further, these overlapping SNPs included the three candidate SNPs identified by the univariate analysis, and two of these candidate SNPs were ranked first and second by RF importance values.

Figure 3.

Figure 3.

Percentage of residual variation (i.e. after all covariates were accounted for in the MLM) explained in migration timing with the number of SNPs included in the RF model. (a) ‘Coarse-sweep’ analysis: groups of markers were selected according to their rank based on either the p-value from the MLM univariate test (black) or based on the importance values using RF exclusively (grey). (b) ‘Fine-sweep’ analysis: starting with the top 150 SNPs based on their rank from either MLM p-value (black line) or the importance values using RF exclusively (grey line), a single least important SNP was purged step by step.

The MLM-rank approach using a coarse-sweep analysis in RF demonstrated that this approach was less effective than RF-rank. With the MLM-rank approach, many more SNPs (a minimum of 100 SNPs) were necessary to explain the highest residual trait variation. Additionally, the highest residual trait variation explained with the MLM-rank approach (approx. 27%; figure 3a) was much lower than that with the RF-rank approach (27% versus 44% of variation, respectively). The same top three SNPs were identified, but were only able to explain 7% residual variation (46% uncorrected variation). Overall, the RF-rank coarse-sweep approach was more efficient because it identified fewer SNPs to explain a greater amount of variation.

Despite relatively low efficiency of the MLM-rank approach in the coarse-sweep analysis, backward purging with fine-sweep analysis using the top-ranked 150 SNPs performed more comparably to the RF-rank approach. A relatively small number of SNPs (18) were identified with the MLM-rank fine-sweep analysis that could explain 44% of the residual trait variation (figure 3b) or 60% uncorrected trait variation. This level of explained residual trait variation with 18 SNPs surpassed the limit of residual trait variation (approx. 27%) that could be explained from the coarse-sweep (figure 3a), and was similar to the level achieved with the RF-rank fine-sweep analysis (46% residual variation with 44 SNPs), but with many fewer markers. Among the 18 top SNPs, 78% of them were found in common among the 44 top SNPs identified by the RF-rank approach. Further, the same three candidate SNPs identified by the MLM univariate analysis were included among the top candidate SNPs identified by both RF-rank and MLM-rank approaches.

(c). Candidate loci for migration timing

In total, 8224 of our 15 239 loci were found to be within 5 kb of at least one coding DNA sequence from the O. mykiss reference genome. Using a single coding sequence per locus, there were 8154 loci that returned at least one sequence description, gene ontology term or InterProScan result in the Blast2GO analysis (electronic supplementary material, table S1). The gene ontology enrichment analysis using a Fisher's exact test identified no statistically significant enrichments for either the 18 top MLM-rank loci or the top 44 RF-rank loci after applying FDR [40] for multiple testing. Lack of significant enrichments may be due to limited annotation of candidate SNPs (e.g. among the 18 top MLM-rank loci there were only eight SNPs determined to have a sequence description, and of those, 3 were identified with GO names; electronic supplementary material, table S1). Coding DNA sequence from the Atlantic salmon genome corroborated some (n = 12) of the O. mykiss genome annotation of candidate loci and identified eight additional homologous genes nearby these candidate loci (electronic supplementary material, table S1); however, these additions still yielded no statistically significant gene ontology enrichments.

Based on the O. mykiss genomic positions of the 18 top MLM-rank loci and the top 44 RF-rank loci, nearly half of the SNPs were on an unknown chromosome, and the remaining SNPs were distributed across multiple chromosomes and did not appear to be concentrated on any particular linkage group (electronic supplementary material, table S1). However, alignment of these candidate loci to the S. salar genome provided more complete chromosome positional information (94% of loci were positioned on chromosomes), and notably showed that a single chromosome (ssa03) included all three candidate SNPs identified by the MLM univariate analysis and two additional candidate SNPs (77429_7 and 34348_17). Among the three candidate SNPs identified by the MLM univariate analysis, we determined that one SNP (47080_53) occurs in O. mykiss chromosome 28 (figure 2b), and localizes to the gene GDF11, which was recently characterized in O. mykiss [41]. The other two SNPs, 52458_16 and 54772_22, were found to occur less than 3 kb apart in an unknown O. mykiss chromosome (i.e. ChrUn positions 366307580 and 366310211, respectively). The region spanned by these two markers contains three CDS of a predicted gene in the GREB1/GREB1-like gene family (electronic supplementary material, figure S5). The alignment to the annotated S. salar genome corroborated the close proximity of SNPs 52458_16 and 54772_22 within GREB1-like, and, further, provided evidence that all three SNPs localize within a 46 kb region on chromosome Ssa03 (electronic supplementary material, figure S5). However, based on the annotated S. salar genome, there was no evidence that SNP 47080_53 is in fact localized within a gene homologous to the O. mykiss GDF11; therefore, its close proximity to the GREB1-like gene is perhaps a more notable and substantiated characteristic.

The STRUCTURE analyses with the top 18 candidate SNPs showed support for two clusters with steep increase in mean LnP(K) between K = 1 and K = 2. The mean LnP(K) plateaued for K > 2 (electronic supplementary material, figure S6a). The proportion of ancestry to these clusters (figure 4a) was correlated with migration timing (r = 0.458, figure 4b), and the Mantel test was significant (p = 0.0001, B-Y FDR corrected α = 0.0083). By contrast, based on the standard 180 SNP marker set used for general population genetic applications, mean LnP(K) plateaued between K = 1 and 2, after which point Pr(K) declined rapidly (electronic supplementary material, figure S6b). Delta K [42] supported K = 2 for both marker sets, but the correlation of proportion of ancestry (figure 4c) versus migration timing was lower (r = 0.037) for the 180 SNP marker set (Mantel test, p = 0.0056; figure 4d). Pairwise FST values were high and significant (B-Y FDR corrected α = 0.05) [39] among migration trait categories based on the 18 candidate loci, and relatively lower (70× lower) and mostly non-significant for the 180 SNPs (electronic supplementary material, table S2).

Figure 4.

Figure 4.

STRUCTURE plot of individual ancestry to two clusters for (a) 18 candidate SNPs and (c) 180 non-candidate SNPs. Individuals were ordered by their migration timing into the Klickitat River and grouped by the summer-run (May–Aug), transitional (Sep–Nov) and winter-run (Dec–Apr) periods. Plots of the Mantel test correlation between individual ancestry and day of migration (units are ordinal day as reordered summer–winter) are shown for (b) 18 candidate SNPs and (d) 180 non-candidate SNPs.

4. Discussion

Our study provides evidence for the genetic basis of differential migration timing exhibited by anadromous steelhead. However, we cannot conclude that these alternative life-history strategies are entirely under genetic control. Instead, this example may be similar to the polygenic class of threshold traits, or alternative migratory tactics exhibited generally across salmonid fishes [43]. Specifically, individual variation in behaviour and physiology of alternative migratory tactics (e.g. residency versus anadromy) may be controlled by developmental thresholds; and a genetic basis may be controlling the variation in proximal traits activating the development of alternative migratory tactics [43,44]. Furthermore, phenotypic plasticity may contribute to variation in this trait as has been observed in other organisms [4547]. Our study also supports that this migration trait is polygenic, and the combination of univariate GWAS and multivariate RF analyses that we employed was helpful for uncovering multiple SNPs and their associated genomic regions that may be involved in this trait and are distributed across multiple chromosomes. Our approach to use information from the univariate analyses as a prior to narrow the search for the RF multivariate analysis may also benefit other studies by increasing their efficiency.

(a). Genetic architecture of migration timing in steelhead

Concentrated genetic architecture (i.e. few quantitative trait loci, QTL, of large effect) has been predicted to evolve under a set of conditions that include, among other factors, higher rates of gene flow between diverging populations compared with conditions leading to more diffuse genetic architecture (i.e. many QTL of small effect) [48]. Recent empirical studies demonstrate that even seemingly complex phenotypes can be controlled by concentrated architecture [49,50]. Life-history strategies for migration timing in salmonids appear to be less extreme than these examples of traits with concentrated genetic architecture, because many loci of both large and small effects have been implicated across chromosomes [35,51]. Our study of migration timing in steelhead lends support to this notion because this trait involves at least one chromosome region of relatively large effect as well as loci across multiple chromosomes with relatively minor effects. The region of large effect was identified by three SNPs that were clearly distinguished by their highly significant association with this trait using GWAS analyses and were found to localize within a 46 kb region of a single chromosome (ssa03) based on the S. salar genome alignment. Placement of these three SNPs together on a single O. mykiss chromosome is also likely given that chromosome ssa03 of S. salar shares synteny with O. mykiss chromosome 28 [52], and one of these three SNPs (47080_53) that could be assigned to an O. mykiss chromosome was found on chromosome 28.

These three SNPs were found to localize within a region that overlaps at least one gene: an oestrogen receptor cofactor, GREB1. Homologous functions of the GREB1 gene that have been characterized in other species appear to support a putative major role in this salmonid migration trait. For example, GREB1 mediates the interaction of oestrogen receptors with other proteins and plays an important role in enhancing the ability of oestrogen to act on cells to induce transcription [53]. This link to oestrogen receptors is relevant to upstream migration timing in steelhead and other salmon, because these species migrate upstream when they are sexually maturing. Further, oestrogen (steroid hormone) levels in the blood increase when salmon are maturing and affect multiple developmental pathways in sexual maturation, including oogenesis, vitellogenesis and testicular development [54].

However, the three top SNPs were found to explain a relatively small portion of the residual variation (approx. 7%) of the migration-timing trait (i.e. the variation remaining after kinship, population structure, gender and year effects were accounted for in the univariate MLM). A multivariate approach using the RF machine learning algorithm demonstrated a much higher proportion of residual trait variation (44%) could be explained using a minimum of 18 minor candidate SNPs. We also demonstrated that the variation at these 18 candidate SNPs can distinguish the individuals in this dataset by their migration timing in contrast with a set of 180 non-candidate SNPs that are a standard marker set for population genetic applications in the Columbia River Basin [23]. Therefore, this evidence suggests multiple loci of both major and minor effects may be involved in this migration trait.

In addition to GREB1, the set of 18 candidate SNPs included five other genes identified by our Blast2GO query: transmembrane protein 230-like gene (TMEM230-like), ras gtpase-activating-like protein (IQGAP3), angiopoietin-related protein 2-like (ANGPTL2), glutamate receptor (NMDA 2C-like) and an unnamed protein product. However, GO-enrichment analysis did not identify a significant pattern of genes involved in any particular function. The enrichment analysis was limited by small sample sizes of the number of genes that could be identified and by the relatively low-density genomic scan we could accomplish with 15 000 SNPs. There are a number of assumptions required when linking a SNP locus with a gene based on homology, including a subjective decision on how nearby a gene should be to consider it linked to a SNP [55]. We chose a conservatively narrow distance (5 kb) threshold within which to identify genes nearby candidate SNPs, and note concordance of gene annotation information from the S. salar and O. mykiss genomes for all five of the 18 candidate SNPs that could be compared. It may be beneficial to reassess the O. mykiss genomic positions and gene ontology of these candidate loci in the near future following continued improvements to the genomic resources for this species [56].

(b). Utility for a combined univariate and multivariate approach in genome-wide association studies

We demonstrated high concordance between results from univariate analyses and RF by noting that of the three SNPs (54772_22, 47080_53 and 52458_16) that were significant in the univariate MLM analysis, two SNPs held the first and second rank (54772_22 and 47080_53, respectively) in the RF analysis (RF-rank backward purging). The third SNP (52458_16) was ranked 15th by this RF analysis, which is probably due to the fact that it is in near perfect linkage disequilibrium with the first-rank SNP. The level of concordance of rankings between the univariate MLM analysis and RF analysis decreased as a larger set of top-ranked markers were compared between methods (i.e. 29% of markers were overlapping in the 150 top-ranked by p-values from univariate analysis and 150 top-ranked by RF importance values).

Importantly, we show that ranking of loci based on results from a univariate analysis may be helpful for narrowing the RF search. Using a starting point of a set of 150 top-ranked SNPs based on p-value (lowest to highest) from the univariate analysis, the RF backward purging steps were able to identify a set of relatively few SNPs (18) required to reach nearly maximal explained residual trait variation (44%). Therefore, when results from a univariate GWAS are used to help inform a search within RF, this step may enable one to identify an optimal set of candidate SNPs (based on their explanatory power for a trait of interest) but with fewer steps than if one were to use RF exclusively. More investigation is warranted with this relatively new application of RF for GWAS; however, RF as a stand-alone analysis may have difficulty finding the optimum solution when so many possible (albeit suboptimal) solutions are probably present in the data given the many thousands of SNP loci that are used in a typical GWAS. Further, univariate analyses can help to overcome one limitation of RF (i.e. a lack of statistical properties for evaluation of SNP variables found to have high predictive power for a trait of interest [21]). The advantage of RF for identifying sets of candidate SNPs that can operate synergistically to provide maximal predictive power for a trait of interest will probably continue to drive its adoption as an analytical tool for GWAS.

(c). Conservation implications

The candidate genetic markers identified in this study may help predict the adult migration timing of individual steelhead throughout the entire course of their life cycle, which would benefit long-term conservation management of this protected species. Ability to identify the genetic propensity for migratory traits in steelhead would be useful for a multitude of applications including characterizing differences associated with these adult alternative migration tactics that pertain to pre-adult life stages (e.g. juvenile migration and size-at-age), and categorizing adults on spawning grounds into migration categories. Currently, steelhead are categorized into summer- or winter-run based on the timing when they enter streams near the mouth of their natal tributary. However, steelhead may overwinter in freshwater areas outside of their natal tributary, which complicates their classification as summer- or winter-run, and therefore underscores the need for a method of genetic classification. Further research is warranted to characterize the extent to which this genetic mechanism for this migration-timing trait applies across the geographical distribution of the species.

Supplementary Material

Supplemental Figures and Methods
rspb20153064supp1.pdf (1.1MB, pdf)

Supplementary Material

Supplemental Tables
rspb20153064supp2.xls (6.5MB, xls)

Acknowledgements

We thank Marine Brieuc, Benjamin Hecht and Nate Campbell for their help with analyses. We also thank two anonymous reviewers and Fuwen Wei for suggestions to improve the original manuscript.

Ethics

Trapping and sampling was authorized by permit from the National Marine Fisheries Service under the Endangered Species Act.

Data accessibility

The data used in this study are available in Dryad: http://dx.doi.org/10.5061/dryad.62q6n.

Authors' contributions

J.E.H. conceived and designed the study, analysed the data and drafted the manuscript; J.S.Z. coordinated data collection; A.R.M. conducted bioinformatics; S.R.N. coordinated the study and helped draft the manuscript. All authors gave final approval for publication.

Competing interests

We have no competing interests.

Funding

This study was funded by Bonneville Power Administration.

References

  • 1.Pulido F, Berthold P, Mohr G, Querner U. 2001. Heritability of the timing of autumn migration in a natural bird population. Proc. R. Soc. Lond. B 268, 953–959. ( 10.1098/rspb.2001.1602) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Hanson KC, et al. 2008. Individual variation in migration speed of upriver-migrating sockeye salmon in the Fraser River in relation to their physiological and energetic status at marine approach. Physiol. Biochem. Zool. 81, 255–268. ( 10.1086/529460) [DOI] [PubMed] [Google Scholar]
  • 3.Hess JE, Caudill CC, Keefer ML, McIlraith BJ, Moser ML, Narum SR. 2014. Genes predict long distance migration and large body size in a migratory fish, Pacific lamprey. Evol. Appl. 7, 1192–1208. ( 10.1111/eva.12203) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Lyons JI, Pierce AA, Barribeau SM, Sternberg ED, Mongue AJ, De Roode JC. 2012. Lack of genetic differentiation between monarch butterflies with divergent migration destinations. Mol. Ecol. 21, 3433–3444. ( 10.1111/j.1365-294X.2012.05613.x) [DOI] [PubMed] [Google Scholar]
  • 5.Liedvogel M, Akesson S, Bensch S. 2011. The genetics of migration on the move. Trends Ecol. Evol. 26, 561–569. ( 10.1016/j.tree.2011.07.009) [DOI] [PubMed] [Google Scholar]
  • 6.Bronmark C, Hulthen K, Nilsson PA, Skov C, Hansson L-A, Brodersen J, Chapman BB. 2014. There and back again: migration in freshwater fishes. Can. J. Zool. 92, 467–479. ( 10.1139/cjz-2012-0277) [DOI] [Google Scholar]
  • 7.Davidson WS, Koop BF, Jones SJ, Iturra P, Vidal R, Maass A, Jonassen I, Lien S, Omholt SW. 2010. Sequencing the genome of the Atlantic salmon (Salmo salar). Genome Biol. 11, 403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Berthelot C, et al. 2014. The rainbow trout genome provides novel insights into evolution after whole-genome duplication in vertebrates. Nat. Commun. 5, 3657 ( 10.1038/ncomms4657) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Myers JM, Busack C, Rawding D, Marshall AR, Teel DJ, Van Doornik DM, Maher MT. 2006. Historical population structure of Pacific salmonids in the Willamette River and lower Columbia River basins. NOAA technical memorandum NMFS-NWFSC-73 Seattle, WA: NOAA NWFSC. [Google Scholar]
  • 10.Stanford JA, Lorang MS, Hauer FR. 2005. The shifting habitat mosaic of river ecosystems. Proc. Int. Soc. Limnol. 29, 123–136. [Google Scholar]
  • 11.Good TP, Waples RS, Adams P. (eds). 2005. Updated status of federally listed ESUs of West Coast salmon and steelhead. NOAA technical memorandum NMFS-NWFSC-66 Seattle, WA: NOAA NWFSC. [Google Scholar]
  • 12.Withler IL. 1966. Variability in life history characteristics of steelhead trout (Salmo gairdneri) along the Pacific Coast of North America. J. Fish. Res. Board Can. 23, 365–393. ( 10.1139/f66-031) [DOI] [Google Scholar]
  • 13.Utter FM, Allendorf FW. 1977. Determination of the breeding structure of steelhead populations through gene frequency analysis. In Proc. of the Genetic Implications of Steelhead Management Symp., 20–21 May 1977, Special Report 77-1 (eds TJ Hassler, RR VanKirk), pp. 44–54. Arcata, CA: Coop. Fish. Res. Unit. Calif.
  • 14.Arciniega M, Clemento AJ, Miller MR, Peterson M, Garza JC, Pearse DE. 2016. Parallel evolution of the summer steelhead ecotype in multiple populations from Oregon and Northern California. Conserv. Genet. 17, 165–175. ( 10.1007/s10592-015-0769-2) [DOI] [Google Scholar]
  • 15.Thorgaard GH. 1983. Chromosomal differences among rainbow trout populations. Copeia 1983, 650–662. ( 10.2307/1444329) [DOI] [Google Scholar]
  • 16.Leider SA, Chilcote MW, Loch JJ. 1984. Spawning characteristics of sympatric populations of steelhead trout (Salmo gairdneri): evidence for partial reproductive isolation. Can. J. Fish. Aquat. Sci. 41, 1454–1462. ( 10.1139/f84-179) [DOI] [Google Scholar]
  • 17.Sharpe C, Hulett P, Wagerman C. 2000. Studies of hatchery and wild steelhead in the lower Columbia region. Washington Department of Fish and Wildlife report no. FPA 00-10. Olympia, WA: Washington Department of Fish and Wildlife.
  • 18.Astle W, Balding DJ. 2009. Population structure and cryptic relatedness in genetic association studies. Stat. Sci. 24, 451–471. ( 10.1214/09-STS307) [DOI] [Google Scholar]
  • 19.Narum SR, Zendt JS, Graves D, Sharp WR. 2008. Influence of landscape on resident and anadromous life history types of Oncorhynchus mykiss. Can. J. Fish. Aquat. Sci. 65, 1013–1023. ( 10.1139/F08-025) [DOI] [Google Scholar]
  • 20.Roff DA, Fairbairn DJ. 2007. The evolution and genetics of migration in insects. Bioscience 57, 155–164. ( 10.1641/B570210) [DOI] [Google Scholar]
  • 21.Goldstein BA, Polley EC, Briggs FBS. 2011. Random forests for genetic association studies. Stat. Appl. Genet. Mol. Biol. 10, 1–34. ( 10.2202/1544-6115.1691) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Pritchard JK, Stephens M, Donnelly P. 2000. Inference of population structure using multilocus genotype data. Genetics 155, 945–959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Matala AP, Ackerman MW, Campbell MR, Narum SR. 2014. Relative contributions of neutral and non-neutral genetic differentiation to inform conservation of steelhead trout across highly variable landscapes. Evol. Appl. 7, 682–701. ( 10.1111/eva.12174) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Hecht BC, Campbell NR, Holecek DE, Narum SR. 2013. Genome-wide association reveals genetic basis for the propensity to migrate in wild populations of rainbow and steelhead trout. Mol. Ecol. 22, 3061–3076. ( 10.1111/mec.12082) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Miller MR, et al. 2012. A conserved haplotype controls parallel adaptation in geographically distant salmonid populations. Mol. Ecol. 21, 237–249. ( 10.1111/j.1365-294X.2011.05305.x) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Catchen J, Hohenlohe PA, Bassham S, Amores A, Cresko WA. 2013. Stacks: an analysis tool set for population genomics. Mol. Ecol. 22, 3124–3140. ( 10.1111/mec.12354) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359. ( 10.1038/nmeth.1923) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, Robles M. 2005. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21, 3674–3676. ( 10.1093/bioinformatics/bti610) [DOI] [PubMed] [Google Scholar]
  • 29.Al-Shahrour F, Minguez P, Tarraga J, Medina I, Alloza E, Montaner D, Dopazo J. 2007. FatiGO+: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments. Nucleic Acids Res. 35, W91–W96. ( 10.1093/nar/gkm260) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. 2007. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23, 2633–2635. ( 10.1093/bioinformatics/btm308) [DOI] [PubMed] [Google Scholar]
  • 31.Yu J, et al. 2006. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208. ( 10.1038/ng1702) [DOI] [PubMed] [Google Scholar]
  • 32.Endelman JB, Jannink JL. 2012. Shrinkage estimation of the realized relationship matrix. Genes Genomes Genet. 2, 1405–1413. ( 10.1534/g3.112.004259) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Zhang Z, et al. 2010. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42, 355–360. ( 10.1038/ng.546) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Holliday JA, Wang T, Aitken S. 2012. Predicting adaptive phenotypes from multilocus genotypes in Sitka spruce (Picea sitchensis) using Random Forest. Genes Genomes Genet. 2, 1085–1093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Brieuc MSO, Ono K, Drinan DP, Naish KA. 2015. Integration of random forest with population-based outlier analyses provides insight on the genomic basis and evolution of run timing in Chinook salmon (Oncorhynchus tshawytscha). Mol. Ecol. 24, 2729–2746. ( 10.1111/mec.13211) [DOI] [PubMed] [Google Scholar]
  • 36.Liaw A, Wiener M. 2002. Classification and regression by random forest. R News 2, 18–20. [Google Scholar]
  • 37.Rosenberg MS, Anderson CD. 2011. PASSaGE: pattern analysis, spatial statistics and geographic exegesis. Version 2. Methods Ecol. Evol. 2, 229–232. ( 10.1111/j.2041-210X.2010.00081.x) [DOI] [Google Scholar]
  • 38.Excoffier L, Lischer HEL. 2010. Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol. Ecol. Resour. 10, 564–567. ( 10.1111/j.1755-0998.2010.02847.x) [DOI] [PubMed] [Google Scholar]
  • 39.Narum SR. 2006. Beyond Bonferroni: less conservative analyses for conservation genetics. Conserv. Genet. 7, 783–787. ( 10.1007/s10592-005-9056-y) [DOI] [Google Scholar]
  • 40.Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. 57, 289–300. [Google Scholar]
  • 41.de Mello F, Streit DP Jr, Sabin N, Gabillard JC. 2014. Identification of TGF-β, inhibin βA and follistatin paralogs in the rainbow trout genome. Comp. Biochem. Physiol. B 177, 46–55. ( 10.1016/j.cbpb.2014.07.006) [DOI] [PubMed] [Google Scholar]
  • 42.Evanno G, Regnaut S, Goudet J. 2005. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol. Ecol. 14, 2611–2620. ( 10.1111/j.1365-294X.2005.02553.x) [DOI] [PubMed] [Google Scholar]
  • 43.Dodson JJ, Aubin-Horth N, Thériault V, Páez DJ. 2013. The evolutionary ecology of alternative migratory tactics in salmonid fishes. Biol. Rev. 88, 602–625. ( 10.1111/brv.12019) [DOI] [PubMed] [Google Scholar]
  • 44.Pearse DE, Miller MR, Abadía-Cardoso A, Garza JC. 2014. Rapid parallel evolution of standing variation in a single, complex, genomic region is associated with life history in steelhead/rainbow trout. Proc. R. Soc. B 281, 20140012 ( 10.1098/rspb.2014.0012) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Gienapp P, Leimu R, Merila J. 2007. Response to climate change in avian migration time—microevolution versus phenotypic plasticity. Clim. Res. 35, 25–35. ( 10.3354/cr00712) [DOI] [Google Scholar]
  • 46.Reveillac E, Feunteun E, Berrebi P, Gagnaire P-A, Lecomte-Finiger R, Bosc P, Robinet T. 2008. Anguilla marmorata larval migration plasticity as revealed by otolith microstructural analysis. Can. J. Fish. Aquat. Sci. 65, 2127–2137. ( 10.1139/F08-122) [DOI] [Google Scholar]
  • 47.Quillfeldt P, Voigt CC, Masello JF. 2010. Plasticity versus repeatability in seabird migratory behaviour. Behav. Ecol. Sociobiol. 64, 1157–1164. ( 10.1007/s00265-010-0931-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Yeaman S, Whitlock MC. 2011. The genetic architecture of adaptation under migration–selection balance. Evolution 65, 1897–1911. ( 10.1111/j.1558-5646.2011.01269.x) [DOI] [PubMed] [Google Scholar]
  • 49.Barson NJ, et al. 2015. Sex-dependent dominance at a single locus maintains variation in age at maturity in salmon. Nature 528, 405–408. ( 10.1038/nature16062) [DOI] [PubMed] [Google Scholar]
  • 50.Küpper C, et al. 2016. A supergene determines highly divergent male reproductive morphs in the ruff. Nat. Genet. 48, 79–83. ( 10.1038/ng.3443) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.O'Malley KG, Jacobson DP, Kurth R, Dill AJ, Banks MA. 2013. Adaptive genetic markers discriminate migratory runs of Chinook salmon (Oncorhynchus tshawytscha) amid continued gene flow. Evol. Appl. 6, 1184–1194. ( 10.1111/eva.12095) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Phillips RB, Keatley KA, Morasch MR, Ventura AB, Lubieniecki KP, Koop BF, Danzmann RG, Davidson WS. 2009. Assignment of Atlantic salmon (Salmo salar) linkage groups to specific chromosomes: conservation of large syntenic blocks corresponding to whole chromosome arms in rainbow trout (Oncorhynchus mykiss). BMC Genet. 10, 46 ( 10.1186/1471-2156-10-46) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Mohammed H, et al. 2013. Endogenous purification reveals GREB1 as a key estrogen receptor regulatory factor. Cell Rep. 3, 342–349. ( 10.1016/j.celrep.2013.01.010) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Choi YJ, Kim NN, Shin HS, Choi CY. 2014. The expression of leptin, estrogen receptors, and vitellogenin mRNAs in migrating female Chum Salmon, Oncorhynchus keta: the effects of hypo-osmotic environmental changes. Asian Australas. J. Anim. Sci. 27, 479–487. ( 10.5713/ajas.2013.13592) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Goodswen SJ, Gondro C, Watson-Haigh NS, Kadarmideen HN. 2010. FunctSNP: an R package to link SNPs to functional knowledge and dbAutoMaker: a suite of Perl scripts to build SNP databases. BMC Bioinform. 11, 311 ( 10.1186/1471-2105-11-311) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Palti Y, Gao G, Liu S, Kent MP, Lien S, Miller MR, Rexroad CE, Moen T. 2015. The development and characterization of a 57 K single nucleotide polymorphism array for rainbow trout. Mol. Ecol. Resour. 15, 662–672. ( 10.1111/1755-0998.12337) [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Figures and Methods
rspb20153064supp1.pdf (1.1MB, pdf)
Supplemental Tables
rspb20153064supp2.xls (6.5MB, xls)

Data Availability Statement

The data used in this study are available in Dryad: http://dx.doi.org/10.5061/dryad.62q6n.


Articles from Proceedings of the Royal Society B: Biological Sciences are provided here courtesy of The Royal Society

RESOURCES