Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2025 Jul 11;25(7):e70011. doi: 10.1111/1755-0998.70011

‘Highly‐Informative’ Genetic Markers Can Bias Conclusions: Examples and General Solutions

Andy Lee 1,, William Hemstrom 1,2,, Natalie Molea 3,4, Gordon Luikart 3,4, Mark R Christie 1,5,
PMCID: PMC12415817  PMID: 40641441

ABSTRACT

High‐grading bias is the overestimation power in a subset of loci caused by model overfitting. Using both empirical and simulated datasets, we show that high‐grading bias can cause severe overestimation of population structure, and thus mislead investigators, whenever highly informative or high‐F ST markers are chosen (i.e., ascertained) and used for subsequent assessments, a common practice in population genetic studies. This problem can occur in panmictic populations with no local adaptation. Biased results from choosing high‐F ST markers may have severe downstream implications for management and conservation, such as erroneous conservation unit delineation, which could squander limited conservation resources to protect incorrectly defined ‘populations’. Furthermore, we caution that high‐grading is not limited to F ST approaches; high‐grading bias is a concern whenever a small subset of markers are first chosen to explain differences among groups based on their degree of difference and are subsequently reused to estimate the degree of difference among those groups. For example, selecting high F ST loci for use in a GT‐seq panel or using differentially expressed genes to plot sample membership in multivariate space can both result in spurious structure when none exists. We illustrate that using statistically based outlier tests in place of arbitrary F ST cut‐offs can reduce bias. Alternatively, permutation tests or cross‐evaluation can be used to detect high‐grading bias. We provide an R package, PCAssess, to help researchers detect and prevent high‐grading bias in genetic datasets by automating permutation tests and principal component analyses (https://github.com/hemstrow/PCAssess).

Keywords: ecological genetics, genomics/proteomics, natural selection and contemporary evolution, population genetics—theoretical

1. Introduction

The analysis of large datasets has become increasingly common across diverse fields and applications such as medicine, manufacturing, remote sensing, and transportation (Chi et al. 2016; Li et al. 2022; Schweizer et al. 2021; Silva et al. 2020; Wyatt and Liu 2002). Due to the limitations imposed by working with large datasets in biology and related fields, data are often filtered to identify informative subsets of data to reduce the costs of future analyses and expedite downstream analysis (Hemstrom et al. 2024; Carvey et al. 2024; Glover et al. 2013; Karlsson et al. 2011; Whitaker et al. 2020). When conducting large‐scale grouping of individuals to populations or geographic locations, for example, researchers often choose (ascertain) subsets of genetic loci that are most differentiated among populations and thus the most informative for assigning individuals to a population of origin (Banks et al. 2003; Manel et al. 2005; Tvedebrink 2022). With informative loci in hand, researchers can then use inexpensive targeted genotyping approaches such as SNP‐Chips, Rapture, or Genotyping‐in‐Thousand (GT‐Seq) (Campbell et al. 2015, Karlsson et al. 2011; Li et al. 2008; Tosser‐Klopp et al. 2014; Ali et al. 2016). These approaches can dramatically reduce costs, even in comparison to reduced‐representation genomic approaches such as RAD‐seq (Andrews et al. 2016).

Researchers can choose informative or candidate loci using many different criteria. For example, researchers often identify putatively adaptive loci using F ST ‐outlier analyses or genotype‐environment associations (GEA) (Kess et al. 2018; Koot et al. 2021; Milano et al. 2014; Samad‐zada et al. 2023; Shen et al. 2019; Silliman 2019; Vu et al. 2020) or choose highly differentiated SNPs using arbitrary cut‐offs (e.g., F ST thresholds, Jansson et al. 2023; Kaiser et al. 2021; Martinez et al. 2017) or distances from the mean (Barr et al. 2023; Fuentes‐Pardo et al. 2024; Han et al. 2020; Weist et al. 2022). Beyond genotypic data, authors may select the most divergent differentially expressed genes (DEGs) to visualise treatment effects (e.g., micro‐habitat, temperature, disease, etc.) on a principal component analysis (PCA) in developmental (Roux et al. 2023), disease (Salis et al. 2022), and ecological (Lee et al. 2024) studies. For the purposes of this paper, we therefore hereafter refer to any subset of markers chosen from a larger set due to their particularly high power in delineating between different groups of the data (such as due to particularly high F ST values as ‘highly informative’ markers).

Unfortunately, choosing informative loci from a distribution (e.g., of F ST values) can cause high‐grading bias, an overestimation of power in a subset of loci due to model overfitting (Waples 2010). Put simply, choosing informative (e.g., highly differentiated) loci from an F ST distribution without care can mislead investigators about the nature of population structure in those data. For example, if investigators use PCA or STRUCTURE to analyse populations using only loci with ‘high’ values of F ST , any structure they find may be completely different from the structure reflected for loci with lower values of F ST . This is not necessarily problematic if high F ST values at some loci were due to biological processes. However, crucially, high‐grading bias may cause researchers to overestimate or detect structure where none exists in cases where loci have higher F ST by chance alone (as in panmictic populations with no or very little local adaptation).

High‐grading bias can be thought of as a specific form of ascertainment bias, whereby the loci that were ascertained as informative or correlated by chance in one data set have little informative power elsewhere. This phenomenon is caused by the tendency that when many loci or other explanatory variables are used in a model or to calculate statistics, some will inevitably be correlated with any given dependent variable (e.g., geographic distance, environment) by chance alone. However, when loci correlate with variables of interest by chance (rather than by a mechanistic factor), those variables should have little to no predictive power when used in a new set of samples. This is a well‐known issue in statistics (Mosteller and Tukey 1977) and machine learning (Guyon and Elisseeff 2003) and has been well described previously in the context of population assignment by Anderson (2010), in which substantial over‐estimations of statistical power for population assignment are expected when only the high‐F ST loci are genotyped on new or independent samples in the future.

High‐grading bias is less well recognised in other types of genetic analysis. For example, clustering and other population structure analyses, such as principal component analyses (PCA) (Patterson et al. 2006; Price et al. 2006), STRUCTURE (Pritchard et al. 2000), and other similar methods are all prone to high‐grading bias when researchers select the most differentiated loci based on a priori grouping of samples (e.g., by populations). Selecting highly differentiated loci may be tempting when no population structure is apparent in a dataset as a whole, but local adaptation or other subtle biological processes are suspected to drive differentiation at only a handful of loci. Subsequently, re‐analysis or inclusion of additional sample collection sites sequenced at a subset of ‘highly informative’ (i.e., highly differentiated) loci to assess structure can seem useful in such cases, but doing so risks high‐grading. At best, this high‐grading bias in this case will result in wasted time and money resulting from over‐confidence in the ability of selected markers to delineate populations; at worst, it could result in mismanagement based on presumed ‘population structure’ that exists solely as a result of statistical artefacts.

Here, we explore the impacts of high‐grading bias in studies of population structure in more depth. We first illustrate the pitfalls of choosing the most highly differentiated SNPs using a priori groups and the resulting unintended consequences in PCAs and STRUCTURE‐like clustering analyses. For contrast and to demonstrate that high‐grading can affect non‐genotypic data, we next demonstrate high‐grading bias in RNA‐seq data. As a solution, we demonstrate the efficacy of both outlier detection and permutation test approaches to detect instances where high‐grading bias generates spurious clustering or where ‘highly informative’ subsets of high‐F ST loci truly remain unbiased. To make the process of detecting high‐grading bias easier for researchers, we also provide an R package, PCAssess, which automates permutation tests for high‐grading in PCAs.

2. Methods

2.1. High‐Grading Bias With Empirical Data

To quantify and illustrate high‐grading bias, we used previously published genotypic data from a single, panmictic population of North American monarch butterflies (Hemstrom et al. 2022). We filtered this dataset by (1) removing loci with a minor allele frequency of less than 0.05 to maximise power for detecting population structure (Linck and Battey 2019), (2) removing loci sequenced in less than 75% of individuals, followed by (3) removing individuals with less than 75% of loci sequenced (final n = 63,514 SNPs and 83 individuals). We next randomly assigned individuals to four ‘populations’ (A, B, C, and D). Given that these ‘populations’ were assigned randomly, they have no biological relevance and do not reflect any geographic or environmental groupings. To determine if high‐grading bias could generate spurious ‘population structure’ in such a case, we created a high‐F ST dataset by calculating global F ST (Weir and Cockerham 1984) for each locus across all four populations and then retaining the top 5% of loci by F ST , which resulted in 3176 SNPs and an F ST value equal to 0.0672. All filtering and F ST calculations were performed in the package snpR (Hemstrom and Jones 2023) in R (R Core Team 2022).

We employed a range of commonly used approaches to determine the effect of high‐grading on population inference. First, we conducted a PCA using the smartPCA method (Patterson et al. 2006; Price et al. 2006). Second, we used the program STRUCTURE (Pritchard et al. 2000) to group individuals into k clusters for each k value from 1 to 4, using 20,000 burn‐in and 100,000 iterations. To determine the degree of variance across STRUCTURE runs, we ran the program 10 times for each value of k. We then used the Δk method to determine the ‘optimal’ k value (Evanno et al. 2005). Given that this method does not reliably account for higher‐order population structuring (Janes et al. 2017), we also visualised our results across all k values by condensing results across runs using the ‘greedy’ option in CLUMPP (Jakobsson and Rosenberg 2007). We used both snpR and pophelper (Francis 2017) to run, organise, and plot these results. Lastly, we used an assignment test to assign samples back to their ‘population’ IDs using the ‘self_assign()’ function with the default parameters in Rubias (Moran and Anderson 2019). In addition to these three approaches, we also evaluated the effects of high‐grading bias with: discriminant analysis of principal components (DAPC), Uniform Manifold Approximation and Projection (UMAP), t‐distributed Stochastic Neighbour Embedding (t‐SNE), and sparse Non‐negative Matrix Factorization (sNMF) as described in Methods S1. We ran each of these methods on two datasets: one containing all genomic loci that passed our quality filters, and the high‐F ST dataset, containing the top 5% of loci by F ST .

2.2. High‐Grading Bias in Non‐Genotypic (Gene Expression) Data

To explore high‐grading bias in non‐genotypic data, we randomly assigned individuals from a single, panmictic population to two randomly assigned pseudo‐treatment groups from a pre‐existing RNA‐seq expression dataset (Lee et al. 2024). In this dataset, the authors reared wild‐caught adult Kellet's whelks ( Kelletia kelletii ) in a common environment with no experimental treatments, and RNA‐Seq was conducted on the F1 offspring. Using the largest population available in this dataset (Monterey, n = 30), we used the R package DESeq2 v.1.34 (Love et al. 2014) to identify DEGs between the two randomly assigned pseudo‐treatment groups. To test for high‐grading bias, we chose the top 1000 most divergent DEGs (as performed in Roux et al. 2023; Salis et al. 2022) measured by log2 fold change in expression between the randomly‐assigned groups. Alternatively, we also assessed outlier DEGs by using a minimum significance threshold of 0.05 after false discovery rate correction via the Benjamini–Hochberg method (Benjamini and Hochberg 1995). We repeated this entire process 100 times to determine the degree to which our results were consistent across different initial datasets.

2.3. High‐Grading Bias in Simulated Data

To evaluate the conditions under which high‐grading bias does and does not create spurious population structure and to evaluate the efficacy of potential solutions, we used scrm (Staab et al. 2015) and custom R scripts to create forward‐time simulations for three different sets of data (see Methods S1 for details):

  1. Panmictic: a single, large population with no selection or introgression from divergent populations.

  2. High gene flow: four populations on discrete islands with twice the carrying capacity as our panmictic population, a high migration rate (an average of 40% of individuals leave each population and migrate to the other three islands with equal probability), and strong, differential, phenotypic‐based survival on each island. Phenotypes were randomly assigned from a distribution with no adaptive genetic variation such that survival was random.

  3. High gene flow with local adaptation: the same model as our high gene flow scenario, but with adaptive loci that contribute to individual phenotypes and thus are subject to selection and local adaptation on each island. Sixty loci were randomly selected to have additive effects on phenotype, the size of which was randomly drawn from a normal distribution. Phenotypes were then determined for each individual by adding cumulative genetic effects across all loci to an environmental effect drawn from a normal distribution with a variance such that narrow‐sense heritability was 0.5 in the first generation.

Individuals were sampled for analysis after 50 generations at the end of the simulation. Sampling occurs in the parental generation after selection, where applicable. For each of these datasets, we repeated our F ST selection and PCA procedures as described above.

2.4. Permutations and Outlier Testing

We used a permutation procedure to assess the degree of bias introduced by high‐grading in our simulated datasets. To do so, we randomly reshuffled population IDs among individual multi‐locus genotypes without replacement (i.e., every ID was uniquely re‐assigned to every multilocus genotype), recalculated F ST , chose high‐F ST loci, and then ran a PCA as before (Figure 2). We then quantified the change in clustering by using a Multivariate Analysis of Variance (MANOVA) test in R with the first two PCs as response variables and population ID as the explanatory variable for both datasets, and then measured the change in the resulting F‐statistic (ΔF) between the highest‐F ST and all‐SNPs datasets (Figure 2). Given that the sample sizes and degrees of freedom were equal across all permutations and that higher MANOVA F statistics represent cases where more of the variance in the PCA is explained by populations, ΔF measures the increase in population clustering when the highest‐F ST loci are used. Correspondingly, a large ΔF represents a large observed increase in within‐group clustering when the highest‐F ST loci are used, and a smaller or negative ΔF represents less of an increase or a decrease in clustering when the highest‐F ST loci are used. We calculated F statistics in R using the recommended Pillai–Bartlett test statistic (Bartlett 1939; Pillai 1955). To generate a null distribution of the ΔF expected by chance alone, we repeated the entire process of permuting population IDs, MANOVA, and calculating ΔF 1000 times for each dataset. We also calculated ΔF a single time from our empirical, non‐permuted dataset.

FIGURE 2.

FIGURE 2

Conceptual figure illustrating a permutation approach that can be used to quantify high‐grading bias. In each permutation, the population IDs are randomly shuffled (without replacement), F ST recalculated, high‐F ST loci chosen, and F‐statistics calculated. The empirically observed change in clustering due to high‐grading (ΔF) is compared to a null distribution of ΔF from the 1000 permutations. A large ΔF represents a large observed increase in within‐group clustering when the highest‐F ST loci are used, and a smaller or negative ΔF represents less of an increase or a decrease in clustering when the highest‐F ST loci are used. Comparing ΔF values between empirical datasets and those where population IDs are permuted thousands of times allows for the detection of high‐grading bias.

We then used an empirical cumulative distribution function in R to calculate a one‐sided p‐value for the probability that ΔF was higher with our empirical, non‐permuted population IDs than expected given the null ΔF distribution derived from the permutations. For this test, the null hypothesis posits that, on average, any increase in clustering driven by using the high‐F ST loci in an empirical dataset is no greater than any increase in clustering driven by selecting high‐F ST loci with randomly permuted population IDs. Rejecting the null hypothesis, therefore, means that the increase in clustering in the empirical datasets exceeds those generated with permuted population IDs, and thus, high‐grading bias is the only driver of any increases in population structure observed using ‘highly informative’, high‐F ST loci. To determine the degree to which our test results are consistent across different starting datasets, we then repeated this entire process (from simulation to p‐value) 100 times for each model (panmictic, high gene‐flow, and high gene‐flow with local adaptation). To determine the effect of linkage disequilibrium (LD) filtering on permutation testing for high‐grading bias, we also conducted 10 runs for each of our models after filtering out loci with high LD before permutation using PLINK (Purcell et al. 2007) with 100 kb sliding windows and an LD cutoff of 0.8.

As an alternative to choosing the top 5% of loci, we also used statistical outlier tests to discover divergent F ST loci. This approach should, in theory, avoid choosing loci which do not have a higher F ST than expected by chance alone, and thus avoid high‐grading bias. To test this idea, we ran the outlier‐detection programs pcadapt (Luu et al. 2017) with k = 2 and LD clumping with a size of 200 and a threshold of 0.1 and outFLANK (Whitlock and Lotterhos 2015a) with the default settings for the simulated ‘panmictic’, ‘high gene flow’, and ‘high gene flow with local adaptation’ datasets. We filtered the simulated datasets to a minimum minor allele frequency of 0.05 before running either program. We then plotted PCAs as above for both the full SNP set and any outliers detected.

2.5. Assessing High‐Grading Bias Management Implications in Pink Salmon

To provide an example where high‐grading bias could have substantial management implications, we used the R package we developed for assessing high‐grading bias, PCAssess, to determine if the clustering increase observed from using highly informative markers in five sub‐populations of pink salmon from Prince William Sound, Alaska (Hemstrom et al. in prep) could be explained by chance alone. To prevent possible biases from poorly sequenced genotypes (Hemstrom et al. 2024), we first removed individuals and loci sequenced in less than 75% of loci or individuals, respectively, and then removed any loci with a minor allele frequency less than 0.05 in any sub‐population using snpR. To avoid large computational costs and simulate a lower‐cost sequence approach, we selected 10,000 random remaining SNPs for analysis. We used PCAssess to randomly permute sub‐population IDs, take the top 5% of loci by global F ST (Weir and Cockerham 1984), construct PCAs, and determine changes in F‐statistics 100 times, as described above. The code for this case study provides a use example of PCAssess and is available in Notebook S1.

3. Results

3.1. Effects of High‐Grading FST on Population Structure Inference

Using our monarch butterfly dataset, we found a strong and consistent pattern of erroneous population structure detection in an unstructured (panmictic) population from which individuals were randomly assigned to four biologically meaningless groups a priori when using a high‐F ST subset of loci (n = 3176 loci). Specifically, PCA found delineated clusters for the unstructured population of monarchs (Figure 1D). Similarly, the Evanno method (Evanno et al. 2005) with STRUCTURE supported k = 4 (Figure 1E), and RUBIAS grouped assigned individuals to the four ‘populations’ with perfect confidence when using the subset of ascertained high‐F ST loci (Figure 1F). These trends were visible when any method was run using the high‐F ST dataset vs. the full set of loci and were not specific to PCA, STRUCTURE, or RUBIAS (in fact, RUBIAS clearly warns users to be cognizant of high‐grading bias, Moran and Anderson 2019). In contrast, and as expected, no population structure was detected when using all loci (n = 63,514 loci) with any method (PCA, STRUCTURE, RUBIAS, Figure 1; t‐SNE, UMAP, and sNMF, Supporting Information). To reiterate, in this panmictic population of monarch butterflies, there is no real population structure or local adaptation that correlates with the randomly assigned population IDs used here. Thus, the population structure identified using the high‐F ST subset of loci is not only different from what is visible from using all loci, but also misleading.

FIGURE 1.

FIGURE 1

High‐grading bias in clustering algorithms and population assignment using empirical data from a single, panmictic population of monarch butterflies (i.e., no real population structure). We randomly assigned individuals to one of four artificially created populations and detected erroneous structure when the top 5% of SNPs with the highest F ST values were used. (A–C) Population structure analysis results using all the SNPs found in this population (63,514 loci), and panels (D–F) illustrate population structure analyses using the highest‐F ST dataset, where only the top 5% of SNPs with the highest F ST were used. Panels (A) and (D) illustrate results using PCA, panels (B) and (E) illustrate results using STRUCTURE with k = 4 and 20,000 burn‐in and 100,000 MCMC iterations, and panels C and F illustrate results using the ‘self_assign’ function in the Bayesian assignment R package Rubias with the default parameters.

High‐grading bias also occurred in gene expression data. We randomly assigned two pseudo‐treatment groups to a pre‐existing RNA‐seq dataset from a single population of Kellet's whelks ( K. kelletii ) that did not undergo any experimental treatment (Lee et al. 2024). Using the top 1000 most divergent DEGs (~0.6% of 167,051 expressed genes), artificial structure between the two randomly assigned pseudo‐treatments was evident in most of the 100 randomly seeded runs (Figure S2A,B).

3.2. Permutation Tests Detect Bias to Improve Genomic Analyses

Our permutation test determines if the increase in structure observed with high‐F ST SNPs is driven by high‐grading bias (Figure 2). Our test permutes (assigns) individuals randomly among populations by re‐assigning population IDs for individuals, identifies high‐F ST markers, and runs a clustering approach repeatedly to generate a null distribution for change in population structure due to high‐grading. This null distribution is then compared to the empirically observed change in structure generated by choosing high F ST SNPs, to then test if there is a residual, beyond‐expected (chance), change in population‐level structuring (Figure 2).

We tested this approach with three independently simulated datasets and found the test was able to delineate between scenarios where no increase in structure should be observed and scenarios where local adaptation was driving increased biological divergence at a few loci (Figure 3). Specifically, we could not reject the null hypothesis of high‐grading bias for either the ‘panmictic’ or ‘high gene flow’ scenarios, but could detect a higher than expected increase in population differentiation for the ‘high gene flow with local adaptation’ scenario when using high F ST loci (Figure 3A–C), consistent with a higher than expected degree of population structuring present in a small subset of strongly selected loci in the latter dataset. These results were generally consistent across replicate runs (n = 60–100, variable due to time‐outs) for each scenario (Figure S3) and when using LD filtering (Figure S4). Note that for the ‘high gene flow’ scenario, some very minor population structure was still detected with all loci, as we should expect, but no greater than expected increase in structure using high F ST loci was observed (Figure 3b,e,h). This is also expected given that no biological processes caused higher F ST at specific loci under this model, and thus some loci had higher F ST than the genome‐wide average purely by chance. Note also that simulations for the ‘high gene flow’ model often resulted in very few individuals surviving in the populations with more extreme selection, since populations were unable to adapt in this scenario.

FIGURE 3.

FIGURE 3

Statistical detection of erroneous clustering due to high‐grading bias when high‐F ST loci are used in PCAs using simulated data for three commonly observed types of population structure: Panmictic (without selection), high gene flow (without selection), and high gene flow with local adaptation (with selection). Panels A–C show PCAs constructed from all SNPs, and panels (D–F) show PCAs constructed from the top 5% of high‐F ST SNPs. Note that all three scenarios show population structure in the latter case, even when none should exist (as in the panmictic scenario). Panels G–I depict the observed increase in clustering (ΔF) between the PCAs for all and top 5% SNPs (vertical red dashed line) alongside the expected null distribution of ΔF (solid black line) derived from permutations. Higher ΔF means a higher increase in clustering with high‐F ST loci (Figure 2). High ΔF in the observed data relative to the null distribution implies that the null hypothesis that clustering increases no more than expected by chance alone when taking high‐F ST loci can be rejected, as correctly seen only in the high gene flow with local adaptation scenario.

3.3. Using Outlier Tests to Identify Loci of Interest Avoids High‐Grading Bias

We also tested two statistical outlier detection approaches—pcadapt and outFLANK (Luu et al. 2017; Whitlock and Lotterhos 2015)—to determine if using these methods to identify loci of interest avoids high‐grading bias. Using the same three simulated datasets (‘panmictic’, ‘high gene flow’, and ‘high gene flow with local adaptation’), we found that both PCAdapt and outFLANK successfully identified no outlier loci and thus avoided generating erroneous population structure estimates in the ‘panmictic’ scenario (Figure 4D,G). However, PCAdapt falsely identified outlier loci in the ‘high gene flow’ scenario (Figure 4E) and identified too many non‐selected loci and thus incorrectly did not show any evidence of local adaptation‐driven population structure in the ‘high gene flow with local adaptation’ scenario (Figure 4F). In contrast, outFLANK correctly identified no outliers in the ‘high gene flow’ scenario (Figure 4H) and correctly identified outlier‐specific structure correlated with selection direction and strength in the ‘high gene flow with local adaptation’ scenario (Figure 4I).

FIGURE 4.

FIGURE 4

Reduction in high‐grading bias on population structure analyses when using statistically identified outlier SNPs. We used simulated data of three types of population structure: (A) panmictic (without selection), (B) high gene flow (without selection), and (C) high gene flow with local adaptation (with selection). Panels A–C illustrate PCAs using all SNPs, D–F use only SNPs that were outliers found by PCAdapt, and G‐I illustrate PCAs when only OutFLANK outliers were used. There were no shared SNPs between OutFLANK and PCAdapt; plots with no outlier loci identified are marked. Note that for the bottom‐right plot, the small number of observed outliers from OutFLANK caused many points to be plotted a top of one another in the two‐dimensional PCA, making interpretation challenging; the distribution of PC1 and PC2 scores per population is clearer and thus shown instead.

We likewise used DESeq2 (Love et al. 2014) to identify outliers in our gene expression dataset between our randomly assigned treatment groups. DESeq2 incorrectly identified outlier DEGs in 80 out of 100 runs with our expression data, although in most cases, it only identified a few incorrect outliers (52 runs identified 1–3 DEGs). In the 41 runs with sufficient numbers of DEGs to create PCAs, we calculated F‐statistics as above (Figure 2, Figure S2). PCAs using these outliers nonetheless showed far less treatment structure, and thus less high‐grading bias, than those constructed using the top 1000 DEGs by log2 fold change directly (Figure S2C,D).

3.4. Assessing High‐Grading Bias Management Implications in Pink Salmon

We found no evidence that the observed increase in clustering in Prince William Sound pink salmon PCAs using the top 5% of loci by global F ST was greater than chance alone (p‐value from permutation tests on F‐statistics = 0.57; Figure 5). As such, we would not recommend management decisions be based on PCA patterns observed with ‘highly informative’ loci alone in this instance.

FIGURE 5.

FIGURE 5

Management implications of high‐grading bias in five sub‐populations of pink salmon ( Oncorhynchus gorbuscha ). We use our R package PCAssess to plot PCAs using all available SNPs (left) and the top 5% SNPs with the highest F ST values among the population (right). We automate permutation testing to detect high‐grading bias in the package PCAssess (bottom), which shows that the chosen subset of loci does not provide a statistically significant, biologically relevant increase in population structure and thus cannot reject the null hypothesis of high‐grading bias (p > 0.57).

4. Discussion

Our results show that intentionally choosing the most differentiated loci can cause severe overestimation of population structure whenever those markers are used for subsequent assessments or to make biological conclusions. High‐grading bias can occur in completely panmictic populations (Figure 1), with only neutral loci (no directional selection on any loci, Figure 3), and impacts all methods for detecting population structure we tested (Figure 2 and Figure S2), including PCAs, STRUCTURE, and assignment testing (as in Anderson 2010). Choosing ‘informative markers’ from a dataset with minimal obvious structure and pre‐defined subdivisions (i.e., inferred a priori subpopulations) and re‐using those markers to search for ‘hidden’ or ‘subtle’ population structure in that same dataset is therefore generally not recommended without careful consideration and testing for high‐grading bias. Typically, these loci will have little to no predictive power when used in an independent dataset (e.g., applying high‐grade loci from a pilot dataset to a large dataset). However, if the initial dataset is used to inform management decisions, it could result in mismanagement based on a presumed ‘population structure’ that does not exist.

Crucially, high‐grading bias is not limited to F ST or SNP datasets. High‐grading bias is a concern whenever a small subset of many variables is chosen to explain differences in groups based on their degree of differentiation, then used to re‐estimate the degree of difference between those groups. Consequently, high‐grading bias is a widespread concern well beyond population genomics. For example, we identified that high‐grading bias can be a factor in gene expression data when the most divergent genes are chosen to visualise treatment effects (Figure S2), something that is performed regularly (Roux et al. 2023; Salis et al. 2022).

In genomics, studies in high‐gene‐flow, low‐structure systems that select top loci based on F ST cut‐offs, quantiles, or thresholds are particularly susceptible to high‐grading bias (Jansson et al. 2023; Kaiser et al. 2021; Martinez et al. 2017; Lehnert et al. 2019; Barr et al. 2023; Weist et al. 2022), which may have severe downstream implications for management and conservation. As an example of this, conservation unit delineation based on flawed, overestimated population structure assessments could squander valuable conservation funding and time protecting populations based on flawed assumptions of local adaptation (Figure 5). Additionally, strong but flawed population assignments based on ‘highly informative’ markers chosen from one dataset (i.e., SNP‐Chips, GT‐seq) could lead to poor but highly confident assignments in new samples (Anderson 2010). Panel development, wherein a reduced set of markers is selected from a large set of sequenced loci to reduce costs for future assignment tests on new individuals or for similar purposes, may be particularly prone to this, since high‐grading bias may result in an uninformative panel when applied to new samples (Anderson 2010). Using other biological knowledge (such as prior information on adaptive loci) rather than F ST alone may be helpful in such cases; truly, biologically informative loci should cross‐evaluate well and pass high‐grading tests. Additional simulations or research are still needed to fully understand the effects of high‐grading bias during panel selection on population delineation, however.

It is important to note, however, that choosing and sub‐setting informative loci is not necessarily problematic, given that in some cases ‘highly informative’ loci may reflect real, underlying population sub‐structuring that extends beyond that expected by chance alone. This possibility could occur, for example, in cases where populations are locally subjected to selection at a small subset of loci and thus are locally adapted (Figure 3C). Detecting such instances is therefore important; outlier detection methods such as OutFLANK (Whitlock and Lotterhos 2015) are one excellent option. Outlier tests are designed to detect loci that are more divergent than one would expect by chance alone and thus account for the sampling error that causes high‐grading bias in the first place (Figure 4 and Figure S2). While outlier tests are not without their problems (Bierne et al. 2013; Fourcade et al. 2013), authors who use them, and in particular those who look for correspondence between two or more methods of outlier detection, should be able to minimise high‐grading bias (Kess et al. 2018; Koot et al. 2021; Milano et al. 2014; Samad‐zada et al. 2023; Shen et al. 2019; Silliman 2019; Vu et al. 2020). Nevertheless, more sensitive methods that can account for greater biological complexity are needed.

Alternatively, permutation tests can be used to determine if a chosen subset of loci increases population structure more than expected by chance alone (Figures 3 and 5). It should be noted, however, that the permutation tests we propose here are specific to systems with subtle population structure in otherwise unstructured systems, and can perform erratically in highly structured systems where random population permutations will always show markedly less population structure and thus never produce the degree of clustering change, positive or negative, observed in the original data (p = 0 or 1 regardless of permutation count depending on if the original data increases or decreases in clustering with ‘informative’ loci, respectively). In such instances, high‐grading bias will often not become an issue, given that most loci will already be ‘highly informative’ and thus selecting a particularly informative subset is less useful. Regardless, cross‐evaluation can be used in a similar way to evaluate candidate loci, since markers that are more differentiated by chance alone should not have predictive power in novel data, even from the same populations, since spurious, ‘by chance’ correlations will not hold in new samples (Anderson 2010). For example, Lee et al. (2024), used statistically significant DEGs and DEG‐derived SNPs to find population structure, which they later validated via cross‐evaluation using an independent set of samples.

In general, our recommendations for researchers who are searching for subtle population structure or signals of local adaptation in high gene flow systems are to use statistically based outlier tests in place of arbitrary F ST cut‐offs or use permutation to determine if subtle population structure is statistically and biologically relevant (Figure 5). Alternatively, researchers may cross‐validate using traditional hold‐out approaches (in which loci are chosen after removing test samples; see Anderson 2010) or using novel sets of data to evaluate highly differentiated markers (Lee et al. 2024). Regardless, researchers should understand how their data type, sequencing, and filtering choices affect their inferences, including high‐grading bias (Hemstrom et al. 2024; Kardos and Waples 2024). It should also be noted that the choice of data filtering parameters (e.g., HWE, MAF, and LD) filtering could also affect the severity of high‐grading bias, although whenever loci are chosen from a distribution of random distribution of F ST (or other statistics), it is still likely to apply.

In summary, our results show that high‐grading bias can occur in population genetic studies when high‐F ST SNPs are used to detect subtle population structure. We show via simulations that high‐grading bias is a particular issue in high gene flow systems (where population differentiation is low; Figure 3), that high‐grading bias is not an issue limited to genotypic data (Figure S2), and that high‐grading must be assessed when proposing management decisions based on highly differentiated loci (Figure 5). To aid researchers in assessing the impact of high‐grading bias on population structure assessments, we provide the R package PCAssess, which implements and automates permutation tests for high‐grading bias in PCAs, which we described here. This package is available from https://github.com/hemstrow/PCAssess and will soon be on CRAN. A workflow example, which we used to produce Figure 5, is available in Notebook S1. By detecting and minimising the effects of high‐grading bias, we can advance our understanding of population structure and population connectivity to improve conservation and management of species, particularly in high gene flow systems.

Author Contributions

W.H. and A.L. conceived and designed the study, performed the analyses, and wrote and edited the manuscript. M.R.C., N.M., and G.L. helped to refine the study design, provided feedback on visualisation and presentation, and edited the MS.

Disclosure

Benefits Sharing: This study provides methodology and addresses issues intended to help population geneticists and improve analytical approaches for the broader scientific field. All collaborators are included as co‐authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

supporting information.

MEN-25-e70011-s001.zip (2.5MB, zip)

Acknowledgements

We thank K.E. Lotterhos, K. Drotos, E. Anderson, K. Ruegg, and C. Bossu for helpful comments and discussions that greatly improved this manuscript. A. Lee is supported, in part, by the Purdue University Ross‐Lynn Fellowship. This work was supported by the National Science Foundation grant number OCE‐1924505.

Handling Editor: Frederic Austerlitz

Funding: This work was supported by Directorate for Biological Sciences (OCE‐1924505).

Andy Lee and William Hemstrom contributed equally to this work.

Contributor Information

Andy Lee, Email: lee3617@purdue.edu.

William Hemstrom, Email: hemstrow@gmail.com.

Mark R. Christie, Email: christ99@purdue.edu.

Data Availability Statement

All code and data used to produce this manuscript are available at https://github.com/ChristieLab/high_grading_bias. The PCAssess package is available at https://github.com/hemstrow/PCAssess. The root simulated genome used in this study for the ‘high gene flow’ and ‘high gene flow with local adaptation’ can be accessed on Dryad (https://doi.org/10.5061/dryad.c2fqz61p8). The genomes used for the ‘panmictic’ scenario are generated automatically using the scripts linked above.

References

  1. Ali, O. A. , O'Rourke S. M., Amish S. J., et al. 2016. “Rad Capture (Rapture): Flexible and Efficient Sequence‐Based Genotyping.” Genetics 202, no. 2: 389–400. 10.1534/genetics.115.183665. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Anderson, E. C. 2010. “Assessing the Power of Informative Subsets of Loci for Population Assignment: Standard Methods Are Upwardly Biased.” Molecular Ecology Resources 10, no. 4: 701–710. 10.1111/j.1755-0998.2010.02846.x. [DOI] [PubMed] [Google Scholar]
  3. Andrews, K. R. , Good J. M., Miller M. R., Luikart G., and Hohenlohe P. A.. 2016. “Harnessing the Power of RADseq for Ecological and Evolutionary Genomics.” Nature Reviews Genetics 17, no. 2: 81–92. 10.1038/nrg.2015.28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Banks, M. , Eichert W., and Olsen J.. 2003. “Which Genetic Loci Have Greater Population Assignment Power?” Bioinformatics (Oxford, England) 19: 1436–1438. 10.1093/bioinformatics/btg172. [DOI] [PubMed] [Google Scholar]
  5. Barr, K. , Bossu C. M., Bay R. A., et al. 2023. “Genetic and Environmental Drivers of Migratory Behavior in Western Burrowing Owls and Implications for Conservation and Management.” Evolutionary Applications 16, no. 12: 1889–1900. 10.1111/eva.13600. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bartlett, M. S. 1939. “A Note on Tests of Significance in Multivariate Analysis.” Mathematical Proceedings of the Cambridge Philosophical Society 35, no. 2: 180–185. [Google Scholar]
  7. Benjamini, Y. , and Hochberg Y.. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society: Series B: Methodological 57, no. 1: 289–300. 10.1111/j.2517-6161.1995.tb. [DOI] [Google Scholar]
  8. Bierne, N. , Roze D., and Welch J. J.. 2013. “Pervasive Selection or Is It…? Why Are FST Outliers Sometimes So Frequent?” Molecular Ecology 22, no. 8: 2061–2064. 10.1111/mec.12241. [DOI] [PubMed] [Google Scholar]
  9. Campbell, N. R. , Harmon S. A., and Narum S. R.. 2015. “Genotyping‐In‐Thousands by Sequencing (GT‐Seq): A Cost Effective SNP Genotyping Method Based on Custom Amplicon Sequencing.” Molecular Ecology Resources 15, no. 4: 855–867. 10.1111/1755-0998.12357. [DOI] [PubMed] [Google Scholar]
  10. Carvey, Q. B. , Pavey S. A., Diamond A. W., et al. 2024. “Genetic Structure of Atlantic Puffins ( Fratercula arctica ) Breeding in Atlantic Canada.” Conservation Genetics 25: 1159–1174. 10.1007/s10592-024-01629-3. [DOI] [Google Scholar]
  11. Chi, M. , Plaza A., Benediktsson J. A., Sun Z., Shen J., and Zhu Y.. 2016. “Big Data for Remote Sensing: Challenges and Opportunities.” Proceedings of the IEEE 104, no. 11: 2207–2219 Proceedings of the IEEE. 10.1109/JPROC.2016.2598228. [DOI] [Google Scholar]
  12. Evanno, G. , Regnaut S., and Goudet J.. 2005. “Detecting the Number of Clusters of Individuals Using the Software Structure: A Simulation Study.” Molecular Ecology 14, no. 8: 2611–2620. 10.1111/j.1365-294X.2005.02553.x. [DOI] [PubMed] [Google Scholar]
  13. Francis, R. M. 2017. “pophelper: An R Package and Web App to Analyse and Visualize Population Structure.” Molecular Ecology Resources 17, no. 1: 27–32. 10.1111/1755-0998.12509. [DOI] [PubMed] [Google Scholar]
  14. Fourcade, Y. , Chaput‐Bardy A., Secondi J., Fleurant C., and Lemaire C.. 2013. “Is Local Selection So Widespread in River Organisms? Fractal Geometry of River Networks Leads to High Bias in Outlier Detection.” Molecular Ecology 22, no. 8: 2065–2073. 10.1111/mec.12158. [DOI] [PubMed] [Google Scholar]
  15. Fuentes‐Pardo, A. P. , Stanley R., Bourne C., et al. 2024. “Adaptation to Seasonal Reproduction and Environment‐Associated Factors Drive Temporal and Spatial Differentiation in Northwest Atlantic Herring Despite Gene Flow.” Evolutionary Applications 17, no. 3: e13675. 10.1111/eva.13675. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Glover, K. A. , Pertoldi C., Besnier F., Wennevik V., Kent M., and Skaala Ø.. 2013. “Atlantic Salmon Populations Invaded by Farmed Escapees: Quantifying Genetic Introgression With a Bayesian Approach and SNPs.” BMC Genetics 14, no. 1: 74. 10.1186/1471-2156-14-74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Guyon, I. , and Elisseeff A.. 2003. “An Introduction to Variable and Feature Selection.” Journal of Machine Learning Research 3: 1157–1182. [Google Scholar]
  18. Han, F. , Jamsandekar M., Pettersson M. E., et al. 2020. “Ecological Adaptation in Atlantic Herring Is Associated With Large Shifts in Allele Frequencies at Hundreds of Loci.” eLife 9: e61076. 10.7554/eLife.61076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Hemstrom, W. , Grummer J. A., Luikart G., and Christie M. R.. 2024. “Next‐Generation Data Filtering in the Genomics Era.” Nature Reviews Genetics 25, no. 11: 750–767. 10.1038/s41576-024-00738-6. [DOI] [PubMed] [Google Scholar]
  20. Hemstrom, W. , and Jones M.. 2023. “snpR: User Friendly Population Genomics for SNP Data Sets With Categorical Metadata.” Molecular Ecology Resources 23, no. 4: 962–973. 10.1111/1755-0998.13721. [DOI] [PubMed] [Google Scholar]
  21. Hemstrom, W. B. , Freedman M. G., Zalucki M. P., Ramírez S. R., and Miller M. R.. 2022. “Population Genetics of a Recent Range Expansion and Subsequent Loss of Migration in Monarch Butterflies.” Molecular Ecology 31, no. 17: 4544–4557. 10.1111/mec.16592. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Hemstrom, W. , Gruenthal K., Shedd K., et al. (in prep). “Run‐Timing Variation in Highly Divergent Lineages of Pink Salmon is Strongly Influenced by Deeply Conserved Variation at the Gene lrrc9.”
  23. Jakobsson, M. , and Rosenberg N. A.. 2007. “CLUMPP: A Cluster Matching and Permutation Program for Dealing With Label Switching and Multimodality in Analysis of Population Structure.” Bioinformatics 23, no. 14: 1801–1806. 10.1093/bioinformatics/btm233. [DOI] [PubMed] [Google Scholar]
  24. Jansson, E. , Faust E., Bekkevold D., et al. 2023. “Global, Regional, and Cryptic Population Structure in a High Gene‐Flow Transatlantic Fish.” PLoS One 18, no. 3: e0283351. 10.1371/journal.pone.0283351. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Janes, J. K. , Miller J. M., Dupuis J. R., et al. 2017. “The K = 2 Conundrum.” Molecular Ecology 26, no. 14: 3594–3602. 10.1111/mec.14187. [DOI] [PubMed] [Google Scholar]
  26. Kaiser, T. S. , von Haeseler A., Tessmar‐Raible K., and Heckel D. G.. 2021. “Timing Strains of the Marine Insect Clunio marinus Diverged and Persist With Gene Flow.” Molecular Ecology 30, no. 5: 1264–1280. 10.1111/mec.15791. [DOI] [PubMed] [Google Scholar]
  27. Kardos, M. , and Waples R. S.. 2024. “Low‐Coverage Sequencing and Wahlund Effect Severely Bias Estimates of Inbreeding, Heterozygosity and Effective Population Size in North American Wolves.” Molecular Ecology: e17415. 10.1111/mec.17415. [DOI] [PubMed] [Google Scholar]
  28. Karlsson, S. , Moen T., Lien S., Glover K. A., and Hindar K.. 2011. “Generic Genetic Differences Between Farmed and Wild Atlantic Salmon Identified From a 7K SNP‐Chip.” Molecular Ecology Resources 11, no. s1: 247–253. 10.1111/j.1755-0998.2010.02959.x. [DOI] [PubMed] [Google Scholar]
  29. Kess, T. , Galindo J., and Boulding E. G.. 2018. “Genomic Divergence Between Spanish Littorina saxatilis Ecotypes Unravels Limited Admixture and Extensive Parallelism Associated With Population History.” Ecology and Evolution 8, no. 16: 8311–8327. 10.1002/ece3.4304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Koot, E. , Wu C., Ruza I., et al. 2021. “Genome‐Wide Analysis Reveals the Genetic Stock Structure of Hoki ( Macruronus novaezelandiae ).” Evolutionary Applications 14, no. 12: 2848–2863. 10.1111/eva.13317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Lee, A. , Daniels B. N., Hemstrom W., et al. 2024. “Genetic Adaptation Despite High Gene Flow in a Range‐Expanding Population.” Molecular Ecology: e17511. 10.1111/mec.17511. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Lehnert, S. J. , DiBacco C., Van Wyngaarden M., et al. 2019. “Fine‐Scale Temperature‐Associated Genetic Structure Between Inshore and Offshore Populations of Sea Scallop ( Placopecten magellanicus ).” Heredity 122, no. 1: 69–80. 10.1038/s41437-018-0087-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Li, J. , Hong D., Gao L., et al. 2022. “Deep Learning in Multimodal Remote Sensing Data Fusion: A Comprehensive Review.” International Journal of Applied Earth Observation and Geoinformation 112: 102926. 10.1016/j.jag.2022.102926. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Li, M. , Li C., and Guan W.. 2008. “Evaluation of Coverage Variation of SNP Chips for Genome‐Wide Association Studies.” European Journal of Human Genetics 16, no. 5: 635–643. 10.1038/sj.ejhg.5202007. [DOI] [PubMed] [Google Scholar]
  35. Linck, E. , and Battey C. J.. 2019. “Minor Allele Frequency Thresholds Strongly Affect Population Structure Inference With Genomic Data Sets.” Molecular Ecology Resources 19, no. 3: 639–647. 10.1111/1755-0998.12995. [DOI] [PubMed] [Google Scholar]
  36. Love, M. I. , Huber W., and Anders S.. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA‐Seq Data With DESeq2.” Genome Biology 15, no. 12: 550. 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Luu, K. , Bazin E., and Blum M. G. B.. 2017. “Pcadapt: An R Package to Perform Genome Scans for Selection Based on Principal Component Analysis.” Molecular Ecology Resources 17, no. 1: 67–77. 10.1111/1755-0998.12592. [DOI] [PubMed] [Google Scholar]
  38. Manel, S. , Gaggiotti O. E., and Waples R. S.. 2005. “Assignment Methods: Matching Biological Questions With Appropriate Techniques.” Trends in Ecology & Evolution 20, no. 3: 136–142. 10.1016/j.tree.2004.12.004. [DOI] [PubMed] [Google Scholar]
  39. Martinez, E. , Buonaccorsi V., Hyde J. R., and Aguilar A.. 2017. “Population Genomics Reveals High Gene Flow in Grass Rockfish ( Sebastes rastrelliger ).” Marine Genomics 33: 57–63. 10.1016/j.margen.2017.01.004. [DOI] [PubMed] [Google Scholar]
  40. Milano, I. , Babbucci M., Cariani A., et al. 2014. “Outlier SNP Markers Reveal Fine‐Scale Genetic Structuring Across European Hake Populations (Merluccius merluccius).” Molecular Ecology 23, no. 1: 118–135. 10.1111/mec.12568. [DOI] [PubMed] [Google Scholar]
  41. Moran, B. M. , and Anderson E. C.. 2019. “Bayesian Inference From the Conditional Genetic Stock Identification Model.” Canadian Journal of Fisheries and Aquatic Sciences 76, no. 4: 551–560. 10.1139/cjfas-2018-0016. [DOI] [Google Scholar]
  42. Mosteller, F. , and Tukey J. W.. 1977. “Data Analysis and Regression. A Second Course in Statistics.” In Addison‐Wesley Series in Behavioral Science: Quantitative Methods. Addison‐Wesley Publishing Company. https://ui.adsabs.harvard.edu/abs/1977dars.book. [Google Scholar]
  43. Patterson, N. , Price A. L., and Reich D.. 2006. “Population Structure and Eigenanalysis.” PLoS Genetics 2, no. 12: e190. 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Price, A. L. , Patterson N. J., Plenge R. M., Weinblatt M. E., Shadick N. A., and Reich D.. 2006. “Principal Components Analysis Corrects for Stratification in Genome‐Wide Association Studies.” Nature Genetics 38, no. 8: 904–909. 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
  45. Pillai, K. S. 1955. “Some New Test Criteria in Multivariate Analysis.” Annals of Mathematical Statistics 26, no. 1: 117–121. [Google Scholar]
  46. Pritchard, J. K. , Stephens M., and Donnelly P.. 2000. “Inference of Population Structure Using Multilocus Genotype Data.” Genetics 155, no. 2: 945–959. 10.1093/genetics/155.2.945. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Purcell, S. , Neale B., Todd‐Brown K., et al. 2007. “PLINK: A Tool Set for Whole‐Genome Association and Population‐Based Linkage Analyses.” American Journal of Human Genetics 81, no. 3: 559–575. 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. R Core Team . 2022. R: A Language and Environment for Statistical ## Computing [Computer Software]. R Foundation for Statistical Computing. [Google Scholar]
  49. Roux, N. , Miura S., Dussenne M., et al. 2023. “The Multi‐Level Regulation of Clownfish Metamorphosis by Thyroid Hormones.” Cell Reports 42, no. 7: 112661. 10.1016/j.celrep.2023.112661. [DOI] [PubMed] [Google Scholar]
  50. Salis, P. , Peyran C., Morage T., et al. 2022. “RNA‐Seq Comparative Study Reveals Molecular Effectors Linked to the Resistance of Pinna Nobilis to Haplosporidium Pinnae Parasite.” Scientific Reports 12, no. 1: 21229. 10.1038/s41598-022-25555-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Samad‐zada, F. , Kelemen E. P., and Rehan S. M.. 2023. “The Impact of Geography and Climate on the Population Structure and Local Adaptation in a Wild Bee.” Evolutionary Applications 16, no. 6: 1154–1168. 10.1111/eva.13558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Schweizer, R. M. , Saarman N., Ramstad K. M., et al. 2021. “Big Data in Conservation Genomics: Boosting Skills, Hedging Bets, and Staying Current in the Field.” Journal of Heredity 112, no. 4: 313–327. 10.1093/jhered/esab019. [DOI] [PubMed] [Google Scholar]
  53. Shen, Y. , Wang L., Fu J., Xu X., Yue G. H., and Li J.. 2019. “Population Structure, Demographic History and Local Adaptation of the Grass Carp.” BMC Genomics 20, no. 1: 467. 10.1186/s12864-019-5872-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Silliman, K. 2019. “Population Structure, Genetic Connectivity, and Adaptation in the Olympia Oyster ( Ostrea lurida ) Along the West Coast of North America.” Evolutionary Applications 12, no. 5: 923–939. 10.1111/eva.12766. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Silva, D. D. , Sierla S., Alahakoon D., Osipov E., Yu X., and Vyatkin V.. 2020. “Toward Intelligent Industrial Informatics: A Review of Current Developments and Future Directions of Artificial Intelligence in Industrial Applications.” IEEE Industrial Electronics Magazine 14, no. 2: 57–72. 10.1109/MIE.2019.2952165. [DOI] [Google Scholar]
  56. Staab, P. R. , Zhu S., Metzler D., and Lunter G.. 2015. “Scrm: Efficiently Simulating Long Sequences Using the Approximated Coalescent With Recombination.” Bioinformatics 31, no. 10: 1680–1682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Tosser‐Klopp, G. , Bardou P., Bouchez O., et al. 2014. “Design and Characterization of a 52K SNP Chip for Goats.” PLoS One 9, no. 1: e86227. 10.1371/journal.pone.0086227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Tvedebrink, T. 2022. “Review of the Forensic Applicability of Biostatistical Methods for Inferring Ancestry From Autosomal Genetic Markers.” Genes 13, no. 1: 141. 10.3390/genes13010141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Vu, N. T. T. , Zenger K. R., Guppy J. L., et al. 2020. “Fine‐Scale Population Structure and Evidence for Local Adaptation in Australian Giant Black Tiger Shrimp ( Penaeus monodon ) Using SNP Analysis.” BMC Genomics 21, no. 1: 669. 10.1186/s12864-020-07084-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Waples, R. S. 2010. “High‐Grading Bias: Subtle Problems With Assessing Power of Selected Subsets of Loci for Population Assignment.” Molecular Ecology 19, no. 13: 2599–2601. 10.1111/j.1365-294X.2010.04675.x. [DOI] [PubMed] [Google Scholar]
  61. Weir, B. S. , and Cockerham C. C.. 1984. “Estimating F‐Statistics for the Analysis of Population Structure.” Evolution 38, no. 6: 1358–1370. 10.1111/j.1558-5646.1984.tb05657.x. [DOI] [PubMed] [Google Scholar]
  62. Weist, P. , Jentoft S., Tørresen O. K., et al. 2022. “The Role of Genomic Signatures of Directional Selection and Demographic History in the Population Structure of a Marine Teleost With High Gene Flow.” Ecology and Evolution 12, no. 12: e9602. 10.1002/ece3.9602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Whitaker, J. M. , Price L. E., Boase J. C., Bernatchez L., and Welsh A. B.. 2020. “Detecting Fine‐Scale Population Structure in the Age of Genomics: A Case Study of Lake Sturgeon in the Great Lakes.” Fisheries Research 230: 105646. 10.1016/j.fishres.2020.105646. [DOI] [Google Scholar]
  64. Whitlock, M. C. , and Lotterhos K. E.. 2015. “Reliable Detection of Loci Responsible for Local Adaptation: Inference of a Null Model Through Trimming the Distribution of FST.” American Naturalist 186, no. S1: S24–S36. 10.1086/682949. [DOI] [PubMed] [Google Scholar]
  65. Wyatt, J. C. , and Liu J. L. Y.. 2002. “Basic Concepts in Medical Informatics.” Journal of Epidemiology and Community Health 56, no. 11: 808–812. 10.1136/jech.56.11.808. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supporting information.

MEN-25-e70011-s001.zip (2.5MB, zip)

Data Availability Statement

All code and data used to produce this manuscript are available at https://github.com/ChristieLab/high_grading_bias. The PCAssess package is available at https://github.com/hemstrow/PCAssess. The root simulated genome used in this study for the ‘high gene flow’ and ‘high gene flow with local adaptation’ can be accessed on Dryad (https://doi.org/10.5061/dryad.c2fqz61p8). The genomes used for the ‘panmictic’ scenario are generated automatically using the scripts linked above.


Articles from Molecular Ecology Resources are provided here courtesy of Wiley

RESOURCES