Abstract
Background
The inference of population structure in domestication studies is prone to biases whenever sampling is unbalanced and effective population sizes (Ne) differ across populations. Such biases can lead to the misclassification of large ancestral populations as admixed, particularly under single-origin domestication scenarios.
Results
We propose a novel parameterization strategy for the STRUCTURE software, combining the F model and alternative ancestry prior (along with a smaller initial ALPHA value), and simulations demonstrate that the strategy mitigates unbalanced sampling and unequal population size biases. We apply our strategy to the domestication history of the common walnut (Juglans regia), using whole-genome resequencing data from 298 individuals from across its range. The results support an origin of J. regia in South Asia, where walnut populations are characterized by high genetic diversity, extensive private allele content, low mutation load, and demographic stability. Building on this demographic framework, we further identify genomic regions under recent positive selection and candidate domestication genes involved in shell structure, pollen development, and lipid transport.
Conclusions
Our results clarify the long-standing debate on the geographic origin of walnut domestication and demonstrate that an optimized, model-aware use of STRUCTURE can substantially improve population-genetic inference in domestication studies and other systems characterized by complex demography.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13059-026-03959-6.
Keywords: Crop domestication, Effective population size, Parameter optimization, Population clustering, Sampling bias, STRUCTURE software
Background
Population-genetic clustering algorithms, such as those implemented in STRUCTURE [1] and ADMIXTURE [2], are widely used to characterize individuals and populations using genetic data. A key application of these tools is reconstructing the domestication history of crops. In single-domestication scenarios, domesticated species typically consist of a large ancestral or wild population at the center of origin and multiple smaller, geographically dispersed populations derived from this source [3]. Under such conditions, unequal effective population sizes (Ne) can confound ancestry inference, as individuals from large ancestral populations may show a mixture of ancestry from all the derived groups, rather than being recognized as a distinct group [4, 5]. This pattern arises because derived populations exhibit stronger drift due to founder effects or bottlenecks, quickly fixing alleles that were originally polymorphic in the ancestral gene pool. As a result, large source populations—retaining greater allelic diversity—tend to display ancestry profiles that resemble admixture [5].
Beyond these confounding effects of unequal effective population size (Ne), another challenge in population clustering arises from unbalanced sample sizes across populations. Simulation studies have demonstrated that clustering methods struggle with highly unbalanced sampling, placing underrepresented populations together even when they are not genetically close [6, 7]. Puechmaille [7] suggested subsampling as a potential solution to mitigate such imbalances. However, this strategy requires prior knowledge of population genetic structure, which is rarely available before conducting clustering analyses. In practice, uneven sampling is frequently unavoidable, as many biodiverse regions remain under-sampled due to geographic, political, and human-resource constraints. Together, unequal effective population size and unbalanced sample size introduce compounding biases that can severely affect the accuracy of population clustering inferences.
Among commonly used clustering algorithms, STRUCTURE remains the most popular due to its pioneering role and continuing refinement, offering high accuracy and flexibility [8, 9]. STRUCTURE supports a variety of prior models that allow users to account for demographic imbalance and uneven sampling across populations [10, 11]—a level of user control that is limited or absent in other tools, such as ADMIXTURE. An evaluation of these prior models by Wang [11] pointed out that the alternative ancestry prior (POPALPHA = 1 and a smaller initial ALPHA value), which allows unequal representations of the populations by the sample, can mitigate misclassification issues caused by unbalanced sampling under default ancestry prior (POPALPHA = 0). He also observed that this improvement would be further enhanced when used in conjunction with the uncorrelated allele frequency model. Another mitigation approach, the correlated allele frequency model (or F model) [10], can theoretically handle differences in Ne among populations, though this capability is rarely appreciated. Importantly, these two strategies are rarely implemented together in practice, and each strategy addresses only a single source of bias. This presents a challenge for crop domestication research, where both sampling imbalances and unequal population sizes presumably are common. To our knowledge, no prior study has systematically addressed both these sources of bias in STRUCTURE-based analyses.
These methodological challenges in population-genetic clustering are particularly relevant when studying domesticated species. An example is the common walnut (Juglans regia), which has a large geographic distribution that includes several botanically underexplored regions, suggesting the presence of sampling biases in addition to likely unequal population sizes. Walnuts have been widely gathered and traded across Asia since the Late Neolithic, as evidenced by nut remains found in an Armenian grave (~ 6200 years ago) [12], walnut shells from Kashmir (~ 4700–4000 years ago) [13], and shells from a former market site in Pakistan (~ 3200 years ago) [14]. Due to its extinction in much of Europe and parts of Asia during the Pleistocene glaciations—and botanical under-collecting in Central, South, and West Asia—the geographic origin of this crop has remained elusive. Competing hypotheses place its origin in the Irano-Anatolian region of West Asia [15, 16], the Tianshan Mountains of Central Asia [17–19], or a broader region encompassing Afghanistan, Bangladesh, Bhutan, India, Nepal, and Pakistan in South Asia [20–23].
Here, we revisit the origin of the common walnut using a STRUCTURE-based approach, complemented by additional lines of genetic evidence—including chloroplast haplotypes, bottleneck signatures, inbreeding levels, and genetic load—leveraging an expanded sampling dataset. Crucially, we propose a new parameterization strategy for STRUCTURE that integrates the F model, an alternative ancestry prior, and a smaller initial ALPHA value, aiming to simultaneously address the effects of both unequal effective population size and sample size imbalance. Following a clear definition of source and derived populations, we conduct genome-wide comparisons of each derived population against the source and identify, for the first time, genes under positive selection linked to walnut domestication.
Results
Genetic structure and phylogenetic relationship inferred from the nuclear genome
Our nuclear dataset comprises 14,950 independent non-coding single nucleotide polymorphisms (SNPs) derived from 298 Juglans regia individuals (Additional file 1: Table S1). We investigated population clustering using two model configurations in STRUCTURE: ParamSet1, which implemented an alternative ancestry prior, a small ALPHA value, and the correlated allele frequency model (F model); and ParamSet2, which applied the same ancestry prior and ALPHA but employed the uncorrelated allele frequency model as advised by Wang [11]. Under ParamSet1, clustering analysis identified K = 6 as the optimal number of populations based on the parsimony estimator [24] (Fig. 1A; Additional file 1: Table S2). Individuals were assigned to six genetically distinct groups (q > 0.75): East Asia (73 individuals), Central Asia (48), Europe (26), West Asia (51), Tibet (40), and South Asia (16). The emergence of the South Asian group at K = 6 was a key finding, as these individuals exhibited an admixed ancestry profile across the five clusters (East Asia, Central Asia, Europe, West Asia, and Tibet) observed at K = 5 (Fig. 1A). In contrast, under ParamSet2, clustering analysis identified K = 5 as the optimal number of populations (Additional file 1: Table S3), supporting five distinct groups—East Asia (74), Central Asia (48), Europe (27), West Asia (36), and Tibet (40)—again with South Asian individuals showing admixed ancestry across multiple clusters. Increasing to K = 6 did not result in the resolution of South Asia as an independent cluster (Additional file 2: Fig. S1).
Fig. 1.
Population structure and phylogenetic analysis. A Genetic structure of Juglans regia inferred using STRUCTURE v. 2.3.4 [1] with ParamSet1 parameter setting: alternative ancestry prior, an ALPHA value (0.25) with the correlated allele frequency model (F model). The y-axis represents the q-value, and the x-axis shows each individual. Color coding: Europe (EUR, blue, q > 0.75), West Asia (WA, yellow, q > 0.75), Central Asia (CA, orange, q > 0.75), East Asia (EA, pink, q > 0.75), Tibet (TIB, green, q > 0.75), South Asia (SA, purple), and black (indicating admixture groups). B A phylogeny of 298 unrelated individuals inferred using Neighbor-Joining (NJ) tree and rooted on J. mandshurica and J. nigra. Branch colors correspond to STRUCTURE clusters, where colored branches indicate individuals with q > 0.75. Bootstrap values ≥ 0.9 are shown as a red dot at nodes. C A principal component analysis (PCA) of 298 individuals of Juglans regia. D STRUCTURE clusters obtained from real data under two model settings (ParamSet1: POPALPHA = 1, ALPHA = 0.25 with F model (the correlated allele frequency model) and ParamSet2: POPALPHA = 1, ALPHA = 0.25 with the uncorrelated allele frequency model) for three populations (107 individuals including 16 South Asian, 51 West Asian, and 40 Tibetan) at K = 2 and K = 3. E Schematic representation of the models used to simulate SNP data with Fastsimcoal2. PopA, popB, and popC represent simulated populations. F STRUCTURE clusters obtained from simulated data and analyzed with two structure parameter settings for popA, popB, and popC (with sample sizes and SNP quantities matching the real dataset) at K = 2 and K = 3
Under the F model implemented in STRUCTURE [10], the population-specific parameter Fk quantifies the magnitude of genetic drift experienced by each population relative to the inferred ancestral allele frequencies since divergence, with larger values indicating stronger drift typically associated with smaller effective population sizes. Given similar divergence times, stronger drift leads to greater genetic differentiation, whereas reduced drift is associated with greater retention of ancestral genetic variation.
The estimated Fk values under ParamSet1 rank genetic drift from highest to lowest as follows: East Asia (0.4964) > Europe (0.3917) > Tibet (0.2738) > Central Asia (0.2322) > West Asia (0.1647) > South Asia (0.0565). This pattern indicates that South Asia has experienced the weakest genetic drift and maintained a relatively large long-term effective population size. Although all extant populations are expected to be equidistant from the ancestral population in terms of divergence time, the reduced drift observed in South Asia suggests greater retention of ancestral genetic variation. Accordingly, we treat South Asia as the source population for subsequent analyses, without implying a direct ancestral–descendant relationship.
The NJ tree based on 11,803 independent SNPs (298 J. regia and two outgroups) revealed a clear population structure (Fig. 1B): Tibet formed a distinct clade; Europe was sister to West Asia; and Central Asia was sister to East Asia, which formed a single clade. In contrast to the optimal clustering of STRUCTURE under ParamSet1 (Fig. 1A), where South Asia was identified as a distinct group, South Asian individuals were placed in two positions: some placed close to the deeper nodes of the topology, consistent with South Asia retaining higher levels of ancestral genetic variation, while others clustered with the Central and East Asian clade, presumably due to the detected gene flow from East Asia into South Asia (see below).
Consistent with the optimal clustering of STRUCTURE under ParamSet1 (Fig. 1A), PCA also clearly separated individuals into six geographic groups (Fig. 1C).
Testing the effects of STRUCTURE parameter settings with real and simulated data
To evaluate how the two STRUCTURE parameter settings (see above) influence the accuracy of ancestral population inference under a domestication scenario, we selected the South Asian and the two other populations (West Asia and Tibet) to represent the possible source and derived lineages in empirical datasets. Under ParamSet1, the optimal K was 3, revealing distinct genetic clusters for each population. In contrast, ParamSet2 yielded an optimal K of 2, with West Asia and Tibet forming separate clusters, while the South Asian population exhibited admixture from both (Fig. 1D).
To facilitate direct comparison, we designed a three-population simulation under a domestication scenario, modeling South Asia, Tibet, and West Asia (Fig. 1E). The demographic histories (model history settings followed the demographic scenarios described below), sample size, and SNP count were matched to the empirical dataset. The simulation results closely paralleled the empirical findings: under ParamSet1, three distinct clusters were identified (Fig. 1F), whereas under ParamSet2, two clusters were observed, with the source population showing genetic admixture from both bottlenecked populations (Fig. 1F). These findings further support South Asia as the candidate source population and suggest that the ParamSet1 (alternative ancestry prior (POPALPHA = 1), a smaller ALPHA value, and the correlated allele frequency model (F model)) may be more suitable for capturing the fine-scale genetic structure within these populations than the parameter combination, ParamSet2, proposed by Wang [11].
Genetic diversity and differentiation among the six geographic groups
Accepting six geographic groups (East Asia, Central Asia, Europe, West Asia, Tibet, South Asia) as best matching our nuclear data, we next calculated standard genetic diversity, linkage disequilibrium (LD), nucleotide diversity (π), heterozygosity, and private SNPs across the six groups. Genome-wide LD analysis revealed substantial variation in r2 decay, with South Asia showing the fastest decay among all STRUCTURE groups, followed by West Asia, Central Asia, Tibet, Europe, and East Asia (Fig. 2A). Among the six groups, South Asia exhibited the highest nucleotide diversity, followed by West Asia, Tibet, Central Asia, Europe, and East Asia (Fig. 2B). This pattern was also reflected in heterozygosity estimates, with South Asia displaying the highest values, followed by West Asia, Tibet, Central Asia, Europe, and East Asia (Fig. 2C). Since private allele counts are influenced by sample size, we controlled for this effect by randomly selecting 16 individuals per group (the smallest group with q > 0.75) and repeating the analysis 20 times. The adjusted estimates showed the highest private SNP proportion in South Asia (21.4%−22.7%), followed by East Asia (8.6%−20.2%), West Asia (9.0%−14.3%), Tibet (8.2%−11.2%), Europe (5.9%−7.3%), and Central Asia (5.3%−6.5%) (Fig. 2D).
Fig. 2.
Genetic diversity and differentiation analysis of Juglans regia across its geographic range. A Linkage disequilibrium (LD) decay. B Nucleotide diversity (π). C Heterozygosity. D Proportions of private SNPs. E Matrix of relative (FST) and absolute (DXY) divergence for pairwise groups comparisons (Upper triangle: DXY; Lower triangle: FST). F Population relationship and migration among populations inferred by OrientAGraph, which incorporates an exhaustive search for Maximum Likelihood Network Orientation (MLNO) into TreeMix. Allele frequency estimates derived from STRUCTURE-defined gene pools corresponding to Central Asia, East Asia, Europe, Tibet, West Asia, and South Asia were used for the analysis. The orange arrow indicates an inferred migration event from the source (here East Asia) to the recipient population (South Asia). For panels B, C, and D, the boxplots indicate the minimum (the lower hinge), maximum (the upper hinge), and median (the middle hinge). p-values were derived using t-tests comparing each group to South Asia group. Significance levels are indicated as “****” p < 0.0001, “***” p < 0.001, “**” p < 0.01, “*” p < 0.05, “ns” p > 0.05
DXY values ranged from 0.140 to 0.165, with the South Asia group showing the highest genetic divergence from other populations (0.161–0.165). FST values ranged from 0.074 to 0.306, with the highest between East Asia and Europe (0.306) and the lowest between West Asia and South Asia (0.074) (Fig. 2E).
Using OrientAGraph [25], we analyzed population relationships and migration patterns among six geographic groups based on allele frequency data from STRUCTURE-defined gene pools. This analysis showed South Asia as the deepest split in the topology, consistent with reduced genetic drift and long-term retention of ancestral allele frequencies. East Asia clustered with Central Asia, and Europe with West Asia. Among the migration events tested (m = 0–6), a single migration event (m = 1) best explained the sample covariance, with gene flow primarily observed from the East Asia group into the South Asia group (Fig. 2F). This inferred migration direction also explains why several South Asian samples cluster with Central and East Asian lineages in the NJ tree (Fig. 1B).
Inbreeding and deleterious mutation load across the six geographic groups
The FROH values, representing the proportion of the genome within runs of homozygosity (ROH), varied significantly across the six geographic groups. East Asia exhibited the highest FROH, followed by Europe, Tibet, Central Asia, West Asia, and South Asia (Fig. 3A). Significance testing confirmed that South Asia had significantly lower FROH values compared to all other groups.
Fig. 3.
Inbreeding, mutational load, and loss-of-function (LoF) variants were analyzed in six groups of Juglans regia. A Inbreeding coefficients were estimated as the average proportion of the genome in runs of homozygosity (FROH). B The ratios of derived deleterious variants to synonymous variants were calculated for each individual. C The ratios of derived LoF variants to synonymous variants were calculated per individual. All the boxplots indicate the minimum (the lower hinge), maximum (the upper hinge), and median (the middle hinge). p-values were derived using t-tests comparing each group to South Asia group. Significance levels are indicated as “***” p < 0.001, “**” p < 0.01, “*” p < 0.05, “ns” p > 0.05
To compare patterns of mutation load across the six geographic groups, we categorized coding-sequence variants into three functional classes based on predicted effects: synonymous, deleterious, and loss-of-function (LoF). Ancestral and derived alleles for each variant were polarized using Juglans mandshurica and Juglans nigra as outgroups. Among the six groups, South Asia exhibited the lowest ratio of total derived deleterious variants to synonymous variants, followed by Central Asia, West Asia, Tibet, Europe, and East Asia (Fig. 3B). Similarly, South Asia showed the lowest ratio of total derived loss-of-function (LoF) variants to synonymous variants, with the remaining regions ranked in the same ascending order (Fig. 3C).
Demographic history and inference of bottlenecks
We inferred the demographic history of the six geographic groups by setting the maximum recombination rate to 0.05 in the software GONE [26] and excluding inversion regions with frequencies between 0.15 and 0.85, and lengths greater than 10 Mb, as suggested by Novo et al. [27]. Both the East Asian and European groups underwent pronounced, severe reductions in effective population size (Nₑ), reaching minima of approximately 60, and the East Asian group exhibited larger Ne than other groups prior to decline. In contrast, Central Asian and West Asian groups experienced comparatively moderate bottlenecks, with Nₑ declining to around 110. Tibetan group showed a rapid and severe population contraction, while South Asian group exhibited no evidence of a pronounced decline and maintained a stable Nₑ—a pattern consistent with expectations for a proposed domestication source region of Juglans regia (Fig. 4A).
Fig. 4.
Population demographic history of Juglans regia. A Demographic history of six populations—South Asia, Tibet, West Asia, Europe, Central Asia, and East Asia—inferred using GONE after excluding inversion regions with frequencies ranging from 0.15 to 0.85 and lengths exceeding 10 megabases (Mb). Colored lines represent distinct populations: South Asia (purple, popA), Tibet (green, popB), West Asia (yellow, popC), Europe (blue, popD), Central Asia (orange, popE), and East Asia (pink, popF). B Schematic of the demographic model used for simulating SNP data in Fastsimcoal2, mirroring the historical dynamics of the six populations (popA–popF) analyzed in panel A. C Each subpanel (C1–C6) corresponds to a simulated population (popA–popF), showing temporal changes in Ne inferred from SNP datasets generated under the model in panel B
To investigate differences in bottleneck histories among the six geographic populations and to assess whether the East Asian group’s higher inferred Nₑ prior to decline might be an artifact of severe bottlenecks, we simulated several demographic scenarios: Population A maintained a constant size with no bottlenecks (mimicking South Asia); Population B underwent two bottlenecks (mimicking Tibet); Populations C and E each experienced a single bottleneck (mimicking Central Asia and West Asia, respectively); and Populations D and F each experienced two bottlenecks (mimicking Europe and East Asia, respectively) (Fig. 4B). The simulated Nₑ trajectories largely matched the empirical results obtained with GONE. Notably, Population F, which experienced the most severe bottlenecks, displayed a much larger inferred historical Nₑ than other populations—recapitulating the pattern observed in the East Asian group (Fig. 4C).
Phylogenetic relationships inferred from chloroplast genomes
Chloroplast haplotype diversity can point to the domestication center of a species, and we therefore also analyzed chloroplast genomic data. We reconstructed a minimum of 160,537 base pairs of the chloroplast genome per sample, identifying 106 substitutions and defining 12 haplotypes. The addition of seven chloroplast genomes from the Western Himalayan region (Afghanistan (1), India (3), Nepal (1), Pakistan (2)) generated by Yan et al. [22] —resulted in 19 haplotypes (Hap 1–19; see Additional file 1: Table S4). Europe (47 samples) and West Asia (Iran, Iraq, Armenia; 81 samples) predominantly harbor haplotypes 10 and 17, lacking any unique regional haplotypes. Central Asia (Kazakhstan, Tajikistan, Xinjiang; 39 samples) has two haplotypes (Hap 10 and 17); Tibet (51 samples) four haplotypes (Hap 7, 8, 10 and 17), with haplotype 7 being region-specific; East Asia (China, Korea, Japan; 111 samples) six haplotypes (Hap 8, 10, 15, 17, 18, and 19), three of them endemic (Hap 15, 18, and 19); and South Asia (Afghanistan, Pakistan, India, Nepal; 31 samples) harbors the highest diversity with 14 haplotypes (Hap 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, and 16), 12 of them region-specific (Hap 1, 2, 3, 4, 5, 6, 9, 11, 12, 13, 14, and 16). Geographically, haplotypes 10 and 17 are widely distributed in the Northern Hemisphere, while haplotype 8 is found in samples from South Asia, Tibet, and a single sample from Qinghai in East Asia (Fig. 5A).
Fig. 5.
Chloroplast haplotype distribution and phylogenetic relationships of Juglans regia. A Map showing the geographical distribution of 19 chloroplast haplotypes. For published studies, exact coordinates were used if available; otherwise, the capital’s coordinates or those extracted from maps were used. Colors and symbols in panel A are consistent with those in panels B and C, where haplotypes within the same clade share the same color scheme but differ in shape or color shade to distinguish individual haplotypes. B A rooted maximum likelihood (ML) tree: the red branch represents the clade 1, and the orange branch represents clade 2, and the blue branch represents clade 3. Bootstrap support values ≥ 90% are shown as a red dot at nodes. C A rooted Neighbor-Joining (NJ) tree: the red branch represents the clade 1, and the orange branch represents clade 2, and the blue branch represents clade 3. Bootstrap support values ≥ 0.9 are shown as a red dot at nodes
A maximum likelihood (ML) tree derived from whole-chloroplast genome sequences revealed a polytomy of three clades: the first contained four haplotypes (Hap 1, 2, 3, and 4) from South Asia, the second contained four haplotypes (Hap 5, 6, 7, and 8) from South Asia and Tibet, and the third contained 11 haplotypes from South Asia and other parts of the range (Hap 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, and 19) (Fig. 5B). Of the haplotypes in the third clade, haplotypes 9, 11, 12, 13, 14, and 16 were all from South Asia. A Neighbor-Joining (NJ) tree from the same data also shows a polytomy of three main clades (Fig. 5C).
Cross-population selection signatures and candidate genes under positive selection
As described above, we inferred walnut populations in South Asia as having retained more ancestral genetic variation and therefore used them as the reference to detect selective sweeps associated with domestication. Under this scenario, domestication genes would be shared among the five derived populations, showing consistent directional allele frequency changes. Given that walnut domestication likely occurred within the last ten thousand years (~ 200 generations), we applied the sensitive haplotype-based XP-EHH method [28], as implemented in selscan v.2.0.3 [29, 30], to detect signatures of positive selection associated with domestication.
After stringent filtering, we identified 45 genes under positive selection by intersecting significant SNPs showing consistent directional allele frequency changes across all derived populations relative to the source population, located within genes or in flanking regions (± 5 kb) (Methods, Additional file 1: Table S5). Walnut domestication mainly involves selection on traits such as nut size, shell thickness, kernel oil content, and flowering behavior [31]. Among 29 functionally annotated candidates, three genes stood out as prime domestication candidates: JreChr11G12281 encodes the LRR receptor-like kinase FEI2, involved in cell wall remodeling; JreChr01G11061 encodes a pectinesterase implicated in pollen tube growth and fruit softening; and JreChr06G11221 encodes the ABC transporter ABCG7, which participates in lipid transport and likely contributes to kernel composition. Notably, two non-synonymous SNPs in the uncharacterized gene JreChr03G10963 show relatively pronounced allele frequency shifts—from 0.5937 and 0.7187 in the source population to near fixation in all derived populations—highlighting the possibility that currently unannotated genes may also contribute to domestication (Additional file 1: Table S5).
Discussion
Addressing unbalanced sample sizes and unequal population sizes in STRUCTURE analyses
The use of clustering algorithms such as STRUCTURE in crop domestication studies is often challenged by two critical and distinct issues: unbalanced sample sizes and unequal effective population sizes (Ne) among populations. While the effects of unbalanced sampling have been extensively discussed [6, 7, 11], the impact of unequal effective population size (Ne) among constituent populations has received comparatively little attention. This oversight is particularly consequential in the context of crop domestication, where source-of-domestication populations often maintain large Ne, while derived populations frequently experience bottlenecks and founder effects. The F model (the correlated allele frequency model) of STRUCTURE introduced by Falush et al. [10] accounts for differences in Ne by allowing population-specific genetic drift (Fk) away from the ancestral allele frequencies. Although this model is the default setting in STRUCTURE now, it has rarely been explicitly employed in domestication studies for the purpose of addressing among-population differences in effective population size.
In contrast, biases introduced by unbalanced sample sizes have received more targeted methodological attention. Wang [11] proposed a strategy that combines an alternative ancestry prior, a small ALPHA value, and the uncorrelated allele frequency model to reduce misclassification caused by sampling imbalance. However, because the uncorrelated allele frequency model simply assumes allele frequencies across populations are independent, it is prone to erroneously classifying populations with low drift levels (or large Ne)—and thus preserved ancestral diversity—as admixed from populations that underwent stronger drift. Consequently, Wang’s strategy alone is insufficient under conditions where both sample size and population size are unbalanced, leading to the misclassification of the source-of-domestication population—such as the South Asian group in our study—as admixed with the derived populations (Additional file 2: Fig. S1).
To jointly address these two challenges, we propose an integrated parameterization strategy that combines Wang’s [11] ancestry prior and ALPHA adjustment with the F-model's correlated allele frequency framework, enabling accurate identification of the source population. This approach recovered South Asia as the domestication origin of Juglans regia (Fig. 1A, D-E; Additional file 2: Fig. S1). By simultaneously correcting for sampling imbalance and unequal Ne, our strategy offers a robust and broadly applicable solution in study systems where unbalanced sample sizes and unequal effective population size among populations co-occur.
Following Puechmaille [7] and Meirmans [32], we also performed a subsampling analysis as an additional test to mitigate sampling imbalances and found that our parameter configuration produced consistent and well-resolved clustering results even with subsampling (Additional file 2: Fig. S2; Additional file 1: Table S6; Additional file 1: Table S7).
Beyond STRUCTURE, other widely used clustering algorithms, such as ADMIXTURE (Alexander et al. 2009), face similar limitations. ADMIXTURE does not incorporate any mechanisms to address either unbalanced sampling or unequal population sizes, making it particularly vulnerable to erroneous inference in complex demographic settings [2]. The latest clustering algorithm, PopCluster [33, 34], explicitly takes sampling imbalance into consideration through a weighted likelihood framework, but it so far lacks a mechanism to explicitly accommodate unequal population size.
South Asia as the region where the common walnut was first domesticated
The geographic origin of the common walnut (Juglans regia) has long been debated, largely due to limitations in earlier population-genetic studies that failed to account for demographic heterogeneity and botanical collecting imbalance, with entire relevant regions underrepresented. With an optimized STRUCTURE-based framework and multiple lines of genomic evidence, including from previously geographically under-sampled regions, our study now consistently identifies South Asia (western Himalayas and adjacent areas) as the initial center of walnut domestication. First, the nuclear genomic structure reveals that South Asian walnuts have the lowest Fk value (i.e., low level of genetic drift), indicative of their ancestral status. Second, the NJ tree places part of the South Asian lineage close to the deeper nodes of the topology, consistent with reduced genetic drift and higher retention of ancestral genetic variation (Fig. 1B). Third, this region harbors the highest number of private alleles and exhibits elevated heterozygosity among six groups (Fig. 2D), reflecting genetic distinctiveness. Fourth, walnuts from South Asia exhibited the lowest mutation load (Fig. 3B, C). Interestingly, when the Tibetan [35] or Chandler 2.0 [36] reference genomes were used for mutation load estimation, the lowest loads were observed in the Tibetan and European populations, respectively, rather than in South Asia. This discrepancy likely reflects reference bias, as reference genomes derived from bottlenecked populations with genomic erosion can substantially underestimate mutation load in those populations [37]. Fifth, demographic reconstruction indicates that the South Asian population has maintained a relatively stable effective population size, in contrast to repeated bottlenecks inferred in other regions (Fig. 4A). Sixth and finally, chloroplast genome analysis reveals the highest diversity of haplotypes in South Asia (Fig. 5), suggesting deep historical lineage retention.
Together, these six genomic patterns robustly support South Asia as the primary center of walnut domestication, aligning with previous insights from Aradhya et al. [20], Roor et al. [21], Yan et al. [22], and Fan et al. [23], while rejecting the Irano-Anatolian region of West Asia [15, 16] and the Tianshan Mountains of Central Asia [17–19]. Based on our simulations, we speculate that the latter inferences were affected by geographically-biased sampling and unaccounted differences in Ne. Clearly, the nuclear genomes of the West and Central Asian populations occupy more peripheral positions in the NJ topology (Fig. 1B), exhibit fewer private alleles, and have higher mutation loads compared to South Asian populations (Figs. 2B–D; 3B, C), suggesting reduced genetic distinctiveness and historical bottlenecks (Fig. 4A). Chloroplast data further reveal fewer haplotypes in Central (3) and Western (2) Asia compared to South Asia (above). In Central and Western Asia, the earliest walnut pollen records date to the Holocene [38] with anthropogenic origins in Kyrgyzstan (2000 years BP) and Juglans pollen in northern Iran (2300–2350 years BP) [39]. In contrast, South Asian records from Nepal and India indicate a longer presence, dating back to 18,000 and 30,000 years BP, respectively, suggesting refugia during the Last Glacial Maximum [40].
The GONE and STRUCTURE analyses revealed a large and stable effective population size in South Asia (Figs. 1A; 4A), providing partial support for the notion that the ‘domestication bottleneck’ may be a problematic concept [41]—particularly in woody perennial crops [41–44]. In contrast, the significant bottlenecks observed in the five derived populations are more plausibly attributed to founder effects.
The geographic spread of Juglans regia from its South Asian center of domestication
Following domestication in South Asia, walnuts dispersed across the broader Eurasian landscape, shaping the present-day genetic structure of the species (Fig. 6). There is a deep genetic differentiation between eastern and western lineages [16, 21, 45], a pattern previously interpreted as evidence of a Central Asian origin, particularly within the Irano-Anatolian region [16, 18, 19]. However, our new results, informed by expanded geographic sampling and improved STRUCTURE settings accounting for sampling and demographic biases, reveal that domesticated walnuts all trace their ancestry to South Asia.
Fig. 6.
Hypothesized domestication center and dispersal routes of the common walnut
Multiple lines of evidence are consistent with an eastward expansion of domesticated walnuts from South Asia. Patterns of genetic similarity inferred from the NJ tree, together with TreeMix results, place Central Asia outside the East Asian cluster within the same broader clade (Figs. 1B; 2F; 6), supporting Central Asia as an intermediate region during this expansion. In Asia, domesticated walnuts underwent two sequential bottlenecks: the first during their human-mediated transfer out of South Asia and the second during their subsequent expansion into East Asia. Our GONE analysis of simulated data corroborates the occurrence of these bottleneck events (Fig. 4B, C). There is a signal of gene flow from East Asia back into the South Asian source population (Fig. 2F), possibly reflecting the deliberate reintroduction of desirable genotypes by humans. However, additional evidence is required to confirm this. Among the five derived groups, the East Asian population exhibits the highest proportion of private alleles (Fig. 2D), likely due to introgression from Juglans sigillata, a species native to southwestern China that has contributed genetic material to eastern J. regia [16, 22].
Tibetan walnuts form a distinct lineage and may have been introduced from South Asia via the Southern Inner-Plateau Route (Fig. 6) [46]. In ancient Chinese, walnuts are sometimes called K’ang t’ao (meaning Tibetan walnut), with K’ang referring to Tibet [47].
Westward dispersal occurred stepwise from South Asia through West Asia before reaching Europe, also involving two sequential bottlenecks: first during the transfer from South Asia to West Asia, and second during expansion into Europe (Fig. 6). Our GONE analysis corroborates the occurrence of these bottleneck events (Fig. 4). Notably, both Fk values in STRUCTURE and the drift parameter from TreeMix (Fig. 2F) consistently indicate stronger genetic drift in Europe compared to West Asia, suggesting a more pronounced reduction in genetic diversity during westward dispersal. These successive bottlenecks likely contributed to the low genetic diversity observed in modern European walnut populations.
Identifying candidate domestication genes via XP-EHH
Our demographic analyses position South Asia as the domestication center, with five derived populations experiencing severe bottlenecks during their geographic expansion. The XP-EHH approach [28] proved optimal for detecting recent selective sweeps in this context because it requires smaller sample sizes (n ≥ 10) than iHS (n ≥ 100), accommodates unphased genotypes, and effectively captures haplotype homozygosity differentials on very short time scales [30].
The observed allele frequency trajectories—moderate in source populations but nearly fixed in derived groups—support a multi-phase domestication model with recurrent selection [43], mirroring patterns in rice [48], maize [49], and adzuki bean [50]. Although XP-EHH is often interpreted as detecting positive selection in only one of the two compared populations (or in the present context, the derived population), it fundamentally measures differences in extended haplotype homozygosity between populations [51], and thus captures differential selection intensities. In other words, the genes identified by XP-EHH might have already experienced a certain degree of human selection in the source population. It is also important to consider that some of the signals detected may reflect the effects of strong genetic drift following bottlenecks in the derived populations, rather than—or in addition to—artificial selection.
Among the candidate loci, three annotated genes—JreChr11G12281, JreChr01G11061, and JreChr06G11221—are likely involved in domestication-related traits [52, 53] (Additional file 1: Table S5). These genes participate in cell wall remodeling [54], pollen development [55], and lipid transport [56], respectively, and may have been targeted by selection for thinner shells, enhanced fertility, or increased oil accumulation—hallmark traits of cultivated walnut. Although these inferences rely on functional homology with Arabidopsis, they provide hypotheses for future validation. Strong selection signals were also observed in unannotated genes, such as JreChr03G10963, which contains non-synonymous SNPs with large allele frequency shifts (Additional file 1: Table S5).
Conclusions
Running STRUCTURE is only the starting point; reliable inference requires proper use of the software and validation of the results with other kinds of data, such as ecological or fossil data. In this study, we show through simulations that an optimized STRUCTU-RE framework—combining the F-model with alternative ancestry priors—can correct for biases from unequal population sizes and sampling imbalance, two major challenges in the study of domestication. Applying this strategy to Juglans regia, we identify South Asia as the center of domestication, supported by multiple lines of nuclear- and chloroplast-genomic evidence and demographic stability, and matching nut shell remains [13, 14]. A reliable identification of source and derived populations then facilitates the detection of candidate genes under positive selection via sensitive cross-population comparisons. These findings resolve a long-standing debate on walnut origins and underscore the importance of model-aware clustering in evolutionary inference. The utility of the approach proposed here extends beyond crop domestication to a range of biological contexts that involve complex demography, such as ancient DNA analyses, post-glacial recolonization, and conservation genomics of fragmented or endangered taxa.
Methods
Sampling and sequencing
We collected 39 mature J. regia individuals from Europe (3), China (Xinjiang (2), Yunnan (3), Beijing (2),Gansu (3), Qinghai (3) and Tibet (23)). Genomic DNA was extracted from dried leaf tissue using a plant total genomic DNA kit (Tiangen, Beijing, China) and was then sequenced using paired-end libraries with an insert size of 350 bp on Illumina HiSeq X-ten instruments by NovoGene (Beijing, China), with read lengths of 150 bp. Samples were sequenced to an average depth of 30 ×. Additionally, we downloaded whole-genome resequencing data of J. regia from various studies: Ji et al. [31] (209 individuals), Luo et al. [57] (49 individuals), Steven et al. [58] (20 individuals), Li et al. [59] (5 individuals), Zhang et al. [60] (6 individuals), Zhang et al. [61] (29 individuals), and Ding et al. [16] (42 J. regia individuals). These datasets encompass individuals from North America, Europe, Central Asia, Western Asia, South Asia, and East Asia, with an average depth higher than 10 ×.
Mapping and variant calling for the nuclear genomes
Raw reads from 399 J. regia individuals were trimmed of adapters and low-quality sequences using Trimmomatic v0.32 [62] and then aligned to the J. regia reference genome [35] using the BWA-MEM algorithm of BWA v0.7.15 [63]. Only uniquely mapped and properly paired reads were retained. SAMtools v1.19 [64] was used to convert SAM files to BAM format and remove PCR duplicates. Indel realignment and SNP calling were conducted using SENTIEON DNAseq software package v202308 [65] and SNPs were aggregated across samples. Stringent SNP filtration was applied via GATK's VariantFiltration [66], using criteria including “QD > 2.0, QUAL > 30.0, SOR < 3.0, FS < 60.0, MQ > 40.0, MQRankSum > −12.5, and ReadPosRankSum > −8.0”. We excluded SNPs with mapping depths outside one-third to triple the individual’s average, non-biallelic sites, and those with missing data. Heterozygous genotypes were determined by the proportion of non-reference alleles, set at 20–80% for depths exceeding three times the average, and 10–90% for depths at least one-third of the average; all others were classified as homozygous.
To minimize the introgression from J. sigillata, we excluded 48 samples with over 10% genetic contribution from this species, as determined by STRUCTURE v2.3.4 [1], reducing the dataset to 351 individuals. To ensure genealogical independence, we used King v.2.2.7 [67] to identify related individuals, excluding one from each pair with a kinship coefficient exceeding 0.0442 (indicative of third-degree relations), favoring those with higher sequencing depths. This led to the exclusion of 53 J. regia individuals from Europe (2), East Asia (44), and North America (7), yielding a final dataset of 298 individuals for STRUCTURE analysis, PCA, and a Neighbor-Joining tree construction.
To obtain neutral and independent SNPs, we excluded SNPs located within coding sequences and their 3-kb flanking regions, following Zhao et al. [68]. We further thinned the SNPs using a distance filter of greater than 20 kb between consecutive SNPs and removed singletons to minimize false positives due to sequencing errors, resulting in a data set of 14,950 SNPs for population structure analysis.
Population structure and phylogenetic analysis
To investigate the population structure of the 298 individuals, we performed principal component analysis (PCA) using the R package SNPRelate v1.6.2 [69] with default settings. Additionally, we used STRUCTURE v2.3.4 [1] to cluster individuals based on the number of clusters (K) ranging from 1 to 8. Clustering was conducted under the admixture model with two distinct parameter settings: the first (ParamSet1) used the alternative ancestry prior (POPALPHA = 1) with a small ALPHA value (ALPHA = 0.25) and the correlated allele frequency model (F model, FREQSCORR = 1), while the second (ParamSet2) used the alternative ancestry prior (POPALPHA = 1) with a small ALPHA value (0.25) and the uncorrelated allele frequency model (FREQSCORR = 0). Each parameter setting was run with 100,000 burn-in steps followed by 500,000 Markov Chain Monte Carlo (MCMC) steps, and 20 replicate runs were conducted for each value of K to assess the variation in likelihood. The optimal number of clusters (K) was determined using three criteria: Ln (D|K), the final posterior probability of K [1]; Delta K, the rate of change in Ln (D|K) between successive K values [70]; and KFinder v1.0, based on the parsimony index (PI) proposed by Wang [24].
Additionally, we incorporated one individual each from J. nigra and J. mandshurica as outgroups. Using MEGA [71], we constructed a Neighbor-Joining (NJ) tree based on the best-fit substitution model selected by the software and validated with 1,000 bootstrap replicates.
To simplify the evaluation of how two STRUCTURE parameter settings (see above) influence the accuracy of ancestral population inference under a domestication scenario, we randomly selected two populations (West Asia and Tibet) from the five derived groups to represent the derived lineages. These two, together with the South Asian population, formed a three-population subset used for comparative analyses based on both empirical and simulated datasets. The sample sizes and the number of SNPs used were consistent with those in the empirical dataset representing three populations: Population A (South Asia), Population B (Tibet), and Population C (West Asia).
For the simulated datasets, we used a coalescent-based framework (Fastsimcoal2). Demographic scenarios were parameterized to reflect plausible walnut histories, including effective population sizes, divergence times, and bottleneck events, and were later confirmed by the GONE analyses (see below). Mutation (1.03 × 10⁻⁷ per site per generation) and recombination rates (2.63 cM/Mb) were set according to empirical estimates for Juglans [72]. These settings ensured that the simulated data were consistent with the empirical patterns while remaining independent of the empirical SNPs. The simulated SNP datasets were then analyzed in STRUCTURE under the two parameter settings (ParamSet1 and ParamSet2), allowing us to assess the robustness of ancestry inference under alternative model assumptions.
Genetic diversity and differentiation analysis
We used VCFtools v0.1.17 [73] to calculate a suite of genetic diversity metrics based on datasets filtered to remove missing data. The analyses included linkage disequilibrium (LD) decay, nucleotide diversity (π), heterozygosity, genetic differentiation (FST), absolute genetic divergence (DXY), and proportions of private SNPs. These metrics were assessed across six genetic groups defined by STRUCTURE and PCA analyses: Europe (26 individuals), West Asia (51 individuals), Central Asia (48 individuals), Tibet (40 individuals), East Asia (73 individuals), and South Asia (16 individuals). To account for sample size differences, we performed 20 replicates for each group by randomly subsampling 16 individuals per replicate. p-values were derived using t-tests comparing each group to the South Asia group.
Gene flow among populations was inferred using the OrientAGraph approach [25], which optimizes Maximum Likelihood Network Orientation (MLNO) within the TreeMix framework [74]. Allele frequencies for the six groups (Central Asia, East Asia, Europe, Tibet, South Asia, and West Asia) were derived from STRUCTURE-defined gene pools.
Runs of homozygosity and mutation load of six groups
To identify runs of homozygosity (ROH), we first converted the six populations’ filtered multi-individual vcf file into a.ped file and identified ROH in PLINK v.1.9 [75]. To assess the robustness of our results to the applied parameters and to potential sequencing errors, we used three sets of parameters where we varied the window size (homozyg-window-snp) and the number of heterozygous sites per window (homozyg-window-het): (1) homozyg-window-snp 100 and homozyg-window-het 1; (2) homozyg-window-snp 250 and homozyg-window-het 3 (reported in main text in Fig. 4A); (3) homozyg-window-snp 500 and homozyg-window-het 5.
All other parameters described hereafter were the same for each of the three parameter sets. If at least 5% of all windows that included a given SNP were defined as homozygous, the SNP was defined as being in a homozygous segment of a chromosome (homozyg-window-threshold 0.05). This threshold was chosen to ensure that the edges of a ROH are properly delimited. A homozygous segment was then defined as a ROH if all of the following conditions were met: the segment included ≥ 25 SNPs (homozyg-snp 25); the segment covered ≥ 100 kb (homozyg-kb 100); the minimum SNP density was one SNP per 50 kb (homozyg-density 50); the maximum distance between two neighbouring SNPs was ≤ 1,000 kb (homozyg-gap 1,000); the number of heterozygous sites within ROH was set to 750 (homozyg-het 750) to prevent sequencing errors from breaking ROH. We then calculated individual inbreeding coefficients (FROH) [76] by summing the proportion of the genome covered by ROHs (total length of ROHs/total length of genome assembly). p-values were derived using t-tests comparing each group to the South Asia group.
When calculating mutation load, if mapping to the Tibetan reference genome [35], the Tibetan population exhibited the lowest ratio of derived deleterious and loss-of-function (LoF) variants to synonymous variants. Moreover, when mapping to the Chandler 2.0 reference genome [36], which originated from France [77], the European population showed the lowest ratios (Additional file 2: Fig. S3). As Dussex et al. [37] explicitly stated, using a reference genome that has suffered from genomic erosion (i.e., genetic threats to small populations) in a bottlenecked population can significantly underestimate genetic load of that population. This effect is corroborated by our results; to mitigate this bias, we followed Dussex et al. [37] by using the genome of Juglans sigillata [78] as the reference for mutation load estimation. The effect of SNP variants on protein-coding gene sequences was further annotated and classified into loss-of-function (LoF), missense, and synonymous variants using SnpEff v5.0 [79]. LoF variants denote those with gain and/or loss of a stop codon, or those with loss of a start codon. Missense SNPs were further predicted as deleterious (score ≤ 0.05) based on the SIFT score computed by the program SIFT4G [80]. At each SNP position, we determined the derived versus ancestral allelic state using the est-sfs software through comparison with J. mandshurica and J. nigra sequences. The total derived alleles for LoF, deleterious and synonymous variants were estimated for each individual. p-values were derived using t-tests comparing each group to South Asia group.
Population demographic analysis
To infer changes in effective population sizes (Ne), we used GONE [26] to analyze six groups identified through STRUCTURE and PCA analysis: Central Asia, East Asia, Europe, South Asia, Tibet, and West Asia. We assumed a constant rate of recombination of 2.63 cM/Mb for the whole genome [72] and excluded LD data with recombination rates > 0.05 to reduce the effect of sampling on the estimates as well as artefacts from recent migrants, as recommended in the GONE User’s Guide. We performed 20 replicate analyses, each including 50,000 SNPs sampled randomly from each chromosome.
To further investigate the impact of population bottlenecks on demographic inference, we simulated SNP data under the demographic models using Fastsimcoal2 [81] based primarily on the empirical data from GONE: (1) a source population model without bottleneck (population A, represents South Asia), (2) a single bottleneck model (population C, represents West Asia; population E, represents Central Asia), and (3) a model incorporating two successive bottlenecks (population B, represents Tibet; population D, represents Europe; population F, represents East Asia). Parameter values for these models—including divergence times, historical Ne changes, and bottleneck intensities—were informed by empirical estimates obtained from the GONE analysis of real data. Simulations were performed using the chromosome sizes of the Juglans regia reference genome [35], assuming a mutation rate of 1.03 × 10⁻⁷ per site per generation (with 50 years per generation) and a recombination rate of 2.63 cM/Mb [72].
The demographic model begins with a large source population (Ne = 10,000) that did not experience a bottleneck but reflects population changes associated with the initial phase of domestication. During this process, the effective population size decreased from Ne = 10,000 to Ne = 4,000, representing the transition to a managed population. This was followed by stabilization at Ne = 2,000 (defined as population A), representing the core domesticated lineage. Population A then served as the source for all subsequent derived populations.
Each subpopulation diverged from population A at specified time points and experienced distinct demographic trajectories. Population B (Tibet) split from A 100 generations ago and underwent a severe bottleneck, with effective population size reduced to 10, followed by partial recovery to 50 by 10 generations ago (sample size n = 40). Population C (West Asia) diverged 80 generations ago with an initial effective size of 50, expanding to 200 by 20 generations ago (n = 50). Similarly, population D (Europe) diverged 40 generations ago with an initial size of 20, increasing to 100 by 20 generations ago (n = 20). Population E (Central Asia) followed a trajectory analogous to that of population C, while population F (East Asia) diverged 40 generations ago with an initial size of 25, expanding to 100 by 20 generations ago (both with n = 50). Simulated datasets were subsequently analyzed using GONE with parameters set to hc = 0.05 and REPs = 40. Each scenario was replicated 20 times, and the geometric mean across replicates was taken as the final estimate of effective population size dynamics.
Chloroplast genome analysis
For the chloroplast analysis, we excluded 48 individuals with over 10% genetic contribution from J. sigillata from 399 samples based on STRUCTURE results from the nuclear data (below), and a total of 351 samples remained. We processed reads from the 351 J. regia individuals using Trimmomatic v0.32 [62] to trim adapters and low-quality sequences. The cleaned reads of the 351 individuals were then aligned to the J. regia chloroplast genome (NC_028617.1) using the BWA-MEM algorithm of BWA v0.7.15 [63]. Variant calling was performed with SAMtools v1.19 [64], and the identified SNPs were formatted into the Variant Call Format (VCF). We distinguished plastid from nuclear sequences by accepting bases at positions where coverage exceeded five-fold the average of the nuclear genome and consensus was achieved in over 90% of reads. Positions not meeting these criteria were designated as missing data, and indels were excluded.
In addition to the 351 chloroplast samples, seven chloroplast genomes from Yan et al. [22] were included, resulting in a total of 358 chloroplast genomes of J. regia being obtained. We included chloroplast genomes of J. nigra (NC_035967.1) and J. mandshurica (NC_033892.1) as outgroups. Sequence alignment was conducted using MAFFT v7.475 [82]. We constructed a Maximum Likelihood (ML) tree using IQ-TREE 2 [83], employing the ModelFinder Plus method and performing 1,000 bootstrap replicates. Simultaneously, a Neighbor-Joining (NJ) tree was generated using MEGA [71], which used the best-fit substitution model selected by the software, also with 1,000 bootstrap replicates. Additionally, haplotypes were identified using DnaSP v6, including sites with two nucleotide types or two plus N [84].
Cross-population selection signatures and candidate genes under positive selection
We employed the cross-population extended haplotype homozygosity (XP-EHH) method to detect signals of positive selection. XP-EHH scores were calculated using selscan (v2.0.3) [29, 30], with each of the five derived populations independently compared to the South Asian population as the reference. The calculation followed the methodology described by Sabeti et al. [28]. The analysis was performed using the following parameters: –xpehh to specify XP-EHH calculation; –unphased to allow the use of unphased genotype data; –vcf and –vcf-ref to input VCF files for the test and reference populations, respectively; and –pmap to enable physical map-based computations. A maximum inter-SNP distance of 200 kb was set using –max-gap to reduce artifacts caused by long-range linkage disequilibrium due to missing data. Rare variants were filtered out using a minor allele frequency threshold of 0.05 (–maf 0.05).
As the XP-EHH statistic approximately follows a normal distribution [28], we normalized the raw XP-EHH scores using the “norm" parameter in selscan. Significant SNPs were identified based on a normalized XP-EHH score threshold (normxpehh value ≥ 2) for each test population relative to the South Asian reference. We then intersected the significant SNPs across all five derived populations, yielding a set of shared loci. Allele frequencies of these SNPs were subsequently calculated across all six populations. We retained SNPs showing a consistent directional shift in allele frequency in all five derived populations relative to the South Asian reference (e.g., all increased or all decreased). Further filtering required these SNPs to be located within genes or within 5 kb upstream or downstream of gene boundaries. Gene function was annotated using eggNOG-mapper (v2.1.9) [85] in combination with UniProt to obtain GO terms, KEGG pathways, and functional descriptions.
Supplementary Information
Additional file 1: Table S1–S7. This file contains all supplementary tables. Table S1. Details of the sample locations, exclusion threshold, kinship level and Q values at K = 6 in STRUCTURE. Table S2. Three methods used to determine the optimal number of clusters in the Juglans STRUCTURE analyses under ParamSet1. Table S3. Three methods used to determine the optimal number of clusters in the Juglans STRUCTURE analyses under ParamSet2. Table S4. Details of the sample chloroplast haplotypes. Table S5. The allele frequencies of SNPs identified in XP-EHH analysis and located within or ≤5 kb downstream of annotated genes in Juglans regia. Table S6. Optimal number of clusters under default setting of subsampling analysis (three methods). Table S7. Optimal number of clusters under ParamSet1 of subsampling analysis (three methods).
Additional file 2: Fig S1–S3. Fig. S1. Population structure analysis of 298 individuals of Juglans regia. Fig. S2. STRUCTURE results under the subsampling design. Fig. S3. Mutational load and loss-of-functionvariants were analyzed in six groups of Juglans regia using the Tibetan reference genome and Chandler 2.0 reference genome.
Acknowledgements
For discussion and helpful comments, we thank the handling editor, three reviewers, and Jian-Quan Liu, Shou-Xian Li.
Peer review information
Martin Mascher and Wenjing She were the primary editors of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.
Authors’ contributions
W.N.B. and D.Y.Z. conceived and supervised the project; W.P.Z., A.N., and J.L. collected materials; C.J.C. and Y.Y. performed the analyses of chloroplast genome, C.J.C. and Y.M.D. performed the nuclear genome analyses, X.X.P. and C.J.C. conducted simulation analyses, W.N.B., C.J.C., S.S.R. wrote the paper; S.S.R., W.N.B., C.J.C., B.W.Z., and D.Y.Z. revised and proofread the paper. All authors approved the final version.
Funding
This research was funded by the National Natural Science Foundation of China (32370230), the National Key R&D Program of China (2017YFA0605104), the “111” Program of Introducing Talents of Discipline to Universities (B13008), the Fundamental Research Funds for the Central Universities, and China Postdoctoral Science Foundation (GZB20240286 to XXP), and the National Natural Science Foundation of China (32170398).
Data availability
The whole-genome resequencing data newly generated in this study have been deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive under BioProject accession number PRJNA356989 [86] and are also available at the National Genomics Data Center (NGDC, China National Center for Bioinformation) under BioProject accession number PRJCA010540 [87]. Previously published whole-genome resequencing datasets used in this study were retrieved from public repositories as described in the original publications (see Methods), including data from Ji et al. [31], Luo et al. [57], Stevens et al. [58], Li et al. [59], Zhang et al. [60, 61], Ding et al. [16], and Yan et al. [22], with corresponding accession numbers available in those studies. All custom scripts used for data processing and analyses have been deposited on GitHub (https://github.com/chencj2599/Origin_of_regia) under the MIT license [88] and archived in Zenodo (10.5281/zenodo.17356142) [89]. Genome-wide XP-EHH scan results generated in this study have been deposited in Zenodo under DOI 10.5281/zenodo.17349969 [90].
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Susanne S. Renner, Email: srenner@wustl.edu
Da-Yong Zhang, Email: zhangdy@bnu.edu.cn.
Wei-Ning Bai, Email: baiwn@bnu.edu.cn.
References
- 1.Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19:1655–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Gutaker RM, Purugganan MD. Adaptation and the geographic spread of crop species. Annu Rev Plant Biol. 2024;75:679–706. [DOI] [PubMed] [Google Scholar]
- 4.Hahn MW. Population structure. In: Molecular population genetics. New York: Oxford University Press; 2019. p. 109–10. [Google Scholar]
- 5.Lawson DJ, van Dorp L, Falush D. A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nat Commun. 2018;9:3258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kalinowski ST. The computer program STRUCTURE does not reliably identify the main genetic clusters within species: simulations and implications for human population structure. Heredity (Edinb). 2011;106:625–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Puechmaille SJ. The program STRUCTURE does not reliably recover the correct population structure when sampling is uneven: subsampling and new estimators alleviate the problem. Mol Ecol Resour. 2016;16:608–27. [DOI] [PubMed] [Google Scholar]
- 8.Novembre J. Pritchard, Stephens, and Donnelly on population structure. Genetics. 2016;204:391–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Pang XX, Zhang DY. A cautionary note on using STRUCTURE to detect hybridization in a phylogenetic context. J Syst Evol. 2025;63:1560–76. [Google Scholar]
- 10.Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003;164:1567–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wang J. The computer program structure for assigning individuals to populations: easy to use but easier to misuse. Mol Ecol Resour. 2017;17:981–90. [DOI] [PubMed] [Google Scholar]
- 12.Wilkinson KN, Gasparian B, Pinhasi R, Avetisyan P, Hovsepyan R, Zardaryan D, et al. Areni-1 cave, Armenia: a chalcolithic-early bronze age settlement and ritual site in the southern Caucasus. J Field Archaeol. 2012;37:20–33. [Google Scholar]
- 13.Pokharia AK, Mani BR, Spate M, Betts A, Srivastava A. Early neolithic agriculture (2700–2000 BC) and Kushan period developments (AD 100–300): macrobotanical evidence from Kanispur in Kashmir, India. Veg Hist Archaeobot. 2018;27:477-91.
- 14.Spengler RN, Tang L, Nayak A, Boivin N, Olivieri LM. The southern Central Asian mountains as an ancient agricultural mixing zone: new archaeobotanical data from Barikot in the Swat valley of Pakistan. Veg Hist Archaeobot. 2021;30:463-76.
- 15.Zohary D, Hopf M, Weiss E. Domestication of plants in the old world: the origin and spread of domesticated plants in Southwest Asia, Europe, and the Mediterranean Basin. In: Oxford: Oxford University Press. 2012:254–61.
- 16.Ding YM, Cao Y, Zhang WP, Chen J, Liu J, Li P, et al. Population-genomic analyses reveal bottlenecks and asymmetric introgression from Persian into iron walnut during domestication. Genome Biol. 2022;23:145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Molnar TJ, Zaurov DE, Capik JM, Eisenman SW, Ford T, Nikolyi LV, et al. Persian walnuts (Juglans regia L.) in Central Asia. In: Northern Nut Growers Association, 101st annual report. 2011:56–69.
- 18.Mapelli S, Pollegioni P, Woeste K, Chiocchini F, Lungo S, Olimpieri I, et al. Spatial genetic structure of common walnut (Juglans regia L.) in central Asia. Acta Hortic. 2016;1190:27–34.
- 19.Pollegioni P, Woeste K, Chiocchini F, Del Lungo S, Ciolfi M, Olimpieri I, et al. Rethinking the history of common walnut (Juglans regia L.) in Europe: its origins and human interactions. PLoS ONE. 2017;12:e0172541. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Aradhya M, Velasco D, Ibrahimov Z, Toktoraliev B, Maghradze D, Musayev M, et al. Genetic and ecological insights into glacial refugia of walnut (Juglans regia L.). PLoS ONE. 2017;12:e0185974. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Roor W, Konrad H, Mamadjanov D, Geburek T. Population differentiation in common walnut (Juglans regia L.) across major parts of its native range-insights from molecular and morphometric data. J Hered. 2017;108:391–404. [DOI] [PubMed] [Google Scholar]
- 22.Yan LJ, Fan PZ, Wambulwa MC, Qi HL, Chen Y, Wu ZY, et al. Human-associated genetic landscape of walnuts in the Himalaya: implications for conservation and utilization. Divers Distrib. 2024;30:e13809. [Google Scholar]
- 23.Fan PZ, Zhu GF, Wambulwa MC, Milne RI, Wu ZY, Luo YH, et al. Genetic origins and climate-induced erosion in economically important Asian walnuts. Conserv. Biol. 2025:e70125. [DOI] [PMC free article] [PubMed]
- 24.Wang J. A parsimony estimator of the number of populations from a STRUCTURE-like analysis. Mol Ecol Resour. 2019;19:970–81. [DOI] [PubMed] [Google Scholar]
- 25.Molloy EK, Durvasula A, Sankararaman S. Advancing admixture graph estimation via maximum likelihood network orientation. Bioinformatics. 2021;37:i142–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Santiago E, Novo I, Pardinas AF, Saura M, Wang J, Caballero A. Recent demographic history inferred by high-resolution analysis of linkage disequilibrium. Mol Biol Evol. 2020;37:3642–53. [DOI] [PubMed] [Google Scholar]
- 27.Novo I, Ordas P, Moraga N, Santiago E, Quesada H, Caballero A. Impact of population structure in the estimation of recent historical effective population size by the software GONE. Genet Sel Evol. 2023;55:86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, et al. Genome-wide detection and characterization of positive selection in human populations. Nature. 2007;449:913–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Szpiech ZA, Hernandez RD. Selscan: an efficient multithreaded program to perform EHH-based scans for positive selection. Mol Biol Evol. 2014;31:2824–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Szpiech ZA. selscan 2.0: scanning for sweeps in unphased data. Bioinformatics. 2024;40:btae006. [DOI] [PMC free article] [PubMed]
- 31.Ji F, Ma Q, Zhang W, Liu J, Feng Y, Zhao P, et al. A genome variation map provides insights into the genetics of walnut adaptation and agronomic traits. Genome Biol. 2021;22:300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Meirmans PG. Subsampling reveals that unbalanced sampling affects STRUCTURE results in a multi-species dataset. Heredity (Edinb). 2019;122:276–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Wang J. Fast and accurate population admixture inference from genotype data from a few microsatellites to millions of SNPs. Heredity (Edinb). 2022;129:79–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wang J. PopCluster: a population genetics model-based toolset for simulating, inferring and visualising individual admixture and population structure. Mol Ecol Resour. 2025;25:e14058. [DOI] [PubMed] [Google Scholar]
- 35.Zhang J, Zhang W, Ji F, Qiu J, Song X, Bu D, et al. A high-quality walnut genome assembly reveals extensive gene expression divergences after whole-genome duplication. Plant Biotechnol J. 2020;18:1848–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Marrano A, Britton M, Zaini PA, Zimin AV, Workman RE, Puiu D, et al. High-quality chromosome-scale assembly of the walnut (Juglans regia L.) reference genome. Gigascience 2020;9:giaa050. [DOI] [PMC free article] [PubMed]
- 37.Dussex N, Morales HE, Grossen C, Dalen L, van Oosterhout C. Purging and accumulation of genetic load in conservation. Trends Ecol Evol. 2023;38:961–9. [DOI] [PubMed] [Google Scholar]
- 38.Beer R, Kaiser F, Schmidt K, Ammann B, Carraro G, Grisa E, et al. Vegetation history of the walnut forests in Kyrgyzstan (Central Asia): natural or anthropogenic origin? Quat Sci Rev. 2008;27:621–32. [Google Scholar]
- 39.Ramezani E, Mrotzek A, Marvie Mohadjer MR, Kakroodi AA, Kroonenberg SB, Joosten H. Between the mountains and the sea: late Holocene Caspian Sea level fluctuations and vegetation history of the lowland forests of northern Iran. Quat Int. 2016;408:52–64. [Google Scholar]
- 40.Kotlia BS, Sharma C, Bhalla MS, Rajagopalan G, Subrahmanyam K, Bhattacharyya A, et al. Palaeoclimatic conditions in the late Pleistocene Wadda Lake, eastern Kumaun Himalaya (India). Palaeogeogr Palaeoclimatol Palaeoecol. 2000;162:105–18. [Google Scholar]
- 41.Allaby RG, Stevens CJ, Kistler L, Fuller DQ. Emerging evidence of plant domestication as a landscape-level process. Trends Ecol Evol. 2022;37:268–79. [DOI] [PubMed] [Google Scholar]
- 42.Miller AJ, Gross BL. From forest to field: perennial fruit crop domestication. Am J Bot. 2011;98:1389–414. [DOI] [PubMed] [Google Scholar]
- 43.Gaut BS, Seymour DK, Liu Q, Zhou Y. Demography and its effects on genomic variation in crop domestication. Nat Plants. 2018;4:512–20. [DOI] [PubMed] [Google Scholar]
- 44.Gaut BS, Díez CM, Morrell PL. Genomics and the contrasting dynamics of annual and perennial domestication. Trends Genet. 2015;31:709–19. [DOI] [PubMed] [Google Scholar]
- 45.Kairova G, Taskuzhina A, Yanin K, Ismagulova E, Oleichenko S, Sarshayeva M, et al. First evaluation of genetic diversity and population structure of wild and cultivated Juglans regia in Kazakhstan. Genet Resour Crop Evol. 2025;72:8281–92. [Google Scholar]
- 46.Zhao YC, Obie M, Stewart BA. The archaeology of human permanency on the Tibetan plateau: a critical review and assessment of current models. Quat Sci Rev. 2023;313:108211. [Google Scholar]
- 47.Laufer B. Sino-iranica: Chinese contributions to the history of civilization in ancient Iran, with special reference to the history of cultivated plants and products. Chicago: Field Museum of Natural History; 1919. p. 254–61. [Google Scholar]
- 48.Jing C, Zhang F, Wang X, Wang M, Zhou L, Cai Z, et al. Multiple domestications of Asian rice. Nat Plants. 2023;9:1221–35. [DOI] [PubMed] [Google Scholar]
- 49.Yang N, Wang Y, Liu X, Jin M, Vallebueno-Estrada M, Calfee E, et al. Two teosintes made modern maize. Science. 2023;382:eadg8940. [DOI] [PubMed] [Google Scholar]
- 50.Chien CC, Seiko T, Muto C, Ariga H, Wang YC, Chang CH, et al. A single domestication origin of adzuki bean in Japan and the evolution of domestication genes. Science. 2025;388:eads2871. [DOI] [PubMed] [Google Scholar]
- 51.Abondio P, Cilli E, Luiselli D. Inferring signatures of positive selection in whole-genome sequencing data: an overview of haplotype-based methods. Genes (Basel). 2022;13:926. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Sharma RL, Kumar K. Genetic diversity and scope of walnut improvement in India. In: Progress in temperate fruit breeding. Volume 1. Edited by Schmidt H, Kellerhals M: Developments in plant breeding. 1994:447–9.
- 53.Khanal A, Timilsina S, Poon T, Adhikari B. Characterization and selection of thin-shelled walnut (Juglans regia L.) genotypes of Mustang, Nepal. Arch Agric Environ Sci. 2023;8:86–91. [Google Scholar]
- 54.Xu SL, Rahman A, Baskin TI, Kieber JJ. Two leucine-rich repeat receptor kinases mediate signaling, linking cell wall biosynthesis and ACC synthase in Arabidopsis. Plant Cell. 2008;20:3065–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Leroux C, Bouton S, Kiefer-Meyer MC, Fabrice TN, Mareck A, Guénin S, et al. PECTIN METHYLESTERASE48 is involved in Arabidopsis pollen grain germination. Plant Physiol. 2015;167:367–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Borghi L, Kang J, de Brito Francisco R. Filling the gap: functional clustering of ABC proteins for the investigation of hormonal transport in planta. Front Plant Sci. 2019;10:422. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Luo X, Zhou H, Cao D, Yan F, Chen P, Wang J, et al. Domestication and selection footprints in Persian walnuts (Juglans regia). PLoS Genet. 2022;18:e1010513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Stevens KA, Woeste K, Chakraborty S, Crepeau MW, Leslie CA, Martinez-Garcia PJ, et al. Genomic variation among and within six Juglans species. G3 (Bethesda). 2018;8:2153–65. [DOI] [PMC free article] [PubMed]
- 59.Li X, Wang X, Zhang D, Huang J, Shi W, Wang J. Historical spread routes of wild walnuts in Central Asia shaped by man-made and nature. Front Plant Sci. 2024;15:1394409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Zhang BW, Xu LL, Li N, Yan PC, Jiang XH, Woeste KE, et al. Phylogenomics reveals an ancient hybrid origin of the Persian walnut. Mol Biol Evol. 2019;36:2451–61. [DOI] [PubMed] [Google Scholar]
- 61.Zhang WP, Cao L, Lin XR, Ding YM, Liang Y, Zhang DY, et al. Dead-end hybridization in walnut trees revealed by large-scale genomic sequence data. Mol Biol Evol. 2022;39:msba308. [DOI] [PMC free article] [PubMed]
- 62.Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27:2987–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Weber JA, Aldana R, Gallagher BD, Edwards JS. Sentieon DNA pipeline for variant detection - software-only solution, over 20× faster than GATK 3.3 with identical results. PeerJ PrePrints. 2016;4:e1672v2. [Google Scholar]
- 66.McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26:2867–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Zhao X, Guo Y, Kang L, Yin C, Bi A, Xu D, et al. Population genomics unravels the holocene history of bread wheat and its relatives. Nat Plants. 2023;9:403–19. [DOI] [PubMed] [Google Scholar]
- 69.Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 2012;28:3326–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Evanno G, Regnaut S, Goudet J. Detecting the number of clusters of individuals using the software structure: a simulation study. Mol Ecol. 2005;14:2611–20. [DOI] [PubMed] [Google Scholar]
- 71.Stecher G, Tamura K, Kumar S. Molecular evolutionary genetics analysis (MEGA) for macOS. Mol Biol Evol. 2020;37:1237–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Ding YM, Pang XX, Cao Y, Zhang WP, Renner SS, Zhang DY, et al. Genome structure-based Juglandaceae phylogenies contradict alignment-based phylogenies and substitution rates vary with DNA repair genes. Nat Commun. 2023;14:617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Pickrell JK, Pritchard JK. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 2012;8:e1002967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Kardos M, Luikart G, Allendorf FW. Measuring individual inbreeding in the age of genomics: marker-based measures are better than pedigrees. Heredity (Edinb). 2015;115:63–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Beede RH, Hasey JK, Ramos DD. The history of the walnut in California. In: Walnut production manual. Oakland: University of California Division of Agriculture and Natural Resources; 1998. p. 8–15. [Google Scholar]
- 78.Ning D, Wu T, Lei W, Zhang S, Ma T, Pan L, et al. The telomere-to-telomere gap-free genome assembly of Juglans sigillata. Hortic Plant J. 2025;11:1551–63. [Google Scholar]
- 79.Cingolani P, Platts A, le Wang L, Coon M, Nguyen T, Wang L, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012;6:80–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Vaser R, Adusumalli S, Leng SN, Sikic M, Ng PC. SIFT missense predictions for genomes. Nat Protoc. 2016;11:1–9. [DOI] [PubMed] [Google Scholar]
- 81.Excoffier L, Foll M. Fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics. 2011;27:1332–4. [DOI] [PubMed] [Google Scholar]
- 82.Katoh K, Standley DM. Mafft multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30:772–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. Iq-tree 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol. 2020;37:1530–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Rozas J, Ferrer-Mata A, Sanchez-DelBarrio JC, Guirao-Rico S, Librado P, Ramos-Onsins SE, et al. DnaSP 6: DNA sequence polymorphism analysis of large data sets. Mol Biol Evol. 2017;34:3299–302. [DOI] [PubMed] [Google Scholar]
- 85.Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol Biol Evol. 2021;38:5825–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Chen CJ, et al. Whole-genome resequencing data of Juglans regia. Accession PRJNA356989. NCBI BioProject. 2025. https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA356989.
- 87.Chen CJ, et al. Whole-genome resequencing data of Juglans regia for domestication genomics. National Genomics Data Center (NGDC), China National Center for Bioinformation (CNCB); BioProject PRJCA010540 Datasets. 2025. https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA010540.
- 88.Chen CJ et al. Custom scripts for population-genomic analyses in the walnut origin study (v1.0.0). GitHub. 2025. https://github.com/chencj2599/Origin_of_regia.
- 89.Chen CJ et al. Custom scripts for population-genomic analyses in the walnut origin study (v1.0.0). Zenodo. 2025. 10.5281/zenodo.17356142.
- 90.Chen CJ et al. XP-EHH genome-scan dataset of Juglans regia (South Asia vs. five derived populations). Dataset. Zenodo. 2025. 10.5281/zenodo.17349969.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Additional file 1: Table S1–S7. This file contains all supplementary tables. Table S1. Details of the sample locations, exclusion threshold, kinship level and Q values at K = 6 in STRUCTURE. Table S2. Three methods used to determine the optimal number of clusters in the Juglans STRUCTURE analyses under ParamSet1. Table S3. Three methods used to determine the optimal number of clusters in the Juglans STRUCTURE analyses under ParamSet2. Table S4. Details of the sample chloroplast haplotypes. Table S5. The allele frequencies of SNPs identified in XP-EHH analysis and located within or ≤5 kb downstream of annotated genes in Juglans regia. Table S6. Optimal number of clusters under default setting of subsampling analysis (three methods). Table S7. Optimal number of clusters under ParamSet1 of subsampling analysis (three methods).
Additional file 2: Fig S1–S3. Fig. S1. Population structure analysis of 298 individuals of Juglans regia. Fig. S2. STRUCTURE results under the subsampling design. Fig. S3. Mutational load and loss-of-functionvariants were analyzed in six groups of Juglans regia using the Tibetan reference genome and Chandler 2.0 reference genome.
Data Availability Statement
The whole-genome resequencing data newly generated in this study have been deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive under BioProject accession number PRJNA356989 [86] and are also available at the National Genomics Data Center (NGDC, China National Center for Bioinformation) under BioProject accession number PRJCA010540 [87]. Previously published whole-genome resequencing datasets used in this study were retrieved from public repositories as described in the original publications (see Methods), including data from Ji et al. [31], Luo et al. [57], Stevens et al. [58], Li et al. [59], Zhang et al. [60, 61], Ding et al. [16], and Yan et al. [22], with corresponding accession numbers available in those studies. All custom scripts used for data processing and analyses have been deposited on GitHub (https://github.com/chencj2599/Origin_of_regia) under the MIT license [88] and archived in Zenodo (10.5281/zenodo.17356142) [89]. Genome-wide XP-EHH scan results generated in this study have been deposited in Zenodo under DOI 10.5281/zenodo.17349969 [90].






