Abstract
Located in the southwestern corner of Europe, the Iberian Peninsula is separated from the rest of the continent by the Pyrenees Mountains and from Africa by the Strait of Gibraltar. This geographical position may have conditioned distinct selective pressures compared to the rest of Europe and influenced differential patterns of gene flow. In this work, we analyse 704 whole-genome sequences from the GCAT reference panel to quantify gene flow into Spain from various historical sources and identify the top signatures of positive (adaptive) selection. While we found no clear evidence of a 16th-century admixture event putatively related to the French diaspora during the Wars of Religion, we detected signals of North African admixture matching the Muslim period and the subsequent Christian Reconquista. Notably, besides finding that well-known candidate genes previously described in Eurasians also seem to be adaptive in Spain, we discovered novel top candidates for positive selection putatively associated with immunity and diet (UBL7, SMYD1, VAC14 and FDFT1). Finally, local ancestry deviation analysis revealed that the MHCIII genomic region underwent post-admixture selection following the post-Neolithic admixture with Steppe ancestry.
Keywords: Post-admixture selection, Spanish population, Selection scan, Positive selection, Human adaptation, Demography
Subject terms: Population genetics, Evolutionary biology
Introduction
During the Out-Of-Africa migration1, a small portion of modern humans dispersed from their native habitat in Africa and started to occupy a diverse range of new environments2. This major migratory event, along with subsequent local expansions around the world, explains the overall reduced genetic diversity of non-African compared to African populations. Nevertheless, the diversity of ecosystems colonized by modern humans also drove phenotypic and genetic variation across populations, providing a suitable model for studying local adaptation3–8. In recent years, numerous statistical tests have been developed to identify regions of the genome that have experienced positive (adaptive) selection9–15. Relying on the theoretical framework provided by population genetics16,17, these genome-wide selection scans can be used to pinpoint patterns of genetic variation that deviate from neutral expectations while being compatible with selection at different timescales15. Analysis of genomic footprints of selection across human populations in various biomes and under diverse selective pressures has shed light on the origin and genetic basis of adaptive traits, such as lactase persistence18, light skin pigmentation in response to lower UV radiation19, and high-altitude adaptation20, among others. Additionally, geographically restricted adaptive responses to local pathogens and other environmental stresses may explain population differences in some immune-related phenotypes and other disease-related traits such as drug response, hypertension, obesity, and diabetes, among many others8,21.
The Iberian Peninsula, located in the southwestern corner of Europe, is bordered by the Mediterranean Sea and the Atlantic Ocean and lies just ~ 13 km away from the North African coast at the narrowest point of the Strait of Gibraltar. Reconstructing the past demography and genetic structure of the Spanish population within the Iberian Peninsula has been challenging. However, recent advances in population genetics, combined with the availability of modern sequencing data22–25 as well as ancient DNA datasets26,27, have significantly enhanced our understanding of its population dynamics and genetic diversity over time. Like other European regions, the Iberian Peninsula has experienced genetic influence from various human populations arriving from the Levant and the Caucasus28. However, it differs in key ways from other parts of Europe. During the glacial period, the peninsula served as a refuge for western hunter-gatherers (WHG), who later also contributed to the genetic diversity of other hunter-gatherer populations across Europe. WHG ancestry is also traceable in Iberian Neolithic farmers, pointing to admixture events between Anatolian farmers and local hunter-gatherers27. Proximity to Africa has also contributed to genetic differences between the Iberian Peninsula and the rest of continental Europe29. Previous analyses of ancient DNA data have shown that Iberian individuals with high North African ancestry date back as early as 4000 years ago (I4246 from Camino de las Yeseras)26. However, these early North African contributions had a limited impact on the overall Iberian genetic pool. In contrast, the ancient DNA record26 shows that during the Muslim rule of the Iberian Peninsula (eighth-fifteenth centuries CE), the proportion of North African ancestry in Iberian individuals was higher than it is today30. Currently, the varying levels of Northern African admixture are probably the main factor explaining the west-to-east genetic differentiation observed within the Iberian Peninsula31. Notably, the availability of a new French dataset32 will allow us to explore the genetic impact of an important, yet lesser-known migration: the French diaspora triggered by the Religion Wars in the late sixteenth century33, which should be particularly notable in Eastern Iberia (Aragon, Catalonia, Valencia)33–36, where it amounted to up to one quarter of the local population36. Finally, the geographical structure of genetic diversity in Spain has implications for traits of clinical relevance37.
Thanks to its inclusion in the 1000 Genomes dataset38,39, the Spanish population has been extensively analysed through genome-wide scans of positive selection. However, no Spanish-specific signal has been reported to date. Here, to explore signatures of recent positive selection in the Spanish population, we take advantage of an exceptional dataset comprising 785 whole genomes from residents of Catalonia, sequenced at 30X coverage as part of the GCAT|Genomes for Life cohort40–42. Given Catalonia’s recent history of significant migratory inflows, particularly from other parts of Spain, notably Andalusia, we hypothesize that the GCAT dataset could serve as an appropriate proxy for the broader Spanish population. This approach differs from recent work focused on the microgeographical structure of autochthonous samples from the Catalan Pyrenees43. Moreover, the large sample size and high sequencing coverage of the analyzed dataset should provide enhanced statistical power to detect positive selection, including Spanish-specific signals of recent selective sweeps15. In this context, we first investigated the genetic structure and potential influence of gene flow from various external population sources, both ancient and present-day. We then applied the SDS (singleton density score)15 and XP-EHH (cross-population extended haplotype homozygosity)10 methods to detect signals of adaptive selection over a broad timeframe (up to 30,000 years ago). This approach allowed us to identify selection signals for novel as well as known candidate genes, which could be due to either our higher statistical power or gene specificity to southwestern Europe. A Catalan version of the article can be found at: 10.5281/zenodo.15261991.
Results
Genetic structure and demography
When we explored the GCAT dataset (n = 704; see details in materials and methods) in the context of the five major geographical regions of the 1000 Genomes Project (1000 GP) (i.e., AFR, AMR, EUR, SAS, and EAS) using principal component analysis (PCA), all GCAT samples clustered within the EUR group (Supplementary Fig. S1). When the PCA was restricted to the 1000GP European populations (i.e., IBS, TSI, CEU, GBR, and FIN), GCAT individuals clustered closely with the Iberians (IBS) and Tuscans (TSI) (Supplementary Fig. S2). In a broader Mediterranean context, and restricting the GCAT subjects to those with grandparents originating from the same Spanish region (n = 141), they clearly overlapped with several other European populations, including Catalan, French, and other Europeans (Fig. 1A; Supplementary Table S1). Focusing more specifically on samples from Spain and France, these GCAT individuals clustered with Catalonian, Balearic and Valencian samples44 (Supplementary Fig. S3). Incorporating ancient data from the three major Mesolithic and Neolithic components that have shaped present-day European populations—western hunter-gatherers (WHG), early European farmers (EEF), and early Steppe nomads (ENS)45—showed that the affinity of GCAT individuals to EEF is closer compared to other European populations such as Orcadian, French, or Italian, although not as close as Sardinians (Supplementary Fig. S4). ADMIXTURE analysis with contemporary neighbouring populations confirmed a major European ancestry component within the GCAT dataset, accompanied by two minor ancestry components present mostly in Middle Eastern and North African populations at K = 7 (lowest cross-validation error followed by K = 8 and K = 6) (Fig. 1b, Supplementary Fig. S5). At K = 8, a further component, mostly characterizing Mozabites but also present in other North African populations and in Palestinians, emerged consistently in all the Spanish samples as well as in those from Provence (Bouches-du-Rhône, BdR) and Sardinia. Thus, while the residents of Catalonia sampled by GCAT mostly show a typical continental European genetic profile, they also contain a small proportion of North African (mean: 0.0442, SD: 0.0023) and Middle Eastern ancestries (mean: 0.1010, SD: 0.0019).
Fig. 1.
Genetic structure of the GCAT dataset. (A) Principal component analysis (PCA) performed with 141,849 SNPs and 141 GCAT samples from individuals whose four grandparents were all born in the same autonomous community within Spain. Each geometric point represents an individual from a particular geographical region. Reference populations were compiled from various sources, covering Catalonia, the Balearic Islands, and the Valencia Community in Spain44, France32, and the Human Genome Diversity Panel92; for additional population details, see Supplementary Table S1. The red polygon encloses all samples from the GCAT dataset, whereas Spanish samples from Biagini et al.44 lie within the blue polygon. Abbreviations: CLM, Castilla La Mancha; CYL, Castilla y León. (B) Admixture analysis of the GCAT dataset at the lowest cross-validation errors (K = 7, followed by K = 8 and K = 6) using reference populations from Catalonia, the Balearic Islands, and the Valencia Community in Spain44, France32, North Africa93 and the Human Genome Diversity Panel (HGDP)92; for additional population details, see Supplementary Table S1. (C) Map showing the fraction of GCAT samples clustered into haplotype-based groups defined with fineSTRUCTURE, when silencing external populations. Pie charts are centered to the autonomous communities where these samples belong. Abbreviations: CAT: Catalonia. ARA: Aragon. MED: Mediterranean. WEST: West-Central Spain.
We next explored for haplotype-based clustering of individuals within the same regional GCAT dataset (n = 141) and neighbouring Mediterranean populations using fineSTRUCTURE (Supplementary Fig. S6A). The Spanish population splits in two main branches: East and West (Supplementary Fig. S6b). The East branch is further divided into two clusters, one with samples from Catalonia and Valencia (CAT, as shown in Fig. 1C) and the other with samples from Aragon (ARA). The West branch is subdivided into Eivissa (Ibiza, not shown in Fig. 1C), a Central West cluster comprising samples from Andalusia, Castile, and Extremadura (WEST), and a third cluster with some samples from Valencia (MED). The geographical distribution of these haplotype-based clusters shows an east-to-west gradient of genetic differentiation, consistent with previous analyses and the historical process of the Spanish Reconquista30,31. Additionally, four main branches were obtained for France (Supplementary Fig. S6B): French Basques, northern samples from Paris and Alsace, southern samples from Provence, Dordogne and the Pyrenees, and Breton samples from Rennes, the latter having been previously shown to be genetically distinct from other French populations32,46. The North African samples were separated into a West and an East group (Supplementary Fig. S6A), the latter having a higher Middle Eastern component, as expected29. Subsequently, admixture events for the East and West genetic clusters of the Spanish population were formally inferred using fastGLOBETROTTER47. Both clusters revealed a single admixture event between a southwestern European source (source 1, primarily consisting of samples from Provence, Dordogne, the Pyrenees, and Brittany) and a minor African-like source (source 2). The proportions were 98% and 2% for the West Spain cluster and 96% and 4% for the East Spain cluster, respectively (Supplementary Fig. S7A). Although fastGLOBETROTTER initially suggested two admixture events in the West Spain cluster, this pattern proved non-robust after bootstrapping. Interestingly, for the West Spain cluster, the West North African cluster was inferred as a surrogate population in both sources 1 and 2, possibly reflecting ancient gene flow between the Iberian Peninsula and the Maghreb. Additionally, French Basques were identified as a surrogate population in sources 1 and 2 for both clusters, pointing to a Basque admixture event, consistent with previous reports30. The inferred dates for these admixture events differ slightly between clusters, being ~ 1153 AD (CI 95%: 1083–1242 AD) for East Spain, and ~ 1211 AD (CI 95%: 1119–1287 AD) for West Spain, assuming 29 years per generation (Supplementary Fig. S7B).
Finding signatures of selection
Signals of selection were initially explored using SDS15 in the entire GCAT dataset (n = 704) to investigate very recent positive selection events. Additionally, we employed XP-EHH10 to detect high-frequency haplotypes that have been recently swept to relatively high frequencies in the GCAT dataset, using the YRI population as a reference (Figs. 2 and S8). After annotating the corresponding genes and SNP variants for each selection peak (for details, see Materials and Methods), we further investigated putative candidate variants for selection using iSAFE12 and CLUES14 (Supplementary Tables S2–S9). As expected, most of the candidate variants confirmed by CLUES and initially identified through SDS exhibit lower allele frequencies and correspond to more recent selective sweeps compared to those identified by XP-EHH (Supplementary Figs. S9-S25). Similarly, whereas 12 out of the 40 SDS selection peaks described in the GCAT dataset comprise well-known candidate regions for positive selection previously identified in Europeans (Supplementary Table S3), the majority of signals detected with XP-EHH in GCAT correspond to selective sweeps already described in both Europeans and Asians when compared with the YRI population (Supplementary Table S7). As discussed below, a significant number of candidate regions for positive selection identified in the GCAT dataset are related to lighter skin pigmentation, diet, and immune response (Table 1).
Fig. 2.
Manhattan plot of signatures of recent positive selection in the GCAT dataset. The y-axis indicates the –log10 (p-value) of the singleton density score (SDS) statistic calculated from the WGS data of 704 Spanish individuals in the GCAT dataset. The 40 selection peaks highlighted in orange represent SNPs above the top 99.99% SDS values accompanied by at least 10 SNPs above the 99.995% SDS values within a 1 Mb region (for details on SDS values, genes, and SNP annotations, see Supplementary Tables S2-S5). Genes associated with plausible adaptive biological functions are shown in black. Well-known candidate genes previously associated with adaptive evolutionary traits are in bold.
Table 1.
Top candidate genes for positive selection in the GCAT dataset.
Function | Genes | SNP | Type | Method | LogLR | s |
---|---|---|---|---|---|---|
Pigmentation | SLC45A2 | rs183671 | Intronic | SDS | 8.06 | 0.00427 |
OCA2-HERC2 | rs7183877 | Intronic | SDS | 8.75 | 0.00104 | |
KITLG | rs556861 | Intronic, non coding transcript | XP-EHH | 6.53 | 0.00342 | |
GRM5-TYR | rs7119749 | Intronic | SDS | 9.81 | 0.00371 | |
SLC12A1-DUT | rs9920281 | Intronic | XP-EHH | 5.39 | 0.01865 | |
MLPH-RAB17 | rs10176842 | Intronic | XP-EHH | 5.02 | 0.00255 | |
Diet | MCM6-LCT | rs4988235 | Intronic | SDS | 22.52 | 0.00862 |
SLC22A4 | rs1050152 | Missense | SDS | 11.68 | 0.00503 | |
ABCC1 | rs212086 | Intronic | SDS | 3.71 | 0.00430 | |
FDFT1* | rs1296025 | Intronic | SDS | 4.72 | 0.00576 | |
CYP3A4 | rs2404955 | Downstream gene variant | XP-EHH | 7.90 | 0.00383 | |
Immune response | UBL7* | rs750607 | Downstream gene variant | SDS | 9.46 | 0.00361 |
SMYD1* | rs35662596 | Intronic | SDS | 6.12 | 0.01294 | |
PRAG1* | rs55852693 | Downstream gene variant | SDS | 5.25 | 0.00413 | |
PBX2 | rs204991 | Upstream gene variant | SDS | 6.68 | 0.00774 | |
TMEM232 | rs10038763 | Intronic, non coding transcript | XP-EHH | 9.24 | 0.00374 | |
MTSS2-VAC14* | rs11075777 | Intronic | SDS | 3.58 | 0.00222 |
Selection in pigmentation genes
Three of the top SDS peaks identified in the GCAT overlap with well-known candidate genes for selection related to lighter skin pigmentation in Europeans: SLC45A248, OCA2-HERC211, and the GRM5-TYR11 region (Table 1, Fig. 2). In SLC45A2 we found strong evidence for selection acting on variant rs183671, which is associated with several skin/hair/eye pigmentation traits according to the NHGRI-EBI Catalogue of GWAS (https://www.ebi.ac.uk/gwas/; Supplementary Table S2). Using CLUES to estimate the past frequency trajectory of the selected derived allele, we detected a selective sweep within the last 500 generations (Supplementary Fig. S9). In the OCA2 genomic region, we estimated strong evidence for selection at the HERC2 intronic variant rs7183877, also associated with skin, hair, and eye pigmentation (Supplementary Table S2, Supplementary Fig. S10). Another SDS peak potentially related to lighter skin pigmentation comprised a putatively selected intronic variant in the GRM5 gene (rs7119749, Supplementary Fig. S11). This variant is located upstream of the TYR gene, which encodes the enzyme involved in the first step of melanin synthesis. In the XPEHH analysis comparing YRI and GCAT, we detected up to three additional candidate regions that may be related to lighter skin pigmentation and adaptation to lower UV radiation, as described in previous studies49: SLC12A1-DUT (located near the SLC24A5 gene), MLPH-RAB17, and the KITLG region (Table 1, Supplementary Figs. S12-S14). The SLC24A5 gene encodes a sodium-calcium exchanger associated with pigmentation in zebrafish and humans, possibly by facilitating ion transport in melanosomes50,51, whereas the melanophilin gene (MLPH) plays a key role in melanosome transport and prostate cancer susceptibility52,53. KITLG encodes the ligand of the tyrosine-kinase receptor, which has been shown to influence pigmentation by regulating melanocyte proliferation and melanin distribution54.
Selection in genes related to diet
Two selection peaks obtained with SDS contain well-known candidate regions for dietary adaptation: LCT-MCM6 and the region containing the SLC22A4 and SLC22A5 genes (Fig. 2, Table 1). As expected for a European population, we found robust evidence supporting positive selection for the derived allele of the intronic rs4988235 MCM6 variant associated with lactase persistence in Europeans. Using CLUES in the genome-wide genealogies inferred from the GCAT dataset, we estimate this selection event to have started within the last 200 generations (Supplementary Fig. S15). In the ergothioneine transporter gene SLC22A4, we find strong evidence for positive selection acting on the missense variant rs1050152 (encoding the L503F substitution) within the last 300 generations (Supplementary Fig. S16). This gene is thought to exhibit a selection signal due to adaptation to the low dietary levels of ergothioneine among early Neolithic farmers in the Fertile Crescent55,56. Interestingly, the selected variant is associated with reduced expression of the SLC22A5 gene according to GTEX (Supplementary Table S2), resulting in lower levels of the OCTN2 carnitine transporter, which is important for transportation and oxidation of fatty acids in mitochondria57. Another selection signal identified with SDS possibly related to adaptive detoxification was observed in the ABCC1 gene in chromosome 16, where we detected positive selection acting on the intronic variant rs212086, although with lower confidence (Table 1, Supplementary Fig. S17). Notably, ABCC1 encodes a well-known multidrug resistance protein (MRP1), which plays a role in the biliary detoxification of various anti-cancer drugs58 and has previously been detected under positive selection in the CEU population59. Additionally, we found moderate evidence for positive selection acting on the downstream gene variant rs2404955 in the CYP3A4 gene (Supplementary Fig. S18), which is similarly involved in detoxification, as well as in the metabolization of numerous therapeutical drugs60.
Selection in immune response genes
A high proportion of the candidate regions for positive selection identified in the GCAT dataset contain genes putatively related to the immune response (Fig. 2, Table 1). For example, a large candidate region detected with SDS, spanning over ~ 1.474 Mb on chromosome 6, comprises the MHC III region, which includes multiple immune-related genes such as HCG20, HCG21, AIF1, GPANK1, ABHD16A, LY6G6F, C2, and PBX2. Using CLUES to analyse several putative functional variants in this region, we detected moderate evidence of selection acting within the last 150 generations on rs204991, a SNP upstream of PBX2 (Supplementary Fig. S19). According to GTEX, this SNP strongly influences the expression of complement components 4A and 4B in various tissues. Additionally, using the XP-EHH statistic, we replicated a previously known selection signal at the TMEM232 gene in Eurasians61, which has been shown to promote inflammatory response in atopic dermatitis62. Among several candidate variants in the TMEM232 region, the strongest evidence for selection was found for the non-coding transcript exon SNP rs10038763 (Supplementary Fig. S20).
New candidates for adaptation
Our analysis of the top SDS signals in the GCAT (Fig. 2) also revealed several previously undocumented candidate regions and new putatively selected variants (Table 1). For instance, we found moderate evidence of selection within the last 6000 years for the SMYD1 intronic SNP rs35662596 (Supplementary Fig. S21), which reaches its highest global frequency in the IBS population (14%). Notably, mutations in SMYD1 can lead to the absence of the AnWj antigen on the surface of red blood cells63, a receptor for Haemophilus influenzae64. The amount of AnWj correlates with the ability of H. influenzae to adhere to epithelial cells65. We also identified moderate evidence of selection acting on the FDFT1 intronic SNP rs1296025 (Supplementary Fig. S22), whose highest allele frequencies in the 1000 GP are found in CEU (17%), GBR (17%) and IBS (21%) populations. Notably, FDFT1 encodes the first specific enzyme in the cholesterol biosynthesis pathway66 and is a downstream target of the fasting response67, whereas rs1296025 has been associated with non-HDL cholesterol levels68. Another novel candidate for positive selection lies within the UBL7 region, harbouring the regulatory variant rs750607 (Supplementary Fig. S23). According to ensembl69, this SNP is associated with differential expression of several genes across the region, including Semaphorin 7A (SEMA7A) in neutrophils, which is involved in immune response, inflammation70, and the regulation of NK cells71, T cells72, and mastocites73. Within the PRAG1 region, weaker evidence for selection was detected for the downstream gene variant rs55852693, which is predicted to disrupt a transcription factor binding site (Supplementary Fig. S24) and has been associated with enjoyment of spicy food74. Although allele frequencies for rs55852693 are similar across all European populations within the 1000GP, which suggests it may not correspond to a Spanish-specific adaptation, PRAG1 has also been identified by a selection scan of an admixed Brazilian population, in a genomic region enriched for the Native American ancestry75,76. Finally, within the MTSS2-VAC14 SDS peak, we found weak evidence of positive selection acting on the MTSS2 intronic rs11075777 SNP (Supplementary Fig. S25). The selected allele is an eQTL for the VAC14 gene, which plays a role in susceptibility to bacteremia77,78. However, given that its frequency is approximately 50% across all European populations in the 1000 GP, it might also represent a broader European signature of selection.
Local ancestry deviations
Next, we explored the genomic regions presenting local ancestry deviations (LAD) associated with specific ancient or modern ancestries that overlap with selection peaks identified via SDS or XP-EHH, as these regions could potentially represent cases of post-admixture selection. In the regional GCAT subset (n = 141), modern LADs were assessed using proxy samples from North Africa for North African ancestry, Palestine for Middle Eastern ancestry, and Southern France as the closest European population to represent the autochthonous background of the GCAT (for details, see Materials and Methods). We identified three overlapping LAD regions (SD > 4.42) on chromosome 6, pointing to an excess of North African and Middle Eastern ancestries, overlapping with three SDS peaks of positive selection (Supplementary Table S10). Two of these SDS peaks displayed LAD for both North African and Middle Eastern ancestries and overlapped with several genes in the MHC III region, including the upstream PBX2 SNP rs204991, which regulates C4A and C4B expression (Fig. 3).
Fig. 3.
Post-admixture selection in the MHC III region. (A) Proportion of North African (NA) ancestry across chromosome 6 in the GCAT dataset. (B) Proportion of Palestinian (PAL) ancestry across chromosome 6 in the GCAT dataset. Black line, mean genomic LAD; red line, significant LAD (4.42 SD). (C) Proportion of early Steppe nomad (ENS) ancestry across chromosome 6 in North Africa. (D) Proportion of ENS ancestry in Palestinians. (E) Proportion of ENS ancestry in the GCAT dataset. Black line, mean genomic LAD; red line, significant LAD (3 SD). (F) Transformed SDS scores after FDR correction, and putative candidate genes overlapping the LAD region detected in chromosome 6. For details and genomic positions, see Supplementary Table S10. Red line indicates a statistical significance cut off of 0.05.
To investigate LAD regions resulting from extensive admixture events during the Mesolithic and Neolithic periods in Europe, we leveraged the Mesoneo Dataset45, using the EEF, WHG, and ENS samples as proxies to estimate the corresponding ancient ancestry proportions of each GCAT individual (n = 704). This analysis revealed an overall pattern similar to that observed in Bronze Age Iberian individuals (Supplementary Fig. S26). Using a strict cut-off of 4.42 SD79, no LAD regions were found for these ancient ancestries, but when considering a more lenient cut-off of 3 SD, we detected up to five regions with an excess of WHG ancestry, two with ENS ancestry, and one with EEF ancestry (Supplementary Table S10 and Supplementary Figs. S27-S28). Among these LAD regions, only three overlapped with the top 40 signatures of selection detected with SDS in the GCAT dataset. Notably, these included the LCT-MCM6 peak (Supplementary Fig. S29) as well as the third SDS selection peak on chromosome 6 comprising the MHC III region (Fig. 3), both displaying ENS ancestry deviation. The third ancient LAD overlapping an SDS peak contains a cluster of zinc finger genes on chromosome 19 and shows an excess of WHG ancestry (Supplementary Fig. S29). Interestingly, although not presenting recent signatures of positive selection in the GCAT dataset, the only LAD associated with EEF ancestry comprised the FADS2 gene. This gene has been previously identified as a target of strong selection across diverse ancestries, including Eastern and Western hunter-gatherers as well as Anatolian farmer populations6.
Discussion
Our analyses demonstrate that the GCAT cohort not only captures the genetic characteristics of a typical European population but also serves as a robust proxy for the general Spanish population within the Iberian Peninsula. As our samples covered regions across Spain, we were able to detect the well-documented genetic contribution from North Africa, which is differentially structured between eastern and western Iberia30. However, we failed to find the contribution to the current Spanish gene pool of the French diaspora linked to the 16th-century Wars of Religion. Given the short genetic distance between NE Iberia and S France (which mirrors the linguistic and surname similarities that facilitated the assimilation of these newcomers) and the smaller sample size of individuals that could be assigned to a particular region given by the birthplace of their four grandparents, our design may have been underpowered to capture this genetic contribution.
Leveraging the enhanced statistical power provided by the large sample size of the GCAT dataset, we were able to identify new candidate genes under positive selection and replicate several well-known cases of adaptive selection related to pigmentation, diet, and the immune response, previously described using different selection methods. As expected, the candidate regions we replicated using the XP-EHH statistic generally correspond to selection signatures shared across various non-African populations, probably reflecting common environmental pressures following the Out-Of-Africa migration. In contrast, the signatures identified with the SDS statistic are predominantly shared with other European populations or, in some cases, are specific to the Spanish population, indicating more recent and localized selective events. Thus, the combined use of SDS and XP-EHH in the GCAT enabled us to uncover a comprehensive set of genomic selection footprints, shaped by the diverse evolutionary pressures experienced by the ancestors of contemporary Spanish populations across different time periods. While many of the identified selection signatures and well-known candidates are shared with other European populations, differences in the timing and selection strength were clearly observed in the GCAT dataset. These differences may stem from varying demographic histories, external influences, and environmental conditions. For instance, the selection coefficient estimated here for rs4988235 at LCT (s = 0.00862) is slightly lower than (though consistent with) previous estimates (s = 0.0194, s = [0.01019, 0.01056])6,80. Moreover, the lactase persistence allele seems to have emerged earlier in northern Europe than in the Iberian Peninsula26 and its frequency is lower in the Spanish population. Conversely, the current frequency of the selected rs1050152 allele at SLC22A4 seems to be higher in the Spanish population (45%) compared to other European populations in the 1000GP (36–41%). This difference may be attributable to the higher EEF ancestry component inferred in the GCAT dataset. Interestingly, while Irving-Pease et al.6 inferred that selection on variant rs1050152 ceased ~ 1500 years ago, our analysis indicates a continuous rise in allele frequency until very recent times, consistent with findings by Mathieson et al.81. Similar high-resolution studies of positive selection in Italy have revealed differential adaptive signatures between northern and southern Italian populations, although the most plausible mechanism involved is probably varying levels of genetic drift82. Unfortunately, the GCAT cohort does not include enough individuals with all four grandparents sharing the same birthplace to perform a robust latitudinal and environmental exploration of positive selection in Spain.
Like other continental European regions, the Iberian Peninsula has been influenced by the cultural transitions and demographic changes of the Holocene. The shift from hunter-gatherer societies to Neolithic agricultural systems, followed by the arrival of pastoralist-nomadic groups, not only transformed the genomic landscape of the Iberian population but also probably introduced new selective pressures, leaving distinct genomic imprints. Moreover, as these incoming populations were likely well-adapted to their respective lifestyles and cultural practices, they may have also contributed adaptive variants to the Spanish gene pool through admixture. Our analysis of the GCAT dataset shows three regions with significant LAD towards the WHG and ENS components, overlapping with recent signals of selection. Interestingly, the LAD region on chromosome 19 (Supplementary Fig. S29), enriched for the WHG ancestry component, harbours a cluster of zinc finger genes. However, the exact function and possible adaptive phenotypes associated with these zinc finger genes remain unknown and require further research to enhance our understanding of pre-Neolithic genetic adaptations in WHG populations. Additionally, LAD analysis showed that the SDS selection peak at the MHC III region was significantly enriched for ENS ancestry. When only contemporary populations were considered, this region displayed parallel enrichment for North African and Middle East ancestry components. Notably, the two populations used as proxies for such external components also displayed significant LAD for the ENS ancestry component, suggesting a shared selective pressure potentially related to domestication and zoonotic transmissions83–85. It should be noted that the high genetic diversity of the HLA region can potentially confound local ancestry inference methods, as observed in research on post-admixture selection in Latin American populations, which showed an excess of African ancestry at the HLA signals86. Nevertheless, this bias is not expected to affect the comparison with ENS in our study. Finally, the LCT-MCM6 region also displayed significant deviation for ENS ancestry in the GCAT dataset. This pattern agrees with previous reports that the signatures of positive selection on LCT can be traced to Steppe ancestry6.
While our study provides valuable insights, it is important to acknowledge some inherent limitations and potential biases. First, the GCAT cohort exhibits a geographic imbalance, with an overrepresentation of individuals from southern and eastern Spain compared to northern and western Spain. This asymmetry limits our ability to thoroughly analyse the different periods of North African admixture across the Iberian Peninsula, as the timing of the transition from Muslim to Christian rule varied significantly between regions. As a result, regional differences in admixture patterns arising from distinct historical interactions with North African populations may not have been fully captured. Second, the characterization of top candidate regions for selection (or selection peaks) relies on two user-defined parameters, which could introduce biases: namely, the length of the window used to detect high-scoring alleles and the number of alleles with genome-wide significant statistical values required to classify a genomic region as part of a peak. These arbitrary thresholds may influence both the identification and interpretation of selection signals detected in genome-wide scans performed using SDS and XP-EHH. To address this, we aimed to validate all candidate regions for selection using iSAFE, enabling the identification of favoured alleles12. Additionally, we applied CLUES to visualize past allele frequency trajectories of each putative selected allele and to estimate the corresponding selection coefficient and likelihood of positive selection15. Finally, the low sample size of populations used as North African proxies may have limited the power of our LAD analyses and admixture time estimations. A more comprehensive understanding of the North African admixture process could be achieved by incorporating high-coverage whole genome data from Northern African populations, particularly Amazigh groups, who played a major role in the Islamic Conquest of Iberia, as documented in historical records such as the Muqaddimah by Ibn Khaldūn87.
In conclusion, our findings demonstrate that several candidate genes previously identified as adaptive in other parts of Europe were subject to positive selection in the ancestral populations of present-day Spaniards. Additionally, we identified novel candidate genes for positive selection, which may be due to the larger sample size used in our study or their specificity to southwestern Europe. However, these genes should be considered as provisional candidates until the functionality of their genetic variation and evolutionary relevance are thoroughly characterized and understood.
Materials and methods
GCAT cohort and genome data processing
VCF files for Illumina 30X Whole Genome Sequences (WGS) of 785 present-day individuals from the GCAT cohort42 were obtained from the European Genome-phenome Archive (EGA) under accession number EGAD00001007774. The GCAT cohort was recruited (2014–2018) from residents in Catalonia aged 40–65 with access to the national public healthcare system. It consists of 19,140 registrants. The characteristics of the cohort40,41,88 and sequenced dataset are described elsewhere42. Complete genome sequences were available for 785 volunteers; of those, registered metadata indicated that 141 had all four grandparents born in the same Spanish region, be it Catalonia or any other one (see details in Supplementary Tables S1 and S11). Quality control and filtering of admixed individuals has been performed previously and are described elsewhere42. To focus on biallelic SNPs, BCFtools was used to exclude structural variants and indels. Additionally, 81 GCAT individuals with self-reported non-Caucasian ethnicity were removed (although the meaning of “Caucasian” in the Spanish context is unclear; Supplementary Table S12). The final sample size was therefore 704 (Dataset A).
The VCF files were then lifted over to hg38 using Picard tools89 and merged with the 1000 GP Phase 339 using the isec command in BCFtools90 (Dataset B). Subsequently, rare variants were removed, SNPs displaying strong Hardy–Weinberg deviations were filtered out, and we pruned for linkage disequilibrium (LD) using PLINK291 in sliding windows of 200 kb, a step size of 25 SNPs, and a square correlation coefficient threshold (r2) of 0.5 (–maf 0.05 –max-maf 0.95 –hwe 1e-50 –indep-pairwise 200 25 0.5). At this point, the dataset consisted of 679,677 variants and 3,208 individuals. Principal components analyses (PCA) were performed using the smartPCA tool from the EIGENSOFT package92 and the EIGENSTRAT93 correction without outlier removal (Supplementary Figs. S7-S2).
ADMIXTURE and fineSTRUCTURE analyses
The GCAT dataset was filtered to retain individuals whose four grandparents were all born in the same autonomous community (first-level administrative division in Spain) and merged with available datasets containing suitable proxies for detecting external contributions to the Iberian Peninsula. Accordingly, the 141 samples with all four grandparents from the same autonomous community were initially merged with the HGDP panel94 using the same procedure as for Dataset B. Additional SNP array genotyping data from France32, Catalonia44, and North Africa95, were also included. Related samples up to the third degree were not found.
Rare variants were removed, SNPs with strong Hardy–Weinberg deviation were filtered, and LD pruning was performed using PLINK2 in sliding windows of 200 kb, a step size of 25 SNPs, and an r2 threshold of 0.5 (–maf 0.05 –max-maf 0.95 –hwe 1e-50 –indep-pairwise 200 25 0.5), retaining 215,178 variants. PCA was conducted using the smartPCA tool from EIGENSOFT package with the EIGENSTRAT correction but without outlier removal. Samples from South America, Oceania, and East Asia were excluded due to their lack of relevance to the ancestry components in GCAT, as revealed in previous studies24,30. Thus, the final dataset (Dataset C) consisted of 1,181 individuals (see Supplementary Table S1 for details). Individual ancestries in this pruned dataset were explored with ADMIXTURE 1.322 in 10 runs using the unsupervised mode and tested from K = 1 to K = 12. We used PONG96 to plot the ADMIXTURE results.
The unpruned dataset, containing 426,650 variants across 1,181 individuals, was phased using SHAPEIT 4.1.397 and the 1000 GP haplotype reference panel39. Subsequently, CHROMOPAINTER98 was run in all-versus-all mode for chromosomes 1, 4, 17, and 20 to estimate the switch rate parameter (Ne) and global mutation rate (M) using 10 iterations of CHROMOPAINTER’s expectation–maximization algorithm (EM). Using these values, we reran CHROMOPAINTER in all-versus-all mode, specifying that all individuals should copy from any other individual for all chromosomes. The fineSTRUCTURE98 Markov Chain Monte Carlo (MCMC) method was then applied to assign each individual to a genetic cluster using 1,000,000 burn-in iterations (parameter -x), and 2,000,000 sample iterations (parameter -y) from which we only retained every 10,000th iteration (parameter -z). Additionally, fineSTRUCTURE was rerun using the force file (-F) to fix clusters outside Spain, France, and Italy as continental groups.
The two major inferred Iberian genetic clusters (West and East) were analysed further using fastGLOBETROTTER47 in donor-vs-recipient mode, excluding the Spanish genetic clusters (West, East and Ibiza) as donors and considering all the clusters as recipients (except the non-target Iberian clusters). Subsequently, CHROMOPAINTER was run in donor-vs-recipient mode, using only the target Spanish cluster as the recipient. Next, fastGLOBETROTTER was used with the prop.ind: 1 option to infer and date admixture events. To account for disequilibrium patterns that could confound admixture signals, the null.ind: 1 option was enabled. A second fastGLOBETROTTER run was performed to conduct bootstrap analysis and estimated a confidence interval around the inferred admixture date.
Ancient ancestry components
To explore the genetic structure of the GCAT dataset in the context of older admixture, we used a publicly available ancient DNA dataset from Allentoft et al.45. The genetic clusters inferred by Allentoft et al.45 as proxies for the three principal ancient populations that explain the genetic diversity in present-day Europe were used: western hunter-gatherers (WHG), early Steppe nomads (ENS) and early European farmers (EEF). We used a preprocessing approach following recommended guidelines: discarding low coverage individuals, and keeping sites passing 1000G genomic masks, MAF > 0.05 and INFO ≥ 0.8. This filtering strategy resulted in a dataset containing 2,997,159 SNPs. Restricting the analysis to transversion sites only, yielded a dataset with 966,986 SNPs. A PCA was performed using the same procedure and pipeline as for Dataset C. Given the demonstrated accuracy of imputation for ancient dataset relative to modern day sequence data45, PCA projection was not applied.
Analyses of positive selection
To investigate candidate regions under positive selection in the GCAT dataset, we employed two statistics: (i) the Singleton Density Score (SDS) to identify very recent selection events; and (ii) the Cross Population Extended Haplotype Homozygosity (XP-EHH) to detect selective sweeps where favoured variants have recently reached high frequencies (or fixation) in the GCAT relative to the YRI population.
To compute the SDS, SNPs in Dataset A were polarized based on their ancestral alleles using custom scripts and the Ensembl EPO fasta file (http://May2024.archive.ensembl.org/info/genome/compara/mlss.html?mlss=2006). Singletons were extracted into separate files. Test SNPs were processed by excluding, rare variants (-maf 0.05 -max-maf 0.95) using PLINK, and keeping sites with three genotypes, resulting in a dataset of 5,251,738 sites. Centromeric regions were also withdrawn from the analysis. We treated the observability of each variant as equal. Gamma shapes were inferred using a European population model based on Tennesse et al.99 (implemented in the SDS github repository) with a sample size of 1408 chromosomes for the allele frequencies ranging from 0.05 to 0.95 (in 0.01 steps). Raw results were normalized by bins of derived allele frequencies of 0.05–0.95 (0.01 steps) and p-values were then computed. SNPs were classified as being in a candidate region for positive selection if they were in the 99.99% quantile of SDS values, accompanied by at least 10 additional variants within the 99.995% quantile in a 1 Mb genomic window.
In the XP-EHH method, we first phased dataset B using SHAPEIT 4.1.397 and the 1000 GP reference panel39 and then computed normalized XP-EHH values using selscan v1.3.0100. SNPs were considered to be in a candidate region for positive selection if they fell within the 99.99% quantile of XP-EHH values and were accompanied by at least 10 additional variants in the 99.995% quantile within a 1 Mb genomic window.
SNPs in candidate regions for positive selection were functionally annotated using the Ensembl Variant Effect Predictor (VEP101) and each region was then manually explored for putative candidate variants (Supplementary Tables S2 and S6). In addition, we ran iSAFE using windows of 400 kb around the SNP with the highest selection signal in each candidate region using the –IgnoreGaps flag and the YRI population from the 1000 GP as the outgroup (Supplementary Tables S4 and S8). Putative candidate variants for selection were further validated using CLUES. For that, we obtained genome-wide genealogies for each site with Relate using previously inferred coalescent times13 on a subset of the merged dataset with 1000 GP including all European, Han Chinese, and Yoruban populations. Population size was then estimated using the –threshold 0.5 for the GCAT population to obtain specific coalescences for the GCAT data. Subsequently, we reestimated the genealogy branch lengths using the RelateCoalescentRate (–mode ReEstimateBranchLengths). Ancestral recombination graphs (ARGs) were sampled with the SampleBranchLengths script, assuming generation times of 28 years with 100 samples. Next, we used CLUES14 to estimate the selection coefficients (Supplementary Tables S5 and S9) and their corresponding allele frequency trajectories (Supplementary Figs. S9 to S25) using the previously inferred coalescence times and all the European samples from the ancient DNA dataset without excluding non-transversion sites. Selection inference was restricted to the oldest time sampled (i.e. 528 generations in San Teodoro 3 – ST3, from Sicily, Italy102).
Local ancestry deviations (LAD)
We explored LAD in the GCAT dataset using contemporary external populations and ancient genetic data. To investigate LAD with contemporary external data, we first ran RFmix103 on Dataset C using the following flags: -e 5 -n 5 –reanalyze-reference to apply the EM iteration algorithm and correct admixture individuals in the reference populations. As reference samples we used the fineSTRUCTURE inferred genetic clusters of Southern France (SUD, comprising mostly samples from Provence and Dordogne) downsampled to 30 samples, Western North Africa (WNA, comprising Algerian, Tunisian and Moroccan samples from the Lazaridis dataset)95, and Eastern North Africa (ENA, comprising Egyptian and Bedouin samples) in one run (Supplementary Fig. S30). To check whether the detected signals could arise from a prior Levantine migration, we repeated the analysis, merging ENA and WNA into a single North African cluster and using the Palestinian (PAL) cluster as a Middle Eastern proxy. The –reanalyze-reference flag was used to account for possible admixture in the reference panel, which is expected in North African populations. To check whether the detected signals arose from a common ancestral source from Neolithic or post-Neolithic times, we repeated the local ancestry inference across the entire GCAT dataset, as well as the North African (ENA and WNA) and PAL clusters, using the EEF, ENS, and WHG inferred ancient genetic clusters45 as proxies.
Supplementary Information
Acknowledgements
We thank Evan Irving-Pease and Laura Vilà-Valls for their technical advice. This study makes use of data generated by the GCAT Genomes for Life, a cohort study of the Genomes of Catalonia, Fundació IGTP. IGTP is part of the CERCA Program / Generalitat de Catalunya. The authors of this study would like to acknowledge all GCAT project investigators who contributed to the generation of the GCAT data. A full list of the investigators is available from www.genomesforlife.com/. We thank the Blood and Tissue Bank from Catalonia (BST) and all the GCAT volunteers that participated in the study.
Author contributions
E.B. and F.C. conceived the study. J.G. performed all the computational analyses. S.A.B. and R.C advised on genetic analyses, interpretation and discussions. J.G., E.B and F.C. wrote the manuscript. All authors reviewed the manuscript.
Funding
This work was supported by PID2023-147621NB-I00 funded by MICIU/AEI/10.13039/501100011033 and by “ERDF A way of making Europe”. JGC was supported with an FPI-MCIN/AEI PhD contract (PRE2020-095762). GCAT was funded by Acción de Dinamización del ISCIII-MINECO and the Ministry of Health of the Generalitat of Catalunya [ADE 10/00026] and is supported by the Agència de Gestió d’Ajuts Universitaris i de Recerca (AGAUR) [SGR 01537]. SAB was also supported by the Czech Ministry of Education, Youth and Sports (CZ.02.01.01/00/22_008/0004593, RES-HUM: Ready for the Future: Understanding the Long-Term Resilience of Human Culture grant).
Data availability
WGS for the GCAT cohort are available at EGA (https://ega-archive.org/) under extension number EGAD00001007774.
Code availability
No software was written specifically for this project. Code use in the project can be downloaded from the following github repository: https://github.com/JorgeGarciaC/SelDem-GCAT
Declarations
Competing interests
The authors declare no competing interests.
Ethical approval
The study was approved by the institutional review board CEIm—PSMAR (reference number 2021/9767/I) and by the institutional Committee for Ethical Review of Projects (CIREP) at Universitat Pompeu Fabra (reference number 296).
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Francesc Calafell, Email: francesc.calafell@upf.edu.
Elena Bosch, Email: elena.bosch@upf.edu.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-025-98272-w.
References
- 1.Montinaro, F., Pankratov, V., Yelmen, B., Pagani, L. & Mondal, M. Revisiting the out of Africa event with a deep-learning approach. Am. J. Hum. Genet.108, 2037–2051 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Bergström, A., Stringer, C., Hajdinjak, M., Scerri, E. M. L. & Skoglund, P. Origins of modern human ancestry. Nature590, 229–237 (2021). [DOI] [PubMed] [Google Scholar]
- 3.Fumagalli, M. et al. Signatures of environmental genetic adaptation pinpoint pathogens as the main selective pressure through human evolution. PLoS Genet7, e1002355 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Fumagalli, M. et al. Greenlandic Inuit show genetic signatures of diet and climate adaptation. Science349, 1343–1347 (2015). [DOI] [PubMed] [Google Scholar]
- 5.Caro-Consuegra, R. et al. Uncovering signals of positive selection in peruvian populations from three ecological regions. Mol. Biol. Evol.39, msac158 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Irving-Pease, E. K. et al. The selection landscape and genetic legacy of ancient Eurasians. Nature625, 312–320 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Sinigaglia, B. et al. Exploring adaptive phenotypes for the human calcium-sensing receptor polymorphism R990G. Mol. Biol. Evol.41, msae015 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Rees, J. S., Castellano, S. & Andrés, A. M. The genomics of human local adaptation. Trends Genet.36, 415–428 (2020). [DOI] [PubMed] [Google Scholar]
- 9.Garud, N. R., Messer, P. W., Buzbas, E. O. & Petrov, D. A. Recent selective sweeps in North American drosophila melanogaster show signatures of soft sweeps. PLOS Genet.11, e1005004 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Sabeti, P. C. et al. Genome-wide detection and characterization of positive selection in human populations. Nature449, 913–918 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Voight, B. F., Kudaravalli, S., Wen, X. & Pritchard, J. K. A map of recent positive selection in the human genome. PLoS Biol.4, e72 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Akbari, A. et al. Identifying the favored mutation in a positive selective sweep. Nat. Methods15, 279–282 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Speidel, L., Forest, M., Shi, S. & Myers, S. R. A method for genome-wide genealogy estimation for thousands of samples. Nat. Genet.51, 1321–1329 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Stern, A. J., Wilton, P. R. & Nielsen, R. An approximate full-likelihood method for inferring selection and allele frequency trajectories from DNA sequence data. PLoS Genet.15, e1008384 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Field, Y. et al. Detection of human adaptation during the past 2000 years. Science354, 760–764 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Smith, J. M. & Haigh, J. The hitch-hiking effect of a favourable gene. Genet. Res.23, 23–35 (1974). [PubMed] [Google Scholar]
- 17.Kaplan, N. L., Hudson, R. R. & Langley, C. H. The ‘hitchhiking effect’ revisited. Genetics123, 887–899 (1989). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Enattah, N. S. et al. Identification of a variant associated with adult-type hypolactasia. Nat. Genet.30, 233–237 (2002). [DOI] [PubMed] [Google Scholar]
- 19.Jablonski, N. G. & Chaplin, G. Epidermal pigmentation in the human lineage is an adaptation to ultraviolet radiation. J. Hum. Evol.65, 671–675 (2013). [DOI] [PubMed] [Google Scholar]
- 20.Huerta-Sánchez, E. et al. Altitude adaptation in Tibetans caused by introgression of Denisovan-like DNA. Nature512, 194–197 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Farré, X. et al. Skin phototype and disease: A comprehensive genetic approach to pigmentary traits pleiotropy using PRS in the GCAT Cohort. Genes14, 149 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res.19, 1655–1664 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Loh, P.-R. et al. Inferring admixture histories of human populations using linkage disequilibrium. Genetics193, 1233–1254 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Hellenthal, G. et al. A genetic atlas of human admixture history. Science343, 747–751 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Salter-Townshend, M. & Myers, S. Fine-scale inference of ancestry segments without prior knowledge of admixing groups. Genetics212, 869–889 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Olalde, I. et al. The genomic history of the Iberian Peninsula over the past 8000 years. Science363, 1230–1234 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Villalba-Mouco, V. et al. Survival of late pleistocene hunter-gatherer ancestry in the iberian peninsula. Curr. Biol.29, 1169-1177.e7 (2019). [DOI] [PubMed] [Google Scholar]
- 28.Lazaridis, I. et al. Ancient human genomes suggest three ancestral populations for present-day Europeans. Nature513, 409–413 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Moorjani, P. et al. The history of African gene flow into Southern Europeans, levantines, and jews. PLoS Genet.7, e1001373 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Bycroft, C. et al. Patterns of genetic differentiation and the footprints of historical migrations in the Iberian Peninsula. Nat. Commun.10, 551 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Hernández, C. L. et al. Human genomic diversity where the mediterranean joins the Atlantic. Mol. Biol. Evol.37, 1041–1055 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Biagini, S. A., Ramos-Luis, E., Comas, D. & Calafell, F. The place of metropolitan France in the European genomic landscape. Hum. Genet.139, 1091–1105 (2020). [DOI] [PubMed] [Google Scholar]
- 33.Salas Ausens, J. A. En busca de El Dorado : inmigración francesa en la España de la Edad Moderna (Universidad del País Vasco, 2009). [Google Scholar]
- 34.Millàs i Castellví, C. Els altres catalans dels segles XVI i XVII: la immigració francesa al Baix Llobregat en temps dels Àustria. (2005).
- 35.Rumech, R. S. i. Quan la terra promesa era al sud. La immigració francesa al Maresme als segles XVI i XVII. Fundació Huro. Paratge 132–132 (2015).
- 36.Barquer I Cerdà, A., Congost I Colomer, R. & Mutos Xicola, C. El reto de reconstituir procesos migratorio.s Diferentes modelos de migraciones francesas en la diócesis de Girona en la época moderna. Revista de Demografía Histórica-Journal of Iberoamerican Population Studies40, 61–88 (2022). [Google Scholar]
- 37.Dopazo, J. et al. 267 spanish exomes reveal population-specific differences in disease-related genetic variation. Mol. Biol. Evol.33, 1205–1218 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Auton, A. et al. A global reference for human genetic variation. Nature526, 68–74 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell185, 3426-3440.e19 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Obón-Santacana, M. et al. GCAT|Genomes for life: A prospective cohort study of the genomes of Catalonia. BMJ Open8, e018324 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Galván-Femenía, I. et al. Multitrait genome association analysis identifies new susceptibility genes for human anthropometric variation in the GCAT cohort. J. Med. Genet.55, 765–778 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Valls-Margarit, J. et al. GCAT|Panel, a comprehensive structural variant haplotype map of the Iberian population from high-coverage whole-genome sequencing. Nucleic Acids Res.50, 2464–2479 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Fibla, J. et al. The power of geohistorical boundaries for modeling the genetic background of human populations: The case of the rural catalan Pyrenees. Front. Genet.13, 1100440 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Biagini, S. A. et al. People from Ibiza: An unexpected isolate in the Western Mediterranean. Eur. J. Hum. Genet.27, 941–951 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Allentoft, M. E. et al. Population genomics of post-glacial western Eurasia. Nature625, 301–311 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Flores-Bello, A. et al. Genetic origins, singularity, and heterogeneity of Basques. Curr. Biol.31, 2167-2177.e4 (2021). [DOI] [PubMed] [Google Scholar]
- 47.Wangkumhang, P., Greenfield, M. & Hellenthal, G. An efficient method to identify, date, and describe admixture events using haplotype information. Genome Res.32, 1553–1564 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Norton, H. L. et al. Genetic evidence for the convergent evolution of light skin in Europeans and East Asians. Mol. Biol. Evol.24, 710–722 (2007). [DOI] [PubMed] [Google Scholar]
- 49.Pickrell, J. K. et al. Signals of recent positive selection in a worldwide sample of human populations. Genome Res.19, 826–837 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Lamason, R. L. et al. SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Science310, 1782–1786 (2005). [DOI] [PubMed] [Google Scholar]
- 51.Ginger, R. S. et al. SLC24A5 encodes a trans-golgi network protein with potassium-dependent sodium-calcium exchange activity that regulates human epidermal melanogenesis*. J. Biol. Chem.283, 5486–5495 (2008). [DOI] [PubMed] [Google Scholar]
- 52.Myung, C. H., Lee, J. E., Jo, C. S., Park, J. & Hwang, J. S. Regulation of Melanophilin (Mlph) gene expression by the glucocorticoid receptor (GR). Sci. Rep.11, 16813 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Ermini, L. et al. Evolutionary selection of alleles in the melanophilin gene that impacts on prostate organ function and cancer risk. Evol. Med. Public Health9, 311–321 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Cario-André, M., Pain, C., Gauthier, Y., Casoli, V. & Taieb, A. In vivo and in vitro evidence of dermal fibroblasts influence on human epidermal pigmentation. Pigment. Cell Res.19, 434–442 (2006). [DOI] [PubMed] [Google Scholar]
- 55.Huff, C. D. et al. Crohn’s disease and genetic Hitchhiking at IBD5. Mol. Biol. Evol.29, 101–111 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Mathieson, S. & Mathieson, I. FADS1 and the timing of human adaptation to agriculture. Mol. Biol. Evol.35, 2957–2970 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Longo, N., Frigeni, M. & Pasquali, M. Carnitine transport and fatty acid oxidation. Biochim. Biophys. Acta Mol. Cell Res.1863, 2422–2435 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Lam, Y. W. F. Chapter 1 - Principles of Pharmacogenomics: Pharmacokinetic, Pharmacodynamic, and Clinical Implications. In Pharmacogenomics (Second Edition) (eds Lam, Y. W. F. & Scott, S. A.) 1–53 (Academic Press, 2019). 10.1016/B978-0-12-812626-4.00001-2. [Google Scholar]
- 59.Wang, Z. et al. Signatures of recent positive selection at the ATP-binding cassette drug transporter superfamily gene loci. Hum. Mol. Genet.16, 1367–1380 (2007). [DOI] [PubMed] [Google Scholar]
- 60.Yang, X. et al. Systematic genetic and genomic analysis of cytochrome P450 enzyme activities in human liver. Genome Res.20, 1020–1036 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Liu, X. et al. Detecting and characterizing genomic signatures of positive selection in global populations. Am. J. Hum. Genet.92, 866–881 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Han, J. et al. TMEM232 promotes the inflammatory response in atopic dermatitis via the nuclear factor-κB and signal transducer and activator of transcription 3 signalling pathways. Br. J. Dermatol.189, 195–209 (2023). [DOI] [PubMed] [Google Scholar]
- 63.Yahalom, V. et al. SMYD1 is the underlying gene for the AnWj-negative blood group phenotype. Eur. J. Haematol.101, 496–501 (2018). [DOI] [PubMed] [Google Scholar]
- 64.Poole, J. & Van Alphen, L. Haemophilus influenzae receptor and the AnWj antigen. Transfusion28, 289–289 (1988). [DOI] [PubMed] [Google Scholar]
- 65.van Alphen, L., van Ham, M., Geelen-van den Broek, L. & Pieters, T. Relationship between secretion of the Anton blood group antigen in saliva and adherence of Haemophilus influenzae to oropharynx epithelial cells. FEMS Microbiol. Lett.47, 357–362 (1989). [DOI] [PubMed] [Google Scholar]
- 66.Brusselmans, K. et al. Squalene synthase, a determinant of raft-associated cholesterol and modulator of cancer cell proliferation*. J. Biol. Chem.282, 18777–18785 (2007). [DOI] [PubMed] [Google Scholar]
- 67.Weng, M. et al. Fasting inhibits aerobic glycolysis and proliferation in colorectal cancer via the Fdft1-mediated AKT/mTOR/HIF1α pathway suppression. Nat. Commun.11, 1869 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Graham, S. E. et al. The power of genetic diversity in genome-wide association studies of lipids. Nature600, 675–679 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Harrison, P. W. et al. Ensembl 2024. Nucleic Acids Res.52, D891–D899 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Körner, A. et al. Sema7A is crucial for resolution of severe inflammation. Proc. Natl. Acad. Sci. USA118, e2017527118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Ghofrani, J., Lucar, O., Dugan, H., Reeves, R. K. & Jost, S. Semaphorin 7A modulates cytokine-induced memory-like responses by human natural killer cells. Eur. J. Immunol.49, 1153–1166 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Gras, C. et al. Semaphorin 7A protein variants differentially regulate T-cell activity. Transfusion53, 270–283 (2013). [DOI] [PubMed] [Google Scholar]
- 73.Holmes, S. et al. Sema7A is a potent monocyte stimulator. Scand. J. Immunol.56, 270–275 (2002). [DOI] [PubMed] [Google Scholar]
- 74.May-Wilson, S. et al. Large-scale GWAS of food liking reveals genetic determinants and genetic correlations with distinct neurophysiological traits. Nat. Commun.13, 2743 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Secolin, R. et al. Distribution of local ancestry and evidence of adaptation in admixed populations. Sci. Rep.9, 13900 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Secolin, R. et al. Exploring a region on chromosome 8p23.1 displaying positive selection signals in brazilian admixed populations: Additional insights into predisposition to obesity and related disorders. Front. Genet.12, 636542 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Alvarez, M. I. et al. Human genetic variation in VAC14 regulates Salmonella invasion and typhoid fever through modulation of cholesterol. Proc. Natl. Acad. Sci. USA114, E7746–E7755 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Gilchrist, J. J. et al. Genetic variation in VAC14 is associated with bacteremia secondary to diverse pathogens in African children. Proc. Natl. Acad. Sci. USA115, E3601–E3603 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Bhatia, G. et al. Genome-wide Scan of 29,141 African Americans finds no evidence of directional selection since admixture. Am. J. Hum. Genet.95, 437–444 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Hejase, H. A., Mo, Z., Campagna, L. & Siepel, A. A deep-learning approach for inference of selective sweeps from the ancestral recombination graph. Mol. Biol. Evol.39, msab332 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Mathieson, I. et al. Genome-wide patterns of selection in 230 ancient Eurasians. Nature528, 499–503 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Sazzini, M. et al. Genomic history of the Italian population recapitulates key evolutionary dynamics of both Continental and Southern Europeans. BMC Biol.18, 51 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Key, F. M. et al. Emergence of human-adapted Salmonella enterica is linked to the Neolithization process. Nat. Ecol. Evol.4, 324–333 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.L’Hôte, L. et al. An 8000 years old genome reveals the Neolithic origin of the zoonosis Brucella melitensis. Nat. Commun.15, 6132 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Barrie, W. et al. Elevated genetic risk for multiple sclerosis emerged in steppe pastoralist populations. Nature625, 321–328 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Mendoza-Revilla, J. et al. Disentangling signatures of selection before and after European colonization in latin Americans. Mol. Biol. Evol.39, msac076 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Arezki Ferrad, M. Repaso de la historia de los amazighes en al-Ándalus desde la conquista hasta los reinos taifas. in Los bereberes en la Península Ibérica: contribución de los Amazighes a la historia de al-Ándalus, 2021, ISBN 978-84-338-6790-2, págs. 81–104 81–104 (Editorial Universidad de Granada, 2021).
- 88.Blay, N. et al. Disease prevalence, health-related and socio-demographic factors in the GCAT cohort. A comparison with the general population of Catalonia. 2023.09.08.23295239 Preprint at 10.1101/2023.09.08.23295239 (2023).
- 89.Picard toolkit. Broad Institute, GitHub repository (2019).
- 90.Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience10, glab008 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience4, s13742-015-s20047 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLOS Genet.2, e190 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet.38, 904–909 (2006). [DOI] [PubMed] [Google Scholar]
- 94.Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science367, eaay5012 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Lazaridis, I. et al. Genomic insights into the origin of farming in the ancient Near East. Nature536, 419–424 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Behr, A. A., Liu, K. Z., Liu-Fang, G., Nakka, P. & Ramachandran, S. pong: Fast analysis and visualization of latent clusters in population genetic data. Bioinformatics32, 2817–2823 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Delaneau, O., Zagury, J.-F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Accurate, scalable and integrative haplotype estimation. Nat. Commun.10, 5436 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Lawson, D. J., Hellenthal, G., Myers, S. & Falush, D. Inference of population structure using dense haplotype data. PLOS Genet.8, e1002453 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Tennessen, J. A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science337, 64–69 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Szpiech, Z. A. & Hernandez, R. D. selscan: An efficient multithreaded program to perform EHH-based scans for positive selection. Mol. Biol. Evol.31, 2824–2827 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.McLaren, W. et al. The ensembl variant effect predictor. Genome. Biol.17, 122 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Scorrano, G. et al. Genomic ancestry, diet and microbiomes of Upper Palaeolithic hunter-gatherers from San Teodoro cave. Commun. Biol.5, 1–13 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Maples, B. K., Gravel, S., Kenny, E. E. & Bustamante, C. D. RFMix: A discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Human Genet.93, 278–288 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
WGS for the GCAT cohort are available at EGA (https://ega-archive.org/) under extension number EGAD00001007774.
No software was written specifically for this project. Code use in the project can be downloaded from the following github repository: https://github.com/JorgeGarciaC/SelDem-GCAT